Disjoint-Set Operations

The following problem arises in some graph algorithms:

Given a list of edges and vertices (aka "a graph"), find the disjoint sets of edges and vertices.

I think the easiest way to understand this definition is to look at an example and what it means. The following example is taken from Cormen et al, Introduction to Algorithms:

Vertices: a, b, c, d, e, f, g, h, i, j
Edges: (b, d), (e, g), (a, c), (h, i), (a, b), (e, f), (b, c)

If you look at these, then you can see for example that the edges (b, d) and (a, b) share vertices, whereas (b, d) and (h, i) do not. Finding the disjoint sets means finding all those sets that are distinct for all members (aka disjoint). In our example, the result of processing the given input data would be:

    (a, b, c, d)  (e, f, g)  (h, i)  (j)

Note that j is a vertex that is not part of any edge, hence an edgeless graph: so by definition it is disjoint to any other subgraph.

Ask yourself: Do I understand the problem?

The problem looks like a task for an algorithm. But why is it mentioned as a data type?
The reason for this is that the literature defines an abstract data type (ADT) for solving this problem: data and methods, and then proceeds to look at various implementations of this problem - very much like a heap, a stack, or a linked list is really some abstract set of data + methods that can be implemented in various ways.
In the literature, the ADT is defined as a) a set datatype and b) three methods:
1. MAKE-SET(x): create a set out of a single vertex x.
2. FIND-SET(x): find the set that has the member x.
3. UNION(u,v): given two sets, create a new set that is a union of both sets (i.e. includes all items of both sets).
Comments:
- Note that MAKE-SET(x) doesn't need to support operations such as "create a set out of two or more vertices").
- Note that this specification does not tell you how sets are represented, it tells you the operations defined on sets, and you choose a datatype best suited to implement those operations, or methods.
- An implicit requirement that is important if you think about relaxing the implementation: you must be able to compare two sets to find out if they are equal. Most definitions in the literature brush over this requirement, but it is there.
For us, input vertices will be integer numbers, and edges will be pairs of integer numbers.

Setting up a test harness

As usual, our first step is to prepare a testbed. Our code strategy for the test will be this:

Our set members are simply going to be integers, in a range SIZE_OF_DOMAIN.
We are going to define SIZE_OF_EDGES random edges by combining two random elements out of the universe of vertices
We will implement a couple of functions for the same task. As usual, we are going to store them in a dictionary along with their name, so that we can call them, time their performance and print out results
We will verify that all algorithms return the same output for the same input.

Therefor, our test code looks like this:

#! -*- Encoding: Latin-1 -*-

def selftest():
    import time

    # see above: test data parameters
    SIZE_OF_DOMAIN = 10000
    SIZE_OF_EDGES = 2000

    edges = [(random.randrange(SIZE_OF_DOMAIN), random.randrange(SIZE_OF_DOMAIN)) for k in range(SIZE_OF_EDGES)]
    vertices = range(SIZE_OF_DOMAIN)

    functions = {
        'using builtin set datatype' : ds_using_builtin_set_type,
        'using builtin list datatype' : ds_using_lists,
    }

    expected_result = None
    for function_name in functions:
        function = functions[function_name]
        t0 = time.time()
        result = len(function(edges, vertices))
        t0 = time.time() - t0
        print "%32s: %12.2f for result %6d" % (function_name, t0, result, ) 

        if expected_result is None:
            expected_result = result
        else:
            assert result == expected_result

if __name__ == "__main__":
    selftest()

Implementation using Python Sets

Python has builtin set datatypes which makes implementing this ADT really simple, and the code really beautiful. Remember our abstract definition:

MAKE-SET(x): create a set out of a single vertex x.

    def MAKE_SET(x):
        return frozenset([x])

FIND-SET(x): find the set that has the member x.

    def FIND_SET(x):
        for subset in sets:
            if x in subset:
                return subset

UNION(u,v): given two sets, create a new set that is a union of both sets (i.e. includes all items of both sets).

    def UNION(set_u, set_v):
        sets.add(frozenset.union(set_u, set_v))
        sets.remove(set_u)
        sets.remove(set_v)

It doesn't get more simple than that. Subsequently, the whole Python code looks like this:

def ds_using_builtin_set_type(edges, vertices):

    # given an element, create a set with only this element as member
    def MAKE_SET(x):
        return frozenset([v])

    # create big set of sets
    sets = set([MAKE_SET(v) for v in vertices])

    # find a set containing element x in a list of all sets
    def FIND_SET(x):
        for subset in sets:
            if x in subset:
                return subset

    # create a combined set containing all elements of both sets, destroy original sets
    def UNION(set_u, set_v):
        sets.add(frozenset.union(set_u, set_v))
        sets.remove(set_u)
        sets.remove(set_v)

    # main algorithm: find all connected components
    for (u, v) in edges:
        set_u = FIND_SET(u)
        set_v = FIND_SET(v)

        if set_u != set_v:
            UNION(set_u, set_v)

    return sets

To give you a feel how this algorithm works, let's look at some sample data.

Edge Processed | Collection of disjoint sets
  Initial Sets | {0}       {1}    {2}            {3}    {4}    {5} {6} {7} {8}
        (5, 4) | {0}       {1}    {2}            {3}    {4, 5}     {6} {7} {8}
        (6, 3) | {0}       {1}    {2}            {3, 6} {4, 5}         {7} {8}
        (6, 2) | {0}       {1}    {2, 3, 6}             {4, 5}         {7} {8}
        (1, 8) | {0}       {1, 8} {2, 3, 6}             {4, 5}         {7}    
        (3, 4) | {0}       {1, 8} {2, 3, 6, 4, 5}                      {7}    
        (1, 0) | {0, 1, 8}        {2, 3, 6, 4, 5}                      {7}

Implementation using Python lists

In much of the literature, the operations are implemented in more basic datastructures - linked lists, arrays and such. For example, we're going to use lists rather than sets to represent sets. UNION will be simple: it will extend one list with the items of the other. But FIND would be slow: we'd have to enumerate all items in all sets, giving us a O(N²) runtime in the worst case. So we'll use an additional hash table to speed up the process of finding elements in the lists. Let's see how this works:

MAKE-SET(x): create a set out of a single vertex x.
```
    def MAKE_SET(x):
        return [x]
```
This is actually even more simple than the version using set datatypes.
FIND-SET(x): find the set that has the member x.
```
    def FIND_SET(x):
        return set_member_lookup[x]
```
Given that set_member_lookup is our lookup dictionary, this implementation is bound to be fast. But it also kind of cheats: because we haven't specified how we keep items of the dictionary up to date.

UNION(u,v): given two sets, create a new set that is a union of both sets (i.e. includes all items of both sets).

    def UNION(set_u, set_v):
        if sets[set_u] is not None:
            if sets[set_v] is not None:
                sets[set_u].extend(sets[set_v])

                for k in array_of_sets[set_v]:
                    set_member_lookup[k] = set_u
                sets[set_v] = None

OK, so this is where the cost of the complexity is hidden: when we join two sets, we need to update all relevant items in the lookup dictionary, and we need to blank out the input set (set it to None)

The complete implementation in Python looks like this:

def ds_using_lists(edges, vertices):

    # given an element, create a set with only this element as member
    def MAKE_SET(x):
        return [x]

    # create big set of sets
    sets = [MAKE_SET(v) for v in vertices]

    # create lookup in sets
    set_member_lookup = {}
    for index, v in enumerate(vertices):
        set_member_lookup[v] = index

    # find a set containing element x in a list of all sets
    def FIND_SET(x):
        return set_member_lookup[x]

    # create a combined set containing all elements of both sets, destroy original sets
    def UNION(set_u, set_v):
        if sets[set_u] is not None:
            if sets[set_v] is not None:
                sets[set_u].extend(sets[set_v])

                for k in sets[set_v]:
                    set_member_lookup[k] = set_u
                sets[set_v] = None

    # main algorithm: find all connected components
    for (u, v) in edges:        
        set_u = FIND_SET(u)
        set_v = FIND_SET(v)

        if set_u != set_v:
            UNION(set_u, set_v)

    return filter(None, sets)

You should be able to see that the main code structure is almost identical, except for a bit of overhead for keeping the dictionary on track.

But how do they actually perform?

For a universe of 5.000 vertices with 5.000 edges, we get these results:

Method	Time
using builtin set datatype	1.36
using builtin list datatype	0.10

The results are much more dramatic if the universe of vertices is a lot bigger than the edges. For example, for 20.000 vertices and 5.000 edges, we get

Method	Time
using builtin set datatype	20.40
using builtin list datatype	0.03

Essentially, this is so because the set operation x in set basically uses linear search, while our find operation using dictionaries is quite unimpressed.

Moral of the story: while the set datatype is nice, sometimes a tuned list-based operation can be a lot faster.