 Disjointset data structure

In computing, a disjointset data structure is a data structure that keeps track of a set of elements partitioned into a number of disjoint (nonoverlapping) subsets. A unionfind algorithm is an algorithm that performs two useful operations on such a data structure:
 Find: Determine which set a particular element is in. Also useful for determining if two elements are in the same set.
 Union: Combine or merge two sets into a single set.
Because it supports these two operations, a disjointset data structure is sometimes called a unionfind data structure or mergefind set. The other important operation, MakeSet, which makes a set containing only a given element (a singleton), is generally trivial. With these three operations, many practical partitioning problems can be solved (see the Applications section).
In order to define these operations more precisely, some way of representing the sets is needed. One common approach is to select a fixed element of each set, called its representative, to represent the set as a whole. Then, Find(x) returns the representative of the set that x belongs to, and Union takes two set representatives as its arguments.
Contents
Disjointset linked lists
A simple approach to creating a disjointset data structure is to create a linked list for each set. The element at the head of each list is chosen as its representative.
MakeSet creates a list of one element. Union appends the two lists, a constanttime operation. The drawback of this implementation is that Find requires Ω(n) or linear time to traverse the list backwards from a given element to the head of the list.
This can be avoided by including in each linked list node a pointer to the head of the list; then Find takes constant time, since this pointer refers directly to the set representative. However, Union now has to update each element of the list being appended to make it point to the head of the new combined list, requiring Ω(n) time.
When the length of each list is tracked, the required time can be improved by always appending the smaller list to the longer. Using this weightedunion heuristic, a sequence of m MakeSet, Union, and Find operations on n elements requires O(m + nlog n) time.^{[1]} For asymptotically faster operations, a different data structure is needed.
Analysis of the naïve approach
We now explain the bound O(nlog(n)) above.
Suppose you have a collection of lists, each node of a list contains an object, the name of the list to which it belongs, and the number of elements in that list. Also assume that the sum of the number of elements in all lists is n (i.e. there are n elements overall). We wish to be able to merge any two of these lists, and update all of their nodes so that they still contain the name of the list to which they belong. The rule for merging the lists A and B is that if A is larger than B then merge the elements of B into A and update the elements that used to belong to B, and vice versa.
Choose an arbitrary element of list L, say x. We wish to count how many times in the worst case will x need to have the name of the list to which it belongs updated. The element x will only have its name updated when the list it belongs to is merged with another list of the same size or of greater size. Each time that happens, the size of the list to which x belongs at least doubles. So finally, the question is "how many times can a number double before it is the size of n?" (then the list containing x will contain all n elements). The answer is exactly log _{2}(n). So for any given element of any given list in the structure described, it will need to be updated log _{2}(n) times in the worst case. Therefore updating a list of n elements stored in this way takes O(nlog(n)) time in the worst case. A find operation can be done in O(1) for this structure because each node contains the name of the list to which it belongs.
A similar argument holds for merging the trees in the data structures discussed below, additionally it helps explain the time analysis of some operations in the binomial heap and Fibonacci heap data structures.
Disjointset forests
Disjointset forests are a data structure where each set is represented by a tree data structure, in which each node holds a reference to its parent node (see spaghetti stack). They were first described by Bernard A. Galler and Michael J. Fischer in 1964,^{[2]} although their precise analysis took years.
In a disjointset forest, the representative of each set is the root of that set's tree. Find follows parent nodes until it reaches the root. Union combines two trees into one by attaching the root of one to the root of the other. One way of implementing these might be:
function MakeSet(x) x.parent := x
function Find(x) if x.parent == x return x else return Find(x.parent)
function Union(x, y) xRoot := Find(x) yRoot := Find(y) xRoot.parent := yRoot
In this naive form, this approach is no better than the linkedlist approach, because the tree it creates can be highly unbalanced; however, it can be enhanced in two ways.
The first way, called union by rank, is to always attach the smaller tree to the root of the larger tree, rather than vice versa. Since it is the depth of the tree that affects the running time, the tree with smaller depth gets added under the root of the deeper tree, which only increases the depth if the depths were equal. In the context of this algorithm, the term rank is used instead of depth since it stops being equal to the depth if path compression (described below) is also used. Oneelement trees are defined to have a rank of zero, and whenever two trees of the same rank r are united, the rank of the result is r+1. Just applying this technique alone yields an amortized runningtime of O(log n) per MakeSet, Union, or Find operation. Pseudocode for the improved
MakeSet
andUnion
:function MakeSet(x) x.parent := x x.rank := 0
function Union(x, y) xRoot := Find(x) yRoot := Find(y) if xRoot == yRoot return // x and y are not already in same set. Merge them. if xRoot.rank < yRoot.rank xRoot.parent := yRoot else if xRoot.rank > yRoot.rank yRoot.parent := xRoot else yRoot.parent := xRoot xRoot.rank := xRoot.rank + 1
The second improvement, called path compression, is a way of flattening the structure of the tree whenever Find is used on it. The idea is that each node visited on the way to a root node may as well be attached directly to the root node; they all share the same representative. To effect this, as
Find
recursively traverses up the tree, it changes each node's parent reference to point to the root that it found. The resulting tree is much flatter, speeding up future operations not only on these elements but on those referencing them, directly or indirectly. Here is the improvedFind
:function Find(x) if x.parent == x return x else x.parent := Find(x.parent) return x.parent
These two techniques complement each other; applied together, the amortized time per operation is only O(α(n)), where α(n) is the inverse of the function f(n) = A(n,n), and A is the extremely quicklygrowing Ackermann function. Since α(n) is the inverse of this function, α(n) is less than 5 for all remotely practical values of n. Thus, the amortized running time per operation is effectively a small constant.
In fact, this is asymptotically optimal: Fredman and Saks showed in 1989 that Ω(α(n)) words must be accessed by any disjointset data structure per operation on average.^{[3]}
Applications
Disjointset data structures model the partitioning of a set, for example to keep track of the connected components of an undirected graph. This model can then be used to determine whether two vertices belong to the same component, or whether adding an edge between them would result in a cycle. The UnionFind algorithm is used in highperformance implementations of Unification.^{[4]}
This data structure is used by the Boost Graph Library to implement its Incremental Connected Components functionality. It is also used for implementing Kruskal's algorithm to find the minimum spanning tree of a graph.
Note that the implementation as disjointset forests doesn't allow deletion of edges—even without path compression or the rank heuristic.
History
While the ideas used in disjointset forests have long been familiar, Robert Tarjan was the first to prove the upper bound (and a restricted version of the lower bound) in terms of the inverse Ackermann function, in 1975.^{[5]} Until this time the best bound on the time per operation, proven by Hopcroft and Ullman,^{[6]} was O(log^{*} n), the iterated logarithm of n, another slowlygrowing function (but not quite as slow as the inverse Ackermann function).
Tarjan and Van Leeuwen also developed onepass Find algorithms that are more efficient in practice while retaining the same worstcase complexity.^{[7]}
In 2007, Sylvain Conchon and JeanChristophe Filliâtre developed a persistent version of the disjointset forest data structure, allowing previous versions of the structure to be efficiently retained, and formalized its correctness using the proof assistant Coq.^{[8]}
See also
 Partition refinement, a different data structure for maintaining disjoint sets, with updates that split sets apart rather than merging them together
References
 ^ Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGrawHill, 2001. ISBN 0262032937. Chapter 21: Data structures for Disjoint Sets, pp. 498–524.
 ^ Bernard A. Galler and Michael J. Fischer. An improved equivalence algorithm. Communications of the ACM, Volume 7, Issue 5 (May 1964), pp. 301–303. The paper originating disjointset forests. ACM Digital Library
 ^ M. Fredman and M. Saks. The cell probe complexity of dynamic data structures. Proceedings of the TwentyFirst Annual ACM Symposium on Theory of Computing, pp. 345–354. May 1989. "Theorem 5: Any CPROBE(log n) implementation of the set union problem requires Ω(m α(m, n)) time to execute m Find's and n−1 Union's, beginning with n singleton sets."
 ^ Knight, Kevin (1989). "Unification: A multidisciplinary survey". ACM Computing Surveys 21: 93–124. doi:10.1145/62029.62030. http://portal.acm.org/citation.cfm?id=62030.
 ^ Tarjan, Robert Endre (1975). "Efficiency of a Good But Not Linear Set Union Algorithm". Journal of the ACM 22 (2): 215–225. doi:10.1145/321879.321884. http://portal.acm.org/citation.cfm?id=321884.
 ^ Hopcroft, J. E.; Ullman, J. D. (1973). "Set Merging Algorithms". SIAM Journal on Computing 2 (4): 294–303. doi:10.1137/0202024.
 ^ Robert E. Tarjan and Jan van Leeuwen. Worstcase analysis of set union algorithms. Journal of the ACM, 31(2):245–281, 1984.
 ^ Sylvain Conchon and JeanChristophe Filliâtre. A Persistent UnionFind Data Structure. In ACM SIGPLAN Workshop on ML, Freiburg, Germany, October 2007.
External links
 C++ implementation, part of the Boost C++ libraries
 A Java implementation with an application to color image segmentation, Statistical Region Merging (SRM), IEEE Trans. Pattern Anal. Mach. Intell. 26(11): 14521458 (2004)
 Java applet: A Graphical UnionFind Implementation, by Rory L. P. McGuire
 Waitfree Parallel Algorithms for the UnionFind Problem, a 1994 paper by Richard J. Anderson and Heather Woll describing a parallelized version of UnionFind that never needs to block
 Python implementation
Categories: Data structures
Wikimedia Foundation. 2010.
Look at other dictionaries:
Data structure — In computer science, a data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently.[1][2] Different kinds of data structures are suited to different kinds of applications, and some are highly … Wikipedia
Zipper (data structure) — Zipper is a purely functional data structure used in functional programming to solve some problems in a way using notions like “context” and “hole”. It is related to the generalization of notion “derivative” (for types). The zipper was described… … Wikipedia
Disjoint sets — Two disjoint sets. In mathematics, two sets are said to be disjoint if they have no element in common. For example, {1, 2, 3} and {4, 5, 6} are disjoint sets.[1] Explanation Formally, two sets A and … Wikipedia
Set (computer science) — In computer science, a set is a collection (container) of certain values, without any particular order, and no repeated values. It corresponds with a finite set in mathematics. Disregarding sequence, and the fact that there are no repeated values … Wikipedia
Data type — For other uses, see Data type (disambiguation). In computer programming, a data type is a classification identifying one of various types of data, such as floating point, integer, or Boolean, that determines the possible values for that type; the … Wikipedia
Data stream clustering — In computer science, data stream clustering is defined as the clustering of data that arrive continuously such as telephone records, multimedia data, financial transactions etc. Data stream clustering is usually studied under the data stream… … Wikipedia
List of terms relating to algorithms and data structures — The [http://www.nist.gov/dads/ NIST Dictionary of Algorithms and Data Structures] is a reference work maintained by the U.S. National Institute of Standards and Technology. It defines a large number of terms relating to algorithms and data… … Wikipedia
Algebraic data type — In computer programming, particularly functional programming and type theory, an algebraic data type (sometimes also called a variant type[1]) is a datatype each of whose values is data from other datatypes wrapped in one of the constructors of… … Wikipedia
Array data type — Not to be confused with Array data structure. In computer science, an array type is a data type that is meant to describe a collection of elements (values or variables), each selected by one or more indices that can be computed at run time by the … Wikipedia
Open set — Example: The points (x, y) satisfying x2 + y2 = r2 are colored blue. The points (x, y) satisfying x2 + y2 < r2 are colored red. The red points form an open set. The blue points form a closed set. The union of the red and blue points is a… … Wikipedia