Disjoint-set data structure

In computer science, a disjoint-set data structure is a data structure that tracks a set of elements partitioned into a number of disjoint subsets. It provides near-constant-time operations to add new sets, to merge existing sets, and to determine whether elements are in the same set. In addition to many other uses, disjoint-sets play a key role in Kruskal's algorithm for finding the minimum spanning tree of a graph.

History

Disjoint-set forests were first described by Bernard A. Galler and Michael J. Fischer in 1964. In 1973, their time complexity was bounded 12, the iterated logarithm of, by Hopcroft and Ullman. In 1975, Robert Tarjan was the first to prove the upper bound on the algorithm's time complexity, and, in 1979, showed that this was the lower bound for a restricted case. In 1989, Fredman and Saks showed that words must be accessed by any disjoint-set data structure per operation, thereby proving the optimality of the data structure.
In 1991, Galil and Italiano published a survey of data structures for disjoint-sets.
In 1994, Richard J. Anderson and Heather Woll described a parallelized version of Union–Find that never needs to block.
In 2007, Sylvain Conchon and Jean-Christophe Filliâtre developed a persistent version of the disjoint-set forest data structure, allowing previous versions of the structure to be efficiently retained, and formalized its correctness using the proof assistant Coq. However, the implementation is only asymptotic if used ephemerally or if the same version of the structure is repeatedly used with limited backtracking.

Representation

A disjoint-set forest consists of a number of elements each of which stores an id, a parent pointer, and, in efficient algorithms, either a size or a "rank" value.
The parent pointers of elements are arranged to form one or more trees, each representing a set. If an element's parent pointer points to no other element, then the element is the root of a tree and is the representative member of its set. A set may consist of only a single element. However, if the element has a parent, the element is part of whatever set is identified by following the chain of parents upwards until a representative element is reached at the root of the tree.
Forests can be represented compactly in memory as arrays in which parents are indicated by their array index.

Operations

MakeSet

Makes a new set by creating a new element with a unique id, a rank of 0, and a parent pointer to itself. The parent pointer to itself indicates that the element is the representative member of its own set.
The MakeSet operation has time complexity, so initializing n sets has time complexity.
Pseudocode:
function MakeSet is
if x is not already present then
add x to the disjoint-set tree
x.parent := x
x.rank := 0
x.size := 1

Find

Find follows the chain of parent pointers from up the tree until it reaches a root element, whose parent is itself. This root element is the representative member of the set to which x belongs, and may be x itself.

Path compression

Path compression flattens the structure of the tree by making every node point to the root whenever Find is used on it. This is valid, since each element visited on the way to a root is part of the same set. The resulting flatter tree speeds up future operations not only on these elements, but also on those referencing them.
Tarjan and Van Leeuwen also developed one-pass Find algorithms that are more efficient in practice while retaining the same worst-case complexity: path splitting and path halving.

Path halving

Path halving makes every other node on the path point to its grandparent.

Path splitting

Path splitting makes every node on the path point to its grandparent.

Pseudocode

Path compression	Path halving	Path splitting
function Find if x.parent ≠ x x.parent := Find return x.parent	function Find while x.parent ≠ x x.parent := x.parent.parent x := x.parent return x	function Find while x.parent ≠ x x, x.parent := x.parent, x.parent.parent return x

Path compression can be implemented using iteration by first finding the root then updating the parents:
function Find is
root := x
while root.parent ≠ root
root := root.parent
while x.parent ≠ root
parent := x.parent
x.parent := root
x := parent
return root
Path splitting can be represented without multiple assignment :
function Find
while x.parent ≠ x
next := x.parent
x.parent := next.parent
x := next
return x
or
function Find
while x.parent ≠ x
prev := x
x := x.parent
prev.parent := x.parent
return x

Union

Union uses Find to determine the roots of the trees x and y belong to. If the roots are distinct, the trees are combined by attaching the root of one to the root of the other. If this is done naively, such as by always making x a child of y, the height of the trees can grow as. To prevent this union by rank or union by size is used.

by rank

Union by rank always attaches the shorter tree to the root of the taller tree. Thus, the resulting tree is no taller than the originals unless they were of equal height, in which case the resulting tree is taller by one node.
To implement union by rank, each element is associated with a rank. Initially a set has one element and a rank of zero. If two sets are unioned and have the same rank, the resulting set's rank is one larger; otherwise, if two sets are unioned and have different ranks, the resulting set's rank is the larger of the two. Ranks are used instead of height or depth because path compression will change the trees' heights over time.

by size

Union by size always attaches the tree with fewer elements to the root of the tree having more elements.

Pseudocode

Time complexity

Without path compression, union by rank, or union by size, the height of trees can grow unchecked as, implying that Find and Union operations will take time.
Using path compression alone gives a worst-case running time of, for a sequence of MakeSet operations and Find operations.
Using union by rank alone gives a running-time of for operations of any sort of which are MakeSet operations.
Using both path compression, splitting, or halving and union by rank or size ensures that the amortized time per operation is only for m disjoint-set operations on n elements, which is optimal, where is the inverse Ackermann function. This function has a value for any value of that can be written in this physical universe, so the disjoint-set operations take place in essentially constant time.
Proof of O time complexity of Union-Find
Proof of O amortized time of Union Find
Statement: If m operations, either Union or Find, are applied to n elements, the total run time is O, where log^* is the iterated logarithm.
Lemma 1: As the find function follows the path along to the root, the rank of node it encounters is increasing.
Lemma 2: A node u which is root of a subtree with rank r has at least 2^r nodes.
Lemma 3: The maximum number of nodes of rank r is at most n/2^r.
For convenience, we define "bucket" here: a bucket is a set that contains vertices with particular ranks.
We create some buckets and put vertices into the buckets according to their ranks inductively. That is, vertices with rank 0 go into the zeroth bucket, vertices with rank 1 go into the first bucket, vertices with ranks 2 and 3 go into the second bucket. If the Bth bucket contains vertices with ranks from interval = then the st bucket will contain vertices with ranks from interval .
We can make two observations about the buckets.

The total number of buckets is at most log^*n
: Proof: When we go from one bucket to the next, we add one more two to the power, that is, the next bucket to will be
The maximum number of elements in bucket is at most 2n/2^B
: Proof: The maximum number of elements in bucket is at most n/2^B + n/2^B+1 + n/2^B+2 + … + n/2^{2^B – 1} ≤ 2n/2^B

Let F represent the list of "find" operations performed, and let
Then the total cost of m finds is T = T₁ + T₂ + T₃
Since each find operation makes exactly one traversal that leads to a root, we have T₁ = O.
Also, from the bound above on the number of buckets, we have T₂ = O.
For T₃, suppose we are traversing an edge from u to v, where u and v have rank in the bucket and v is not the root. Fix u and consider the sequence v₁,v₂,...,v_k that take the role of v in different find operations. Because of path compression and not accounting for the edge to a root, this sequence contains only different nodes and because of Lemma 1 we know that the ranks of the nodes in this sequence are strictly increasing. By both of the nodes being in the bucket we can conclude that the length k of the sequence is at most the number of ranks in the buckets B, i.e. at most 2^B − 1 − B < 2^B.
Therefore,
From Observations 1 and 2, we can conclude that
Therefore, T = T₁ + T₂ + T₃ = O.

Applications

Disjoint-set data structures model the partitioning of a set, for example to keep track of the connected components of an undirected graph. This model can then be used to determine whether two vertices belong to the same component, or whether adding an edge between them would result in a cycle. The Union–Find algorithm is used in high-performance implementations of unification.
This data structure is used by the Boost Graph Library to implement its functionality. It is also a key component in implementing Kruskal's algorithm to find the minimum spanning tree of a graph.
Note that the implementation as disjoint-set forests doesn't allow the deletion of edges, even without path compression or the rank heuristic.
Sharir and Agarwal report connections between the worst-case behavior of disjoint-sets and the length of Davenport–Schinzel sequences, a combinatorial structure from computational geometry.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...