Implementing a Union Find Data Structure
When I first started learning clojure one of the topics that puzzled me the most was how to implement immutable datastructures efficiently. In C/Java based languages most (if not all) operations are simply reduced to moving pointers/references around. This is both efficient and straightforward most of the time.
However when your start thinking about immutable datastructures, some operations which are usually trivial, become hard or at least non-intuitive. The purpose of this blog post is to implement a Union Find datastructure and show you how this can be achieved using clojure.
Union & Find
To begin, let’s define what a UnionFind datastructure is. If you are already familiar with UnionFind, feel free to skip this section. I chose UnionFind as an interesting data structure to implement because it’s simple yet most developers (including me) don’t use it on a daily basis so implementing it will not be completely trivial.
The union find (UF) datastructure exposes two operations
find. It is used most notably in
Kruskal’s Minimum Spanning Tree Algorithm. A UF is composed of a collection of sets; given an element
you can quickly determine to which set the element belongs to (
find) and given two sets, you can
merge them (
union). UF also typically exposes a
makeset operation which, given an element, returns
a singleton set with that element in it.
To get a bit of intuition behind UF consider that you have a set of gangsters and a map indicating where was the last time that they were seen in action (possibly in some lucrative venture like stealing cookies from little girl scouts). Your superb crime fighting skills tell you that gangsters that have been spoted close to each other are likely to belong to the same gang. You are asked to identify groups of gangsters that have been spotted in the same area so that you can get an idea of which gang they belong to. Assume you also know beforehand the number of gangs operating in the city.
This classic clustering problem is easy to solve using UF:
- Form a graph with all the gangsters as vertices (a graph G, get it?)
- Find the closest pair of gangsters
(a, b)such that
find(a) != find(b)and call
union(a,b). Add an adge in the graph between
- Repeat n-k times until there are exactly k gangs.
SHOW ME THE CODEZ!
Now that we have a good intuition on why UF is an important and useful datastructure, let’s see how this would look like in Clojure. Let’s being by describing the protocol:
For the purposes of this implementation I have added two additional operations:
This is just to make the code a bit more readable but it’s specific to a
RankedUnionFind which is
one of many ways in which a UnionFind can be implemented.
RankedUnionFind holds a set of trees. Every node in the tree has 2 things: a rank (which is
basically the hight of the tree as measured from that point) and an element (in our case, an evil
parent operation will return, given an element
x the element’s parent in the tree.
find-root then returns the root of the tree and
union will merge to trees together.
Trees!?!?! I thought union find was all about sets
RankedUnionFind represents sets as trees. An element is said to be in the set if it is in the tree.
Two sets are considered the same if their root is the same. This is a neat representation because
it allows to compare sets in constant time and to determine which set an element belongs to in logarithmic
time (the height of the tree).
And now, without further delay, I give you the
Notice that this implementation holds 2 maps:
- The first one
parent-mapis a hashmap that maps elements to their parent. The astute reader will say: “Well, why not use a
Nodewhich holds a reference to their parent instead?” While this sounds very reasonable, you will get into problems when merging to trees and you need to change the root of the tree. Take a look at the implementation of
unionand see why this can happen.
- The secon one
rank-mapholds a reference to the rank of each element in the map. The reason for this is mainly because it makes the code a bit cleaner IMHO. I encourage you to try different ways in which
RankedUnionFindcould be implemented.
Another thing to notice here is that
find-root determines that a node has is a root node when
the node’s parent is itself. I could have also done this be making the node’s parent be the
value, but then the data structure would not support
Finally let’s provide a way for users of the data structure to get a new instance
So that’s a functional immutable implementation of
UnionFind in clojure. Notice that if
you strip out comments there are actually only 30 lines of code (or 40 if you count the protocol)
which just goes to show how concise clojure code can be. Out implementation, although not optimized,
is still fairly efficient. It merges sets in
O(1) and can
find an element in a blazingly fast