Generating Subsets Lexicographically with Binary Numbers and Cyclic Shifts

Last Major Update: March 20, 2011

I've spent a bit of time recently thinking over the connections between different number systems and certain algorithms. Many of my favorite algorithms and data structures are in some way based on number representations. You can generate permutations by using the factoradic number system, build fast priority queues (binomial heaps) out of binary numbers, take logarithms and search sorted ranges quickly by using Fibonacci coding, etc.

The other day, I started wondering whether it would be possible to list all subsets of some given set by using binary numbers. After all, there are 2ⁿ subsets of a set with n elements in it, just as there are 2ⁿ different ways to write out numbers with n bits. However, having seen how you can use factoradic numbers to list permutations in ascending lexicographical order, I was curious to see if you could somehow use binary representations of numbers to list subsets in ascending lexicographical order.

After staring at the problem for a while, I luckily found a solution to this problem based on binary numbers and cyclic shifting. The resulting solution is beautiful, concise, and elegant, though I confess that I don't fully understand exactly why it works. I know that it works, as I have a correctness proof that is described along with the solution, but I have little intution for it. Part of the reason for posting this here at all is to invite the readers of the internet to help shed some light on this.

The Problem: Listing Subsets Lexicographically

The initial problem I began working on is as follows: Given a set S of elements drawn from some totally-ordered universe, list all of the subsets of S in lexicographically increasing order. Of course, sets don't naturally have a notion of "lexicographic order," but if the sets are of elements that are themselves ordered, we can define a lexicographic ordering on those sets according to the natural definition: we just list all the elements of the sets in increasing order, then compare those strings lexicographically. For example, we have that

{a, b, c} < {a, b, d}

and that

{} < {a, b, c, d}

For simplicity, I'll use string notation rather than set notation to encode subsets. For example, I'll write abd instead of {a, b, d} and cd instead of {c, d}. I'll use the symbol ∅ to denote the empty set. In this notation, the subsets of the set abcd are, in lexicographically increasing order:

∅, a, ab, abc, abcd, abd, ac, acd, ad, b, bc, bcd, bd, c, cd, d

Simply listing all subsets in lexicographically increasing order is tricky, but to make the problem harder let's say that we're also interested in this follow-up question: given some number k with 0 ≤ k < 2ⁿ, what is the kth lexicographically ordered subset of S? For example, given the set abcd and the number 11, we should produce bcd. Ideally, we'd like an answer that works in time polynomial in log k, since in the worst case k = O(2ⁿ), or even better independently of k.

Looking at Binary Numbers

Because we're interested in computing the kth subset of some set S, it would be really great if we could find some way of computing that subset by just looking at the properties of the number k. As a starting point, it seems like a good idea to look at the binary representation of k, since there are 2ⁿ subsets of a set S and 2ⁿ binary numbers that have exactly n bits. One idea we might try would be to use the bits of the binary representation as indicators for what elements to pick out of the set. For example, when looking at subsets of abcd, we could have the most significant bit control whether a is in the set, the second-most significant bit control whether b is in the set, etc. If we use this approach, then we would generate the subsets in this order:

This doesn't seem very promising - the elements here are almost completely out of order. However, suppose that we change our encoding scheme so that we treat the bits of the number in reverse. That is, we'll use 0 to encode that we should choose the given number and 1 to encode that we should not choose the number. That is, the number 1001 would be the set bc, not the set ad. If we do this, then we get this table, which is still out of order but looks a bit closer to what it should:

If you'll notice, these elements are still not in lexicographic order, but there's still some semblance of order here. For example, all sets containing a precede all sets that don't contain a, and within the sets starting with a, the sets containing b precede those without b, etc. This may not be perfect, but perhaps we can massage it into lexicographic order by making some local changes to the ordering.

One thing to notice is that the empty set is completely out of position - it's at the very end, when it should be at the very beginning. What if we were to move it to the beginning by cyclically shifting all of the elements forward a step? That is, we'll move every element forward a step, moving the very last element to the front. If we do this to this ordering, we get the following:

In this image and in the images that follow, I'll use a large square brace around a group of elements to indicate that those elements should be cyclically-shifted forward one step. In this picture, since everything was cyclically-shifted one step forward, there's a square brace enclosing the entire sequence.

If you look at this new sequence, you'll notice that there are some other spots where elements seem to be in the wrong order. For example, the singleton set a is the very last set containing an a, though it should come lexicographically first. The same is true for b and c. Perhaps we should cyclically shift those elements up as well! If we do this, we'll get the following:

There are a few things to notice in this diagram. First, note that the cyclic shifts are applied left-to-right, meaning that in this case we cyclically shift the sets starting with a, then the sets starting with b, etc. before we then shift all the elements. Also, note that I've marked that d should be shifted with itself. This is mostly for completeness - a cyclic shift over a single element is a no-op since the only element just wraps back around to where it started.

The ordering that we're getting right now looks a lot better than what we started with, but it's still very much out of order. We're going to need to do some more cyclic shifting. But if you'll notice, the shifts we have so far seem to be part of a recursive structure. The outermost shift shifts all 16 elements so that we can move the last element up to the top, leaving 15 elements that still need to be positioned. Since 15 = 8 + 4 + 2 + 1, we break those fifteen elements up into groups of 8, 4, 2, and 1 and then cyclically shift all those groups. What happens if we continue this recursive pattern? For example, of the group of eight elements, the shift correctly positions the last element, leaving 7 = 4 + 2 + 1 elements behind. What if we cyclically shift those elements, then continue this subdivision again? The shifts that we'd apply this way are shown below:

And amazingly, this construction has somehow rearranged the elements into lexicographically ascending order!

Let's now be a bit more rigorous about what groups we're applying these cyclic shifts to an in what order.

Given a set S of size n, begin by listing all of the binary numbers between 0 and 2ⁿ - 1, then convert each of these binary numbers into a subset by including the kth element of S in the resulting set when the kth most significant bit is set. Then, recursively apply the shifts as follows. As a base case, if n = 0, then cyclically shift the single empty set that's generated. Otherwise:

Break the 2ⁿ - 1 first numbers generated into n - 1 groups of size 2^n-1, 2^n-2, ..., 2¹, and 2⁰, then recursively apply the cyclic shifts to those groups.
Cyclically shift all 2ⁿ elements.

To give some more concrete examples, here are the shifts that you'd perform for n = 1, 2, 3, and 4:

Shifts for n = 1

Shifts for n = 2

Shifts for n = 3

Shifts for n = 4

From Listing All Subsets to Generating One Subset

This recursive definition is nice, but it has one key flaw. When we set out to solve this problem, we were interested in solving the problem "given k, what is the kth lexicographically-ordered subset of S?" That is, we want to be able to compute the proper subset of the master set S without writing out all the subsets and then shifting them. Ideally, we should be able to figure out the result just by looking at the number k, perhaps in binary. The issue we face now is that the recursive formulation we've set up doesn't seem to provide an avenue for determining what shifts end up getting applied to any one number. It looks like we have to compute all the shifts in order to know what shifts need to be applied to any particular number k.

However, this isn't anything to worry about. Suppose, for example, that we're given some subset and are curious about its index in the lexicographic ordering of subsets. If we know what binary number corresponds to that set, then we can just "pump" the set through all of the shifts to determine its final resting place in the overall sequence. In other words, we can use the scaffolding of the cyclic shifts to move single elements around, even if we don't have all of the elements written out.

As an example, suppose that we were curious where the subset abc of the set abcd would appear in the final lexicographic ordering of abcd's subsets. Then we could pump it through the cyclic shifts as follows:

Tracing the path followed by the subset abc through the shift network

With this setup, we can see where abc would have ended up in the final lexicographic order, but the above picture is a bit hand-wavy about how we actually keep track of what index abc is at in each step. To be more precise about how we answer the question "what index does this subset end up at?," we'll repeat the above process. However, at each step, we'll keep track of the index that the current set occupies rather than the contents of the set. If we do this, then we get the following:

Tracing the index of abc through the shift network

From this, it's easier to see that the resulting index is the binary number 0011, and so we see that the subset abc occupies position three (zero-indexed) in the overall sequence.

With this construction, we can map a subset to its position in the lexicogrphic ordering of subsets. By reversing this construction, we can also determine, given an index, what subset is at that index. For example, suppose that we want to find the subset that appears at position one (zero-indexed) in the lexicographic ordering of subsets. We start off by writing out one in binary (0001), then run the cycles backward by shifting in the opposite direction. If we do this, then we get this result:

Tracing the position of 0001 through the shift network

Once we invert the shifts, we see that the set that ends up at position 0001 begins at position 0111. This corresponds to the set a, which is indeed at position one in the lexicographic ordering of sets (following the empty set).

Interpreting Cyclic Shifts in Binary

We now know that if we know what shifts we need to apply to an element, we can convert between indices and subsets. But how can we determine what shifts need to be applied? So far we've relied on having all of the shifts explicitly spelled out for us, but that seems wasteful; why genereate a whole bunch of shifts that we won't end up using? Ideally, we'd like to have some way to figure out what shifts need to be applied just by looking at the number itself.

There's a truly remarkable observation we can make about what elements get cyclically shifted in groups of which size. Below I've reprinted the shifts that get performed for the case where n = 3. In the n = 3 case, there is one shift that cyclically permutes four elements. I've highlighted the first bit of all of those elements:

Shifts for n = 3, highlighting the first bit.

Notice that these four elements, which will eventually have a four-element cyclic shift applied to them, all have a zero in their first bit! Moreover, notice that the elements that are not cyclic shifted in a group of four have a one in this first bit!

What if we applied this same logic to cyclic shifts of two elements? Again, there are four elements that have a two-element cyclic shift applied to them, and I've highlighted below their second bit:

Shifts for n = 3, highlighting the second bit.

Now that's interesting! We have the same pattern here: numbers with their second bit as a zero get shifted in a two-cycle, while numbers with a one in their second bit do not.

This observation suggests a more general trend: the shifts in a number can be determined from the zeros in that number's binary representation. In particular, if the bit for 2^k is zero, then the number is part of a 2^k cycle.

But wait a minute - in our construction, after doing all the local cyclic shifts, we do one more "big" cyclic shift at the end that cycles all the elements. This doesn't seem to be encoded in the binary representations of the numbers anywhere. However, we can easily fix this by writing out the numbers using one more digit that is necessary. In fact, if we try writing out these numbers using n + 1 bits, then the first bit will always be zero. Given our above rule, this means that we will shift everything as part of a 2ⁿ⁺¹-cycle. Here's a sample of the encoding scheme for n = 3 with the extra zero out in front:

Shifts for n = 3, highlighting the zero bit in front.

It now seems that we have an algorithm that, given some starting index, tells us what transformation to apply to the binary representation of the number to see at what index that number will end up in the final permutation. We simply write the number out in binary (using one more bit than is necessary), then apply cyclic shifts whenever we find a zero in the number. Because we always apply smaller shifts before larger ones, we'll work from the right to the left when applying these shifts.

When we have all the elements written out in sequence, the definition of a cyclic shift is graphical and intuitive - we just push everything forward a step and move the last element to the front. But what does a cyclic shift mean at the binary level, when all we have is a numeric representation of the index? Perhaps seeing a few examples will help us figure this out. Let's look at the above diagram for n = 3 and think about what happens when we cyclically shift all the elements as part of the big 8-cycle. In this setup, we have that 0000 goes to 0001, 0001 goes to 0010, etc. The only interesting case is that 0111 goes to 0000. What about the four-cycle? Well, we have that 0000 goes to 0001, which goes to 0010, which in turn goes to 0011, which finally returns to 0000. Finally, let's look at the two-cycles. One of them maps 0000 to 0001 and 0001 to 0000. Another maps 0100 to 0101 and 0101 to 0100.

There are definitely some trends at work here, and they're probably easiest to see if we know what bits to look at. Take the four-cycle, for example. If you'll notice, in the four-cycle, only the lower two bits of the number are affected. Moreover, those bits map as 00 → 01 → 10 → 11 → 00. This looks surprisingly like binary addition modulo four! Similarly, take a look at the two-cycles, which only change the very last bit. That bit cycles 0 → 1 → 0, which is just binary addition modulo two. More generally, we have that a 2^k cyclic shift can be implemented by adding one to the last (k - 1) bits, modulo 2^k.

Let's walk through an example of this. Suppose that we're interested in finding the final resting place of the subset acd of the set abcd. We begin by converting the set into a binary number, remembering that 0 bits mean "present" and 1 bits mean "not present." This gives us 0100, and by prepending an extra 0 to this we get 00100. Now, we scan across the bits from right to left apply shifts as appropriate. The leftmost bit is a zero, which means that we should do a 1-cycle, which is a no-op. The bit before that is also a zero, so we should do a 2-cycle by looking at the last bit, adding one, modulo two. This gives us 00101. As we continue across the bits, we see that the 4's bit is a one, so we don't cycle the number, but the 8's bit is a zero, meaning that we do an eight-cycle. This means that we take the last three bits (101) and add one modulo eight to get 110, giving us the resulting number 00110. Finally, we look at the 16's bit, which is a zero. This means that we need to do a sixteen-cycle, which can be done by adding one to the last four bits (0110) mod sixteen, yielding 00111. This says that the subset acd should end up in position 7 (zero-indexed) lexicographically. If you refer back to the example with n = 4, you'll notice that this is precisely correct.

The problem with this setup is that we've solved the wrong problem! The above logic can be used to answer the question "given a subset, what index is it in the lexicographic ordering of all subsets of S," whereas we're interested in the question "given an index k, what is the kth lexicographically-ordered subset of S?" Fortunately, we can easily remedy this by just reversing the process. Rather than scanning across the bits from right-to-left and doing cyclic forward shifts, we'll scan across the bits from left-to-right and do cyclic backward shifts. Just as cyclic forward shifts can be thought of as adding one modulo some power of two, cyclic shifts backward can be though of as subtracting one modulo some power of two.

To see how this works, let's work out an example. Suppose that we're interested in finding the twelfth subset of abcd. We start off by writing twelve in binary, prepending an extra zero bit, and get 01010. Starting with the leftmost bit (the 16's bit), we notice that it is a zero, so we should do a sixteen-element cyclic shift backwards. This is equivalent to subtracting one from the binary number 1010, modulo sixteen, which yields 1001, and so our number is now 01001. We now examine the 8's bit, which is one, and so we do nothing. Next, we look at the 4's bit, which is zero, and so we do a four-element cyclic shift backwards. This is equivalent to subtracting one from the last two bits (01) modulo four, yielding 00. This gives us 01000. We then examine the 2's bit, which is zero, so we do a two-element cyclic shift backwards. This is equivalent to subtracting one from the last bit (0) modulo two, which yields 1, giving us 01001. Finally, we examine the 1's bit, which is 1, and so we do nothing. We now have the resulting binary number 01001. The first zero is there just to help control the shifts that we use, so we can ignore it. However, the remaining four bits (1001) encode which elements of the set we should pick. In particular, since the second and third bits are zero, we end up with the subset bc. Sure enough, if you look back to the lexicographic ordering of subsets of abcd, bc is in position twelve (zero-indexed, of course).

Proving the Construction Correct

So far all of the examples we've tried have worked out correctly, but there's no reason to believe that this set of weird bit twiddlings and shifts should magically work out and result in the correct lexicographic ordering of all subsets. In order to have certainty that this works, we'll need to prove it correct. Fortunately, there's a very elegant inductive proof of the construction. The proof shows that the cycles suggested by the binary representation correctly rearrange the elements as intended.

Theorem: The above construction correctly lists subsets in lexicographically ascending order for any set S.

Proof: By induction on |S|, the number of elements in S. As a base case, consider the case when |S| = 0, in which case S = ∅ Then the construction says to write out all the numbers in the range 0 to 2⁰-1 using 0 + 1 = 1 bits, and so we write out the number 0. The construction then says that we should cyclically shift one element, which is a no-op. Finally, we scan across all the numbers we have and convert each to a set by discarding the first zero bit and then considering the rest of the bits to determine the elements of the set. However, there are no more bits, and so the set we generate for the number 0 will be the empty set, which is correctly the only subset of ∅. Thus the claim holds for n = 0.

For the inductive step, assume that for any S with |S| = n, the claim holds and consider any set S with size |S| = n + 1. We can then write S = {a} ∪ S', where a is the lexicographically first element of S and S' are the remaining elements of S. Now, suppose that we write out the first 2ⁿ⁺¹ numbers in binary using n + 2 bits. The structure of binary numbers is such that the first 2ⁿ of these and the last 2ⁿ of these will be identical except for the first two bits, which will be 00 in the first half and 01 in the second half. This is shown here graphically:

Bit patterns duplicated in binary counting

The binary numbers from 0 to 2ⁿ⁺¹ - 1 contain two copies of the binary numbers from 0 to 2ⁿ - 1.

Now, let's think about what cyclic shifts will be in place here. Because the binary numbers we have here contain the binary numbers from 0 to 2ⁿ-1 twice, many of the shifts that would occur for the subproblems of size n will still be present. However, we must be careful, because when working with a subproblem of size n, the solutions assume that the shifts are correct when using n + 1 bits, with the first bit being a zero. In our case, this means that the first half of the numbers corresponds exactly to the subproblem with n elements, but the second half of our problem does not. In particular, the topmost shift that would have been applied in the subproblem of size n is no longer being applied, so the elements in the second half are not going to be in lexicographic order.

However, there is one trick we can do to make it easier to reason about the contents of the second half of the elements. Essentially, we can think of those elements as being completely in lexicographic order, but then being reverse-shifted one step. We can do this because we can treat the absence of a top-level shift as identical to the composition of a cyclic shift forward followed by a cyclic shift backwards. From here on, we'll assume (based on the inductive hypothesis) that the elements in the latter half of the array are in the correct order, but have been reverse-shifted one step.

At this point, we have interpreted the binary representations of the numbers as yielding two copies of the shifts from the subproblems, plus one shift in the opposite direction in the latter half. We have not yet applied the final top-level shift that will cycle all the elements forward one step. More importantly, though, we haven't talked about what subsets are going to be produced by the binary numbers we have here after we apply all but the top-level shift. To answer this, we'll look at the recursive structure of binary numbers. Each of the numbers we have here are written out as n+2 bits, of which the first bit is zero. The bit in the (n+1)st position controls whether the lexicographically least element (in this case, a) is present in the set, and the remaining bits control whether the elements of S' are in the set. Since both the first and last half of the numbers contain all possible bit patterns for the last n bits, this means that the first half of the elements are formed by computing all subsets of S', then adding in a, while the latter half of the elements are just all possible subsets of S'.

My claim is that if you ignore the top-level shift and the very last element in the sequence, that the elements will be put into lexicographic order. To see this, consider any two indices i < j in the final ordering (after doing all the shifts except the last). There are three cases to consider:

Case 1: i < 2ⁿ and j < 2ⁿ. Then both sets corresponding to those indices came from the first half of the numbers, so they both contain a. Thus set i comes before set j iff set i compares lexicographically before set j ignoring the element a. By the inductive hypothesis, the set of shifts performed on the remaining n bits will order the elements in the first half of the numbers into lexicographic order, and so set i will compare lexicographically lower than set j.
Case 2: i ≥ 2ⁿ and j ≥ 2ⁿ. Then using similar logic as above, set i lexicographically precedes set j by using the inductive hypothesis. Note that we can ignore the fact that the elements were cyclically shifted here, since by ignoring the last set generated we ignore the only element whose relative position has changed.
Case 3: i < 2ⁿ and j ≥ 2ⁿ. Then the set i contains a and the set j does not. Moreover, set j is not the empty set, because after cyclically shifting the lexicographically-ordered elements of S' backwards one step, the empty set must be the last set generated, and we've stated that set j cannot be this set. Thus set j is a nonempty set that does not contain a, so it lexicographically follows set i.

In other words, of the 2ⁿ⁺¹ sets we've generated, the first 2ⁿ⁺¹-1 of them are lexicographically ordered. The last set, as mentioned above, must be the empty set, and so it should come lexicographically first. Thus if we do apply the final top-level shift, we'll end up with the empty set followed by the nonempty subsets of S ordered lexicographically, and so the resulting sequence is the subsets of S in lexicographically ordered. QED

Analyzing the Algorithm

At this point we have come up with a construction that uses properties of binary numbers to list all of the subsets of some set S in lexicographic order. Moreover, we have the ability to take any subset and return its index in the order, or conversely to take any index and get the subset at this index. All that's left to do now is analyze the runtime complexity of these algorithms. Since one algorithm is the inverse of the other, and since they rely only on simply invertible operations (shifts, addition, and subtraction), I'll just analyze the algorithm from going from a set to an index. The other way around has the same complexity.

Given some subset T ⊆ S, where |S| = n and |T| = k, we can construct the binary indicator representation of T in time O(k lg n) assuming that S is stored in either a binary search tree or sorted array. To do this, for each of the k elements in T, we determine its order (its position in the sorted order) and then set the appropriate bit of our binary number. This lookup can be done in O(lg n) time using a binary search in S.

Once we've built up this binary number, we need to start applying shifts to it. For each of its O(n) bits, if that bit is a zero, we take the remaining bits, then add one modulo the appropriate power of two. For arbitrary-precision words, this takes time at most O(n), since we might have to flip O(n) different bits. This step thus runs in O(n²) time.

Overall, this gives us a runtime of O(k lg n + n²), and since k = O(n), the entire operation is O(n²). We thus have a polynomial-time algorithm for generating the kth subset in lexicographic order, or vice-versa.

Concluding Remarks

Although I understand this construction and its proof of correctness, I have absolutely no intuition as to why it should work. The series of shifts seems pretty much arbitrary, as was the particular mapping from binary numbers to sets. That said, I think this is a beautiful algorithm when presented in its simplest form. It allows you to take a binary number and through the properties of that number alone map it into a subset with particular properties.

If you're interested in an implementation of this algorithm, I have a C++ implementation of the algorithms described here. The logic to generate the kth lexicographic subset was remarkably simple - I was actually able to implement the entire algorithm on my first try - while the code for going the other way around is a bit trickier.

I have not seen this algorithm used elsewhere, though I cannot believe that in all of the years of mathematics and computer science that no one else has thought it up. If anyone has a reference they could point me at that explores this algorithm in any detail, I'd love to read more about it and its properties.

If you have any thoughts or questions about this algorithm or the writeup, please let me know at htiek@cs.stanford.edu. I'd love to hear your thoughts!