-- File: random-access-list.hs
-- Author: Keith Schwarz (htiek@cs.stanford.edu)
--
-- An implementation of the random access list data structure. The random
-- access list is a data structure, invented by Prof. Chris Okasaki, that
-- represents a list of elements that supports random access, appending, and
-- remove-last all in O(log n) time. The data structure is purely functional,
-- meaning that any operation on the data structure produces two separate
-- random access lists - one holding the original list, and the other holding
-- the list formed by applying the particular operation.
--
-- The random access list is based on the skew binary number system. In the
-- skew binary system, numbers are written out using 0s, 1s, and 2s.
-- Each place in the number represents some multiple of 2^{n + 1} - 1. For
-- example, the number 120 would be
--
-- 1 x (2^3 - 1) + 2 x (2^2 - 1) + 0 x (2^1 - 1)
-- = 1 x 7 + 2 x 3 + 0 x 1
-- = 7 + 6
-- = 13
--
-- Unlike many other number systems, it is possible to represent the same
-- number in many ways using the skew binary system. For example, the
-- number 8 could be written either as 101 (1 x 7 + 1 x 1) or as 22 (2 x 3 +
-- 2 x 1). However, there is a unique way of writing out each number in a
-- "canonical" skew binary format. This format is structured as follows: each
-- number has at most one 2 in it. If there is a 2 anywhere in the number,
-- then all digits that follow the 2 must be 0.
--
-- For example, the first few numbers written in canonical skew binary would
-- be
--
-- #: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15
-- 0, 1, 2, 10, 11, 12, 20, 100, 101, 102, 110, 111, 112, 120, 200, 1000
--
-- There is a beautiful pattern for incrementing one skew binary number to
-- the next. If the number contains a 2 anywhere, replace that 2 with a 0 and
-- increment the preceding digit. For example:
--
-- 112 (12) + 1 -> 120 (13) + 1 -> 200 (14) + 1 -> 1000 (15)
--
-- This works correctly because if there is a 2 digit somewhere and one is
-- added, the quantity represented by that column goes from
--
-- 2 x (2^{n+1} - 1) + 1 = 2^{n+2} - 2 + 1 = 2^{n + 2} - 1
--
-- which happens to be the value in the next column.
--
-- Otherwise, if the number 2 does not appear in the representation, then the
-- last digit of the number should be incremented (which simulates adding 1 to
-- the total).
--
-- The random access list uses this number system in the following way. Note
-- that 2^{n+1} - 1 is the number of nodes in a complete binary tree of height
-- n. You can see this here:
--
-- *
-- / \
-- * * *
-- / \ / \ / \
-- * * * * * * *
-- height: 0 1 2
-- nodes: 1 3 7
--
-- The random access list works by storing the n nodes in the list in a series
-- of complete binary trees. It does so by choosing a collection of complete
-- binary trees such that the number and ordering of the trees is given by the
-- digits of the n's canonical skew binary representation. For example, if we
-- have 13 nodes (120 in skew binary), we would have 1 tree of height 2, 2
-- trees of height 1, and 0 trees of height 0. We would then order the trees
-- so that the tree of height 2 came first, then the two trees of height 1.
--
-- The advantage of this representation is that it is extremely easy to insert
-- additional nodes into this structure. Suppose that we want to add one more
-- node into the representation. To do so, we can look at the smallest trees
-- in the collection. If there are two of them, then we insert the new node as
-- the root of a tree with the two old trees as subtrees. Otherwise, we insert
-- a new tree of height 0. Note how this parallels the increment operation for
-- skew binary numbers: if there is a 2 (which must be followed purely by 0s),
-- we increment the preceding digit (by adding in a new tree of that height).
-- Otherwise, we insert a new tree of height 0, which corresponds to increment
-- the very last digit.
--
-- This information is, for now, sufficient to support the append and remove-
-- last operations on the random access list. The details of how to get
-- random access are described below.
module RandomAccessList where
-- Type: Tree
--
-- A type representing a binary tree. For simplicity later on, we will cache
-- in each node the total number of nodes in the tree rooted at that node.
data Tree a = Node { value :: a
, size :: Integer
, left :: Tree a
, right :: Tree a}
| Nil deriving (Show)
-- Utility function: makeLeaf
--
-- Creates a leaf node out of the specified value.
makeLeaf :: a -> Tree a
makeLeaf a = Node a 1 Nil Nil
-- Utility function: joinTrees
--
-- Joins two trees together into a single tree, using the value provided as the
-- root of the new tree.
joinTrees :: Tree a -> Tree a -> a -> Tree a
joinTrees lhs rhs root = Node root (1 + size lhs + size rhs) lhs rhs
-- Type: RandomAccessList
--
-- A type representing a random access list. The random access list can be
-- one of the following:
--
-- 1. Empty
-- 2. A single tree of some order, followed by a RandomAccessList, or
-- 3. Two trees of the same order, followed by nothing.
--
-- For simplicity, we will actually store the random access list in reverse,
-- with the first element at the end and the last element at the front. The
-- reason for this is that it gives us O(1) insertion time. Once we know which
-- case we are in, it only requires O(1) operations to update the list. Since
-- all that matters in determining this case is the type and number of trees at
-- the end of the list, this reversal strategy dramatically improves the
-- performance.
type RandomAccessList a = [Tree a]
-- Operation: append
--
-- Appends the specified value to the given RandomAccessList.
append :: RandomAccessList a -> a -> RandomAccessList a
-- Case 1: The list is empty. Then the resulting list is just a single tree.
append [] value = [makeLeaf value]
-- Case 2: The list has just one tree. Then we prepend a tree of order 0 to
-- the list.
append [tree] value = [makeLeaf value, tree]
-- Case 3: The list has two or more trees. Now we split into cases.
append list@(t1@(Node _ s1 _ _) : t2@(Node _ s2 _ _) : ts) value =
if s1 == s2
-- Case 3a: The two trees have the same order. Then we join them together
-- and prepend the result to the resulting list.
then (joinTrees t1 t2 value) : ts
-- Case 3b: The two trees have differing order. Insert a singleton tree of
-- height 0.
else (makeLeaf value) : list
-- Operation: last
--
-- Returns the last element of the random access list.
last :: RandomAccessList a -> a
last ((Node value _ _ _) : _) = value
-- Operation: isEmpty
--
-- Returns whether the given RandomAccessList is empty.
isEmpty :: RandomAccessList a -> Bool
isEmpty list = null list
-- Operation: length
--
-- Returns the length of a RandomAccessList.
length :: RandomAccessList a -> Integer
length list = let addSize (Node _ s _ _) acc = s + acc in
foldr addSize 0 list
-- Operation: removeLast
--
-- Removes the last element of the random access list.
removeLast :: RandomAccessList a -> RandomAccessList a
removeLast ((Node _ s left right) : remainder) =
-- Case 1: The tree has height 0 (and therefore size 1). We just remove the
-- tree.
if (s == 1) then remainder
-- Case 2: The tree has height > 0. We then remove the root of that tree
-- and place the left and right children back into the list.
else left : right : remainder
-- Now, we get to talk about the fun part - how do you do random access on a
-- random access list?
--
-- The basic idea behind this algorithm is the following. Our list is stored
-- as a collection of trees in which new nodes are always added to the trees
-- at the end of the list (in our representation, we store the list backwards,
-- so those new elements are at the front of the list). Consequently, if we
-- want to do random access, need to figure out which tree that element is in,
-- then look up the node within that tree.
--
-- Given that our trees are represented in reverse order, if we want to find
-- the kth element of a list, we actually want to find the (n - k - 1)st
-- element of the reversal of the list. This observation is important for
-- later on.
--
-- We now need to think about where each node will end up. Let's suppose that
-- we try to build up a random access list of 7 values (numbered 0 through 6).
-- The sequence of snapshots of the data structure will look like this:
--
-- (empty)
--
-- 0
--
-- 1 0
--
-- 2
-- / \
-- 1 0
--
-- 2
-- / \
-- 3 1 0
--
-- 2
-- / \
-- 4 3 1 0
--
-- 5 2
-- / \ / \
-- 4 3 1 0
--
-- 6
-- / \
-- 5 2
-- / \ / \
-- 4 3 1 0
--
-- Notice the general structure of each tree. In particular, the very last
-- element inserted as at the root, and the left subtree is newer than the
-- right subtree.
--
-- But remember - we're searching in reverse order! If we want the kth element
-- the sequence, we actually the n - k - 1st element. With that observation,
-- the root element is actually element 0, then the left subtree contains the
-- next range of elements, and finally the right subtree holds the remaining
-- elements of the tree. So let's suppose that there are a total of n nodes
-- in the left subtree. Then we can recursively select the kth newest element
-- in the tree as follows:
--
-- If k = 0, return the root.
-- If 1 <= k <= n, return the (k - 1)st newest element in the left subtree.
-- Otherwise, return the (k - 1 - n)th newest element in the right subtree.
--
-- We can turn that directly into Haskell code with one last observation: since
-- we are working with perfect binary trees, if the overall tree has a total of
-- k nodes in it, the left and right subtrees each have (k - 1) / 2 nodes in
-- them.
kthNewestInTree :: Tree a -> Integer -> a
kthNewestInTree (Node value size left right) k =
if k == 0 then value
else
let subtreeSize = (size - 1) `quot` 2 in
if 1 <= k && k <= subtreeSize then
kthNewestInTree left (k - 1)
else
kthNewestInTree right (k - 1 - subtreeSize)
-- Notice that this algorithm has time complexity proportional to the height of
-- the tree. Since each tree is a perfect binary tree of some height h, the
-- runtime will be O(h), since we descend into exactly one subtree at each
-- level.
--
-- Since we have a total of n elements in our list overall, we can relate h and
-- n. A perfect binary tree with n nodes can have height at most O(log n).
-- In the worst case, all nodes in the random access list belong to a single
-- tree. Consequently, h = O(log n), and so on a list of length n the time
-- required to do a lookup in a single tree is O(log n). Great!
--
-- Now that we have the ability to select the kth newest element from a single
-- tree, we can do random access. To find element number k, we will find the
-- (n - k - 1)st element overall. Since the trees are sorted from newest to
-- oldest (because our list is stored in reverse order), we can just skip from
-- tree to tree until we find the tree in which the given element belongs.
--
-- More specifically, let's suppose that we're looking for the zth *newest*
-- element (remember that this is the (n - k - 1)st overall element in the
-- logical list). Starting at the first tree, we can check if z is less than
-- the total number of elements in this tree (let's suppose there are t total
-- nodes in this tree). If z < t, then we use our above function to select
-- the zth newest element in the tree. Otherwise, we can skip past all of
-- these nodes. We're now looking for the (z - t)th biggest element of the
-- next tree. Recursively repeating this process will eventually find the node
-- we want.
--
-- The Haskell code for this is below; I'll discuss algorithmic efficiency
-- right afterwards.
select :: RandomAccessList a -> Integer -> a
select list k =
let selectNewest (tree@(Node _ size _ _) : rest) k =
if k < size then kthNewestInTree tree k
else selectNewest rest (k - size)
in selectNewest list ((RandomAccessList.length list) - k - 1)
-- So now we need to analyze the efficiency of this algorithm. In the worst
-- case, we have to skip over all the trees in the list, so the worst-case
-- number of trees visited is given by the worst-case number of trees in the
-- list. We also have to do O(log n) work at the end due to the call to
-- kthNewestInTree that's buried in there.
--
-- To determine the maximum number of trees possible, it's useful to note that
-- there are at most two trees of any individual order in a random access list.
-- (It's never possible to have two trees of two or more different orders, but
-- this gives a useful upper bound). In the worst case, we would have trees
-- of all possible orders. Therefore, if there are 2k trees, then there would
-- have to be at least
--
-- 2 x (2^1 - 1 + 2^2 - 1 + 2^3 - 1 + ... + 2^k - 1)
-- = 2 x (2^0 + 2^1 - 1 + 2^2 - 1 + 2^3 - 1 + ... + 2^k - 1 - 2^0)
-- = 2 x (2^0 + 2^1 + ... + 2^k - k - 2^0)
-- = 2 x (2^{k + 1} - 1 - k - 1)
-- = 2^{k + 2} - k - 2
-- < 2^{k + 2}
--
-- nodes in the collection. Therefore, we have that the maximum possible k
-- for which we could have two trees of each order is bounded by the maximum
-- k for which
--
-- 2^{k + 2} <= n
-- k + 2 <= lg n
-- k <= lg n - 2
--
-- Therefore, the maximum possible value for k is O(log n). Consequently, we
-- know that the maximum number of trees we have to look at is O(log n), and so
-- the lookup time in a random access list is given by O(log n) time to find
-- the right tree, then O(log n) time to search the tree, yielding an overall
-- runtime of O(log n). This is excellent performance!
--
-- In short, our data structure supports
--
-- isEmpty in O(1)
-- length in O(log n)
-- append in O(1)
-- removeLast in O(1)
-- select in O(log n)
--
-- This is dramatically better than using just a standard linked list!