CS166 Archive

This page contains archived versions of the Stanford CS166 (Data Structures) webpage in the quarters I've taught it. All internal links should be valid, though external links may no longer be functional.

2018-2019 Academic Year
Spring 2019

As with most classes I taught in the 2018-2019 academic year, this version of CS166 represented a pretty significant overhaul to the course materials, and I think that it's been a huge gradient step in the right direction.

The major change I made to the lectures on this iteration was to explore, much more so than usual, where these data structures come from and how to derive them from first principles. The lecture on suffix trees, for example, explored Patricia tries in the abstract and how to arrive at them by optimizing regular tries for fast searching, and talked about the distinction between operational and mechanical descriptions of structures (mechanically, a suffix tree is what you get if you make a trie for all the endmarked suffixes and eliminate non-branching internal nodes; operationally, it's a tree with one internal node per branching word and one leaf per suffix). The lecture on suffix arrays spent more time talking about why it is that suffix arrays don't lose much relative to suffix trees, and actually explored how to construct LCP arrays using Kasai's algorithm (which, as I found, is not very hard to understanding!) In doing this, I trimmed out some material on generalized suffix trees and suffix arrays and on longest common extensions, which I was sad to see go. My notes to myself indicate that there's about ten minutes of free time left in those lectures, so perhaps I can put a shout-out in there. Overall, though, redesigning these slides gave me a much deeper appreciation for how these data structures work, and I'm glad I did this! To get this to fit, I split the lecture on suffix trees and suffix arrays apart into two and cut the lecture on Aho-Corasick automata, which I was initially sad to see go but honestly am not losing any sleep over at this point. I think the topic refocusing worked great!

I also did a major revision of the lecture on B-trees to show where B-trees come from (how would you build a balanced multiway tree when there's a cap on the number of keys per node?) and how to derive the standard invariants from the basic data structure operations, rather than the other way around. In doing so, I had to slightly scale back the discussion of where red/black trees come from, which is really a shame. In the future, I might cut the discussion of split and join for red/black trees (those are tough topics that rarely make an appearance and can be better done with splay trees) and then put the discussion of red/black trees and tree rotations into the second balanced trees lecture, which would keep that lecture more unified on the theme of tree rotations and how to use red/black trees, rather than spreading those topics across multiple lectures. In the lecture on B-trees, I also tried a new format where I periodically paused and asked students to work through some sample questions on their own (which of these variants would be better and why?), which worked extremely well. That's something I tried to integrate into all future lectures and is something I think really improved student understanding.

I'm not super thrilled with the presentation of binomial heaps, which is strange because I made only minor touchups to those slides and previously considered them to be some of my best work! I think the issue is that it proceeds too much along the lines of "here's what binomial heaps are and how they work," skips some beautiful derivations (where do you get binomial trees from?), and glosses over the hard part of lazy binomial heaps (how do you compact trees together?). In teaching this lecture, I realized that there's a totally different algorithm for tree compaction that is much easier to understand than the standard one (do a true bucket sort by tossing each tree into a bucket based on its order, then repeatedly compact trees and put them into the next bucket), which I integrated into the Fibonacci heaps lecture and will need to back-patch into the original binomial heaps lecture. That revised Fibonacci heaps lecture seemed to go much better than before by highlighting the pain points from past quarters (what's the order of a tree when trees can be reshaped so easily? how do lazy binomial heaps work, again?), and I'm pretty happy with the result.

Something I didn't expect was what would happen when I went back to revise my splay trees slides. I already thought those slides were in good shape, and I figured that my revisions would mostly be about contextualizing them by looking at what their properties (balance, entropy, working set, dynamic finger) actually mean in practice. Instead, the lecture turned out to be almost 60 minutes on a beyond-worst-case analysis of binary search trees and a sampler of data structures like level-linked finger BSTs and Iacono's working set structure, with only a quick overview of splay trees at the end. That lecture seemed to be really popular with students and there's easily enough content here for two lectures, so I'm planning on splitting this into a two-lecture sequence in the future, spending more time exploring the entropy property (can we actually prove the entropy lower bound?) and Iacono's working set structure in the first lecture and focusing more extensively on splay trees, their properties, the proofs, and the open problems in the second.

When it came time to teach frequency estimators, I realized that I hadn't actually stopped to think about what an (ε, δ)-approximation actually "looked like" in a probabilistic sense or what 2-independent, 3-independent, etc. hash functions "felt like." I redid the lecture to start off with a tour of why we use 2-independent hash functions when we want a nice distribution of elements, focusing on how each part of the definition (uniform distribution and pairwise independence) were each necessary to pin down the idea that collisions should be infrequent. I also drew some diagrams of what tuning ε and δ would do to the probability mass of an (ε, δ)-estimator and why we don't need our estimators to be unbiased. I also detailed a multi-step process for building a good estimator, which in retrospect I'm really proud of. The good news is that this went over well with students and really motivated the major ideas from the count-min sketch. The bad news is that this totally overflowed a single lecture and led to me almost entirely scaling back the presentation of linear probing in the next lecture to fit count sketches. I need to think about what to do here, since not everyone has seen count-min sketches on entry to the course, though many have.

As alluded to above, I also revised the slides on linear probing to focus more on the key transferrable skill: increasing the degree of independence of a hash function gives tighter bounds on the probability that "too many" elements hash to a specific place in a hash table. This abstracts away the pain points of the previous presentation (all the nuances of overloaded buckets and Thorup's segment tree, the hard math of 4-independence, etc.) while reinforcing the major ideas from before (indicator variables and concentration inequalities). With a bit more time I think I could really make that idea shine, but, alas, this quarter I totally ran out of time to explore this in depth.

I also tuned up the lecture on cuckoo hashing in a few major ways. To motivate their analysis, rather than beginning with the assumption that the number of slots per table needs to be the number of elements times some constant greater than one, I ran some experiments on my computer to generate a plot of the success rate as a function of the table size, which (unsurprisingly) led to the standard phase transition plot showing a rapid shift from "you'll almost certainly fail" to "you'll almost certainly succeed." I did this as well with the number of displacements per insertion, generating a plot that strongly suggested that insertions were expected O(1). This went over really well with students - I got all sorts of questions about how the data were generated and it showed off how, in practice, these experiments help drive theory. This is something I'd absolutely love to retrofit into previous lectures if at all possible, since it shows that we aren't purely in Theoryland all the time when designing data structures. Another major change in this lecture was the introduction of a different proof that shows the probability of a failed insert is low, which came courtesy of a CS166 project team from last quarter. As expected, these changes ate into the lecture time and we didn't manage to cover everything I was hoping to touch on. I think that can be fixed by replacing the super generic version of the subcritical Galton-Watson process math with a more focused one (we just need it for Binom(n, 1/m) variables) that's heavier on the intuition (we're building a tree with an exponentially-decaying branching factor) than on the math, and possibly by offloading some of the discussion of second- choice hashing and hashing with relocation to the previous lecture on linear probing.

I added in a new lecture on approximate membership queries this quarter, replacing the vestigial lecture on dynamic connectivity from last time. I set out to cover both Bloom filters, quotient filters, and cuckoo filters, though in development realized that there was no way we were going to get cuckoo filters to fit and when the lecture actually ran we barely managed to touch quotient filters. This lecture was the first I'd given that touched on information-theoretic lower bounds and on the idea that we might want to count individual bits, and it was a ton of fun to put together! I also think that the key idea I was exploring to bridge Bloom filters and quotient filters - the idea of fingerprinting elements using hashing ("universe reduction," I later learned this is called) and then storing the fingerprints efficienctly - is a really beautiful one that I'd like to spend more time on, so I think I might split this into two lectures (one purely on Bloom filters and Bloom filter variants and a second on fingerprinting, quotient filters, and, yes, cuckoo filters).

We concluded the lecture series with the treatment of integer structures from last time. Students were surprisingly engaged and actively questioned many of the premises of what we were studying (why can you assume integer operations are free/fast? why do you count multiplication when it's so slow? why are we looking at this if these aren't used in practice?). I'm glad that this wasn't my first go-around with these structures, because I was much more prepared to answer questions now than last time! I really liked the two-lecture series beginning with sardine trees and concluding with fusion trees, and I'd be sad to see them go, especially after all the work that went into them, but given their reception and the fact that there are higher-value topics to hit (splay trees, approximate membership queries, count sketches), I would give good odds that they aren't going to survive to see the next iteration of the course.

On the logistics side, I tuned up the problem sets this quarter. PS1 included a "build the fastest RMQ you can" contest that really got students excited and rewrote the starter files in C++ for consistency with the rest of the quarter. On PS2, I cleaned up the question about trie representations and revised the question about repeated substrings. To give students more practice with augmented trees, on PS3 I replaced the older question about red/black versus AVL trees and put in a new coding question about dynamic point-in-range searches, which I think went pretty well (though it was admittedly a tough question!). On PS4, I removed the discussion of weight-balanced trees, which, while interesting, was not particularly geared toward our standard amortization techniques, and added in some more practice with basic amortization. PS5 was for the most part unchanged, with only slight tweaks to the cardinality estimation question. These problem sets seemed to work out pretty well, though as usual I've got tons of notes on them for next time. I was a bit sad that I didn't have the bandwidth to make more "make the fastest X possible" contests; that's something I would have really liked to have given students more exposure to.

The biggest policy change this quarter was how we did the final projects. With only three (excellent!) TAs on staff, we required students to work in teams of three. To help keep people more on-track for finishing, we set up a final project checkpoint due at the end of Week 7 that asked students to summarize their preliminary research, answer some questions specific to their topic, and propose an "interesting" component. I'm writing this before we're 100% done reading project papers, but my sense is that both the project papers and the project presentations were much better this quarter than in the past, and I think it's largely due to these changes. The project format still needs some minor tuning (for example, how do we get students to interrogate their topics in more detail?), but this was, I think, a huge improvement from before.

The last major change was the midterm, which I converted to a 48-hour take-home exam instead of a 3-hour sitdown. I think this was a good call - people did much better this time around and it seemed like the exam fulfilled its duty as a sanity- check on individual student progress.

As usual, I've learned a ton this quarter and am blown away by the final projects and the level of student creativity demonstrated here. This class was a blast to teach and I'm excited to see how it turns out when I run in next year in 2020. My own intuitions for these topics is significantly stronger than before, and I think that's leading to significantly improved lectures, problem sets, etc. Let's see what things look like when I can take the full accumulated wisdom from this quarter and pump it back into the course one more time. :-)

2017-2018 Academic Year
Spring 2018

This iteration of CS166 was a blast to teach. After many, many hours of reading papers, reverse-engineering intuitions, and creating monster slide decks, I was able to replace coverage of the DC3 suffix array construction algorithm with the newer, faster, but more complicated SA-IS algorithm, which has great theoretical and practical performance. I also introduced fusion trees to the course, replacing an older lecture on disjoint-set forests, and I'm quite proud of how those lectures turned out. I learned a ton in the course of putting those lectures together and hope that those slides hold up well!

I also made a number of major touch-ups to some of the existing lectures, including a mostly redesigned lecture on x-fast and y-fast tries, better visualizations for the Fischer-Heun RMQ structure, and a mathematically streamlined presentation of the analysis of linear probing.

I made a number of revisions to the problem sets and coding assignments from last quarter. I removed the questions from Spring 2016 that asked students to derive KMP as a special case of Aho-Corasick and Fenwick trees as an optimized version of augmented trees, since those questions had the unfortunate side-effect of being answerable with a quick Google search without much conceptual understanding. In their place, I added in a new coding assignment involving SA-IS and suffix array searching and some questions about different types of balanced trees. I'm thinking that in the next offering of CS166 I'll keep the SA-IS assignment (and, in fact, probably soup it up!), but will likely cut the balanced trees question in favor of more practice with tree augmentation (for example, giving students a pluggable augmented tree and asking them to implement various data structures or algorithms with it). On PS4, I added in a new question about a linear-time algorithm for building weight-balanced trees and slightly adjusted the coding question to address it, which I may dial back in future quarters to focus on other topics. One change I'm quite happy with is the introduction of a new question about cardinality estimation on PS5, which with a little bit of tuning I think could become a mainstay!

I have a ton of notes to myself of what to work on for future quarters. I think there's space for a survey lecture on approximate membership query data structures (Bloom filters, quotient filters, and cuckoo filters, possibly with Bloomier filters) given how popular these topics were in the final projects and how nicely they dovetail with the existing randomized structures we've covered. I'd like to introduce coding questions that have a performance optimization component so that students can get a feel for what techniques are out there for speeding data structures up in practice - as well as the gap between theory and practice. My lecture on Fibonacci heaps could use a few fixups with regards to the intuition behind the extract-min procedure, and I think I should move away from the traditional presentation of splay trees (which uses a mixture of the amortized and startup costs) toward a more streamlined model that folds those costs into the construction cost or exposes them separately. The problem sets could probably use a little bit of rebalancing, as well, and I'd like to offer a little more guidance on experimental design for the final projects.

As usual, it was a ton of fun seeing all the final projects from this quarter. One project in particular looks like it may be publishable (a clever analysis of k-d trees), and I was blown away by another project on rotation distances in binary search trees through an isometry between polygon triangulations and BSTs. I also learned several new structures, like majority-inverter graphs, cell lists, hollow heaps, learned index structures, Bloomier filters, area minimum queries, Hopfield networks, relaxed radix-based trees, persistent B-trees, and distributed hash tables. Some of these seem like prime candidates for integration into future offerings of CS166, and I'm looking forward to playing around with them!

A huge thanks to this quarter's CS166 students and staff for making this class so much fun to teach!

2015-2016 Academic Year
Spring 2016

This second iteration of CS166 was a lot of fun! I'd learned a lot about data structures since the last time I taught this course, so there were a few new topics (namely, the analysis of 5-independence in linear probing) and the problem sets were more interesting and more polished than in the first iteration. The final projects this time around continued to shine. Many are publishable, and some of the explanations we saw were so good that I'm encouraging students to post their findings online somewhere.

I moved a few of the topics around this time, putting string data structures closer to the beginning where they tie in nicely with RMQ structures and deferring balanced trees until later. I rebuilt my lecture on Aho-Corasick from scratch and it turned out a lot better than before and combined the lectures on dynamic graphs and Euler tour trees into a single survey lecture on dynamic graph algorithms. I also cleaned up the presentation of disjoint-set forests, the split and join operations on trees, and splay trees (and I'm particularly proud of that last one). Next time around, I'll see if I can work in some other topics from the final presentations.

2013-2014 Academic Year
Spring 2014

This was my first time teaching CS166 and it was a wonderful experience. The students, the staff, and I all learned a lot and the final project presentations were just wonderful. I'm looking forward to teaching this class again in the future!