This page contains archived versions of the Stanford CS166 (Data Structures) webpage in the quarters I've taught it. All internal links should be valid, though external links may no longer be functional.
This quarter's offering of CS166 represents the biggest quarter-on-quarter delta I've made in years. I moved a bunch of topics around, dropped a few lectures on topics I'm sad to see go, and introduced several new topics that I'm glad I experimented with.
Structurally, the biggest differenceb between this CS166 offering and ones from the past is the content ordering. To better support students as they worked on their research projects, I decided to move the "fundamental" topics earlier in the quarter, pushing "applications" topics until later. This meant deferring suffix trees and arrays to the end of the quarter to move amortization and randomization earlier. In retrospect, I think it's good that amortization appeared earlier, since that's always a tough topic for folks to wrap their heads around, but it was a bummer to see strings moved so late, especially since they're a nice ramp-up topic for hitting things like isometries, mechanical/operational descriptions, etc. The other main reason I moved it back was because it's a three-lecture series (suffix trees, then suffix arrays, then SA-IS), and moving it earlier would cause the other two-lecture series to be perpetually offset. But maybe the right call here would be to just make the material on SA-IS available as a video or set of lecture slides, instead focusing on suffix arrays and suffix trees as-is.
I decided to cut the unit on word-RAM (integer) structures this quarter, and I also have mixed feelings about this. That topic was always somewhat polarizing - people either loved it (wow! there's a degree of parallelism we can harness that we normally don't!) or hated it (this is completely inpractical! no one would ever do this!) - but it's always something that's been near and dear to me, especially because it's a venue to put together so many other course topics in one place. I might resurrect that topic in the future - or, again, I might just make older lecture videos available to students who are curious.
Filling in the gap left by integer structures were two new topics. The first was on 2D computational geometry: layered range trees and persistent data structures for planar point location. My sense is that the first of these was something I found really interesting but which didn't resonate all that much with students, while the second was well-received. I'm considering dropping layered range trees in the future and instead doing Kirkpatrick's structure. That would let us focus a bit more on properties of plane graphs and Euler's formula more explicitly (IIRC we don't cover this anywhere else in the curriculum) and would free up some time in the lecture on persistent trees.
The other new unit was on approximate membership queries. I'd taught Bloom filters before and brought back a slimmed-down version of that lecture, introducing the basic Bloom filter and then doing the information-theoretic lower bound as a way of measuring "how much better can se do?" The second lecture was on the cuckoo filter and XOR filter. I have to say - that was a really, really fun lecture to put together, and a I learned a ton in the process. I initially was planning on covering Bloomier filters here based on how well that topic had worked as a research project, but in the course of doing my reading saw a slide in David Eppstein's CS261 course that mentioned the XOR filter as an applied Bloomier filter. That in turn led me down the rabbit hole of hypergraph orientation and hypergraph peeling, which in turn got me reading a paper about spatial coupling in data structures, and led to one of the better lectures I think I've constructed in a long while. I'm so glad I came across this - it's great to modernize the topic coverage and expose students to the ideas.
On that note, I also did a major overhaul of my lecture on cuckoo hashing. One of David Eppstein's slides mentioned that Erdos-Renyi graphs basically gave all the results needed to justify why cuckoo hashing worked - something I'd never seen before. That led me to a deep dive on random graphs and to actually read through the papers about blocked cuckoo hashing and d-ary cuckoo hashing that I'd known about but never checked out in depth. The result was a significantly revised cuckoo hashing lecture. Ironically, after putting in all the work to update that lecture, I actually think the version I gave wasn't as good as it should have been. After doing a bit more reading on XOR filters and their properties, I realized that I should have directly introduced the hypergraph orientation problem in that lecture and talked about how it generalizes Erdos-Renyi, using that as a way of explaining how the cuckoo graph for regular cuckoo hashing generalizes to these other models. I think there's a much cleaner lecture in here just waiting to get out, and in my next iteration of CS166 I'd love to see if I could make random hypergraphs an explicit theme.
In addition to these (major) changes, I made a number of smaller updates and edits throughout the quarter that are worth noting. I added in a section on HyperLogLog, eschewing the (extremely difficult!) analysis in favor of a simpler, more intuitive presentation backed by some empirical analysis. I think I could have done a slightly better job presenting the final HyperLogLog structure, but overall this seemed well-received. After teaching this, I realized that there's a connection between the analysis of the coin-flipping game that powers HyperLogLog and the maximum height of a randomized skiplist - in fact, it's the same math! So I'm now wondering whether it would be a good idea to put skiplists into the main curriculum, show off the math behind it, and then say "and hey, look at this - we already know how to solve this!"
I also did an overhaul on my lecture on amortization, completely removing the banker's method and focusing purely on the potential method as applied to three motivating examples (two-stack queues, dynamic arrays, and building B-trees). That worked better than before, but it's clear that this lecture on its own isn't sufficient to teach the topic. I'm going to need to build out some new materials on amortization, possibly something like a "Guide to Amortization" that walks students through how to do an amortized analysis and how to think with amortization.
I still think I have a ways to go when it comes to improving the lectures. My lecture on splay trees, for example, is too crowded and doesn't do justice to them. The one before it - about "better-than-balanced-trees" - is still really good, though. My sequence on binomial heaps leading into Fibonacci heaps also could use a little work. The Fibonacci heap lecture focuses a bit too much on deriving the Fibonacci heap and left some students confused about exactly what the final structure looked like. My lecture on Bloom filters definitely could use some cleanup in the presentation - it would probably be better to start with a count-min sketch and go from there to the Bloom filter. And the second lecture on balanced trees, where we derive all the red/black tree rules - is too crowded with content. I'm going to need to figure out how to address this.
Complicating matters, there are now a bunch of other topics I want to integrate into CS166. We had teams presenting on scapegoat trees and quake heaps, and after seeing their work I would love to work those into the quarter. Scapegoat trees would make a great testbed for the ideas in amortization, and quake heaps are simpler than Fibonacci heaps to explain. Interestingly, I think I found a way to simplify quake heaps a bit, removing the global per-level counters and using a scapegoat tree style top-down search to find bad layers. I'd like to investigate this more and see whether that could address the Fibonacci heap lecture concern. I'm also wondering whether we could cut B-trees and red/black trees - and isometries more generally - to instead focus on scapegoat trees. I'd be super sad to see those lectures go - they're among my favorites - but we already spent a lot of time on BSTs and it might be better to show off more amortized-efficient strutures. And I'm also thinking that it might be good to do a lecture sequence on Wilber's lower bound on BSTs, then tango trees, and finally the geometry of BSTs, but I have no idea how to fit that in.
Or maybe I should just say that I need to offer a "CS266" course that's a sequel to CS166. That would let me slow down and present a core set of topics in CS166, moving all the "advanced" ones to CS266. But that's a topic for another discussion.
The last radical idea on my mind is whether we should keep the project component to the course. It's been there since the beginning, and it's a great way to help students get more comfortable reading things on their own. The drawback is that it means that we don't have problem set coverage of the topics from the tail end of the quarter, meaning that students are never asked to grapple with them the same way they have to grapple with the topics from earlier. Maybe that's a fair tradeoff, but, then again, maybe it's not. It's certainly more rewarding to have project teams present their work than it is to just grade problem sets. I'll have to give this one some thought between now and next year to see what works best.
Overall, for teaching during a remote quarter when students, the TAs, and I are all somewhat burned out from Zoom U, I think this was a passable offering of the course. I'm excited to see that there's still so much room for this class to grow, and I've got my work cut out for me - given 15 to 20 weeks of content, which topics are so important that they're worth showing off, which ones aren't, how do you present the ones you want to present, and how should you structure the course around it?
This was a tumultuous quarter. We began at the start of the COVID-19 pandemic and ended with nationwide protests following the death of George Floyd. While I usually use these sections to document what changes I made to the course in order to better learn from my decisions later on, I figure it might be good to instead focus on what decisions I made with regards to running a class at what's definitely the most stressful time for the US in at least a decade.
When this quarter started up, most of the US was under shelter-in-place orders. It was unclear just how serious the pandemic would be. We didn't know whether we should expect a large fraction of students to contract COVID-19, nor did we know whether it was likely that the course staff might be out for several weeks were we to fall sick. We didn't know how difficult it was going to be for students to take classes remotely. We didn't know how office hours or remote assessments would work.
This quarter was also offered only on a pass/fail (S/NC) grading basis. I supported the university's decision to move in that direction given the degree of uncertainty around how COVID-19 would play out, and in retrospect I do think it was the right call. That gave us more flexibility to experiment with course design.
On entry to this quarter, I remembered how challenging the last two weeks of CS106B were last quarter and how having so much of the grade determined by a single end-of-quarter exam complicated things. As a result, one of the biggest changes to this quarter's offering of CS166 was splitting the problem sets into two sections - regular problem sets that could be completed in pairs, plus an "individual assessment" that functioned more or less as a weeklong take-home exam. We required that students earn passing grades on all the assignments and all but one of the individual assessments, offering students "resubmits" that they could use to redo an assignment or assessment in the event that it didn't go well. (I think the idea to allow for resubmits came from Cynthia Lee.) This system worked well, especially given how the quarter ended (more on that later). I'm considering, going forward, adopting this model or something like it for future iterations of CS166, since it seemed to work much better than holding a traditional cumulative assessment at the tail end of the quarter in a way that competed with the final project.
I also had to redesign the final project, as asking for a presentation seemed like a tall order given that everything was being conducted over Zoom. Rather than requiring a paper and a presentation, I instead asked students to produce an "explantory article" that walked the reader on a journey from ignorance to understanding. We then set up two project checkpoints - one based on the version from last quarter, and one to check in on a draft of the final writeup and to offer guidance on the "interesting" component. The TAs and I divided up the project teams and met (virtually) with the teams over Zoom to offer feedback at the two checkpoints, and I think this was definitely a step in the right direction. There's still more work to be done on my end to figure out how to keep the TAs and I consistent in our expectations and level of feedback, and I'll need to sort that out for next time. We also increased the group size from three to four to introduce more redundancy in case students were sick or had to attend to emergencies, which turned out to be a good decision. (More on that later.)
Considering that this quarter was offered pass/fail, that everything was done remotely, and that basically everyone was feeling the effects of shelter-in-place, I was very happy with how the projects turned out. Some of them absolutely dazzled me, and I now have a way better understanding of a bunch of topics I knew comparably little about beforehand. Thanks to all my students for the solid effort!
Probably the most difficult part of the quarter was handling the last two weeks when the George Floyd protests swept the US and put policing and racism under the spotlight. The range of emotions that came out then - despair, anger, resentment, hopelessness, etc. - made it hard for lots of people, including me, to keep focused on academics. I ended up relaxing the course requirements to make one of the problem sets optional and pushed the deadline on the final project as late as I could while still giving time to read and review papers. I wish I could get a better sense of how effective these two steps were at reducing student stress, and once the course evals come out I can hopefully learn more about that. However, I will say that the course design of spreading evaluations across the entire quarter and having multiple project checkpoints to gauge progress really helped here - the staff and I knew where our students were and we felt very comfortable making these adjustments. Going forward with remote teaching, I think I should keep this lesson in mind and, when possible, design for maximum flexibility.
In terms of delivery of material - I opted for a labor-intensive approach of making recorded versions of all of the lectures on my own, then delivering a live version of the lectures where students could ask questions knowing that they weren't being recorded. I have to say, it did take a while to record all the lectures, but I felt that this really improved the quality of the slides and the presentation! Often times when recording something I'd find myself "fighting" the presentation and wanting to talk about things in a new way, and when that happened I could always stop recording, fix things, and rerecord to get a better version. This led to many improvements in the lecture slides, almost as if I'd run two iterations of the class with two chances to take notes and tweak things.
One thing I'm disappointed about is that I didn't get clearance to post the recorded videos online for the public. I was hoping that the recorded versions of the lectures, since they were literally just me talking over the slides and didn't involve students, could go up on YouTube or something like that so that they'd be available to anyone anywhere during shelter-in-place. I unfortunately never had any of my emails answered, so the videos ended up restricted to just current Stanford students. I'm going to try to follow up on this and see if there's any way to get the videos online somewhere, since it would be a shame if all these recordings existed but weren't available more broadly.
On the content side of things - this quarter's offering of CS166 more or less mirrored the Spring 2019 version, with most of the changes being minor touch-ups or improvements rather than radical reenvisionings. Here's a quick rundown of my recollections of what changed and what areas still need improvement.
The unit on RMQ was more or less the same as the previous versions, with a few minor edits here and there. I added some more visuals into the first lecture to show off how the hybrids were structured, how they worked, and why the runtimes were intuitive for the different choices. In the second lecture, I kept the same general presentation, but updated the slides to include slicker and more intuitive animations and a better explanation for things like how to build Cartesian trees quickly and how the stack-based algorithm worked. I also did some general touch-ups on the overall narrative, such as deferring the discussion of the overall runtime until all the pieces were there.
The unit on string data structures went more or less the same as last time. The main change was in the first lecture on string data structures, where I spent more time motivating how tries work, why searches in Patricia tries proceed as they do, etc. That seemed to work well, and given how many teams wanted to present on string data structures for their final projects (FM-indices, suffix tree matrix multiplication, ropes, and backwards DAWG matching), I think these lectures are in pretty good shape! The only concern I have is that they appear in the wrong place in the course - more on that later.
I did make some larger changes to the unit on balanced trees and isometries. I decided to cut the discussion of splitting and joining red/black trees, which I'm sad to see go (it's so cool that you can do this!) but let me do a much better job explaining B-trees and red/black trees worked. This freed up time in the first lecture on balanced trees to talk more about mechanical versus operational descriptions of B-trees and gave me time in the second lecture to work through how to derive rules for red/black tree insertion in more depth than before. I also cleaned up the presentation of augmented trees to use a more visual/intuitive explanation. Overall, I think that by doing more depth on less surface area, these lectures turned out a lot better.
One of the biggest changes was in the presentation of amortized data structures. I did a pretty serious overhaul of the first lecture on amortization to motivate what amortization "feels" like, using an extended analogy of amortized efficiency as a dishwasher. This allowed me to carry through a narrative thread about "cleaning up messes" that went through the unit on binomial heaps and Fibonacci heaps, and I really liked how that turned out. However, I think I should have done a much deeper dive into how exactly one performs amortized analyses using the banker's and potential methods. Most students figured out how to do this, but a good deal really struggled on the problem set / individual assessment. Going forward, I'm considering completely removing the banker's method and focusing purely on the potential method, doing more examples in the initial lecture on amortization showing off what these analyses look like and how to properly choose a good potential.
Moving on to binomial and Fibonacci heaps - I did some major surgery on the binomial heaps lecture, pulling in the perspective of coalesce I devised last time for the Fibonacci heaps lecture and dropping the analysis of a pure-insert sequence into binomial heaps. This worked only okay, I think, and I was really pressed for time. I'm going to need to figure out how to decompress this lecture for future iterations of the class. The lecture on Fibonacci heaps went well, except that I didn't have time to go into the pointer-juggling details necessary to represent Fibonacci heaps efficiently. One thing I was very happy about, though, was a pair of students who answered a challenge I'd given them (find a good intuitive explanation for where Fibonacci numbers come up in Fibonacci heaps) and presented a clean, simple argument that's so good that I would be remiss if I didn't use it next time! I'm excited to see how that turns out.
After amortization, we moved on to randomization. I started off with count-min sketches and hash independence, as before, and that lecture only had minor touchups. In the next lecture, I presented count sketches, and I was actually really happy with the new version I put together. I followed the same general outline as last time, but did a better job drawing parallels between the analysis of count sketches and the analysis of count-min sketches. I also added a section explaining how to "derive" the count sketch's key different from count-min sketches by analyzing the weaknesses of the count-min sketch, which turned out well. The problem, though, was that I then tried to cover all of Bloom filters in the remaining time in that lecture, which turned out about as well as you might have expected that it would have. Oops. Next time, I'll see if I can break that topic out into its own lecture. There was an awesome project on Bloomier filters from this quarter that might be worth bringing into this unit as well - it would be interesting to compare how Bloom and Bloomier filters work! I ended up cutting linear probing this quarter, mostly because the analysis of linear probing had already been reduced, more or less, to introducing the fourth moment bound. Plus, if I keep teaching linear probing in CS106B, which I think I'd like to keep doing, there's a bit less urgency there. We wrapped up with cuckoo hashing, which I presented more or less the same as last time with a bit of cleanup (using circles rather than arrowheads in the cuckoo graph, removing the general presentation of Galton-Watson processes in favor of a more direct analysis of sums of binomial variables, etc.).
We then moved on to a new two-lecture section that was based on my (crowded!) splay trees lecture from last year. I gave a full lecture on "better than balanced BSTs" that introduced the balance, entropy, working-set, and dynamic-finger properties and worked some examples of data structures meeting these properties. That one went really, really well, and I think it's one of the best lectures I've put together to date! I pulled the discussion of weight-equalized trees in from PS2 and worked the proof from last year's midterm that they have the entropy property, which I thought was a lot of fun and gave a better sense of what entropy measures. The only concern I have with this lecture is that it ran a bit long, but I think I can fix that. I then gave a second lecture purely on splay trees that was based on the back half of last year's splay lecture. This version motivated where splay trees come from by showing how rotate-to-root fails. I decided to go a bit deeper into the analysis of splay trees this time, working out the choices of weights per node that would give the balance, entropy, and working-set properties and giving a partial analysis of dynamic-finger that built up to the static finger theorem. The only problem I had here was that I was running low on time. Otherwise, this worked pretty well!
We concluded with a three-lecture series on integer data structures. The presentation of x-fast and y-fast tries was more or less the same as last time, but with a revised presentation of how the binary search in x-fast tries worked and a more careful space analysis. The presentation of sardine and fusion trees worked more or less the same as before, with a few edits to improve clarity. Overall, I thought I did a good job presenting these topics, but I wasn't feeling as excited about them as I was the past few times. Maybe that's because this coincided with the George Floyd protests and I kept feeling that there were more important things to be focusing on. But I think it's probably, more realistically, that this unit gives new ways to implement BST replacements, and we already have a bunch of lectures on those topics (red/black trees, B-trees, better-than-balanced trees, and splay trees) and I think it might be good to instead cover something fundamentally different.
Looking to next time, I think I'm going to make a few edits to the syllabus. I'd like to move strings later in the quarter as more of a place where they can serve as a testbed for other concepts we'd covered. That would push amortization earlier, which I think would be a good thing because that way students can get more practice with it. This would also enable me to split the class into two clear units: "preliminaries," which cover generalizable techniques, and "applications," where we see data structures that arise in more particular contexts, and perhaps (?) to make it so that the first half of the class has the problem sets and individual assessments and the back half is just projects. Additionally, I'm thinking about replacing the unit on integer data structures with a unit on geometric data structures, which would explore a problem space that's totally unlikely what we see elsewhere. There are, of course, a bunch of other touch-ups to do (revising problem sets, polishing lectures, etc.), but overall I think I'm starting to converge on something that feels well-motivated and logistically well-organized.
This was a difficult quarter for all of us. On an S/NC grading basis, I'd give myself a solid S for "passing work in the face of national crisis." I learned a bunch of lessons, both in terms of pedagogy and teaching strategy, in terms of technical content, and about myself. Here's hoping that future quarters run more smoothly and that I emerge from this a better educator than I entered it.
As with most classes I taught in the 2018-2019 academic year, this version of CS166 represented a pretty significant overhaul to the course materials, and I think that it's been a huge gradient step in the right direction.
The major change I made to the lectures on this iteration was to explore, much more so than usual, where these data structures come from and how to derive them from first principles. The lecture on suffix trees, for example, explored Patricia tries in the abstract and how to arrive at them by optimizing regular tries for fast searching, and talked about the distinction between operational and mechanical descriptions of structures (mechanically, a suffix tree is what you get if you make a trie for all the endmarked suffixes and eliminate non-branching internal nodes; operationally, it's a tree with one internal node per branching word and one leaf per suffix). The lecture on suffix arrays spent more time talking about why it is that suffix arrays don't lose much relative to suffix trees, and actually explored how to construct LCP arrays using Kasai's algorithm (which, as I found, is not very hard to understanding!) In doing this, I trimmed out some material on generalized suffix trees and suffix arrays and on longest common extensions, which I was sad to see go. My notes to myself indicate that there's about ten minutes of free time left in those lectures, so perhaps I can put a shout-out in there. Overall, though, redesigning these slides gave me a much deeper appreciation for how these data structures work, and I'm glad I did this! To get this to fit, I split the lecture on suffix trees and suffix arrays apart into two and cut the lecture on Aho-Corasick automata, which I was initially sad to see go but honestly am not losing any sleep over at this point. I think the topic refocusing worked great!
I also did a major revision of the lecture on B-trees to show where B-trees come from (how would you build a balanced multiway tree when there's a cap on the number of keys per node?) and how to derive the standard invariants from the basic data structure operations, rather than the other way around. In doing so, I had to slightly scale back the discussion of where red/black trees come from, which is really a shame. In the future, I might cut the discussion of split and join for red/black trees (those are tough topics that rarely make an appearance and can be better done with splay trees) and then put the discussion of red/black trees and tree rotations into the second balanced trees lecture, which would keep that lecture more unified on the theme of tree rotations and how to use red/black trees, rather than spreading those topics across multiple lectures. In the lecture on B-trees, I also tried a new format where I periodically paused and asked students to work through some sample questions on their own (which of these variants would be better and why?), which worked extremely well. That's something I tried to integrate into all future lectures and is something I think really improved student understanding.
I'm not super thrilled with the presentation of binomial heaps, which is strange because I made only minor touchups to those slides and previously considered them to be some of my best work! I think the issue is that it proceeds too much along the lines of "here's what binomial heaps are and how they work," skips some beautiful derivations (where do you get binomial trees from?), and glosses over the hard part of lazy binomial heaps (how do you compact trees together?). In teaching this lecture, I realized that there's a totally different algorithm for tree compaction that is much easier to understand than the standard one (do a true bucket sort by tossing each tree into a bucket based on its order, then repeatedly compact trees and put them into the next bucket), which I integrated into the Fibonacci heaps lecture and will need to back-patch into the original binomial heaps lecture. That revised Fibonacci heaps lecture seemed to go much better than before by highlighting the pain points from past quarters (what's the order of a tree when trees can be reshaped so easily? how do lazy binomial heaps work, again?), and I'm pretty happy with the result.
Something I didn't expect was what would happen when I went back to revise my splay trees slides. I already thought those slides were in good shape, and I figured that my revisions would mostly be about contextualizing them by looking at what their properties (balance, entropy, working set, dynamic finger) actually mean in practice. Instead, the lecture turned out to be almost 60 minutes on a beyond-worst-case analysis of binary search trees and a sampler of data structures like level-linked finger BSTs and Iacono's working set structure, with only a quick overview of splay trees at the end. That lecture seemed to be really popular with students and there's easily enough content here for two lectures, so I'm planning on splitting this into a two-lecture sequence in the future, spending more time exploring the entropy property (can we actually prove the entropy lower bound?) and Iacono's working set structure in the first lecture and focusing more extensively on splay trees, their properties, the proofs, and the open problems in the second.
When it came time to teach frequency estimators, I realized that I hadn't actually stopped to think about what an (ε, δ)-approximation actually "looked like" in a probabilistic sense or what 2-independent, 3-independent, etc. hash functions "felt like." I redid the lecture to start off with a tour of why we use 2-independent hash functions when we want a nice distribution of elements, focusing on how each part of the definition (uniform distribution and pairwise independence) were each necessary to pin down the idea that collisions should be infrequent. I also drew some diagrams of what tuning ε and δ would do to the probability mass of an (ε, δ)-estimator and why we don't need our estimators to be unbiased. I also detailed a multi-step process for building a good estimator, which in retrospect I'm really proud of. The good news is that this went over well with students and really motivated the major ideas from the count-min sketch. The bad news is that this totally overflowed a single lecture and led to me almost entirely scaling back the presentation of linear probing in the next lecture to fit count sketches. I need to think about what to do here, since not everyone has seen count-min sketches on entry to the course, though many have.
As alluded to above, I also revised the slides on linear probing to focus more on the key transferrable skill: increasing the degree of independence of a hash function gives tighter bounds on the probability that "too many" elements hash to a specific place in a hash table. This abstracts away the pain points of the previous presentation (all the nuances of overloaded buckets and Thorup's segment tree, the hard math of 4-independence, etc.) while reinforcing the major ideas from before (indicator variables and concentration inequalities). With a bit more time I think I could really make that idea shine, but, alas, this quarter I totally ran out of time to explore this in depth.
I also tuned up the lecture on cuckoo hashing in a few major ways. To motivate their analysis, rather than beginning with the assumption that the number of slots per table needs to be the number of elements times some constant greater than one, I ran some experiments on my computer to generate a plot of the success rate as a function of the table size, which (unsurprisingly) led to the standard phase transition plot showing a rapid shift from "you'll almost certainly fail" to "you'll almost certainly succeed." I did this as well with the number of displacements per insertion, generating a plot that strongly suggested that insertions were expected O(1). This went over really well with students - I got all sorts of questions about how the data were generated and it showed off how, in practice, these experiments help drive theory. This is something I'd absolutely love to retrofit into previous lectures if at all possible, since it shows that we aren't purely in Theoryland all the time when designing data structures. Another major change in this lecture was the introduction of a different proof that shows the probability of a failed insert is low, which came courtesy of a CS166 project team from last quarter. As expected, these changes ate into the lecture time and we didn't manage to cover everything I was hoping to touch on. I think that can be fixed by replacing the super generic version of the subcritical Galton-Watson process math with a more focused one (we just need it for Binom(n, 1/m) variables) that's heavier on the intuition (we're building a tree with an exponentially-decaying branching factor) than on the math, and possibly by offloading some of the discussion of second- choice hashing and hashing with relocation to the previous lecture on linear probing.
I added in a new lecture on approximate membership queries this quarter, replacing the vestigial lecture on dynamic connectivity from last time. I set out to cover both Bloom filters, quotient filters, and cuckoo filters, though in development realized that there was no way we were going to get cuckoo filters to fit and when the lecture actually ran we barely managed to touch quotient filters. This lecture was the first I'd given that touched on information-theoretic lower bounds and on the idea that we might want to count individual bits, and it was a ton of fun to put together! I also think that the key idea I was exploring to bridge Bloom filters and quotient filters - the idea of fingerprinting elements using hashing ("universe reduction," I later learned this is called) and then storing the fingerprints efficienctly - is a really beautiful one that I'd like to spend more time on, so I think I might split this into two lectures (one purely on Bloom filters and Bloom filter variants and a second on fingerprinting, quotient filters, and, yes, cuckoo filters).
We concluded the lecture series with the treatment of integer structures from last time. Students were surprisingly engaged and actively questioned many of the premises of what we were studying (why can you assume integer operations are free/fast? why do you count multiplication when it's so slow? why are we looking at this if these aren't used in practice?). I'm glad that this wasn't my first go-around with these structures, because I was much more prepared to answer questions now than last time! I really liked the two-lecture series beginning with sardine trees and concluding with fusion trees, and I'd be sad to see them go, especially after all the work that went into them, but given their reception and the fact that there are higher-value topics to hit (splay trees, approximate membership queries, count sketches), I would give good odds that they aren't going to survive to see the next iteration of the course.
On the logistics side, I tuned up the problem sets this quarter. PS1 included a "build the fastest RMQ you can" contest that really got students excited and rewrote the starter files in C++ for consistency with the rest of the quarter. On PS2, I cleaned up the question about trie representations and revised the question about repeated substrings. To give students more practice with augmented trees, on PS3 I replaced the older question about red/black versus AVL trees and put in a new coding question about dynamic point-in-range searches, which I think went pretty well (though it was admittedly a tough question!). On PS4, I removed the discussion of weight-balanced trees, which, while interesting, was not particularly geared toward our standard amortization techniques, and added in some more practice with basic amortization. PS5 was for the most part unchanged, with only slight tweaks to the cardinality estimation question. These problem sets seemed to work out pretty well, though as usual I've got tons of notes on them for next time. I was a bit sad that I didn't have the bandwidth to make more "make the fastest X possible" contests; that's something I would have really liked to have given students more exposure to.
The biggest policy change this quarter was how we did the final projects. With only three (excellent!) TAs on staff, we required students to work in teams of three. To help keep people more on-track for finishing, we set up a final project checkpoint due at the end of Week 7 that asked students to summarize their preliminary research, answer some questions specific to their topic, and propose an "interesting" component. I'm writing this before we're 100% done reading project papers, but my sense is that both the project papers and the project presentations were much better this quarter than in the past, and I think it's largely due to these changes. The project format still needs some minor tuning (for example, how do we get students to interrogate their topics in more detail?), but this was, I think, a huge improvement from before.
The last major change was the midterm, which I converted to a 48-hour take-home exam instead of a 3-hour sitdown. I think this was a good call - people did much better this time around and it seemed like the exam fulfilled its duty as a sanity- check on individual student progress.
As usual, I've learned a ton this quarter and am blown away by the final projects and the level of student creativity demonstrated here. This class was a blast to teach and I'm excited to see how it turns out when I run in next year in 2020. My own intuitions for these topics is significantly stronger than before, and I think that's leading to significantly improved lectures, problem sets, etc. Let's see what things look like when I can take the full accumulated wisdom from this quarter and pump it back into the course one more time. :-)
This iteration of CS166 was a blast to teach. After many, many hours of reading papers, reverse-engineering intuitions, and creating monster slide decks, I was able to replace coverage of the DC3 suffix array construction algorithm with the newer, faster, but more complicated SA-IS algorithm, which has great theoretical and practical performance. I also introduced fusion trees to the course, replacing an older lecture on disjoint-set forests, and I'm quite proud of how those lectures turned out. I learned a ton in the course of putting those lectures together and hope that those slides hold up well!
I also made a number of major touch-ups to some of the existing lectures, including a mostly redesigned lecture on x-fast and y-fast tries, better visualizations for the Fischer-Heun RMQ structure, and a mathematically streamlined presentation of the analysis of linear probing.
I made a number of revisions to the problem sets and coding assignments from last quarter. I removed the questions from Spring 2016 that asked students to derive KMP as a special case of Aho-Corasick and Fenwick trees as an optimized version of augmented trees, since those questions had the unfortunate side-effect of being answerable with a quick Google search without much conceptual understanding. In their place, I added in a new coding assignment involving SA-IS and suffix array searching and some questions about different types of balanced trees. I'm thinking that in the next offering of CS166 I'll keep the SA-IS assignment (and, in fact, probably soup it up!), but will likely cut the balanced trees question in favor of more practice with tree augmentation (for example, giving students a pluggable augmented tree and asking them to implement various data structures or algorithms with it). On PS4, I added in a new question about a linear-time algorithm for building weight-balanced trees and slightly adjusted the coding question to address it, which I may dial back in future quarters to focus on other topics. One change I'm quite happy with is the introduction of a new question about cardinality estimation on PS5, which with a little bit of tuning I think could become a mainstay!
I have a ton of notes to myself of what to work on for future quarters. I think there's space for a survey lecture on approximate membership query data structures (Bloom filters, quotient filters, and cuckoo filters, possibly with Bloomier filters) given how popular these topics were in the final projects and how nicely they dovetail with the existing randomized structures we've covered. I'd like to introduce coding questions that have a performance optimization component so that students can get a feel for what techniques are out there for speeding data structures up in practice - as well as the gap between theory and practice. My lecture on Fibonacci heaps could use a few fixups with regards to the intuition behind the extract-min procedure, and I think I should move away from the traditional presentation of splay trees (which uses a mixture of the amortized and startup costs) toward a more streamlined model that folds those costs into the construction cost or exposes them separately. The problem sets could probably use a little bit of rebalancing, as well, and I'd like to offer a little more guidance on experimental design for the final projects.
As usual, it was a ton of fun seeing all the final projects from this quarter. One project in particular looks like it may be publishable (a clever analysis of k-d trees), and I was blown away by another project on rotation distances in binary search trees through an isometry between polygon triangulations and BSTs. I also learned several new structures, like majority-inverter graphs, cell lists, hollow heaps, learned index structures, Bloomier filters, area minimum queries, Hopfield networks, relaxed radix-based trees, persistent B-trees, and distributed hash tables. Some of these seem like prime candidates for integration into future offerings of CS166, and I'm looking forward to playing around with them!
A huge thanks to this quarter's CS166 students and staff for making this class so much fun to teach!
This second iteration of CS166 was a lot of fun! I'd learned a lot about data structures since the last time I taught this course, so there were a few new topics (namely, the analysis of 5-independence in linear probing) and the problem sets were more interesting and more polished than in the first iteration. The final projects this time around continued to shine. Many are publishable, and some of the explanations we saw were so good that I'm encouraging students to post their findings online somewhere.
I moved a few of the topics around this time, putting string data structures closer to the beginning where they tie in nicely with RMQ structures and deferring balanced trees until later. I rebuilt my lecture on Aho-Corasick from scratch and it turned out a lot better than before and combined the lectures on dynamic graphs and Euler tour trees into a single survey lecture on dynamic graph algorithms. I also cleaned up the presentation of disjoint-set forests, the split and join operations on trees, and splay trees (and I'm particularly proud of that last one). Next time around, I'll see if I can work in some other topics from the final presentations.
This was my first time teaching CS166 and it was a wonderful experience. The students, the staff, and I all learned a lot and the final project presentations were just wonderful. I'm looking forward to teaching this class again in the future!