Data Structures: The Code That Isn't There
Between YOW! 2012 and Strange Loop 2012‘s videos slowly coming online, there’s a lot of conference talks available to watch.
So tonight, instead of doing the final exam for Algorithms Part 2, I watched Data Structures – The Code That Isn’t There, by Scott Vokes.
The talk’s title comes from this quote from Gordon Bell:
The cheapest, fastest and most reliable components are those that aren’t there.
Scott provides a couple of links to notes taken by others in his post on how the talk went. Apologies for any inaccuracies in my own notes.
- Skiplists: Like a linked list but with additional pointers, at certain items along the way, to items much further down the line (ie you ‘skip’ a bunch of items). Ends up looking like a balanced binary tree. As it’s mutable, won’t be perfectly balanced over time but that doesn’t matter because it remains balanced enough – in fact coin flips can be used to determine the number of additional pointers to add to each item in the list and the odds are very good that the end result will be very close to optimal.
- Difference lists: Immutable, but you can append to it, and it will behave in future as though it had always been that way. My reading is that the list facilitates this by including a ‘hole’ at the end at which can be filled later (by another list + hole… thus allowing unlimited appends).
- Rolling Hash: finds matching/overlapping sequences in binary data. Useful when you want to compare two long binary strings (eg DNA). Moves a window over the binary data, hashing the windowed content. Used in rsync, as it reduces the number of proper hash calculations that need to be done on the destination side and can request only the unmatched parts – I’m not 100% clear on the reasons why so won’t write much more!
- Jumprope: Storage for large binary strings. Blocks of the content are addressed by their hash rather than by pointers; this means that duplicate blocks only need to be stored once (and that it can use distributed key → value stores!). Immutable (thus can be cached). Also persistent. Apparently like a git repo, except git is better for lots of small files while this is better for a few large files. (I… won’t pretend I understand how this works internally.)
Other comments:
- Yet another mention of Prolog as something which has a lot of useful ideas in it. Scott worked through The Art of Prolog. Mentions Prolog’s logic variables as a way the language allows the explicit modelling of partial information.
- The difference list discussion in particular brings back happy memories of the Clojure content covered in Brian Marick’s book Functional Programming for the Object-Oriented Programmer.
- Scott also mentions the book Managing Gigabytes, which apparently has a lot of cool theory about data compression.