Thursday, April 21, 2011

Four Data Structures Every Programmer Should Know

I went back to reading Stack Overflow during my convalescence. And saw a bunch of questions and answers that really disturbed me. For those who don't want to click, they're all questions that relate to data structures.

OK, I realize that data structures can be daunting. Knuth took three books just to cover the basics. Cormen is nearly 1,300 pages, and Sedgewick almost 1,000. And I'll be the first to admit that I've not memorized any of these books, or even read significant pieces of them (much less cover-to-cover). They're meant as references: if and when I need to implement an algorithm, I open the book. If you asked me to sit down and implement a red-black tree without a book, I couldn't do it.

But I do know the performance characteristics of binary trees, and I know that java.util.TreeMap uses a binary tree. And as a Java programmer, I know when I should use a TreeMap versus a HashMap, versus an ArrayList that I've sorted. And I think that knowledge is a baseline that every programmer should have.

So, here's a cheat-sheet of the four data structures that every programmer should know, because they'll handle almost all of your in-memory data management needs.

Structure Insert Remove Search Indexed Access Example
Hash table O(1) O(1) O(1) n/a HashMap
Binary tree O(logN) O(logN) O(logN) n/a TreeMap
Linked list O(1) O(1) O(N) O(N) LinkedList
Array-backed list O(N) O(N) O(N) O(1) ArrayList

Hash table

If you need to maintain a key-value relationship, then a hashed data structure should be your first choice. As I've written elsewhere, a hashed lookup is as fast as you can get: O(1) access for all operations. At least, it's O(1) if you have a good hash function. If you have a bad hash function, performance can drop to O(N).

There is one downside to hash tables: they add a lot of memory overhead. To avoid collisions, a hash table needs to be larger than the number of entries; by default, the Java HashMap resizes when the table becomes 75% full. There's also overhead from the linked list used to hold hash buckets: each entry is a distinct object, and in the Sun implementation, on a 32-bit JVM, takes 24 bytes — not counting the value that it holds.

Binary tree

Whenever someone mentions O(logN) performance, they're talking about a tree of some form. Trees implement a “divide and conquer” approach to data: starting at a “root” node, you recursively identify a child node that contains the data you need. At each step, you divide the total search space by some constant value; in the case of a binary tree, that value is 2, you divide the space in half. There are a bunch of different trees, and a bunch of data structures that implement a binary tree, but as far as Java is concerned, there are just two (which means that O(logN) is really O(log2N)).

TreeMap and TreeSet use a tree structure in which the value of a node's “left” child is less than the value of the node, and the value of the node's ”right” child is greater. This means that, not only do you have O(logN) search performance, you can easily iterate through the elements in order from smallest to largest.

And this is (with a few exceedingly rare exceptions), the only reason that you would pick TreeMap over HashMap. For example, if you are extracting word counts from a piece of text, it makes sense to manage the words in sorted order, because you'll want to output them in that form. Other than this use case, the performance benefit of HashMap means it's a better choice.

The PriorityQueue is another Java class that implements a binary tree structure, however it does not maintain the order of its elements. Instead, it uses a tree where a given element is required to be lower value than either of its children. This means that inserts are O(logN), but searching is O(N); you know that the root is the smallest value in the tree, but you know nothing about any of the others.

Priority queues have a very limited set of use cases. They're great when you need to process or initiate time-dependent events, or for graph traversal algorithms where you care about a least-cost path. Because they don't maintain a total ordering over their elements, there's less work that has to be done for each N, and there's no need for an explicit node object (even though two algorithms can have the same “big O” complexity, the work that takes place can make them very different in performance).

I mention priority queues because of one of the SO links above. Because priority queues are binary structures, they take O(logN) time for inserts and removes. Which is a really dumb penalty to pay if you're looking for a simple linear (eg: fifo, or first-in-first-out) queue.

Linked list

Which brings me to linked lists, the traditional data structure for queues. As long as you actually hold a reference to a node, insertion and removal are O(1): they're implemented by updating pointers. It's easy to get a reference to either the first or last node (at least in a doubly-linked list, which is what the Java LinkedList class implenents). It's not so easy to get access to an arbitrary node, because you need to iterate the predecessor nodes first, an O(N) activity.

But linked lists have another dark side: each of the nodes is a distinct object, that consumes memory and must eventually be garbage-collected (20 bytes on a Sun 32-bit JVM). And the use-case for removing elements from the head or tail of a list tends to be multi-threaded applications, in which case one of the classes in java.util.concurrent will be more appropriate. In fact, the only time that LinkedList would be my first choice is for an application where I need known marginal memory consumption.

Array-backed list

For all other generic lists, my first choice is ArrayList. As its name implies, it implements the List interface using a pre-allocated array for storage. This means that you get O(1) access to any value, as long as you know its position in the array. It also means that, while add or remove of the last element is O(1), it's O(N) for any other position, because the other values in the list have to be shifted. However, in practice most additions are to the end of a list. And even where you need to add or remove values from the start of the list, in a small list (a few thousand entries) the CPU cost to do this is lower than the cost to manage the entries of a LinkedList.

The problem with ArrayList is the way that it resizes the underlying array. As I said, it pre-allocates an array for storage; if you want to add more elements than the array can hold, it must allocate a new, larger array, and copy the existing entries into it. The documentation describes this behavior as “amortized” O(1) performance: although the copy is an O(N) operation, it will only happen every N times you add an element, and O(N) / N == O(1).

More important, in practice, is that an ArrayList allocates a new array that is some significant percentage larger than the old array (it's explicitly not documented, but in the Sun 1.6 JDK is 50%). If you're pushing the boundaries of available memory, and need to expand a large list, it's all too easy to get an OutOfMemoryError. There's also the potential for a lot of wasted space, particularly with small lists.

Summary

I would say that, 99% of the time, when I need a Java collection class I turn to ArrayList and HashMap. To me, they represent the Platonic ideal for List and Map implementations. But I do that with full understanding of their limitations, and am ready to switch to a different data structure (or implement my own) when I run into those limitations.

No comments: