blog.kdgregory.com: May 2011

If you need to build a lookup table for some large number of values, what data structure should you choose?

Obviously, you want a hashed data structure. In Java, a HashMap or HashSet. It will provide O(1) performance, versus O(logN) for a binary search and O(N) for a linear search. And the “Big-O” performance of an algorithm becomes increasingly more important the larger the table becomes. Or does it?

To find out, I wrote a micro-benchmark that compared different approaches for varying table sizes. One version used a HashSet, another used an Integer[] and Arrays.binarySearch(), and a third used int[] and a home-grown binary search. For each of them, I probed the lookup table 10,000,000 times with random values, and recorded the execution time in seconds.

Table Size	HashSet	Binary Search, Object	Binary Search, Primitive
1,000	0.37	1.00	0.88
10,000	0.43	1.99	1.19
100,000	0.54	3.18	1.60
1,000,000	1.00	5.75	2.33
10,000,000	1.19	12.70	4.99
100,000,000	1.58	20.68	8.50
1,000,000,000	n/a	n/a	13.28

No surprises: the execution time for the hashed version remained relatively constant, while the binary search time increased logarithmically with the size of the table.* But this table leaves out two important pieces of information.

The first is that real-world applications rarely spend all their time doing lookups. Instead, they do a lookup and then do something else with the results. And while the HashSet outperformed the Integer[] search by a factor of 13, the absolute performance of the latter is two microseconds per iteration. For a real-world application, this is almost certainly sufficient.

Even so, it seems to make sense to go with the HashSet. But there's another factor at work: absolute performance isn't the only way to measure this benchmark. And if you noticed the “n/a” at the bottom of the first two columns, and the inclusion of a primitive version in the last column, you can probably guess where this post is going.**

Table Size	HashSet	Binary Search, Object	Binary Search, Primitive
1,000,000	178,054K	142,525K	126,900K
10,000,000	656,742K	318,307K	162,057K
100,000,000	5,286,458K	2,077,979K	513,619K
1,000,000,000	n/a	n/a	3,969,678K

I ran these tests on an Amazon EC2 double-extra-large instance, with a 16 Gb Java heap. It costs $1 per hour ($8,766/year) to rent this machine. If my application needed a billion-entry lookup table, I would have two choices: either use a sorted array, or upgrade to a quad-extra-large instance and double my costs. For a single machine, maybe not an issue. But if I'm running a cluster of machines, the cost difference would add up quickly.

The bottom line is that absolute CPU performance is only one factor that goes into picking a data structure. Memory consumption is another, and programmer performance is a third — one that's perhaps more important than the other two put together. A HashSet is a well-known, easy to use data structure. A sorted array is less so, and becomes increasingly complex if you have to implement a custom Comparator. Using one will increase your time to implement, and perhaps your maintenance costs. So it wouldn't be my first choice.

But if I were faced with a limited-memory environment, neither would I be blinded by “Big-O” performance.

* If the HashSet implementation is O(1), why did its performance decrease as the table got larger? There's always a risk, when you write a benchmark, that you'll get some apparently anomalous results and will need to explain them. In this case, I believe the performance difference is an artifact of the processor cache. I ran the tests on a machine with an 8MB cache, which is large enough to hold the entire hash array for 100,000 entries — and probably large numbers of the buckets for 1,000 and 10,000 entries. Since the test repeatedly probes the list, the cache is a significant contributor to overall performance. Yet another pitfall in a micro-benchmark …

** You may be wondering why I only show the last four rows of the table. The reason is another pitfall of micro-benchmarks: finding the correct measurement tool. I used MemoryMXBean.getHeapMemoryUsage(), along with a couple of calls to System.gc(). This should have shown me only the memory used by live objects, but instead I saw a minimum of 100Mb for each run. With the larger table sizes, however, the real memory consumption overwhelmed this baseline “noise.”

blog.kdgregory.com

Monday, May 16, 2011

The difference between a programmer and a software engineer

Monday, May 9, 2011

Big-O Isn't Everything, Part 2

Website

Blog Archive