If you need to build a lookup table for some large number of values, what data structure should you choose?
Obviously, you want a hashed data structure. In Java, a HashMap
or HashSet
. It will provide O(1) performance, versus O(logN) for a binary search and O(N) for a linear search. And the “Big-O” performance of an algorithm becomes increasingly more important the larger the table becomes. Or does it?
To find out, I wrote a micro-benchmark that compared different approaches for varying table sizes. One version used a HashSet
, another used an Integer[]
and Arrays.binarySearch()
, and a third used int[]
and a home-grown binary search. For each of them, I probed the lookup table 10,000,000 times with random values, and recorded the execution time in seconds.
Table Size | HashSet | Binary Search, Object | Binary Search, Primitive |
---|---|---|---|
1,000 | 0.37 | 1.00 | 0.88 |
10,000 | 0.43 | 1.99 | 1.19 |
100,000 | 0.54 | 3.18 | 1.60 |
1,000,000 | 1.00 | 5.75 | 2.33 |
10,000,000 | 1.19 | 12.70 | 4.99 |
100,000,000 | 1.58 | 20.68 | 8.50 |
1,000,000,000 | n/a | n/a | 13.28 |
No surprises: the execution time for the hashed version remained relatively constant, while the binary search time increased logarithmically with the size of the table.* But this table leaves out two important pieces of information.
The first is that real-world applications rarely spend all their time doing lookups. Instead, they do a lookup and then do something else with the results. And while the HashSet
outperformed the Integer[]
search by a factor of 13, the absolute performance of the latter is two microseconds per iteration. For a real-world application, this is almost certainly sufficient.
Even so, it seems to make sense to go with the HashSet
. But there's another factor at work: absolute performance isn't the only way to measure this benchmark. And if you noticed the “n/a” at the bottom of the first two columns, and the inclusion of a primitive version in the last column, you can probably guess where this post is going.**
Table Size | HashSet | Binary Search, Object | Binary Search, Primitive |
---|---|---|---|
1,000,000 | 178,054K | 142,525K | 126,900K |
10,000,000 | 656,742K | 318,307K | 162,057K |
100,000,000 | 5,286,458K | 2,077,979K | 513,619K |
1,000,000,000 | n/a | n/a | 3,969,678K |
I ran these tests on an Amazon EC2 double-extra-large instance, with a 16 Gb Java heap. It costs $1 per hour ($8,766/year) to rent this machine. If my application needed a billion-entry lookup table, I would have two choices: either use a sorted array, or upgrade to a quad-extra-large instance and double my costs. For a single machine, maybe not an issue. But if I'm running a cluster of machines, the cost difference would add up quickly.
The bottom line is that absolute CPU performance is only one factor that goes into picking a data structure. Memory consumption is another, and programmer performance is a third — one that's perhaps more important than the other two put together. A HashSet
is a well-known, easy to use data structure. A sorted array is less so, and becomes increasingly complex if you have to implement a custom Comparator
. Using one will increase your time to implement, and perhaps your maintenance costs. So it wouldn't be my first choice.
But if I were faced with a limited-memory environment, neither would I be blinded by “Big-O” performance.
*
If the HashSet
implementation is O(1), why did its performance
decrease as the table got larger? There's always a risk, when you write a benchmark,
that you'll get some apparently anomalous results and will need to explain them.
In this case, I believe the performance difference is an artifact of the
processor cache. I ran the tests on a machine with an 8MB cache, which is large
enough to hold the entire hash array for 100,000 entries — and probably large
numbers of the buckets for 1,000 and 10,000 entries. Since the test repeatedly probes
the list, the cache is a significant contributor to overall performance. Yet another
pitfall in a micro-benchmark …
**
You may be wondering why I only show the last four rows of the table. The reason is
another pitfall of micro-benchmarks: finding the correct measurement tool. I used
MemoryMXBean.getHeapMemoryUsage()
, along with a couple of calls to
System.gc()
. This should have shown me only the memory used by live
objects, but instead I saw a minimum of 100Mb for each run. With the larger table
sizes, however, the real memory consumption overwhelmed this baseline “noise.”
No comments:
Post a Comment