If you need to build a lookup table for some large number of values, what data structure should you choose?
Obviously, you want a hashed data structure. In Java, a HashMap
or HashSet
. It will provide O(1) performance, versus O(logN) for a binary search and O(N) for a linear search. And the “Big-O” performance of an algorithm becomes increasingly more important the larger the table becomes. Or does it?
To find out, I wrote a micro-benchmark that compared different approaches for varying table sizes. One version used a HashSet
, another used an Integer[]
and Arrays.binarySearch()
, and a third used int[]
and a home-grown binary search. For each of them, I probed the lookup table 10,000,000 times with random values, and recorded the execution time in seconds.
Table Size
| HashSet
| Binary Search, Object
| Binary Search, Primitive
|
---|
1,000
| 0.37
| 1.00
| 0.88
|
10,000
| 0.43
| 1.99
| 1.19
|
100,000
| 0.54
| 3.18
| 1.60
|
1,000,000
| 1.00
| 5.75
| 2.33
|
10,000,000
| 1.19
| 12.70
| 4.99
|
100,000,000
| 1.58
| 20.68
| 8.50
|
1,000,000,000
| n/a
| n/a
| 13.28
|
No surprises: the execution time for the hashed version remained relatively constant, while the binary search time increased logarithmically with the size of the table.* But this table leaves out two important pieces of information.
The first is that real-world applications rarely spend all their time doing lookups. Instead, they do a lookup and then do something else with the results. And while the HashSet
outperformed the Integer[]
search by a factor of 13, the absolute performance of the latter is two microseconds per iteration. For a real-world application, this is almost certainly sufficient.
Even so, it seems to make sense to go with the HashSet
. But there's another factor at work: absolute performance isn't the only way to measure this benchmark. And if you noticed the “n/a” at the bottom of the first two columns, and the inclusion of a primitive version in the last column, you can probably guess where this post is going.**
Table Size
| HashSet
| Binary Search, Object
| Binary Search, Primitive
|
---|
1,000,000
| 178,054K
| 142,525K
| 126,900K
|
10,000,000
| 656,742K
| 318,307K
| 162,057K
|
100,000,000
| 5,286,458K
| 2,077,979K
| 513,619K
|
1,000,000,000
| n/a
| n/a
| 3,969,678K
|
I ran these tests on an Amazon EC2 double-extra-large instance, with a 16 Gb Java heap. It costs $1 per hour ($8,766/year) to rent this machine. If my application needed a billion-entry lookup table, I would have two choices: either use a sorted array, or upgrade to a quad-extra-large instance and double my costs. For a single machine, maybe not an issue. But if I'm running a cluster of machines, the cost difference would add up quickly.
The bottom line is that absolute CPU performance is only one factor that goes into picking a data structure. Memory consumption is another, and programmer performance is a third — one that's perhaps more important than the other two put together. A HashSet
is a well-known, easy to use data structure. A sorted array is less so, and becomes increasingly complex if you have to implement a custom Comparator
. Using one will increase your time to implement, and perhaps your maintenance costs. So it wouldn't be my first choice.
But if I were faced with a limited-memory environment, neither would I be blinded by “Big-O” performance.
*
If the HashSet
implementation is O(1), why did its performance
decrease as the table got larger? There's always a risk, when you write a benchmark,
that you'll get some apparently anomalous results and will need to explain them.
In this case, I believe the performance difference is an artifact of the
processor cache. I ran the tests on a machine with an 8MB cache, which is large
enough to hold the entire hash array for 100,000 entries — and probably large
numbers of the buckets for 1,000 and 10,000 entries. Since the test repeatedly probes
the list, the cache is a significant contributor to overall performance. Yet another
pitfall in a micro-benchmark …
**
You may be wondering why I only show the last four rows of the table. The reason is
another pitfall of micro-benchmarks: finding the correct measurement tool. I used
MemoryMXBean.getHeapMemoryUsage()
, along with a couple of calls to
System.gc()
. This should have shown me only the memory used by live
objects, but instead I saw a minimum of 100Mb for each run. With the larger table
sizes, however, the real memory consumption overwhelmed this baseline “noise.”