If you need to build a lookup table for some large number of values, what data structure should you choose?
 Obviously, you want a hashed data structure. In Java, a HashMap or HashSet. It will provide O(1) performance, versus O(logN) for a binary search and O(N) for a linear search. And the “Big-O” performance of an algorithm becomes increasingly more important the larger the table becomes. Or does it?
 To find out, I wrote a micro-benchmark that compared different approaches for varying table sizes. One version used a HashSet, another used an Integer[] and Arrays.binarySearch(), and a third used int[] and a home-grown binary search. For each of them, I probed the lookup table 10,000,000 times with random values, and recorded the execution time in seconds. 
 
| Table Size | HashSet | Binary Search, Object | Binary Search, Primitive | 
|---|
| 1,000 | 0.37 | 1.00 | 0.88 | 
| 10,000 | 0.43 | 1.99 | 1.19 | 
| 100,000 | 0.54 | 3.18 | 1.60 | 
| 1,000,000 | 1.00 | 5.75 | 2.33 | 
| 10,000,000 | 1.19 | 12.70 | 4.99 | 
| 100,000,000 | 1.58 | 20.68 | 8.50 | 
| 1,000,000,000 | n/a | n/a | 13.28 | 
 
 No surprises: the execution time for the hashed version remained relatively constant, while the binary search time increased logarithmically with the size of the table.* But this table leaves out two important pieces of information.
 The first is that real-world applications rarely spend all their time doing lookups. Instead, they do a lookup and then do something else with the results. And while the HashSet outperformed the Integer[] search by a factor of 13, the absolute performance of the latter is two microseconds per iteration. For a real-world application, this is almost certainly sufficient.
 Even so, it seems to make sense to go with the HashSet. But there's another factor at work: absolute performance isn't the only way to measure this benchmark. And if you noticed the “n/a” at the bottom of the first two columns, and the inclusion of a primitive version in the last column, you can probably guess where this post is going.**
 
| Table Size | HashSet | Binary Search, Object | Binary Search, Primitive | 
|---|
| 1,000,000 | 178,054K | 142,525K | 126,900K | 
| 10,000,000 | 656,742K | 318,307K | 162,057K | 
| 100,000,000 | 5,286,458K | 2,077,979K | 513,619K | 
| 1,000,000,000 | n/a | n/a | 3,969,678K | 
 
 I ran these tests on an Amazon EC2 double-extra-large instance, with a 16 Gb Java heap. It costs $1 per hour ($8,766/year) to rent this machine. If my application needed a billion-entry lookup table, I would have two choices: either use a sorted array, or upgrade to a quad-extra-large instance and double my costs. For a single machine, maybe not an issue. But if I'm running a cluster of machines, the cost difference would add up quickly.
 The bottom line is that absolute CPU performance is only one factor that goes into picking a data structure. Memory consumption is another, and programmer performance is a third — one that's perhaps more important than the other two put together. A HashSet is a well-known, easy to use data structure. A sorted array is less so, and becomes increasingly complex if you have to implement a custom Comparator. Using one will increase your time to implement, and perhaps your maintenance costs. So it wouldn't be my first choice.
 But if I were faced with a limited-memory environment, neither would I be blinded by “Big-O” performance.
 *
If the HashSet implementation is O(1), why did its performance
    decrease as the table got larger? There's always a risk, when you write a benchmark,
    that you'll get some apparently anomalous results and will need to explain them.
    In this case, I believe the performance difference is an artifact of the
    processor cache. I ran the tests on a machine with an 8MB cache, which is large
    enough to hold the entire hash array for 100,000 entries — and probably large
    numbers of the buckets for 1,000 and 10,000 entries. Since the test repeatedly probes
    the list, the cache is a significant contributor to overall performance. Yet another
    pitfall in a micro-benchmark …
 **
    You may be wondering why I only show the last four rows of the table. The reason is
    another pitfall of micro-benchmarks: finding the correct measurement tool. I used
    MemoryMXBean.getHeapMemoryUsage(), along with a couple of calls to
    System.gc(). This should have shown me only the memory used by live
    objects, but instead I saw a minimum of 100Mb for each run. With the larger table
    sizes, however, the real memory consumption overwhelmed this baseline “noise.”