Question: when does an O(N2)
algorithm outperform an O(NlogN)
algorithm?
As part of my article on primitives, I decided to implement Heapsort. I needed an in-place sort for primitive integers that used an external comparator, and couldn't find one with a reasonable amount of Googling. I picked Heapsort because it's an easy algorithm to implement, and has O(NlogN)
behavior for all inputs (unlike Quicksort, which degrades to O(N2)
on already-sorted input unless you take care picking the pivot).
Naturally, I wanted to see how my implementation performed, vis-a-vis the sorts provided by the JDK. Particularly on large datasets, which is the reason that I wanted an in-place sort. So I created a benchmark that measured thread CPU time to sort a 10,000,000 element array (which I knew would fit in RAM). Here are the results, the average of 10 executions on my aged desktop (a 2.4 Ghz P4 with only 512 kb of cache):
Implementation | Time (sec) | Compares |
---|---|---|
Arrays.sort(int[]) | 2.55 | |
Heapsort.sort(int[]) | 18.08 | |
Arrays.sort(Integer[]) | 25.09 | 223,179,141 |
Heapsort.sort(Integer[]) | 79.91 | 439,021,125 |
Those are dramatic differences. I knew that Heapsort would make more comparisons than Mergesort (which the JDK will use for Integer[]
), so I added a counter to the comparator. But the running time was over three times longer, with only twice the number of comparisons. And Quicksort (used by the JDK for int[]
) was over seven times faster. Perhaps some of that difference comes from Heapsort calling a comparator, but clearly, not all O(NlogN)
algorithms are alike!
So I decided to look at the JDK source. And found this interesting bit of code:*
private static void sort1(int x[], int off, int len) { // Insertion sort on smallest arrays if (len < 7) { for (int i=off; i<len+off; i++) for (int j=i; j>off && x[j-1]>x[j]; j--) swap(x, j, j-1); return; }
Insertion sort is an O(N2)
algorithm. But it doesn't have a lot of moving parts — unlike Quicksort, which takes almost 40 lines of code in the JDK implementation, not counting the swap and range-swap methods. And when N
is small, lines of code are more important than algorithmic efficiency (or, as my coworker Max points out, when N
is bounded, you have a constant-time operation).
There's one more performance-enhancing feature of this snippet: all of the data accesses should be within the L1 cache. On Intel processors, the cache is structured as 64-byte chunks of memory — enough to hold 8 adjacent int
values. While Arrays.sort()
makes no attempt to ensure alignment of the insertion-sorted sub-arrays, it's a naturally cache-friendly algorithm. As is Quicksort itself, for that matter. And Heapsort, which jumps around the array comparing elements at position N
and 2 * N
, is decidedly cache-unfriendly.
All of which is to say that sometimes, micro-optimizations have a big effect. And that, rather than implementing Heapsort, I should have simply copied the OpenJDK implementation of Quicksort and added a comparator.**
* This optimization isn't specific to the JDK. The GNU C library implementation also switches to insertion sort, albeit at 4 elements, and credits Sedgewick for that and other optimizations.
** In case you're wondering, the updated Quicksort took 4.25 seconds to run, and made approximately 375,000,000 comparisons. If it weren't covered by GPL2, I'd put it in my toolbox in place of the Heapsort implementation.