Thursday, November 19, 2009

How HashMaps Work

This was originally written as an answer to a StackOverflow question where the original poster and several responders seemed to have some misunderstandings of how hashed lookup works. Since it would end up buried under a dozen more-or-less right responses if posted there, I've decided to post here, where it might end up on the front page of Google.

The Hashcode

Hashed lookup starts with an object's hashcode. A hashcode is simply a number that is permanently associated with an object, used only when storing that object in a hashed structure; there's no magic about it. The only rule that governs hashcode values is that, given objects x and y, if x.equals(y) returns true, then both x and y must return the same hashcode. The reverse isn't true: x and y could return the same hashcode, but not be considered equal. Here's an example:

Long x = Long.valueOf(0x600000002L);
Long y = Long.valueOf(0x700000003L);


All objects could return 0 as their hashcode, and that would satisfy the rules. It wouldn't be very useful, however. To be useful, a hash function must evenly distribute values over the 32-bit range of an integer: you should be able to apply the same statistical tests that you would for a random number generator.

There are many ways to compute hashcodes. Integer, for example, simply returns its int value. Long does a shift and xor, to mix its high-order 32 bits with its low-order 32 bits. A common approach is to keep a running sum, multiplying that sum by a small prime number before adding each term. This is how String does it:

for (int i = 0; i < len; i++) {
    h = 31*h + val[off++];

Note that this causes a string's hashcode to be “driven” by its trailing characters; integer overflow during multiplication claims the leading characters. This is OK, as long as the strings end with different characters — which normal text almost always will. As I recall, the original (1.1.x) implementation of String.hashCode() gave greater weight to the leading characters in the string, which turned out to be a very bad thing when processing HTTP URLs (which all start with “http://”).

Some Myths About Hashcodes

A lot of people think that a hashcode is a unique identifier for an object. This is wrong, and the reason should be obvious: a Java hashcode is 32 bits. As soon as you have objects with 33 bits of data, you're guaranteed to have two objects with the same hashcode. Depending on the hash function, you can have collisions with less than 32 bits of object data: the strings “Az” and “C<”, for example.

Then there's the “identity” hashcode returned by Object.hashCode(). The myth is that it returns a pointer to the object, which in a 32-bit JVM would be a unique identifier. Perhaps it does return a pointer, at least on the first call; it is, after all, implemented as a native method. However, objects are permitted to move around in the heap, so if the identity hashcode is indeed a pointer, there is no guarantee that two objects won't (over time) have the same value.


OK, so every object has a hashcode; what's the point? For just a moment, pretend that a hashcode is indeed a unique identifier for the object. If that were the case, then you could create a map around an object array that had 2^32 elements: compute the key's hashcode, and return whatever you found at that spot in the array.

But, as I've said, hashcodes aren't unique identifiers, and an array containing 2^32 objects would be too big to fit in the heap (at least on a 32-bit JVM). But the idea still holds: take an array with 1,000 elements, and divide the hashcode by 1,000, and you can return whatever is at that index.

But then what happens if two objects map to the same index? They don't even have to have the same hashcode: with our thousand element table, one object could have hashcode 1001, and the second could have hashcode 2001. This is known as a hash collision.

There are many ways to resolve a hash collision. A really dumb approach is to find the next empty cell in the array. So if your hashcode mod 1000 is 723, and cell 723 already has a value, you look in cell 724 … then 725, then 726, and so on. As the table gets full, you'll end up doing a linear search of the entire table looking for an empty cell.

A much smarter way is to have each cell (called a "bucket") in your hash table hold a list of entries. So when your hashcode resolves to index 723, you can simply add your object onto the end of that list. And this is just what the Sun HashMap does: it defines an array of buckets:

 * The table, resized as necessary. Length MUST Always be a power of two.
transient Entry[] table;

And an Entry object is a node in a linked list:

static class Entry<K,V> implements Map.Entry<K,V> {
    final K key;
    V value;
    final int hash;
    Entry<K,V> next;

You'll still run into a problem if many objects have the same hash value, or if they have different hash values that resolve to the same bucket index: you'll traverse a very long linked list.

As long as you have a good hash function, that produces an even distribution of 32 bit numbers from your objects, this is not an issue: if you start to get a long list on a particular bucket, you simply increase the size of the table. Since it's an even distribution, you'll be distributing the objects across more buckets, which means fewer objects to a single bucket.

The Sun implementation uses hash tables that are always a power of two in size (eg: 16 elements, 32, 64, and so on). This is a performance optimization in that you can transform a 32-bit hashcode into the appropriate index by applying a mask (eg, if you have 16 elements in your table, your index is hashcode & 0x0F). However, such tables are very susceptible to bad hash functions; the Sun implementation takes extra steps to rehash supplied values.

The Bottom Line

If you use a HashMap, you need to ensure that the objects that you use as keys have a good hashcode. If you, for example, take a population of 1,000,000 objects, and have a hash function that can produce 1,000 discrete hashcodes, you'll end up with buckets containing 1,000 entries. And this will mean your map, which has an optimal access time of O(1), will turn into a data structure that effectively has O(n) access time.

No comments: