Thursday, November 19, 2009

How HashMaps Work

This was originally written as an answer to a StackOverflow question where the original poster and several responders seemed to have some misunderstandings of how hashed lookup works. Since it would end up buried under a dozen more-or-less right responses if posted there, I've decided to post here, where it might end up on the front page of Google.

The Hashcode

Hashed lookup starts with an object's hashcode. A hashcode is simply a number that is permanently associated with an object, used only when storing that object in a hashed structure; there's no magic about it. The only rule that governs hashcode values is that, given objects x and y, if x.equals(y) returns true, then both x and y must return the same hashcode. The reverse isn't true: x and y could return the same hashcode, but not be considered equal. Here's an example:

Long x = Long.valueOf(0x600000002L);
Long y = Long.valueOf(0x700000003L);

System.out.println(x.hashCode());
System.out.println(y.hashCode());
System.out.println(x.equals(y));

All objects could return 0 as their hashcode, and that would satisfy the rules. It wouldn't be very useful, however. To be useful, a hash function must evenly distribute values over the 32-bit range of an integer: you should be able to apply the same statistical tests that you would for a random number generator.

There are many ways to compute hashcodes. Integer, for example, simply returns its int value. Long does a shift and xor, to mix its high-order 32 bits with its low-order 32 bits. A common approach is to keep a running sum, multiplying that sum by a small prime number before adding each term. This is how String does it:

for (int i = 0; i < len; i++) {
    h = 31*h + val[off++];
}

Note that this causes a string's hashcode to be “driven” by its trailing characters; integer overflow during multiplication claims the leading characters. This is OK, as long as the strings end with different characters — which normal text almost always will. As I recall, the original (1.1.x) implementation of String.hashCode() gave greater weight to the leading characters in the string, which turned out to be a very bad thing when processing HTTP URLs (which all start with “http://”).

Some Myths About Hashcodes

A lot of people think that a hashcode is a unique identifier for an object. This is wrong, and the reason should be obvious: a Java hashcode is 32 bits. As soon as you have objects with 33 bits of data, you're guaranteed to have two objects with the same hashcode. Depending on the hash function, you can have collisions with less than 32 bits of object data: the strings “Az” and “C<”, for example.

Then there's the “identity” hashcode returned by Object.hashCode(). The myth is that it returns a pointer to the object, which in a 32-bit JVM would be a unique identifier. Perhaps it does return a pointer, at least on the first call; it is, after all, implemented as a native method. However, objects are permitted to move around in the heap, so if the identity hashcode is indeed a pointer, there is no guarantee that two objects won't (over time) have the same value.

HashMap

OK, so every object has a hashcode; what's the point? For just a moment, pretend that a hashcode is indeed a unique identifier for the object. If that were the case, then you could create a map around an object array that had 2^32 elements: compute the key's hashcode, and return whatever you found at that spot in the array.

But, as I've said, hashcodes aren't unique identifiers, and an array containing 2^32 objects would be too big to fit in the heap (at least on a 32-bit JVM). But the idea still holds: take an array with 1,000 elements, and divide the hashcode by 1,000, and you can return whatever is at that index.

But then what happens if two objects map to the same index? They don't even have to have the same hashcode: with our thousand element table, one object could have hashcode 1001, and the second could have hashcode 2001. This is known as a hash collision.

There are many ways to resolve a hash collision. A really dumb approach is to find the next empty cell in the array. So if your hashcode mod 1000 is 723, and cell 723 already has a value, you look in cell 724 … then 725, then 726, and so on. As the table gets full, you'll end up doing a linear search of the entire table looking for an empty cell.

A much smarter way is to have each cell (called a "bucket") in your hash table hold a list of entries. So when your hashcode resolves to index 723, you can simply add your object onto the end of that list. And this is just what the Sun HashMap does: it defines an array of buckets:

/**
 * The table, resized as necessary. Length MUST Always be a power of two.
 */
transient Entry[] table;

And an Entry object is a node in a linked list:

static class Entry<K,V> implements Map.Entry<K,V> {
    final K key;
    V value;
    final int hash;
    Entry<K,V> next;

You'll still run into a problem if many objects have the same hash value, or if they have different hash values that resolve to the same bucket index: you'll traverse a very long linked list.

As long as you have a good hash function, that produces an even distribution of 32 bit numbers from your objects, this is not an issue: if you start to get a long list on a particular bucket, you simply increase the size of the table. Since it's an even distribution, you'll be distributing the objects across more buckets, which means fewer objects to a single bucket.

The Sun implementation uses hash tables that are always a power of two in size (eg: 16 elements, 32, 64, and so on). This is a performance optimization in that you can transform a 32-bit hashcode into the appropriate index by applying a mask (eg, if you have 16 elements in your table, your index is hashcode & 0x0F). However, such tables are very susceptible to bad hash functions; the Sun implementation takes extra steps to rehash supplied values.

The Bottom Line

If you use a HashMap, you need to ensure that the objects that you use as keys have a good hashcode. If you, for example, take a population of 1,000,000 objects, and have a hash function that can produce 1,000 discrete hashcodes, you'll end up with buckets containing 1,000 entries. And this will mean your map, which has an optimal access time of O(1), will turn into a data structure that effectively has O(n) access time.

Tuesday, November 17, 2009

Deploying to Amazon EC2

After talking with Fred Stluka, I started using Amazon EC2 as a demo site. For me, the choice was between Amazon or a Java hosting account with GoDaddy or similar. The economics are a tossup, but I like the control that I get from Amazon: I can turn on a demo box for just as long as I need it, running exactly the software that I want (provided that I install it), and not worry (too much) about someone breaking in and installing malware while I'm not looking.

One of the interesting issues with this setup is deploying new builds. Tomcat has a nice, simple, form-based upload-and-deploy feature in its Manager application. Which falls down pretty quickly when I'm uploading a new WAR over my 40kbytes/second residential ADSL connection. My current app is under 2 Mb, but it's still enough time to get another cup of coffee.

So far, I've tried two alternatives, with their own benefits and drawbacks.

The first is to maintain synchronized exploded WAR directories on both my build machine and the EC2 machine, using rsync. The following command will upload only those files that have changed, using compression to reduce their sizes:

rsync -aczv --delete . demo.kdgregory.com:Workspace/demo.war

It works, and the files that get uploaded tend to be small because the included JARs don't change, but it adds manual steps into my deployment: I have to unpack the WAR locally, then repack it remotely. Yes, I know that I could sync the exploded WAR right into Tomcat's webapps directory, but I'm concerned about non-atomic updates.

The other alternative is to do builds on the EC2 machine. It is, after all, a fully-functional computer with unlimited shell access. I've installed a complete Java toolset, and my applications build with Maven. The only real problem with this is getting the source code out of my home repository. Including getting Subversion so that I can get the source code.

EC2 provides many different pre-built instance types. I'm currently using a bare-bones Fedora Core instance, which does not come with any development tools. I already had a start script to create my users (their home directories are persisted on S3), so simply added “yum install subversion” to that script.

Once subversion is installed, I need to be able to use it. I have an SSH port open on my home machine, and Subversion allows you to tunnel requests through SSH: it starts a new subversion server for each request. Since I'm not using the standard port, I need to add the following to ${HOME}/.subversion/config:

kdg = ssh -p 12345

Now I can check out my project with the following command:

svn co 'svn+kdg://kdghome/repo/Project/trunk'

That only leaves one step: editing /etc/hosts to map kdghome to whatever public IP Verizon has given me today (and hoping that it doesn't change in the middle of the day). It's a lot of work, but only has to happen once per session. After that, I can run svn update from the remote machine to get any changes, or make edits there and commit back to my home repository.

Monday, November 16, 2009

Building a Product List Service: HTML Templating

Along with the product list service itself, I built a demo page. The first iteration of this page was very simple: a table to display list entries, a visible form to add new entries, and some invisible forms for generating update/delete requests. Even with this primitive page, however, I ran into the problem of how to turn the JSON data from the service into HTML.

My first approach was “the simplest thing that could possibly work”: I wrote a JavaScript function that built up an HTML string from the JSON, and then inserted this string into DIV using innerHTML. It worked, and had the benefit that the various event handlers were defined in the same file — if I changed a function's name or interface, I didn't have to update multiple files. But embedding markup in a script is ugly and hard to maintain; just keeping quotes matched takes a lot of work.

My second approach was to develop a utility library containing functions that build markup based on model objects. This approach was clearly influenced by Swing; in fact, the object passed to my buildTable() function looked a lot like Swing's TableModel. Doing this got the markup out of my main script, and gave me reusable components, which I liked. However, the code defining the model object alone was larger than my original concatenation function.

If my second approach resembled Swing, my first approach resembled early servlets. Keeping with the Java analogies, what I was really looking for was an approach that resembled JSP: all markup within one file, making references to data provided elsewhere.

With some pointers from friends who are adept JavaScript programmers, I started looking at different templating solutions. John Resig's micro-templates even looked like JSP directives (which meant, unfortunately, that they couldn't be used within a JSP-generated page). I tried out a few of the existing solutions, then decided to write my own — it was only a dozen lines of code, using regular expressions.

But while the template approach did the job, and provided a fairly clean break between markup and scripts, I was still uncomfortable. In part because there was now too much of a break between the markup and the scripts that interacted with it: a script in one file would blindly access markup from another file. My JavaScript friends would say that this break is a Good Thing, but I think that it ignores the fact that the markup and scripts are tightly coupled by nature — and in fact takes us back to a 1960s view of programs manipulating data. But hat's a topic for a future post.