Thursday, November 19, 2009

How HashMaps Work

This was originally written as an answer to a StackOverflow question where the original poster and several responders seemed to have some misunderstandings of how hashed lookup works. Since it would end up buried under a dozen more-or-less right responses if posted there, I've decided to post here, where it might end up on the front page of Google.

The Hashcode

Hashed lookup starts with an object's hashcode. A hashcode is simply a number that is permanently associated with an object, used only when storing that object in a hashed structure; there's no magic about it. The only rule that governs hashcode values is that, given objects x and y, if x.equals(y) returns true, then both x and y must return the same hashcode. The reverse isn't true: x and y could return the same hashcode, but not be considered equal. Here's an example:

Long x = Long.valueOf(0x600000002L);
Long y = Long.valueOf(0x700000003L);

System.out.println(x.hashCode());
System.out.println(y.hashCode());
System.out.println(x.equals(y));

All objects could return 0 as their hashcode, and that would satisfy the rules. It wouldn't be very useful, however. To be useful, a hash function must evenly distribute values over the 32-bit range of an integer: you should be able to apply the same statistical tests that you would for a random number generator.

There are many ways to compute hashcodes. Integer, for example, simply returns its int value. Long does a shift and xor, to mix its high-order 32 bits with its low-order 32 bits. A common approach is to keep a running sum, multiplying that sum by a small prime number before adding each term. This is how String does it:

for (int i = 0; i < len; i++) {
    h = 31*h + val[off++];
}

Note that this causes a string's hashcode to be “driven” by its trailing characters; integer overflow during multiplication claims the leading characters. This is OK, as long as the strings end with different characters — which normal text almost always will. As I recall, the original (1.1.x) implementation of String.hashCode() gave greater weight to the leading characters in the string, which turned out to be a very bad thing when processing HTTP URLs (which all start with “http://”).

Some Myths About Hashcodes

A lot of people think that a hashcode is a unique identifier for an object. This is wrong, and the reason should be obvious: a Java hashcode is 32 bits. As soon as you have objects with 33 bits of data, you're guaranteed to have two objects with the same hashcode. Depending on the hash function, you can have collisions with less than 32 bits of object data: the strings “Az” and “C<”, for example.

Then there's the “identity” hashcode returned by Object.hashCode(). The myth is that it returns a pointer to the object, which in a 32-bit JVM would be a unique identifier. Perhaps it does return a pointer, at least on the first call; it is, after all, implemented as a native method. However, objects are permitted to move around in the heap, so if the identity hashcode is indeed a pointer, there is no guarantee that two objects won't (over time) have the same value.

HashMap

OK, so every object has a hashcode; what's the point? For just a moment, pretend that a hashcode is indeed a unique identifier for the object. If that were the case, then you could create a map around an object array that had 2^32 elements: compute the key's hashcode, and return whatever you found at that spot in the array.

But, as I've said, hashcodes aren't unique identifiers, and an array containing 2^32 objects would be too big to fit in the heap (at least on a 32-bit JVM). But the idea still holds: take an array with 1,000 elements, and divide the hashcode by 1,000, and you can return whatever is at that index.

But then what happens if two objects map to the same index? They don't even have to have the same hashcode: with our thousand element table, one object could have hashcode 1001, and the second could have hashcode 2001. This is known as a hash collision.

There are many ways to resolve a hash collision. A really dumb approach is to find the next empty cell in the array. So if your hashcode mod 1000 is 723, and cell 723 already has a value, you look in cell 724 … then 725, then 726, and so on. As the table gets full, you'll end up doing a linear search of the entire table looking for an empty cell.

A much smarter way is to have each cell (called a "bucket") in your hash table hold a list of entries. So when your hashcode resolves to index 723, you can simply add your object onto the end of that list. And this is just what the Sun HashMap does: it defines an array of buckets:

/**
 * The table, resized as necessary. Length MUST Always be a power of two.
 */
transient Entry[] table;

And an Entry object is a node in a linked list:

static class Entry<K,V> implements Map.Entry<K,V> {
    final K key;
    V value;
    final int hash;
    Entry<K,V> next;

You'll still run into a problem if many objects have the same hash value, or if they have different hash values that resolve to the same bucket index: you'll traverse a very long linked list.

As long as you have a good hash function, that produces an even distribution of 32 bit numbers from your objects, this is not an issue: if you start to get a long list on a particular bucket, you simply increase the size of the table. Since it's an even distribution, you'll be distributing the objects across more buckets, which means fewer objects to a single bucket.

The Sun implementation uses hash tables that are always a power of two in size (eg: 16 elements, 32, 64, and so on). This is a performance optimization in that you can transform a 32-bit hashcode into the appropriate index by applying a mask (eg, if you have 16 elements in your table, your index is hashcode & 0x0F). However, such tables are very susceptible to bad hash functions; the Sun implementation takes extra steps to rehash supplied values.

The Bottom Line

If you use a HashMap, you need to ensure that the objects that you use as keys have a good hashcode. If you, for example, take a population of 1,000,000 objects, and have a hash function that can produce 1,000 discrete hashcodes, you'll end up with buckets containing 1,000 entries. And this will mean your map, which has an optimal access time of O(1), will turn into a data structure that effectively has O(n) access time.

Tuesday, November 17, 2009

Deploying to Amazon EC2

After talking with Fred Stluka, I started using Amazon EC2 as a demo site. For me, the choice was between Amazon or a Java hosting account with GoDaddy or similar. The economics are a tossup, but I like the control that I get from Amazon: I can turn on a demo box for just as long as I need it, running exactly the software that I want (provided that I install it), and not worry (too much) about someone breaking in and installing malware while I'm not looking.

One of the interesting issues with this setup is deploying new builds. Tomcat has a nice, simple, form-based upload-and-deploy feature in its Manager application. Which falls down pretty quickly when I'm uploading a new WAR over my 40kbytes/second residential ADSL connection. My current app is under 2 Mb, but it's still enough time to get another cup of coffee.

So far, I've tried two alternatives, with their own benefits and drawbacks.

The first is to maintain synchronized exploded WAR directories on both my build machine and the EC2 machine, using rsync. The following command will upload only those files that have changed, using compression to reduce their sizes:

rsync -aczv --delete . demo.kdgregory.com:Workspace/demo.war

It works, and the files that get uploaded tend to be small because the included JARs don't change, but it adds manual steps into my deployment: I have to unpack the WAR locally, then repack it remotely. Yes, I know that I could sync the exploded WAR right into Tomcat's webapps directory, but I'm concerned about non-atomic updates.

The other alternative is to do builds on the EC2 machine. It is, after all, a fully-functional computer with unlimited shell access. I've installed a complete Java toolset, and my applications build with Maven. The only real problem with this is getting the source code out of my home repository. Including getting Subversion so that I can get the source code.

EC2 provides many different pre-built instance types. I'm currently using a bare-bones Fedora Core instance, which does not come with any development tools. I already had a start script to create my users (their home directories are persisted on S3), so simply added “yum install subversion” to that script.

Once subversion is installed, I need to be able to use it. I have an SSH port open on my home machine, and Subversion allows you to tunnel requests through SSH: it starts a new subversion server for each request. Since I'm not using the standard port, I need to add the following to ${HOME}/.subversion/config:

kdg = ssh -p 12345

Now I can check out my project with the following command:

svn co 'svn+kdg://kdghome/repo/Project/trunk'

That only leaves one step: editing /etc/hosts to map kdghome to whatever public IP Verizon has given me today (and hoping that it doesn't change in the middle of the day). It's a lot of work, but only has to happen once per session. After that, I can run svn update from the remote machine to get any changes, or make edits there and commit back to my home repository.

Monday, November 16, 2009

Building a Product List Service: HTML Templating

Along with the product list service itself, I built a demo page. The first iteration of this page was very simple: a table to display list entries, a visible form to add new entries, and some invisible forms for generating update/delete requests. Even with this primitive page, however, I ran into the problem of how to turn the JSON data from the service into HTML.

My first approach was “the simplest thing that could possibly work”: I wrote a JavaScript function that built up an HTML string from the JSON, and then inserted this string into DIV using innerHTML. It worked, and had the benefit that the various event handlers were defined in the same file — if I changed a function's name or interface, I didn't have to update multiple files. But embedding markup in a script is ugly and hard to maintain; just keeping quotes matched takes a lot of work.

My second approach was to develop a utility library containing functions that build markup based on model objects. This approach was clearly influenced by Swing; in fact, the object passed to my buildTable() function looked a lot like Swing's TableModel. Doing this got the markup out of my main script, and gave me reusable components, which I liked. However, the code defining the model object alone was larger than my original concatenation function.

If my second approach resembled Swing, my first approach resembled early servlets. Keeping with the Java analogies, what I was really looking for was an approach that resembled JSP: all markup within one file, making references to data provided elsewhere.

With some pointers from friends who are adept JavaScript programmers, I started looking at different templating solutions. John Resig's micro-templates even looked like JSP directives (which meant, unfortunately, that they couldn't be used within a JSP-generated page). I tried out a few of the existing solutions, then decided to write my own — it was only a dozen lines of code, using regular expressions.

But while the template approach did the job, and provided a fairly clean break between markup and scripts, I was still uncomfortable. In part because there was now too much of a break between the markup and the scripts that interacted with it: a script in one file would blindly access markup from another file. My JavaScript friends would say that this break is a Good Thing, but I think that it ignores the fact that the markup and scripts are tightly coupled by nature — and in fact takes us back to a 1960s view of programs manipulating data. But hat's a topic for a future post.

Tuesday, November 10, 2009

Refactoring: Evil, or Just Misunderstood

How did refactoring get such a bad reputation? Depending on who you talk to, it introduces bugs, makes projects more expensive, breaks unit tests, and helped the Yankees win the World Series. Now, as a Boston native and Philly resident, I hardly want to help the Yankees, but it seems that claim is as overdone as the others.

Let's start with increased project cost. In my prior job, I often had to create estimates for changes to our current codebase. After working through an estimate I would usually have a conversation with the Business Owner that went like this:

BO: You have an estimate of five days for these page changes; can you give a breakdown?

Me: Sure. These are pages that do essentially the same thing, but have about 3,000 lines of code each. Refactoring accounts for four days, and the new additions take a day.

Usually, I was interrupted in the middle of the word “refactoring,” with a loud “We can't afford that!” And while I understood the reaction, the alternate estimate was for 8 days, allowing the developers time to figure out where to make yet another set of copy-and-paste changes to those enormous, similar-but-not-the-same JSPs.

In my view, refactoring is a tool to reduce duplication and reduce the amount of code that a developer has to keep in his or her head at one time. This viewpoint is derived largely from the use of the word “factor“ in mathematics: the number 12, for example, has the factors 4 and 6, which can be reduced to prime factors 2, 2, and 3. Once you reduce to prime factors, you discover duplication.

In software, duplicate code may not be so obvious. Unless, of course, that software was built using “copy and paste coding,” with short deadlines and no refactoring during development. In that case, it becomes just as obvious as the prime factors of 12. Eliminating duplication reduces the amount of code that has to be maintained and changed, consequently reducing the number of places that bugs can hide. The result should therefore be reduced implementation costs (and in the places where I got away with page refactoring, using the technique described here, those cost reductions did indeed appear in future projects).

Which brings up the second complaint: refactoring introduces bugs. I suppose that's true: every time you edit code, you run the risk of introducing a bug. And if you write bugfree code that never has to be changed, then I fully agree: don't refactor it. For myself, although I've occasionally written (non-trivial) code that is apparently bug-free, I've always had to change it at some point down the road as the business requirements changed.

And as I see it, well-factored code should actually reduce bugs! Consider the following code, something that you might see scattered throughout a web application:

String operation = request.getParameter("operation");
if ((operation != null) && (operation.length() > 0))
{
    // do something with the operation value
}

I look at this code and see two immediate refactorings: replacing the literal value with a constant, and replacing the if condition with a method call like Jakarta Commons' StringUtils.isNotEmpty(). These refactorings would immediately reduce the chance of errors due to typos — including a typo that left out the null check (yes, I've seen this in production code). True, good code would be written like this from the start, but again, that's not code that needs refactoring.

If I were feeling ambitious — and if this were part of a large method or scriptlet I would almost certainly feel that way — I would extract the entire if operation into its own method, with a descriptive name like retrieveOperationData(). Over the long term, this will make the code easier to maintain, because developers can simply skip the implementation if the don't care about operation data. And they're less likely to put some unrelated code inside the if.

But what if my changes introduced a bug? What if the code inside that if already mucked with some unrelated variable? Assuming that the code would actually compile with such a bug, I'd expect my tests to catch it — and if they didn't, I'd write a new test once a person reported the bug.

Which brings up the last complaint (and the one that prompted this article): refactoring breaks unit tests. Before taking on this point, I want to quote Martin Fowler, from the preface to Refactoring:

Refactoring is the process of changing a software system in such a way that it does not alter the external behavior of the code yet improves its internal structure.

If you don't alter the external behavior of the system, how do you break tests? Well, one obvious way is to write tests that depend on the internal structure. It's particularly easy to do this when you use mock-object testing, or when you decompose large objects into smaller “helper” objects. If you've unintentionally made the tests dependent on internal structure, then you fix your tests, and hopefully won't repeat the mistake. But a breaking test might indicate that your mainline code has similar dependencies, in which case it would be a Good Thing to eliminate those dependencies everywhere.

A more likely cause, however, is that you're not really refactoring per the definition above, but restructuring: changing the external behavior of the code as well as its internal structure. In mathematical terms, you've decided that rather than 12, you really want 14; which happens to factor into 2 and 7 and can't be reduced further.

While you might have a perfectly valid reason for wanting this, you can't blame the tests for continuing to assert 12. Instead, ask yourself why you once thought 12 — your original design — was correct, and why it's no longer correct. That's the topic of a future post.

Friday, November 6, 2009

Objectificated

I have an admission to make: it wasn't until October 1999 that I truly “got” object-oriented programming. Perhaps that doesn't seem like a big admission, but consider this: I had been programming in C++ since 1990, and had been reading articles on object-oriented programming since the mid-1980s.

Oh, sure, at one level I understood why objects were better than procedural code. When I read Stroustrup's introduction to The C++ Programming Language, I understood and agreed with his ideas of modularization via objects. By then, structured programming and layered design were in the mainstream, and objects seemed a simple way to accomplish those goals. And, working in the financial industry, I had many examples of polymorphic behavior (all securities have a price, but that price means something very different if you're talking about stocks, bonds, or options).

But it never gelled. Although I replaced struct with class, carefully considered the public/private divide, and compiled using a C++ compiler, I continued to refer to my self as a “C programmer.” In fact, my resume doesn't mention C++ (although that's more because C++ has come a long way since I stopped using it).

In October 1999, I joined an existing Java project — my first real use of the language. And one day, looking at the output from a parser generator, it clicked: polymorphism wasn't about three classes that exposed the same set of methods, it was about 150 classes, each of which used different data to drive the same method. Since that time, I've had a clear division in my mind between classes that provide behavior, and classes that hold (and expose) data.

Most “business” applications, however, revolve around data, not behavior. If you're writing online trading software, your use of polymorphism is limited to calculating the value of a position; something else displays that value. And that something else tends to be written in a procedural style: get this piece of data, do something to it, get the next piece of data.

There have been a lot of approaches to making such processing “more object-oriented,” from the Visitor pattern to mandates that all data be accessed via getters and setters. And to be sure, there are benefits from these techniques, particularly if you come up with a clean way to encapsulate business rules.

I think that “clean” is the real problem. Business rules tend to me more exception than rule, having evolved over time, changing to suit the needs of the moment, and involving inter-locking relationships between different sources of data. In that world, “tell, don't ask” is a lot more difficult than “ask and decide.”

(which, as I'm proofing this, strikes me as an example of Conway's law in action — but that's a topic for another post)

Tuesday, November 3, 2009

Multitasking

Right now I have four IDE windows open: NetBeans, two instances of Eclipse running on my main development machine, and another Eclipse running on a laptop to which I'm connected via Remote Desktop. All are running some mixture of the same projects. My development box runs Ubuntu, and I'm a multiple desktop addict.

I've noted before how I like NetBeans' integrated Tomcat server, but don't particularly like its Java editor or debugger. So when working on a web app, I set up projects in both NetBeans and Eclipse that refer to the same source tree. When switching between them after edits, I do need to kick off a rebuild, but this takes a fraction of a second to complete. It works well enough, in fact, that I don't feel the need for any of the web development plugins that I can get for Eclipse.

The two Eclipse instances are a little trickier to explain. My default Eclipse workspace has pretty much all of the projects that I'm currently working on, some open, most closed. And for the most part, that's a Good Thing: when I edit, say, a utility class, all dependencies in the workspace will pick up the update.

However, I've found that there are times when I want to switch back and forth between projects. For example, I'm working on a mainline project, and think of a series of changes that I want to make to a utility library. I find it's easier to use separate Eclipse workspaces for this, since each workspace can have its own set of open windows. When I switch from one workspace to the other, I maintain the context of where I stopped working.

Unfortunately, switching workspaces within a single Eclipse instance is not a matter of a few seconds, and the new workspace opens with the folder view collapsed. Separate instances can coexist peacefully, however, as long as they're using different workspaces. The JVM has become good enough about sharing portions of virtual memory that there's no noticeable swapping when I switch between instances. The only real drawback to this technique is that changes — particularly refactorings — don't automatically update across workspaces; I have to manually refresh.

That brings me to the Eclipse instance running on my laptop. I've found that, even within a single project, I'll be working on one part of the code and think of a refactoring somewhere else. Or realize that a widely accessed method should change its signature. I tend to be anal about my checkins: the commit message should accurately describe what was changed, and I don't want to check in a lot of files that have changes incidental to the commit.

One solution is to keep a list of such changes, and postpone them until after my current checkin. But that doesn't always work: sometimes it will be hours between checkins. Sometimes an entire day. And while some refactorings don't look so good in the cold light of morning, usually the delay just makes my current coding more difficult.

The laptop solves this problem, by holding a copy of the code as-of the previous checkin. Refactorings or signature changes can happen there, get checked in, then applied to my main directory via svn update. And while this seems like a recipe for disaster, or at least eternal conflicts, I've been surprised at just how easy it is to integrate such changes — they are, after all, incidental to my “current” work.