Showing posts with label data structures. Show all posts
Showing posts with label data structures. Show all posts

Monday, March 19, 2012

Synchronizing put() is Not Sufficient

Have you ever seen a class like this, which synchronizes put() but not get()? Have you ever written one?

public class MyConcurrentlyAccessedLookupTable
{
    private Map<String,MyDataObject> data = new HashMap<String,MyDataObject>();
    
    // ...
    
    public MyDataObject get(String key)
    {
        return data.get(key);
    }
    
    public synchronized void put(String key, MyDataObject value)
    {
        data.put(key, value);
    }
}

I've been working over a legacy codebase recently, and have seen a lot of them. And I want to track down the original developer, give him a good shake, and say “this only works bacause you never put anything into the map after initializing it!” Actually, this is one of the least egregious ways that he (yes, he, I know his name) managed to fail at concurrent programming. Don't get me started on his abuse of static initializers.

OK, it feels good to let that out. I can at least understand why he thought it would work. A HashMap, by its nature is resilient to incorrectly managed concurrent access. Under normal operation, new items get added to a linked list; a get() should either see or not see the new entry. With put(), because two new entries might get added to the same bucket list and the scheduler might decide to switch threads in the middle, one of the entries might disappear.

And then there's resize: doubling the size of the hash table to reduce the depth of the bucket chains. And, at least in the Sun implementation, there's no attempt to preserve the chains during this operation (I can't see why any implementation would, but there might be one). So put() could get a chain using the old table, but that chain could be assigned to a different bucket by the time the new entry is attached.

OK, so in both cases get() fails to find an entry that is actually in the map, is that really so bad? Well, is it bad if you spend days trying to figure out why a transaction failed when the data was obviously present? Or if that failed transaction caused your company to lose a sale?

The problem with concurrency bugs is that they aren't easily reproducible. Sometimes they appear once and never again. But they're still bugs, and they're not that difficult to avoid: when in doubt, synchronize all access to shared state. If you understand your access patterns (eg, you always initialize and then only read), don't synchronize at all.

But never go halfway.

Monday, May 9, 2011

Big-O Isn't Everything, Part 2

If you need to build a lookup table for some large number of values, what data structure should you choose?

Obviously, you want a hashed data structure. In Java, a HashMap or HashSet. It will provide O(1) performance, versus O(logN) for a binary search and O(N) for a linear search. And the “Big-O” performance of an algorithm becomes increasingly more important the larger the table becomes. Or does it?

To find out, I wrote a micro-benchmark that compared different approaches for varying table sizes. One version used a HashSet, another used an Integer[] and Arrays.binarySearch(), and a third used int[] and a home-grown binary search. For each of them, I probed the lookup table 10,000,000 times with random values, and recorded the execution time in seconds.

Table Size HashSet Binary Search, Object Binary Search, Primitive
1,000 0.37 1.00 0.88
10,000 0.43 1.99 1.19
100,000 0.54 3.18 1.60
1,000,000 1.00 5.75 2.33
10,000,000 1.19 12.70 4.99
100,000,000 1.58 20.68 8.50
1,000,000,000 n/a n/a 13.28

No surprises: the execution time for the hashed version remained relatively constant, while the binary search time increased logarithmically with the size of the table.* But this table leaves out two important pieces of information.

The first is that real-world applications rarely spend all their time doing lookups. Instead, they do a lookup and then do something else with the results. And while the HashSet outperformed the Integer[] search by a factor of 13, the absolute performance of the latter is two microseconds per iteration. For a real-world application, this is almost certainly sufficient.

Even so, it seems to make sense to go with the HashSet. But there's another factor at work: absolute performance isn't the only way to measure this benchmark. And if you noticed the “n/a” at the bottom of the first two columns, and the inclusion of a primitive version in the last column, you can probably guess where this post is going.**

Table Size HashSet Binary Search, Object Binary Search, Primitive
1,000,000 178,054K 142,525K 126,900K
10,000,000 656,742K 318,307K 162,057K
100,000,000 5,286,458K 2,077,979K 513,619K
1,000,000,000 n/a n/a 3,969,678K

I ran these tests on an Amazon EC2 double-extra-large instance, with a 16 Gb Java heap. It costs $1 per hour ($8,766/year) to rent this machine. If my application needed a billion-entry lookup table, I would have two choices: either use a sorted array, or upgrade to a quad-extra-large instance and double my costs. For a single machine, maybe not an issue. But if I'm running a cluster of machines, the cost difference would add up quickly.

The bottom line is that absolute CPU performance is only one factor that goes into picking a data structure. Memory consumption is another, and programmer performance is a third — one that's perhaps more important than the other two put together. A HashSet is a well-known, easy to use data structure. A sorted array is less so, and becomes increasingly complex if you have to implement a custom Comparator. Using one will increase your time to implement, and perhaps your maintenance costs. So it wouldn't be my first choice.

But if I were faced with a limited-memory environment, neither would I be blinded by “Big-O” performance.


* If the HashSet implementation is O(1), why did its performance decrease as the table got larger? There's always a risk, when you write a benchmark, that you'll get some apparently anomalous results and will need to explain them. In this case, I believe the performance difference is an artifact of the processor cache. I ran the tests on a machine with an 8MB cache, which is large enough to hold the entire hash array for 100,000 entries — and probably large numbers of the buckets for 1,000 and 10,000 entries. Since the test repeatedly probes the list, the cache is a significant contributor to overall performance. Yet another pitfall in a micro-benchmark …

** You may be wondering why I only show the last four rows of the table. The reason is another pitfall of micro-benchmarks: finding the correct measurement tool. I used MemoryMXBean.getHeapMemoryUsage(), along with a couple of calls to System.gc(). This should have shown me only the memory used by live objects, but instead I saw a minimum of 100Mb for each run. With the larger table sizes, however, the real memory consumption overwhelmed this baseline “noise.”

Thursday, April 21, 2011

Four Data Structures Every Programmer Should Know

I went back to reading Stack Overflow during my convalescence. And saw a bunch of questions and answers that really disturbed me. For those who don't want to click, they're all questions that relate to data structures.

OK, I realize that data structures can be daunting. Knuth took three books just to cover the basics. Cormen is nearly 1,300 pages, and Sedgewick almost 1,000. And I'll be the first to admit that I've not memorized any of these books, or even read significant pieces of them (much less cover-to-cover). They're meant as references: if and when I need to implement an algorithm, I open the book. If you asked me to sit down and implement a red-black tree without a book, I couldn't do it.

But I do know the performance characteristics of binary trees, and I know that java.util.TreeMap uses a binary tree. And as a Java programmer, I know when I should use a TreeMap versus a HashMap, versus an ArrayList that I've sorted. And I think that knowledge is a baseline that every programmer should have.

So, here's a cheat-sheet of the four data structures that every programmer should know, because they'll handle almost all of your in-memory data management needs.

Structure Insert Remove Search Indexed Access Example
Hash table O(1) O(1) O(1) n/a HashMap
Binary tree O(logN) O(logN) O(logN) n/a TreeMap
Linked list O(1) O(1) O(N) O(N) LinkedList
Array-backed list O(N) O(N) O(N) O(1) ArrayList

Hash table

If you need to maintain a key-value relationship, then a hashed data structure should be your first choice. As I've written elsewhere, a hashed lookup is as fast as you can get: O(1) access for all operations. At least, it's O(1) if you have a good hash function. If you have a bad hash function, performance can drop to O(N).

There is one downside to hash tables: they add a lot of memory overhead. To avoid collisions, a hash table needs to be larger than the number of entries; by default, the Java HashMap resizes when the table becomes 75% full. There's also overhead from the linked list used to hold hash buckets: each entry is a distinct object, and in the Sun implementation, on a 32-bit JVM, takes 24 bytes — not counting the value that it holds.

Binary tree

Whenever someone mentions O(logN) performance, they're talking about a tree of some form. Trees implement a “divide and conquer” approach to data: starting at a “root” node, you recursively identify a child node that contains the data you need. At each step, you divide the total search space by some constant value; in the case of a binary tree, that value is 2, you divide the space in half. There are a bunch of different trees, and a bunch of data structures that implement a binary tree, but as far as Java is concerned, there are just two (which means that O(logN) is really O(log2N)).

TreeMap and TreeSet use a tree structure in which the value of a node's “left” child is less than the value of the node, and the value of the node's ”right” child is greater. This means that, not only do you have O(logN) search performance, you can easily iterate through the elements in order from smallest to largest.

And this is (with a few exceedingly rare exceptions), the only reason that you would pick TreeMap over HashMap. For example, if you are extracting word counts from a piece of text, it makes sense to manage the words in sorted order, because you'll want to output them in that form. Other than this use case, the performance benefit of HashMap means it's a better choice.

The PriorityQueue is another Java class that implements a binary tree structure, however it does not maintain the order of its elements. Instead, it uses a tree where a given element is required to be lower value than either of its children. This means that inserts are O(logN), but searching is O(N); you know that the root is the smallest value in the tree, but you know nothing about any of the others.

Priority queues have a very limited set of use cases. They're great when you need to process or initiate time-dependent events, or for graph traversal algorithms where you care about a least-cost path. Because they don't maintain a total ordering over their elements, there's less work that has to be done for each N, and there's no need for an explicit node object (even though two algorithms can have the same “big O” complexity, the work that takes place can make them very different in performance).

I mention priority queues because of one of the SO links above. Because priority queues are binary structures, they take O(logN) time for inserts and removes. Which is a really dumb penalty to pay if you're looking for a simple linear (eg: fifo, or first-in-first-out) queue.

Linked list

Which brings me to linked lists, the traditional data structure for queues. As long as you actually hold a reference to a node, insertion and removal are O(1): they're implemented by updating pointers. It's easy to get a reference to either the first or last node (at least in a doubly-linked list, which is what the Java LinkedList class implenents). It's not so easy to get access to an arbitrary node, because you need to iterate the predecessor nodes first, an O(N) activity.

But linked lists have another dark side: each of the nodes is a distinct object, that consumes memory and must eventually be garbage-collected (20 bytes on a Sun 32-bit JVM). And the use-case for removing elements from the head or tail of a list tends to be multi-threaded applications, in which case one of the classes in java.util.concurrent will be more appropriate. In fact, the only time that LinkedList would be my first choice is for an application where I need known marginal memory consumption.

Array-backed list

For all other generic lists, my first choice is ArrayList. As its name implies, it implements the List interface using a pre-allocated array for storage. This means that you get O(1) access to any value, as long as you know its position in the array. It also means that, while add or remove of the last element is O(1), it's O(N) for any other position, because the other values in the list have to be shifted. However, in practice most additions are to the end of a list. And even where you need to add or remove values from the start of the list, in a small list (a few thousand entries) the CPU cost to do this is lower than the cost to manage the entries of a LinkedList.

The problem with ArrayList is the way that it resizes the underlying array. As I said, it pre-allocates an array for storage; if you want to add more elements than the array can hold, it must allocate a new, larger array, and copy the existing entries into it. The documentation describes this behavior as “amortized” O(1) performance: although the copy is an O(N) operation, it will only happen every N times you add an element, and O(N) / N == O(1).

More important, in practice, is that an ArrayList allocates a new array that is some significant percentage larger than the old array (it's explicitly not documented, but in the Sun 1.6 JDK is 50%). If you're pushing the boundaries of available memory, and need to expand a large list, it's all too easy to get an OutOfMemoryError. There's also the potential for a lot of wasted space, particularly with small lists.

Summary

I would say that, 99% of the time, when I need a Java collection class I turn to ArrayList and HashMap. To me, they represent the Platonic ideal for List and Map implementations. But I do that with full understanding of their limitations, and am ready to switch to a different data structure (or implement my own) when I run into those limitations.

Wednesday, October 27, 2010

Living Among Primitives, part 3: Collections

The last post “turned an object inside-out” to eliminate the overhead of a list of discrete objects. While that technique allows you to pack a lot of objects into a relatively small amount of space, your programs will be a lot cleaner if you can pass around Point objects rather than a PointHolder and its index. The java.util.AbstractList class can help you reach a middle ground.

public class PointHolder2
extends AbstractList<Point>
{
    private int _size;
    // ...

    @Override
    public int size()
    {
        return _size;
    }

    @Override
    public boolean add(Point point)
    {
        _x[_size] = point.getX();
        _y[_size] = point.getY();
        _z[_size] = point.getZ();
        _size++;
        return true;
    }

    @Override
    public Point get(int index)
    {
        return new Point(_x[index], _y[index], _z[index]);
    }

    @Override
    public Point set(int index, Point point)
    {
        _x[index] = point.getX();
        _y[index] = point.getY();
        _z[index] = point.getZ();
        return point;
    }

AbstractList is a tool for implementing lists with an application-specific backing store. Out of the box, you have to implement the get() and size() methods, and the setter methods all throw UnsupportedOperationException. But as you can see, it doesn't take a lot of work to implement these, and the result is nearly identical in terms of performance to an ArrayList<Point>.

And the Point class has reappeared, making your application code cleaner. But now they're created as-needed, and will be collected as soon as the program is done with them; you won't pay the memory overhead to keep them around. And assuming that they're relatively short-lived once created, they should rarely if ever make their way out of the heap's young generation, so will be cheap to collect.

There is one important difference between PointHolder and ArrayList<Point>, not immediately apparent from this example: since the Point instances do not actually exist within the list, changes to them won't be reflected in the PointHolder data. As implemented, each Point instance contains its own copies of the primitive values. To make the Point mutable, you'll need to design them with a reference back to the PointHolder (which I won't go into because I don't like mutable data-only objects).

Now that you have a space-efficient List<Point>, you may be thinking of how you'd access it — for example, how to search for a specific point. And you might think of calling Collections.sort(), followed by Collections.binarySearch(). And while this will work, it's not very efficient (because you're constantly creating Point instances), and has hidden memory costs. Those are topics for the next post.

Tuesday, October 19, 2010

Living Among Primitives, part 2: Data Structures

It may seem strange to talk about “living among primitives” and then jump to data structures, but the truth is that most programs work with structured data. So even in a world where you've turned to primitives, you need to think about how you'll organize those primitives.

And the best way, surprise, is to use Java objects — but with an eye toward the overheads that they introduce. In the last post I noted that an Integer adds 8 bytes of overhead to 4 bytes of data. That number came from Java Platform Performance: Strategies and Tactics, and I suspect it's higher for 64-bit JVMs: I recall reading in an early Hotspot doc that the overhead consisted of a reference to the object's class (which would be 4 bytes under 32-bit, 8 under 64-bit) and the object's identity hashcode. Arrays add another 4 bytes of overhead to hold their length.

As a case study for the rest of this post, I'm going to look at a three-dimensional Point class:

public class Point
{
    private float _x;
    private float _y;
    private float _z;
}

The data in this object takes up 12 bytes (float is a 32-bit IEE-754 floating-point value), and we'll be conservative on overhead and say it's 8 bytes (who knows, perhaps a 64-bit JVM will efficiently encode a reference to the object's class). So a total of 20 bytes per instance, 50 million instances per gigabyte. But you'll need something to manage those 50 million instances:

Point[] points = new Point[50000000];

On a 64-bit JVM, each element in that array is a reference, consuming 8 bytes; that means nearly half a gigabyte just to keep track of your gigabyte of points! And from here on out I'm going to assume that you have data volumes that require a 64-bit JVM.

One way to deal with this is to turn the structure inside-out: rather than an array of points, why not an object that holds arrays of values:

public class PointHolder
{
    private float[] _x;
    private float[] _y;
    private float[] _z;
    
    // constructor will initialize the arrays
    
    public float getX(int index)
    {
        return _x[index];
    }
    
    // and so on
}

This approach may make your code a little more complex, but it's eliminated 16 bytes of overhead per point: the references in the original array, plus the per-object overhead. You can now fit nearly 90 million points in a gigabyte. And it has another benefit: you can change the implementation without changing your code. For example, if you're dealing with truly huge arrays of points, you could move them off the heap entirely, using a memory-mapped file.

The problem with “turning the object inside-out” is that you have to give careful thought to how you work with the data: for example, how do you sort these points? How do you find a particular point? That will be the topic of the next few posts.

Monday, October 18, 2010

Living Among Primitives, part 1

There are a lot of people who say that Java shouldn't have primitive data types, that everything should be an object. And there is some truth to their arguments. Most business applications don't do intensive numeric operations, and for those that do, double is a poor substitute for a fully-thought-out Money. You could even claim that the presence of primitives leads to a procedural style of code: things like using an index variable rather than an iterator to access a list. And languages such as Ruby and Python follow the “everything's an object” path without apparent ill effect.

But I feel that primitives are in fact one of the strong points of the Java language, and languages that don't have native primitives often sprout hacks to replace them. Because there are several reasons for primitives to exist.

One is performance. It's true that, for simple expressions, evaluated once, you can't tell the difference. But there's a reason that libraries such as NumPy, which performs matrix math (among other things), are written in C or C++ rather than Python.

There's a primitive at the heart of any numeric object — after all, that's what the CPU works with. So when you invoke a method on a numeric object, not only do you have the overhead of a function call, but you're calling into native code. True, Hotspot could recognize mathematical operations and translate them into inline native code. But that still leaves the cost of allocating and collecting objects that are used only during the evaluation of a function. Unless Hotspot recognized those and transformed them into, well, primitives.

Another, often more important reason to use primitives is memory consumption: an int takes 4 bytes, but on a 32-bit Sun JVM, an Integer adds 8 bytes of overhead, there's also overhead to reference that Integer: 4 bytes on a 32-bit JVM, 8 bytes on a 64-bit JVM. Again, that doesn't seem like much, unless your application is dealing with hundreds of millions of numeric values.

And hundreds of millions of numeric values is exactly what I'm dealing with on my latest project … to the point where I recently told a colleague that “this application should be written in C.” The next few posts will dive into some of the issues that you'll face when living among primitives.

Monday, September 20, 2010

BigMemory

Last week Terracotta released BigMemory, which stores objects outside of the Java heap and “defuses the GC time bomb.” Since I've written the top-ranked relevant article when you Google for “non-heap memory” (it's on page 2), my website has seen a spike in volume. Unlike previous spikes, people have been poking around, which is nice to see.

I did some poking around myself, both to Terracotta's blogs, and to some of the Twitter posts and articles that linked to my site. And I saw a lot of comments that disturbed me. They ranged from (paraphrased) “this is just hype, it isn't anything new,” to “I wonder if Oracle will file a patent lawsuit on this,” to “the JVM should be doing this itself,” to the ever-popular “people should just learn to write efficient code.” I'll take on that last one first.

But first some caveats: I've never used EHCache outside of Hibernate, and even then I simply followed a cookbook. My closest contact with EHCache development was attending a Java Users Group meeting with a badly jetlagged Greg Luck, followed by beer with a him and a bunch of other people. I can't remember what was discussed around the table, but I don't think it was cache internals. But I do know something about caches, and about the way that the JVM manages memory, so I feel at least qualified to comment on the comments.

And my first comment is: a cache is efficient code … when appropriately used. You can spend days micro-optimizing your code, only to find it all wasted with a single trip to the database.* You use a cache when it's expensive to retrieve or create often-used data; the L1 and L2 caches in your processor exist because memory is relatively slow, so it's expensive to retrieve. Similarly, it's expensive to create a web page (and load its data from the database), so you would use EHCache (or Akamai) to cache the pre-built page (EHCache is cheaper).

The interface to a cache is pretty straightforward: you give it a key, it gives you a value. In the case of the processor's L1 cache, the key is an address, the value is the bytes around that address. In the case of memcached, the key is a string and the value is a blob. In Java, this corresponds to a Map, and you can create a simple cache using a LinkedHashMap.

The reason to use a LinkedHashMap, versus a regular HashMap, is eviction: removing old entries from the map. Although memory is cheap, it's not infinite, and there's no good reason to keep every piece of data ever accessed. For example, consider Stack Overflow: when a question is on the home page, it might get hundreds of views; clearly you don't want to keep reading it from the database. But after it drops off the home page, it might get viewed once a week, so you don't want to keep it in memory.

With LinkedHashMap, you can implement a simple eviction strategy: once the map reaches a certain size, you remove the head entry on each new put(). More sophisticated caches use more sophisticated strategies: for example, weighting the life of a cached object by the amount of work that it takes to load from source. In my opinion, this is why you go with a pre-built solution like EHCache: they've thought through the eviction strategies and neatly packaged them.

So that's how a cache works. What's interesting is how it interacts with the garbage collector. The point of a cache is to hold content in memory for as long as it's valuable; for a generational garbage collector like Sun's, this means it will end up in the tenured generation … but eventually get removed. And this causes two issues.

First is the cost to find the garbage. The Hotspot GC is a “mark-sweep” collector: it starts at “root” references and traces through the entire object graph to find live objects; everything else is garbage (this is all covered in my article on reference objects; go back and click the link). If you have tens or hundreds of millions of objects in your heap (and for an eCommerce site, that's not an unreasonable number), it's going to take some time to find them all: a few milliseconds. This is, as I understand it, the price that you pay for any collection, major or minor.

A major collection, however, performs one additional step: it compacts the heap after collecting the garbage. This is a big deal. No matter how fast your CPU and memory, it takes time to move gigabytes of memory around. Particularly since the JVM is updating all the references to the moved objects. But if you've ever written a C application that fragmented its heap, you'll thank the Hotspot team every time you see a major collection take place.

Or maybe you won't. Heap compaction is an artifact of tuning the JVM for turn-of-the-century computers, where 1Gb of main memory was “server class,” and swap (the pagefile) was an important part of making your computer usable for multiple applications. Compacting the heap was a way to reduce the resident set size, and improve the overall performance of an application.**

Today, of course, we have desktop-class machines with multiple gigabytes, running an OS and JVM that allow direct access to all of it (and more). And there's been a lot of work in the past few decades to reduce the risk of fragmentation with C-style memory allocation. So maybe it is time to rethink garbage collection in the JVM, at least for the tenured generation, and introduce freelists (of course, that will inflame the “Java is a memory hog” crowd). Or maybe it's time to introduce a new generation, for large tenured objects. Maybe in JDK 1.8. I'm not holding my breath, and neither, it seems, were the EHCache developers.

So to summarize: EHCache has taken techniques that are well-known, and applied them to the Java world. If I were still working in the eCommerce field, I would consider this a Good Thing, and be glad that they invested the time and effort to make it work. Oh, and as far as patents go, I suspect that IBM invented memory-mapped files sometime in the 1960s, so no need to worry.


* Or, in the words of one of my favorite authors: “No matter how subtle the wizard, a knife between the shoulder blades will seriously cramp his style.”

** If you work with Swing apps, you'll often find a minimize handler that explicitly calls System.gc(); this assumes that the app's pages will be swapped out while minimized, and is an attempt to reduce restore time.

Thursday, April 15, 2010

intern() and equals()

A lot of programmers have the belief that String.intern() can be used to speed up if-else chains. So instead of writing this:

public void dispatch(String command)
{
    if (command.equals("open"))
        doOpenAction();
    else if (command.equals("close"))
        doCloseAction();
    else if (command.equals("calculate"))
        doCalculateAction();
    // and so on
}

they write this:

public void dispatch(String command)
{
    command = command.intern();
    if (command == "open")
        doOpenAction();
    else if (command == "close")
        doCloseAction();
    else if (command == "calculate")
        doCalculateAction();
    // and so on
}

Their logic seems to be that it's faster to compare integers than to compare strings. And while the only way to know for sure is a benchmark, I'm willing to bet that the intern() approach is actually slower (except, of course, for carefully constructed examples). The reason being that string comparisons are quite fast, while intern() has to do a lot of work behind the scenes.

Let's start with a string comparison, since it's available in the src.zip file that comes with the JDK (in this case, JDK 1.6):

    public boolean equals(Object anObject) {
    if (this == anObject) {
        return true;
    }
    if (anObject instanceof String) {
        String anotherString = (String)anObject;
        int n = count;
        if (n == anotherString.count) {
            char v1[] = value;
            char v2[] = anotherString.value;
            int i = offset;
            int j = anotherString.offset;
            while (n-- != 0) {
                if (v1[i++] != v2[j++])
                return false;
            }
            return true;
        }
    }
    return false;
    }

Like all good equals() implementations, this method tests identity equality first. Which means that, intern() or not, there's no reason to get into the bad habit of using == to compare strings (because sooner or later you'll do that with a string that isn't interned, and have a bug to track down).

The next comparison is length: if two strings have different length, they can't be equal. And finally, iteration through the actual characters of the string. Most calls should return from the length check, and in almost all real-world cases the strings will differ in the first few characters. This isn't going to take much time, particularly if the JVM decides to inline the code.

How about intern()? The source code for the JDK is available, and if you trace through method calls, you end up at symbolTable.cpp, which defines the StringTable class. Which is, unsurprisingly, a hash table.

All hash lookups work the same way: first you compute a hashcode, then you use that hashcode to probe a table to find a list of values, then you compare the actual value against every value in that list. In the example above, computing the hashcode for any of the expected strings will take as many memory accesses as all of the comparisons in the if-else chain.

So if intern() doesn't speed up comparisons, what is it good for? The answer is that it conserves memory.

A lot of programs read strings from files and then hold those strings in memory. For example, an XML parser builds up a DOM tree in which each Element instance holds its name, namespace, and perhaps a map of attributes. There's usually a lot of duplication in this data: think of the number of times that <div> and href appear in a typical XHTML file. And since you've read these names from a file, they come into your program as separate String instances; nothing is shared. By passing them through intern(), you're able to eliminate the duplication and hold one canonical instance.

The one drawback to String.intern() is that it stores data in the permgen (at least for the Sun JVM), which is often a scarce resource (particularly in app-servers). To get the benefits of canonicalization without risking an OutOfMemoryError, you can create your own canonicalizing map.

And if you'd like to replace long if-else chains with something more efficient, think about a Map of functors.

Thursday, November 19, 2009

How HashMaps Work

This was originally written as an answer to a StackOverflow question where the original poster and several responders seemed to have some misunderstandings of how hashed lookup works. Since it would end up buried under a dozen more-or-less right responses if posted there, I've decided to post here, where it might end up on the front page of Google.

The Hashcode

Hashed lookup starts with an object's hashcode. A hashcode is simply a number that is permanently associated with an object, used only when storing that object in a hashed structure; there's no magic about it. The only rule that governs hashcode values is that, given objects x and y, if x.equals(y) returns true, then both x and y must return the same hashcode. The reverse isn't true: x and y could return the same hashcode, but not be considered equal. Here's an example:

Long x = Long.valueOf(0x600000002L);
Long y = Long.valueOf(0x700000003L);

System.out.println(x.hashCode());
System.out.println(y.hashCode());
System.out.println(x.equals(y));

All objects could return 0 as their hashcode, and that would satisfy the rules. It wouldn't be very useful, however. To be useful, a hash function must evenly distribute values over the 32-bit range of an integer: you should be able to apply the same statistical tests that you would for a random number generator.

There are many ways to compute hashcodes. Integer, for example, simply returns its int value. Long does a shift and xor, to mix its high-order 32 bits with its low-order 32 bits. A common approach is to keep a running sum, multiplying that sum by a small prime number before adding each term. This is how String does it:

for (int i = 0; i < len; i++) {
    h = 31*h + val[off++];
}

Note that this causes a string's hashcode to be “driven” by its trailing characters; integer overflow during multiplication claims the leading characters. This is OK, as long as the strings end with different characters — which normal text almost always will. As I recall, the original (1.1.x) implementation of String.hashCode() gave greater weight to the leading characters in the string, which turned out to be a very bad thing when processing HTTP URLs (which all start with “http://”).

Some Myths About Hashcodes

A lot of people think that a hashcode is a unique identifier for an object. This is wrong, and the reason should be obvious: a Java hashcode is 32 bits. As soon as you have objects with 33 bits of data, you're guaranteed to have two objects with the same hashcode. Depending on the hash function, you can have collisions with less than 32 bits of object data: the strings “Az” and “C<”, for example.

Then there's the “identity” hashcode returned by Object.hashCode(). The myth is that it returns a pointer to the object, which in a 32-bit JVM would be a unique identifier. Perhaps it does return a pointer, at least on the first call; it is, after all, implemented as a native method. However, objects are permitted to move around in the heap, so if the identity hashcode is indeed a pointer, there is no guarantee that two objects won't (over time) have the same value.

HashMap

OK, so every object has a hashcode; what's the point? For just a moment, pretend that a hashcode is indeed a unique identifier for the object. If that were the case, then you could create a map around an object array that had 2^32 elements: compute the key's hashcode, and return whatever you found at that spot in the array.

But, as I've said, hashcodes aren't unique identifiers, and an array containing 2^32 objects would be too big to fit in the heap (at least on a 32-bit JVM). But the idea still holds: take an array with 1,000 elements, and divide the hashcode by 1,000, and you can return whatever is at that index.

But then what happens if two objects map to the same index? They don't even have to have the same hashcode: with our thousand element table, one object could have hashcode 1001, and the second could have hashcode 2001. This is known as a hash collision.

There are many ways to resolve a hash collision. A really dumb approach is to find the next empty cell in the array. So if your hashcode mod 1000 is 723, and cell 723 already has a value, you look in cell 724 … then 725, then 726, and so on. As the table gets full, you'll end up doing a linear search of the entire table looking for an empty cell.

A much smarter way is to have each cell (called a "bucket") in your hash table hold a list of entries. So when your hashcode resolves to index 723, you can simply add your object onto the end of that list. And this is just what the Sun HashMap does: it defines an array of buckets:

/**
 * The table, resized as necessary. Length MUST Always be a power of two.
 */
transient Entry[] table;

And an Entry object is a node in a linked list:

static class Entry<K,V> implements Map.Entry<K,V> {
    final K key;
    V value;
    final int hash;
    Entry<K,V> next;

You'll still run into a problem if many objects have the same hash value, or if they have different hash values that resolve to the same bucket index: you'll traverse a very long linked list.

As long as you have a good hash function, that produces an even distribution of 32 bit numbers from your objects, this is not an issue: if you start to get a long list on a particular bucket, you simply increase the size of the table. Since it's an even distribution, you'll be distributing the objects across more buckets, which means fewer objects to a single bucket.

The Sun implementation uses hash tables that are always a power of two in size (eg: 16 elements, 32, 64, and so on). This is a performance optimization in that you can transform a 32-bit hashcode into the appropriate index by applying a mask (eg, if you have 16 elements in your table, your index is hashcode & 0x0F). However, such tables are very susceptible to bad hash functions; the Sun implementation takes extra steps to rehash supplied values.

The Bottom Line

If you use a HashMap, you need to ensure that the objects that you use as keys have a good hashcode. If you, for example, take a population of 1,000,000 objects, and have a hash function that can produce 1,000 discrete hashcodes, you'll end up with buckets containing 1,000 entries. And this will mean your map, which has an optimal access time of O(1), will turn into a data structure that effectively has O(n) access time.