Thursday, April 15, 2010

intern() and equals()

A lot of programmers have the belief that String.intern() can be used to speed up if-else chains. So instead of writing this:

public void dispatch(String command)
{
    if (command.equals("open"))
        doOpenAction();
    else if (command.equals("close"))
        doCloseAction();
    else if (command.equals("calculate"))
        doCalculateAction();
    // and so on
}

they write this:

public void dispatch(String command)
{
    command = command.intern();
    if (command == "open")
        doOpenAction();
    else if (command == "close")
        doCloseAction();
    else if (command == "calculate")
        doCalculateAction();
    // and so on
}

Their logic seems to be that it's faster to compare integers than to compare strings. And while the only way to know for sure is a benchmark, I'm willing to bet that the intern() approach is actually slower (except, of course, for carefully constructed examples). The reason being that string comparisons are quite fast, while intern() has to do a lot of work behind the scenes.

Let's start with a string comparison, since it's available in the src.zip file that comes with the JDK (in this case, JDK 1.6):

    public boolean equals(Object anObject) {
    if (this == anObject) {
        return true;
    }
    if (anObject instanceof String) {
        String anotherString = (String)anObject;
        int n = count;
        if (n == anotherString.count) {
            char v1[] = value;
            char v2[] = anotherString.value;
            int i = offset;
            int j = anotherString.offset;
            while (n-- != 0) {
                if (v1[i++] != v2[j++])
                return false;
            }
            return true;
        }
    }
    return false;
    }

Like all good equals() implementations, this method tests identity equality first. Which means that, intern() or not, there's no reason to get into the bad habit of using == to compare strings (because sooner or later you'll do that with a string that isn't interned, and have a bug to track down).

The next comparison is length: if two strings have different length, they can't be equal. And finally, iteration through the actual characters of the string. Most calls should return from the length check, and in almost all real-world cases the strings will differ in the first few characters. This isn't going to take much time, particularly if the JVM decides to inline the code.

How about intern()? The source code for the JDK is available, and if you trace through method calls, you end up at symbolTable.cpp, which defines the StringTable class. Which is, unsurprisingly, a hash table.

All hash lookups work the same way: first you compute a hashcode, then you use that hashcode to probe a table to find a list of values, then you compare the actual value against every value in that list. In the example above, computing the hashcode for any of the expected strings will take as many memory accesses as all of the comparisons in the if-else chain.

So if intern() doesn't speed up comparisons, what is it good for? The answer is that it conserves memory.

A lot of programs read strings from files and then hold those strings in memory. For example, an XML parser builds up a DOM tree in which each Element instance holds its name, namespace, and perhaps a map of attributes. There's usually a lot of duplication in this data: think of the number of times that <div> and href appear in a typical XHTML file. And since you've read these names from a file, they come into your program as separate String instances; nothing is shared. By passing them through intern(), you're able to eliminate the duplication and hold one canonical instance.

The one drawback to String.intern() is that it stores data in the permgen (at least for the Sun JVM), which is often a scarce resource (particularly in app-servers). To get the benefits of canonicalization without risking an OutOfMemoryError, you can create your own canonicalizing map.

And if you'd like to replace long if-else chains with something more efficient, think about a Map of functors.

No comments: