Thursday, April 15, 2010

intern() and equals()

A lot of programmers have the belief that String.intern() can be used to speed up if-else chains. So instead of writing this:

public void dispatch(String command)
{
    if (command.equals("open"))
        doOpenAction();
    else if (command.equals("close"))
        doCloseAction();
    else if (command.equals("calculate"))
        doCalculateAction();
    // and so on
}

they write this:

public void dispatch(String command)
{
    command = command.intern();
    if (command == "open")
        doOpenAction();
    else if (command == "close")
        doCloseAction();
    else if (command == "calculate")
        doCalculateAction();
    // and so on
}

Their logic seems to be that it's faster to compare integers than to compare strings. And while the only way to know for sure is a benchmark, I'm willing to bet that the intern() approach is actually slower (except, of course, for carefully constructed examples). The reason being that string comparisons are quite fast, while intern() has to do a lot of work behind the scenes.

Let's start with a string comparison, since it's available in the src.zip file that comes with the JDK (in this case, JDK 1.6):

    public boolean equals(Object anObject) {
    if (this == anObject) {
        return true;
    }
    if (anObject instanceof String) {
        String anotherString = (String)anObject;
        int n = count;
        if (n == anotherString.count) {
            char v1[] = value;
            char v2[] = anotherString.value;
            int i = offset;
            int j = anotherString.offset;
            while (n-- != 0) {
                if (v1[i++] != v2[j++])
                return false;
            }
            return true;
        }
    }
    return false;
    }

Like all good equals() implementations, this method tests identity equality first. Which means that, intern() or not, there's no reason to get into the bad habit of using == to compare strings (because sooner or later you'll do that with a string that isn't interned, and have a bug to track down).

The next comparison is length: if two strings have different length, they can't be equal. And finally, iteration through the actual characters of the string. Most calls should return from the length check, and in almost all real-world cases the strings will differ in the first few characters. This isn't going to take much time, particularly if the JVM decides to inline the code.

How about intern()? The source code for the JDK is available, and if you trace through method calls, you end up at symbolTable.cpp, which defines the StringTable class. Which is, unsurprisingly, a hash table.

All hash lookups work the same way: first you compute a hashcode, then you use that hashcode to probe a table to find a list of values, then you compare the actual value against every value in that list. In the example above, computing the hashcode for any of the expected strings will take as many memory accesses as all of the comparisons in the if-else chain.

So if intern() doesn't speed up comparisons, what is it good for? The answer is that it conserves memory.

A lot of programs read strings from files and then hold those strings in memory. For example, an XML parser builds up a DOM tree in which each Element instance holds its name, namespace, and perhaps a map of attributes. There's usually a lot of duplication in this data: think of the number of times that <div> and href appear in a typical XHTML file. And since you've read these names from a file, they come into your program as separate String instances; nothing is shared. By passing them through intern(), you're able to eliminate the duplication and hold one canonical instance.

The one drawback to String.intern() is that it stores data in the permgen (at least for the Sun JVM), which is often a scarce resource (particularly in app-servers). To get the benefits of canonicalization without risking an OutOfMemoryError, you can create your own canonicalizing map.

And if you'd like to replace long if-else chains with something more efficient, think about a Map of functors.

Tuesday, April 13, 2010

Book Review: Coders at Work / Founders at Work

I just finished reading Coders at Work. I received it as an early Christmas present, and while I've read several other books in the interim, getting to the end was a struggle. This is in sharp contrast to its companion volume, Founders at Work, which I bought last summer and read over the course of a week.

Both books consist of interviews with more-or-less well-known people. In Founders, these range from Steve Wozniak, talking about early days at Apple, to James Hong, who found himself dealing with the viral growth of Hot or Not. In Coders, the interviews range from Simon Peyton Jones (the creator of Haskell) to Donald Knuth (who should need to introduction). All of whom have fascinating histories.

So why did I like one book and not the other? At first I thought it was because I was more familiar with the programmers' stories, particularly those who entered the field at the same time I did. In comparison, the founders' stories were new to me: the challenges of dealing with a viral website, the hunt to find funding. Those stories seemed particularly relevant to me at the time, given that I had just left my full-time job with the thought of founding a software business.

But as I plowed through Coders, I realized that the difference was in the interviewers, not the interviewed. Jessica Livingston, the author of Founders, seemed to let her interviewees go wherever they wished: Woz, for example, took three pages to describe how he got the Apple floppy drive to work. Peter Seibel, by comparison, seemed to have a set list of questions, and forced each interviewee to respond to those questions. At one point I thought of turning “literate programming” into a drinking game, but realized I would be too drunk to ever finish the book.

This approach not only made the interviews unfocused, it made them long. If you put the two books side by side, they appear to be the same size. That's misleading: Founders is 466 pages, while Coders is 617. More important, the former has 33 interviews, the latter only 15. It's easy to read 15 pages in a sitting, but you have to plan for 40+ — or put the book down halfway through and try to regain context when you pick it up.

Bottom line: if you want to learn about the history of our industry, both books are a good choice. If you want to enjoy the process of learning, stick with Founders.

Tuesday, April 6, 2010

The Trouble with Wiki

If the comments don't match the code, they're both wrong.

I've been thinking about this maxim a lot. I recently started a new job, and have been busy learning the codebase. One of the tools that I've been using is the corporate wiki, and once again I'm faced with “wiki rot”: pages that are no longer relevant, or have broken links, or were started with the best of intention and then abandoned. I suppose I should feel lucky: I know one company that stored status reports in their wiki; the chance of a search returning anything useful at that company was pretty low.

Don't get me wrong. I think that wikis were a great invention. Wikipedia is my first stop for answers on a wide range of topics. And I learned much of what I know about patterns and Agile methodology from Ward's wiki (although by the time I started reading, much of the original community had seemingly vanished, replaced by a lot of insecure whiners).

At its best, a wiki gives you a place to write notes that can be shared within a team. It's a natural transition from the “project notebook” that Brooks recommended in The Mythical Man Month. But as long-term project documentation? Not so good.

Projects evolve, and all too often wiki pages don't. After all, the people who spend their days working on a project know how it works. They know the environment variables required for a successful build, and the configuration needed for a successful run. And when new people join the project, it's easier (once you realize the wiki is wrong) to just ask someone. And no-one has time to update the wiki …

What's to be done? Wikipedia has one answer: a large community of people who passionately update pages. But the key word there is “large.” In a corporate environment, you simply don't have this community. And if you have a corporate practice of putting everything into the wiki, it's not going to be useful even if you do have a community of dedicated maintainers.

I think that a solution for corporate wikis starts with organization by project or product rather than organizational unit. This change is subtle, meant to tap into the developers' feelings of craftsmanship: “these pages are the public face of my work.” An empty project wiki stares team members in the face in the same way “fixme” comments do in code — admittedly, that doesn't work for all teams.

A second, perhaps more important step is to archive all pages that haven't been edited in the last year. Keep them available, but not easily available: don't show them in basic searches, and highlight any links from other pages. In the best case, this will prompt team members to keep their pages up-to-date. In the worst, it will keep casual users from wasting time with incorrect documentation — which is valuable in itself.