Tuesday, September 29, 2009

Intern Isn't Forever ... And Maybe Never Was

Today I managed to hit myself over the head with a cluebat, proof that there's always something to learn and that you should never accept dogma. In this case, the dogma was that interned strings never get garbage-collected. Like all good dogma, it combined a few facts with a leap of faith, and was plausible enough that I never challenged it.

The facts first (actually, only one fact): interned strings are stored in the same pool as literal strings. This is explicitly stated in the JavaDoc for String.intern(), and can be demonstrated with the following code:

String a = "Are we having fun yet?";
String b = new String(a);
System.out.println(a == b);

String c = b.intern();
System.out.println(a == c);

And now the leap of faith: the JVM doesn't clean up the constant pool. Seems plausible: after all, two literal strings are guaranteed to be the same. And intern() is a native method, so it must be doing something tricky behind the scenes. And everybody else says you'll cause bugs if you intern too many strings, so …

A skeptic might ask “how can you tell that two string literals aren't the same if you don't have references to both?” I even said as much when I wrote about canonicalizing maps (an article that got some edits today). Once all references to a string go out of scope (including any references within a class definition), then there's no need to keep that string in the pool. And the JVM doesn't — at least, the Sun JVM doesn't.

This particular cluebat entered the picture because I'm currently writing an article on out-of-memory exceptions, and wanted a program to demonstrate permgen failures. So I wrote a loop that interned big, random-content strings … and nothing happened. I killed the program after I finally realized that it wasn't going to die on its own.

But the dogma must have some basis in fact, right? It happens that I have a machine with Sun JVMs from version 1.2 on up. So I ran my test program on each revision, and while the -verbose:gc output changed, the result did not: all of these versions appear to clean up the string pool. Is it possible that a 1.1 release is the source of this dogma? Perhaps, and if someone still has one installed, here's the program:

public class InternExhaustion
{
    public static void main(String[] argv)
    throws Exception
    {
        while (true)
        {
            String str = generateRandomString(65536);
            str.intern();
        }
    }    
    
    private static String generateRandomString(int length)
    {
        char[] chars = new char[length];
        for (int ii = 0 ; ii < length ; ii++)
            chars[ii] = (char)(96 * Math.random() + ' ');
        return new String(chars);
    }
}

For myself, I have some edits to make. And a lump on the head to remind me to question dogma.

Thursday, September 24, 2009

Je vais m'amuser

If you're a native French speaker, you may be giggling to yourself at the title of this post. If, like me, you had two semesters of French in college, you may see nothing wrong — until your Parisian instructor starts giggling. “I will amuse myself.” The difference is idiom: in America, amusing oneself is harmlessly waiting for another; in France … well, it's still harmless, but that's not the point.

Programming idiomatically means understanding a language at a level deeper than its syntax. In Python, it means using list comprehensions rather than explicit iteration, or a generator rather than a physical list. In Java, it means using enum rather than constants. In any object-oriented language, it means “tell, don't ask.”

A hallmark of programmers who are new to a language is their lack of idiom. They write the new language as if it were the old, often bemoaning the loss of their familiar constructs (I'm no different: as I was learning Java I wished for C-style function pointers). There's an old saying that “You can write Fortran in any language.”

A danger of idioms is that they can turn into slang, used to add color rather than value, a way to signify membership in a clique. Duff's device is an extremely clever use of C's switch statement, and arguably a brilliant optimization (although I have trouble believing that the cycles saved from unrolling loops weren't immediately lost waiting for the IO register). But when used for copying memory, it loses all cleverness: replacing a CISC hardware instruction with discrete moves, or preventing a RISC compiler from applying its own optimizations.

More dangerous are the programmers who never spend enough time with a language to learn its idioms. They are the ones who code randomly: an ArrayList here, a LinkedList there, reading all rows from a database so that they can do a sum on the client. Exposure to different languages can be a great learning experience, as long as you learn enough to know when someone wants a little time alone.

Monday, September 14, 2009

Short Attention Span Theatre, or, What Happens When Your Page Becomes Delicious

Last week my article on Java reference objects managed to find its way to the “popular Java” page of delicious.com (and, within a day or two, drop off again). Normally, this article gets hit a dozen times a day, usually by people who are Googling for “how to fix out of memory errors.” During the week that it appeared on delicious.com (which was preceeded by a lot of tweeting), it was getting hit 150+ times a day — including 49 times in short order from one London-based ISP (which I assume was a proxy).

What was interesting to me, however, was that almost none of the people visiting this page went to any other page on the site — and of the ones that did, more went to the “food” section than to the “programming” section. And they tended to do so quickly: under a minute between initial page hit and moving on. I know that it takes longer than that to read about reference objects.

To an author, this should be a blow to the ego: is my writing so bad that people don't want to read more of it? It's like they picked up my book, read the dust jacket, and put it back down. But I refuse to accept that conclusion. For one thing, I have a strong ego. But more important, enough people bookmarked the page for it to appear as a popular link.

The real answer, I think, is that people in the software industry are very quick to jump on new ideas, but don't have a lot of follow-through. And why should they, when there's a plethora of well-known pundits saying they should explore as many new things as they can, and there's always a new popular link showing up on delicious.com?

The problem with this attitude is that exposure to a lot of ideas doesn't mean that any of them will stick. In fact, I'd say the opposite is true: if you don't take the time to examine an idea and see how it fits with your existing knowledge, you're not going to ever make use of it.

Thursday, September 10, 2009

Building a Wishlist Service: External Libraries

One of the great things about developing Java applications is the wealth of open-source libraries available. The Jakarta Commons libraries have been part of every project that I've done over the past five years; you can find StringUtils.isEmpty() in almost all text manipulation code that I write.

However, the wealth of open-source libraries presents a paradox of choice: which libraries do you use for a project. Each library adds to the memory footprint of your project, either directly as classes are loaded, or indirectly as the JVM memory-maps the library's JAR. External libraries also make dependency management in your build more complex, in some cases forcing you to build the library locally.

More important, every library represents a form of lock-in: once your code is written to conform to the library, it will be expensive to change. And if you discover a bug or missing feature, you'll need to develop a remediation plan. Even if you can code a patch, it will take time to integrate with the mainline code — assuming that it is accepted. In some cases you may find yourself maintaining a private fork of the library over several public releases.

All of which is to say: use open source, but pick your libraries carefully.

In the cases of the product list service, one of the places where I considered external libraries was XML management, in particular conversion between XML and Java beans. There are lots of libraries that handle this: XMLBeans and XStream are two that are commonly used, and the JDK provides its own serialization and deserialization classes as part of the java.beans package.

Of these, XStream seemed to be the best choice: XmlBeans requires a separate pre-compile step, while the JDK's serialization format would require a lot of work on the part of any non-Java client. However, I had another alternative: I am the administrator and main developer on Practical XML, an open-source library for XML manipulation. It didn't support XML-object conversion, but I also had some converter classes that I'd written before XStream became popular. I figured that it would take a minimal amount of work to flesh out those classes and integrate them into the library.

I have an incentive to evolve the Practical XML library, and to use it in all of my projects. However, adding this functionality introduced a two week diversion into my project. In this case the delay didn't matter: I have no hard deadlines on this project. And since I was already using the library in other places, I had the benefit of consistency and reduced footprint. Faced with an unmovable ship date, my decision would have been different.

Friday, September 4, 2009

Building a Wishlist Service: Template Method

When I'm giving an interview, one of my favorite questions is “Are you familiar with Design Patterns? Please describe your favorite pattern, tell me why you like it, and give an example where you've recently used it.” Historically, if I were to be asked that question, my answer would be Strategy; currently, I think it would be Template Method. Which is interesting, because those two patterns basically do the same thing from different directions.

A description first: Template Method is a pattern for building class hierarchies in which the base class of the hierarchy imposes logic on its subclasses. The base class defines public methods, then calls into abstract protected methods that must be implemented by the subclass. For example, the base class might define a validate() method:

public final void validate()
{
    validateEncryptedParam();
    validateExpiration();
    validateRequiredParams();
    subclassValidation();
}

Historically I've liked Template Method because it avoids bugs caused by subclasses that don't invoke their super's implementation of a method (I take the attitude that “subclasses are like ogres,” which is a topic for a future post). However, as I'm realizing with this service, it also highlights duplicated code, and allows that code to be moved into the superclass. In this example, there's only one method that's actually implemented by the subclass (and its name should make it obvious).

I think it also pushes you toward moving responsibilities out of the class hierarchy entirely. In this example, validateRequiredParams() was originally intended for the subclass, which knew what parameters it needed. But the actual validation code was common to all subclasses, so I changed the method to simply ask the subclass for a list of its required parameters. A little more thought, and I realized that this isn't really an attribute of the subclass, but of the operation itself. So I added the list of parameters to the enum defining the service's operations (and this spurred writing an article for my website):

public enum Operation
{
    // only the third example has a required parameter

    RetrieveList(HttpMethod.GET, null),
    AddItem(HttpMethod.POST, ProductEntry.class),
    DeleteItem(HttpMethod.POST, null, RequestParameter.itemId),
    // ,,,
}

This kind of thinking, taken to an extreme, will take you out of Template Method and into objects constructed according to a configuration file. And that's not a bad thing: if your subclasses don't have behavior, they don't deserve to exist.

Wednesday, September 2, 2009

Building a Wishlist Service: Testing

There are a lot of ways to test a web application: in-container testing using tools such as Cactus, full client-server testing using tools such as HttpUnit, out-of-container unit tests, and even manual tests.

I'm not a fan of any test approach that requires starting the application, because you won't want to do that on every build. So most of the tests for my service are unit tests, meant to run outside of the container. And although I knew that this would happen, I'm still amazed at how such tests push me toward decoupled code, even when I write the tests after the code.

The servlet's dispatch mechanism is a prime example. Going in, I knew that I wanted a RequestHandler class for each operation. My first pass used an interface, with a getRequestHandler() method inside the servlet class. However, the only way to test that method was to make it public or protected, and I'm not willing to violate encapsulation for testing. So RequestHandler became an abstract class, and getRequestHandler() became a static factory method. At the same time, I decided to instantiate a handler object per request, rather than reuse a table of singleton objects: the latter was easier to manage, but the former was easier to test.

Unit-testing servlet components means a lot of stubbing and mocking, and I decided to create a reflection proxy for the servlet request and response objects. I realize that there are existing mock implementations for these objects, but I figured that I wasn't using many of their methods and wanted to limit the use of third-party libraries (that's the topic of another post).

And that led to another observation: writing your own mock objects tends to reduce the number of places you touch those objects. If I have to mock both getParameter() and getParameters(), I'm going to think about why I call both methods, and probably use just one of them. This should translate to reduced chance of errors, in this case because I'll be cognizant of cases where there may be more than one parameter with the same name.

There's another effect that I'm seeing from writing for testability: I tend to use the Template Method pattern. A lot. Or perhaps it's simply a natural pattern for web services. I'll look at that more closely in the next post.