Tuesday, April 7, 2009

Strings sharing backing arrays

For some reason, I thought this bug was fixed years ago … it was reported in 2001, after all. Yet today I was poking around in the JDK 1.5 source, and what do I find?

return ((beginIndex == 0) && (endIndex == count)) ? this :
    new String(offset + beginIndex, endIndex - beginIndex, value);

This particular string constructor is package-private, and exists solely for making copies of strings with the same backing array:

String(int offset, int count, char value[]) {
    this.value = value;
    this.offset = offset;
    this.count = count;
}

So why is this important? Let's say that you're extracting two fields from a large file of fixed-length lines of text. The file contains a gigabyte of data, millions of lines, but you figure you're OK because you're only saving 10 characters per line:

while ((line = reader.readLine()) != null)
{
    values.put(line.substring(0, 5), line.substring(17, 23));
}

Will you be surprised by the OutOfMemoryError? It happens because those substring calls don't actually extract characters from the large string. Instead, they create new strings that reference the same character array. Once you put them in the map, that character array becomes ineligible for collection, even though line is. The solution is one of the few cases where you should call a string constructor directly:

while ((line = reader.readLine()) != null)
{
    String s1 = new String(line.substring(0, 5));
    String s2 = new String(line.substring(17, 23));
    values.put(s1, s2);
}

This constructor, which seems to do nothing other than copy the string, actually has a lot of logic inside: if it determines that the source string has a backing array that's larger than it needs, it creates a new backing array. In this example, the new backing arrays allows the array backing line to go out of scope and be collected. No more OutOfMemoryError.

Of course, this leaves the question: when and where should you use this technique? There are, of course, lots of cases where you'll know that you're working with small strings, or will be breaking a large string into pieces and keeping all of the pieces. In these cases, the default behavior makes sense.

But in cases where you don't know what will become of the substring — such as when you're writing a utility method — by all means construct a new string. Object creation, we're told, is extremely fast. Is it worth running out of memory to save a few cycles?

No comments: