Tuesday, October 19, 2010

Living Among Primitives, part 2: Data Structures

It may seem strange to talk about “living among primitives” and then jump to data structures, but the truth is that most programs work with structured data. So even in a world where you've turned to primitives, you need to think about how you'll organize those primitives.

And the best way, surprise, is to use Java objects — but with an eye toward the overheads that they introduce. In the last post I noted that an Integer adds 8 bytes of overhead to 4 bytes of data. That number came from Java Platform Performance: Strategies and Tactics, and I suspect it's higher for 64-bit JVMs: I recall reading in an early Hotspot doc that the overhead consisted of a reference to the object's class (which would be 4 bytes under 32-bit, 8 under 64-bit) and the object's identity hashcode. Arrays add another 4 bytes of overhead to hold their length.

As a case study for the rest of this post, I'm going to look at a three-dimensional Point class:

public class Point
{
    private float _x;
    private float _y;
    private float _z;
}

The data in this object takes up 12 bytes (float is a 32-bit IEE-754 floating-point value), and we'll be conservative on overhead and say it's 8 bytes (who knows, perhaps a 64-bit JVM will efficiently encode a reference to the object's class). So a total of 20 bytes per instance, 50 million instances per gigabyte. But you'll need something to manage those 50 million instances:

Point[] points = new Point[50000000];

On a 64-bit JVM, each element in that array is a reference, consuming 8 bytes; that means nearly half a gigabyte just to keep track of your gigabyte of points! And from here on out I'm going to assume that you have data volumes that require a 64-bit JVM.

One way to deal with this is to turn the structure inside-out: rather than an array of points, why not an object that holds arrays of values:

public class PointHolder
{
    private float[] _x;
    private float[] _y;
    private float[] _z;
    
    // constructor will initialize the arrays
    
    public float getX(int index)
    {
        return _x[index];
    }
    
    // and so on
}

This approach may make your code a little more complex, but it's eliminated 16 bytes of overhead per point: the references in the original array, plus the per-object overhead. You can now fit nearly 90 million points in a gigabyte. And it has another benefit: you can change the implementation without changing your code. For example, if you're dealing with truly huge arrays of points, you could move them off the heap entirely, using a memory-mapped file.

The problem with “turning the object inside-out” is that you have to give careful thought to how you work with the data: for example, how do you sort these points? How do you find a particular point? That will be the topic of the next few posts.

No comments: