Wednesday, October 27, 2010

Living Among Primitives, part 3: Collections

The last post “turned an object inside-out” to eliminate the overhead of a list of discrete objects. While that technique allows you to pack a lot of objects into a relatively small amount of space, your programs will be a lot cleaner if you can pass around Point objects rather than a PointHolder and its index. The java.util.AbstractList class can help you reach a middle ground.

public class PointHolder2
extends AbstractList<Point>
    private int _size;
    // ...

    public int size()
        return _size;

    public boolean add(Point point)
        _x[_size] = point.getX();
        _y[_size] = point.getY();
        _z[_size] = point.getZ();
        return true;

    public Point get(int index)
        return new Point(_x[index], _y[index], _z[index]);

    public Point set(int index, Point point)
        _x[index] = point.getX();
        _y[index] = point.getY();
        _z[index] = point.getZ();
        return point;

AbstractList is a tool for implementing lists with an application-specific backing store. Out of the box, you have to implement the get() and size() methods, and the setter methods all throw UnsupportedOperationException. But as you can see, it doesn't take a lot of work to implement these, and the result is nearly identical in terms of performance to an ArrayList<Point>.

And the Point class has reappeared, making your application code cleaner. But now they're created as-needed, and will be collected as soon as the program is done with them; you won't pay the memory overhead to keep them around. And assuming that they're relatively short-lived once created, they should rarely if ever make their way out of the heap's young generation, so will be cheap to collect.

There is one important difference between PointHolder and ArrayList<Point>, not immediately apparent from this example: since the Point instances do not actually exist within the list, changes to them won't be reflected in the PointHolder data. As implemented, each Point instance contains its own copies of the primitive values. To make the Point mutable, you'll need to design them with a reference back to the PointHolder (which I won't go into because I don't like mutable data-only objects).

Now that you have a space-efficient List<Point>, you may be thinking of how you'd access it — for example, how to search for a specific point. And you might think of calling Collections.sort(), followed by Collections.binarySearch(). And while this will work, it's not very efficient (because you're constantly creating Point instances), and has hidden memory costs. Those are topics for the next post.

Tuesday, October 19, 2010

Living Among Primitives, part 2: Data Structures

It may seem strange to talk about “living among primitives” and then jump to data structures, but the truth is that most programs work with structured data. So even in a world where you've turned to primitives, you need to think about how you'll organize those primitives.

And the best way, surprise, is to use Java objects — but with an eye toward the overheads that they introduce. In the last post I noted that an Integer adds 8 bytes of overhead to 4 bytes of data. That number came from Java Platform Performance: Strategies and Tactics, and I suspect it's higher for 64-bit JVMs: I recall reading in an early Hotspot doc that the overhead consisted of a reference to the object's class (which would be 4 bytes under 32-bit, 8 under 64-bit) and the object's identity hashcode. Arrays add another 4 bytes of overhead to hold their length.

As a case study for the rest of this post, I'm going to look at a three-dimensional Point class:

public class Point
    private float _x;
    private float _y;
    private float _z;

The data in this object takes up 12 bytes (float is a 32-bit IEE-754 floating-point value), and we'll be conservative on overhead and say it's 8 bytes (who knows, perhaps a 64-bit JVM will efficiently encode a reference to the object's class). So a total of 20 bytes per instance, 50 million instances per gigabyte. But you'll need something to manage those 50 million instances:

Point[] points = new Point[50000000];

On a 64-bit JVM, each element in that array is a reference, consuming 8 bytes; that means nearly half a gigabyte just to keep track of your gigabyte of points! And from here on out I'm going to assume that you have data volumes that require a 64-bit JVM.

One way to deal with this is to turn the structure inside-out: rather than an array of points, why not an object that holds arrays of values:

public class PointHolder
    private float[] _x;
    private float[] _y;
    private float[] _z;
    // constructor will initialize the arrays
    public float getX(int index)
        return _x[index];
    // and so on

This approach may make your code a little more complex, but it's eliminated 16 bytes of overhead per point: the references in the original array, plus the per-object overhead. You can now fit nearly 90 million points in a gigabyte. And it has another benefit: you can change the implementation without changing your code. For example, if you're dealing with truly huge arrays of points, you could move them off the heap entirely, using a memory-mapped file.

The problem with “turning the object inside-out” is that you have to give careful thought to how you work with the data: for example, how do you sort these points? How do you find a particular point? That will be the topic of the next few posts.

Monday, October 18, 2010

Living Among Primitives, part 1

There are a lot of people who say that Java shouldn't have primitive data types, that everything should be an object. And there is some truth to their arguments. Most business applications don't do intensive numeric operations, and for those that do, double is a poor substitute for a fully-thought-out Money. You could even claim that the presence of primitives leads to a procedural style of code: things like using an index variable rather than an iterator to access a list. And languages such as Ruby and Python follow the “everything's an object” path without apparent ill effect.

But I feel that primitives are in fact one of the strong points of the Java language, and languages that don't have native primitives often sprout hacks to replace them. Because there are several reasons for primitives to exist.

One is performance. It's true that, for simple expressions, evaluated once, you can't tell the difference. But there's a reason that libraries such as NumPy, which performs matrix math (among other things), are written in C or C++ rather than Python.

There's a primitive at the heart of any numeric object — after all, that's what the CPU works with. So when you invoke a method on a numeric object, not only do you have the overhead of a function call, but you're calling into native code. True, Hotspot could recognize mathematical operations and translate them into inline native code. But that still leaves the cost of allocating and collecting objects that are used only during the evaluation of a function. Unless Hotspot recognized those and transformed them into, well, primitives.

Another, often more important reason to use primitives is memory consumption: an int takes 4 bytes, but on a 32-bit Sun JVM, an Integer adds 8 bytes of overhead, there's also overhead to reference that Integer: 4 bytes on a 32-bit JVM, 8 bytes on a 64-bit JVM. Again, that doesn't seem like much, unless your application is dealing with hundreds of millions of numeric values.

And hundreds of millions of numeric values is exactly what I'm dealing with on my latest project … to the point where I recently told a colleague that “this application should be written in C.” The next few posts will dive into some of the issues that you'll face when living among primitives.