It may seem strange to talk about “living among primitives” and then jump to data structures, but the truth is that most programs work with structured data. So even in a world where you've turned to primitives, you need to think about how you'll organize those primitives.
And the best way, surprise, is to use Java objects — but with an eye toward the overheads that they introduce. In the last post I noted that an Integer
adds 8 bytes of overhead to 4 bytes of data. That number came from Java Platform Performance: Strategies and Tactics, and I suspect it's higher for 64-bit JVMs: I recall reading in an early Hotspot doc that the overhead consisted of a reference to the object's class (which would be 4 bytes under 32-bit, 8 under 64-bit) and the object's identity hashcode. Arrays add another 4 bytes of overhead to hold their length.
As a case study for the rest of this post, I'm going to look at a three-dimensional Point
class:
public class Point { private float _x; private float _y; private float _z; }
The data in this object takes up 12 bytes (float
is a 32-bit IEE-754 floating-point value), and we'll be conservative on overhead and say it's 8 bytes (who knows, perhaps a 64-bit JVM will efficiently encode a reference to the object's class). So a total of 20 bytes per instance, 50 million instances per gigabyte. But you'll need something to manage those 50 million instances:
Point[] points = new Point[50000000];
On a 64-bit JVM, each element in that array is a reference, consuming 8 bytes; that means nearly half a gigabyte just to keep track of your gigabyte of points! And from here on out I'm going to assume that you have data volumes that require a 64-bit JVM.
One way to deal with this is to turn the structure inside-out: rather than an array of points, why not an object that holds arrays of values:
public class PointHolder { private float[] _x; private float[] _y; private float[] _z; // constructor will initialize the arrays public float getX(int index) { return _x[index]; } // and so on }
This approach may make your code a little more complex, but it's eliminated 16 bytes of overhead per point: the references in the original array, plus the per-object overhead. You can now fit nearly 90 million points in a gigabyte. And it has another benefit: you can change the implementation without changing your code. For example, if you're dealing with truly huge arrays of points, you could move them off the heap entirely, using a memory-mapped file.
The problem with “turning the object inside-out” is that you have to give careful thought to how you work with the data: for example, how do you sort these points? How do you find a particular point? That will be the topic of the next few posts.
No comments:
Post a Comment