blog.kdgregory.com: optimization

Showing posts with label optimization. Show all posts

Monday, December 28, 2020

Diving Into First-Run Times for Lambdas Written in Java

A few weeks ago I answered a Stack Overflow question about the poor performance of Java Lambdas. I gave some general suggestions, but didn't provide any data to back them up. This post is my data.

The question posed a situation where one Lambda invoked another, which did nothing. The asker seemed particularly concerned by the time taken by the first function: 6000+ milliseconds when cold, 300 when warm. Rather than set up an identical situation, I'm going to use a simpler example, retrieving a list of CloudWatch log groups. The execution time, as you'll see, is similar.

Before getting started, I want to clarify terminology. When I use the term “cold,” I refer to a Lambda that's just been deployed. The execution time in this case consists of the time to initialize the Lambda runtime, initialize the Java Virtual Machine (JVM), and execute the Lambda's handler function. By comparison, a “warm” invocation is able to use an existing execution environment, so only includes the time to execute the handler function.

OK, on to Teh Codez:

public void handler(Object ignored, Context context)
{
    long start = System.currentTimeMillis();
    
    AWSLogs client = AWSLogsClientBuilder.defaultClient();
    
    long clientCreated = System.currentTimeMillis();
    
    client.describeLogGroups();
    
    long apiInvoked = System.currentTimeMillis();
    
    System.err.format("time to create SDK client = %6d\n", (clientCreated - start));
    System.err.format("time to make API call     = %6d\n", (apiInvoked - clientCreated));
}

This is an extremely simple Lambda: it creates an AWS SDK client and then uses that client to invoke an API call. Before each of these actions I retrieve the current system time. How long can it take?

Setting a Baseline

People like to say things are “slow.” But that term is meaningless if you don't have context. And to develop that context, you have to know both what is happening, and the physical constraints that apply. For example, on my 100 Mbit/sec Internet connection, it should take a little over 80 seconds to download a gigabyte of data. That feels like a really long time if you're waiting for the file to download, but it's not slow. If that same file takes 5 minutes to download over the same connection, it's reasonable to say that it's slow, to try to figure out why, and to attempt to correct it.

If you look at the example code, you might think that it should execute instantaneously. After all, it's just two operations. If you have think about the fact that at least the second operation makes a network call, you might say that it should take 100 milliseconds or so.

But observations are always better than guesses. Running on my laptop, with an Intel i7-6500U CPU at 2.5 GHz, here's how long it takes (to get these numbers I created a main() that just invokes the handler function):

time to create SDK client =    744
time to make API call     =    469

That's … much longer than one might expect. Over a second. To make sure it isn't a fluke, you should run the code several times. You should also try some other variants of the code. For example, call the handler function twice in a row:

time to create SDK client =    702
time to make API call     =    522
time to create SDK client =     21
time to make API call     =    151

Hmmmm. The time to create the client dropped dramatically. The time to make the API call also dropped, and is closer to what I'd expect from a network service call. At this point I might also use the Linux time command:

> time java -jar target/sandbox-lambda-java-1.0-SNAPSHOT.jar 
time to create SDK client =    576
time to make API call     =    623

real    0m1.320s
user    0m2.130s
sys     0m0.113s

The “real” value is also known as “wall clock” time: it's what I'm measuring by saving timestamps, and it's pretty close to the timings I print from within the program. The “user” time is the actual CPU time consumed by the program; it's nearly twice the real time, which indicates that the CPU is doing a lot of work. If the program was spending all of its time making network calls, that user time should be less than the real time.

This information alone is enough to make a first pass at optimization.

Memory Allocation

When you create a Java Lambda using the Console, you get a default memory allocation of 512 MB. While Java has a not always deserved reputation for excessive memory consumption, this is more than enough to run the example program. In fact, if you look at the post-run statistics, you'll see that “Max memory used” is around 150 MB.

However, maximum memory consumption only tells part of the story. A much — much — more important fact is that Lambda provides CPU in proportion to allocated memory. To get one virtual CPU, you need to allocate 1,769 MB of memory to your function. The AWS docs don't provide a table of how much CPU you'd get for a given memory size, but assuming a linear relationship, 512 MB would be a third of a CPU — a virtual CPU of unspecified performance.

The table below compares timings for different memory allocations. These are all “cold” times: before each test I uploaded the deployment bundle, which forced Lambda to create a new execution environment. I invoked the Lambda from the Console, providing a dummy event. For consistency with the previous numbers, times are in milliseconds.

	512 MB	1024 MB	2048 MB	4096 MB
Create client	5298	2493	1272	1019
Invoke API call	3844	2023	1061	613
Billed duration	9213	4555	2349	1648

If you add up the numbers, you'll see that the billed duration is slightly larger than the sum of the two recorded times. I believe this corresponds to the time taken to start the JVM and invoke the handler function (much like running on my laptop took 1320 ms total, but only 1199 was accounted for by my timestamps).

These numbers also omit the Lambda initialization time, which was approximately 500 ms. This is the time taken to start the Lambda's container, download the function code onto the container, and start the runtime environment. You aren't billed for this time, but it does affect the response time of the function, and Java Lambdas seem to take much longer than, say Python (where initialization takes around 150 milliseconds).

Based on what we know from the baseline performance test, the numbers make sense: at 4096 MB we have the equivalent of slightly more than two virtual CPUs, and the execution times are in line with what I saw running on my laptop (for what it's worth, increasing the memory size to 8192 MB, which should be 4+ vCPUs, does not significantly affect these timings). This leads to my number one rule for Lambda performance tuning:

The most important thing that you can do to improve Lambda performance is increase memory allocation.

This advice is, of course, subject to caveats. This example program is short, has high CPU usage relative to its runtime, and can exploit multiple virtual CPUs. A long-running Lambda that spends most of its time waiting for network calls may not benefit as much from the CPU boost.

However, almost any Java Lambda will benefit from increased memory allotment as a result of garbage collection: the less memory you give it, the more often collection has to run. I added some code to track garbage collection time, and the 512 MB run consumed nearly 200 milliseconds, versus 30 for the 4096 MB run. Again, this depends on what the program is doing, but in general a larger heap means that more of the program's garbage will never make it out of the “young” generation, which can be collected more efficiently than the “tenured” generation.

Classloading

So what's consuming all of the CPU time for this Lambda? At least part of the answer is classloading: to run this simple program, the JVM loads 3,329 classes.

The JVM loads classes on-demand. It won't load the AWSLogsClientBuilder class until it it executes the line that calls defaultClient(). Loading and initializing a class require loading any classes that it depends on, and so on. Even though an individual class can be loaded very quickly, the total classloading time adds up.

Unfortunately, there aren't a lot of ways to avoid this cost, especially in a simple program like the example. The tricks available to stand-alone Java programs aren't available in Lambda.

However, for real-world applications you can make architectural choices that minimize the number of classes that need to be loaded. One of the simplest is to avoid large frameworks such as Spring.

A Different SDK

Another possibility might be to replace the standard AWS SDK with something that loads fewer classes. In November 2018 AWS released version 2 of its Java SDK, which is described as “major rewrite of the version 1.x code base.” You'll occasionally see recommendations to use it for improved performance (including in the SDK docs themselves, via a link that doesn't go anywhere).

But, as I said before, there's no substitution for observation. Here are the numbers using version 2.15.53:

	512 MB	1024 MB	2048 MB	4096 MB
Create client	4965	2278	1141	959
Invoke API call	4235	2062	1047	661
Billed duration	9237	4357	2204	1637

No meaningful change. To be fair, I just used the default configuration. The v2 SDK lets you change out the underlying HTTP implementation, so maybe a different one would give better numbers. But that seems like a lot of work for “maybe.”

For that matter, switching to the v2 SDK requires a signficant amount of tedious recoding to change package and method names. And as-of this writing, there are still some features that aren't supported by v2. So I don't recommend making that switch until and unless there's a compelling reason.

Packaging

Earlier this year I wrote an article about the different ways to package a Java Lambda. In that article I explored why the Lambda documentation recommended against using an “UberJar” produced by the Maven Shade plugin — even though that same documentation uses that plugin for examples. However, I didn't record the performance gained by switching to the Assembly plugin.

Here's that comparison. I've taken the billed duration from the previous table, and compared it to the billed duration when packaged via the assembly plugin. Rather than show all of the memory sizes, I just show the two extremes:

	512 MB	4096 MB
Shade Plugin	9213	1648
Assembly Plugin	8138	1358

So, a decent speedup with a small memory allocation (which I hope you woudn't use after reading this post!), and a minor speedup when you have plenty of memory (and CPU). Still, any speedup is a good speedup, and this requires little effort.

Provisioned Concurrency

On Stack Overflow, the typical answer to people who are concerned about first-run execution time is “use Provisioned Concurrency.” I suspect that the people who say this have never actually used provisioned concurrency, because it's not a silver bullet. In fact, if you enable provisioned concurrency for my example program, you'll see no change in first-run execution time.

The reason is that my example does everything inside its handler function, and so incurs all of the cost of classloading and initialization when that function executes. Provisioned concurrency won't help with that.

To make Provisioned Concurrency help with Java first-start times, you need to move all of the code that triggers classloading and initialization into a constructor (or a static initializer, but trust me, don't go there). This adds to the complexity of your Lambda, because you have to ensure that you fully load the SDK (and other) classes that you need to run, without actually changing anything (you don't want to write bogus data to your production DynamoDB table!).

Assuming that you've gotten everything working, you'll still have the “N + 1” problem: unless you dramatically over-provision, you'll still get the occasional cold start. Perhaps it happens first thing in the morning, when all of your users connect for the first time. Or perhaps it happens when your site gets mentioned on Hacker News. Sooner or later, it will happen.

Finally, there's the matter of cost: with Provisioned Concurrency, you are paying for an always-on machine. In my example, with 4 GB allocated to the Lambda, enabling Provisioned Concurrency would cost approximately $45/month in addition to the per-request cost. It's hard to find an exact equivalent in the EC2 world, but a t3.medium has the same memory and virtual CPU count, and costs a little over $30 per month (both prices are for us-east-2). So if you're planning to replace your “traditional” Java web-app with Lambdas to save costs, you'll get the opposite.

Wrapping Up

I first implemented a Lambda web-app with Java nearly four years ago, and my opinion hasn't changed since then: if you need fast response times, then Java should not be your language of choice. Use Python, or NodeJS, or Go. Or if you must use Java, but want the benefits of Serverless computing, deploy it on ECS Fargate as a traditional J2EE application with embedded Jetty server.

That doesn't mean that Java is always a bad choice for Lambdas. If you have a long-running task, such as processing large files, then startup time pales in comparison to overall runtime. And Java is an especially good choice for CPU-intensive tasks, because the Hotspot engine will optimize performance.

The bottom line is that writing your Lambda functions in Java, like everything else in software engineering, is a tradeoff. Pick your application architecture based on your actual needs, not philosophical stances.

Friday, January 14, 2011

Big-O Isn't Everything

Question: when does an O(N²) algorithm outperform an O(NlogN) algorithm?

As part of my article on primitives, I decided to implement Heapsort. I needed an in-place sort for primitive integers that used an external comparator, and couldn't find one with a reasonable amount of Googling. I picked Heapsort because it's an easy algorithm to implement, and has O(NlogN) behavior for all inputs (unlike Quicksort, which degrades to O(N²) on already-sorted input unless you take care picking the pivot).

Naturally, I wanted to see how my implementation performed, vis-a-vis the sorts provided by the JDK. Particularly on large datasets, which is the reason that I wanted an in-place sort. So I created a benchmark that measured thread CPU time to sort a 10,000,000 element array (which I knew would fit in RAM). Here are the results, the average of 10 executions on my aged desktop (a 2.4 Ghz P4 with only 512 kb of cache):

Implementation	Time (sec)	Compares
Arrays.sort(int[])	2.55
Heapsort.sort(int[])	18.08
Arrays.sort(Integer[])	25.09	223,179,141
Heapsort.sort(Integer[])	79.91	439,021,125

Those are dramatic differences. I knew that Heapsort would make more comparisons than Mergesort (which the JDK will use for Integer[]), so I added a counter to the comparator. But the running time was over three times longer, with only twice the number of comparisons. And Quicksort (used by the JDK for int[]) was over seven times faster. Perhaps some of that difference comes from Heapsort calling a comparator, but clearly, not all O(NlogN) algorithms are alike!

So I decided to look at the JDK source. And found this interesting bit of code:*

private static void sort1(int x[], int off, int len) {
// Insertion sort on smallest arrays
if (len < 7) {
    for (int i=off; i<len+off; i++)
    for (int j=i; j>off && x[j-1]>x[j]; j--)
        swap(x, j, j-1);
    return;
}

Insertion sort is an O(N²) algorithm. But it doesn't have a lot of moving parts — unlike Quicksort, which takes almost 40 lines of code in the JDK implementation, not counting the swap and range-swap methods. And when N is small, lines of code are more important than algorithmic efficiency (or, as my coworker Max points out, when N is bounded, you have a constant-time operation).

There's one more performance-enhancing feature of this snippet: all of the data accesses should be within the L1 cache. On Intel processors, the cache is structured as 64-byte chunks of memory — enough to hold 8 adjacent int values. While Arrays.sort() makes no attempt to ensure alignment of the insertion-sorted sub-arrays, it's a naturally cache-friendly algorithm. As is Quicksort itself, for that matter. And Heapsort, which jumps around the array comparing elements at position N and 2 * N, is decidedly cache-unfriendly.

All of which is to say that sometimes, micro-optimizations have a big effect. And that, rather than implementing Heapsort, I should have simply copied the OpenJDK implementation of Quicksort and added a comparator.**

* This optimization isn't specific to the JDK. The GNU C library implementation also switches to insertion sort, albeit at 4 elements, and credits Sedgewick for that and other optimizations.

** In case you're wondering, the updated Quicksort took 4.25 seconds to run, and made approximately 375,000,000 comparisons. If it weren't covered by GPL2, I'd put it in my toolbox in place of the Heapsort implementation.

Wednesday, October 27, 2010

Living Among Primitives, part 3: Collections

The last post “turned an object inside-out” to eliminate the overhead of a list of discrete objects. While that technique allows you to pack a lot of objects into a relatively small amount of space, your programs will be a lot cleaner if you can pass around Point objects rather than a PointHolder and its index. The java.util.AbstractList class can help you reach a middle ground.

public class PointHolder2
extends AbstractList<Point>
{
    private int _size;
    // ...

    @Override
    public int size()
    {
        return _size;
    }

    @Override
    public boolean add(Point point)
    {
        _x[_size] = point.getX();
        _y[_size] = point.getY();
        _z[_size] = point.getZ();
        _size++;
        return true;
    }

    @Override
    public Point get(int index)
    {
        return new Point(_x[index], _y[index], _z[index]);
    }

    @Override
    public Point set(int index, Point point)
    {
        _x[index] = point.getX();
        _y[index] = point.getY();
        _z[index] = point.getZ();
        return point;
    }

AbstractList is a tool for implementing lists with an application-specific backing store. Out of the box, you have to implement the get() and size() methods, and the setter methods all throw UnsupportedOperationException. But as you can see, it doesn't take a lot of work to implement these, and the result is nearly identical in terms of performance to an ArrayList<Point>.

And the Point class has reappeared, making your application code cleaner. But now they're created as-needed, and will be collected as soon as the program is done with them; you won't pay the memory overhead to keep them around. And assuming that they're relatively short-lived once created, they should rarely if ever make their way out of the heap's young generation, so will be cheap to collect.

There is one important difference between PointHolder and ArrayList<Point>, not immediately apparent from this example: since the Point instances do not actually exist within the list, changes to them won't be reflected in the PointHolder data. As implemented, each Point instance contains its own copies of the primitive values. To make the Point mutable, you'll need to design them with a reference back to the PointHolder (which I won't go into because I don't like mutable data-only objects).

Now that you have a space-efficient List<Point>, you may be thinking of how you'd access it — for example, how to search for a specific point. And you might think of calling Collections.sort(), followed by Collections.binarySearch(). And while this will work, it's not very efficient (because you're constantly creating Point instances), and has hidden memory costs. Those are topics for the next post.

Tuesday, October 19, 2010

Living Among Primitives, part 2: Data Structures

It may seem strange to talk about “living among primitives” and then jump to data structures, but the truth is that most programs work with structured data. So even in a world where you've turned to primitives, you need to think about how you'll organize those primitives.

And the best way, surprise, is to use Java objects — but with an eye toward the overheads that they introduce. In the last post I noted that an Integer adds 8 bytes of overhead to 4 bytes of data. That number came from Java Platform Performance: Strategies and Tactics, and I suspect it's higher for 64-bit JVMs: I recall reading in an early Hotspot doc that the overhead consisted of a reference to the object's class (which would be 4 bytes under 32-bit, 8 under 64-bit) and the object's identity hashcode. Arrays add another 4 bytes of overhead to hold their length.

As a case study for the rest of this post, I'm going to look at a three-dimensional Point class:

public class Point
{
    private float _x;
    private float _y;
    private float _z;
}

The data in this object takes up 12 bytes (float is a 32-bit IEE-754 floating-point value), and we'll be conservative on overhead and say it's 8 bytes (who knows, perhaps a 64-bit JVM will efficiently encode a reference to the object's class). So a total of 20 bytes per instance, 50 million instances per gigabyte. But you'll need something to manage those 50 million instances:

Point[] points = new Point[50000000];

On a 64-bit JVM, each element in that array is a reference, consuming 8 bytes; that means nearly half a gigabyte just to keep track of your gigabyte of points! And from here on out I'm going to assume that you have data volumes that require a 64-bit JVM.

One way to deal with this is to turn the structure inside-out: rather than an array of points, why not an object that holds arrays of values:

public class PointHolder
{
    private float[] _x;
    private float[] _y;
    private float[] _z;
    
    // constructor will initialize the arrays
    
    public float getX(int index)
    {
        return _x[index];
    }
    
    // and so on
}

This approach may make your code a little more complex, but it's eliminated 16 bytes of overhead per point: the references in the original array, plus the per-object overhead. You can now fit nearly 90 million points in a gigabyte. And it has another benefit: you can change the implementation without changing your code. For example, if you're dealing with truly huge arrays of points, you could move them off the heap entirely, using a memory-mapped file.

The problem with “turning the object inside-out” is that you have to give careful thought to how you work with the data: for example, how do you sort these points? How do you find a particular point? That will be the topic of the next few posts.

Monday, October 18, 2010

Living Among Primitives, part 1

There are a lot of people who say that Java shouldn't have primitive data types, that everything should be an object. And there is some truth to their arguments. Most business applications don't do intensive numeric operations, and for those that do, double is a poor substitute for a fully-thought-out Money. You could even claim that the presence of primitives leads to a procedural style of code: things like using an index variable rather than an iterator to access a list. And languages such as Ruby and Python follow the “everything's an object” path without apparent ill effect.

But I feel that primitives are in fact one of the strong points of the Java language, and languages that don't have native primitives often sprout hacks to replace them. Because there are several reasons for primitives to exist.

One is performance. It's true that, for simple expressions, evaluated once, you can't tell the difference. But there's a reason that libraries such as NumPy, which performs matrix math (among other things), are written in C or C++ rather than Python.

There's a primitive at the heart of any numeric object — after all, that's what the CPU works with. So when you invoke a method on a numeric object, not only do you have the overhead of a function call, but you're calling into native code. True, Hotspot could recognize mathematical operations and translate them into inline native code. But that still leaves the cost of allocating and collecting objects that are used only during the evaluation of a function. Unless Hotspot recognized those and transformed them into, well, primitives.

Another, often more important reason to use primitives is memory consumption: an int takes 4 bytes, but on a 32-bit Sun JVM, an Integer adds 8 bytes of overhead, there's also overhead to reference that Integer: 4 bytes on a 32-bit JVM, 8 bytes on a 64-bit JVM. Again, that doesn't seem like much, unless your application is dealing with hundreds of millions of numeric values.

And hundreds of millions of numeric values is exactly what I'm dealing with on my latest project … to the point where I recently told a colleague that “this application should be written in C.” The next few posts will dive into some of the issues that you'll face when living among primitives.

Monday, September 20, 2010

BigMemory

Last week Terracotta released BigMemory, which stores objects outside of the Java heap and “defuses the GC time bomb.” Since I've written the top-ranked relevant article when you Google for “non-heap memory” (it's on page 2), my website has seen a spike in volume. Unlike previous spikes, people have been poking around, which is nice to see.

I did some poking around myself, both to Terracotta's blogs, and to some of the Twitter posts and articles that linked to my site. And I saw a lot of comments that disturbed me. They ranged from (paraphrased) “this is just hype, it isn't anything new,” to “I wonder if Oracle will file a patent lawsuit on this,” to “the JVM should be doing this itself,” to the ever-popular “people should just learn to write efficient code.” I'll take on that last one first.

But first some caveats: I've never used EHCache outside of Hibernate, and even then I simply followed a cookbook. My closest contact with EHCache development was attending a Java Users Group meeting with a badly jetlagged Greg Luck, followed by beer with a him and a bunch of other people. I can't remember what was discussed around the table, but I don't think it was cache internals. But I do know something about caches, and about the way that the JVM manages memory, so I feel at least qualified to comment on the comments.

And my first comment is: a cache is efficient code … when appropriately used. You can spend days micro-optimizing your code, only to find it all wasted with a single trip to the database.* You use a cache when it's expensive to retrieve or create often-used data; the L1 and L2 caches in your processor exist because memory is relatively slow, so it's expensive to retrieve. Similarly, it's expensive to create a web page (and load its data from the database), so you would use EHCache (or Akamai) to cache the pre-built page (EHCache is cheaper).

The interface to a cache is pretty straightforward: you give it a key, it gives you a value. In the case of the processor's L1 cache, the key is an address, the value is the bytes around that address. In the case of memcached, the key is a string and the value is a blob. In Java, this corresponds to a Map, and you can create a simple cache using a LinkedHashMap.

The reason to use a LinkedHashMap, versus a regular HashMap, is eviction: removing old entries from the map. Although memory is cheap, it's not infinite, and there's no good reason to keep every piece of data ever accessed. For example, consider Stack Overflow: when a question is on the home page, it might get hundreds of views; clearly you don't want to keep reading it from the database. But after it drops off the home page, it might get viewed once a week, so you don't want to keep it in memory.

With LinkedHashMap, you can implement a simple eviction strategy: once the map reaches a certain size, you remove the head entry on each new put(). More sophisticated caches use more sophisticated strategies: for example, weighting the life of a cached object by the amount of work that it takes to load from source. In my opinion, this is why you go with a pre-built solution like EHCache: they've thought through the eviction strategies and neatly packaged them.

So that's how a cache works. What's interesting is how it interacts with the garbage collector. The point of a cache is to hold content in memory for as long as it's valuable; for a generational garbage collector like Sun's, this means it will end up in the tenured generation … but eventually get removed. And this causes two issues.

First is the cost to find the garbage. The Hotspot GC is a “mark-sweep” collector: it starts at “root” references and traces through the entire object graph to find live objects; everything else is garbage (this is all covered in my article on reference objects; go back and click the link). If you have tens or hundreds of millions of objects in your heap (and for an eCommerce site, that's not an unreasonable number), it's going to take some time to find them all: a few milliseconds. This is, as I understand it, the price that you pay for any collection, major or minor.

A major collection, however, performs one additional step: it compacts the heap after collecting the garbage. This is a big deal. No matter how fast your CPU and memory, it takes time to move gigabytes of memory around. Particularly since the JVM is updating all the references to the moved objects. But if you've ever written a C application that fragmented its heap, you'll thank the Hotspot team every time you see a major collection take place.

Or maybe you won't. Heap compaction is an artifact of tuning the JVM for turn-of-the-century computers, where 1Gb of main memory was “server class,” and swap (the pagefile) was an important part of making your computer usable for multiple applications. Compacting the heap was a way to reduce the resident set size, and improve the overall performance of an application.**

Today, of course, we have desktop-class machines with multiple gigabytes, running an OS and JVM that allow direct access to all of it (and more). And there's been a lot of work in the past few decades to reduce the risk of fragmentation with C-style memory allocation. So maybe it is time to rethink garbage collection in the JVM, at least for the tenured generation, and introduce freelists (of course, that will inflame the “Java is a memory hog” crowd). Or maybe it's time to introduce a new generation, for large tenured objects. Maybe in JDK 1.8. I'm not holding my breath, and neither, it seems, were the EHCache developers.

So to summarize: EHCache has taken techniques that are well-known, and applied them to the Java world. If I were still working in the eCommerce field, I would consider this a Good Thing, and be glad that they invested the time and effort to make it work. Oh, and as far as patents go, I suspect that IBM invented memory-mapped files sometime in the 1960s, so no need to worry.

* Or, in the words of one of my favorite authors: “No matter how subtle the wizard, a knife between the shoulder blades will seriously cramp his style.”

** If you work with Swing apps, you'll often find a minimize handler that explicitly calls System.gc(); this assumes that the app's pages will be swapped out while minimized, and is an attempt to reduce restore time.

Thursday, April 15, 2010

intern() and equals()

A lot of programmers have the belief that String.intern() can be used to speed up if-else chains. So instead of writing this:

public void dispatch(String command)
{
    if (command.equals("open"))
        doOpenAction();
    else if (command.equals("close"))
        doCloseAction();
    else if (command.equals("calculate"))
        doCalculateAction();
    // and so on
}

they write this:

public void dispatch(String command)
{
    command = command.intern();
    if (command == "open")
        doOpenAction();
    else if (command == "close")
        doCloseAction();
    else if (command == "calculate")
        doCalculateAction();
    // and so on
}

Their logic seems to be that it's faster to compare integers than to compare strings. And while the only way to know for sure is a benchmark, I'm willing to bet that the intern() approach is actually slower (except, of course, for carefully constructed examples). The reason being that string comparisons are quite fast, while intern() has to do a lot of work behind the scenes.

Let's start with a string comparison, since it's available in the src.zip file that comes with the JDK (in this case, JDK 1.6):

    public boolean equals(Object anObject) {
    if (this == anObject) {
        return true;
    }
    if (anObject instanceof String) {
        String anotherString = (String)anObject;
        int n = count;
        if (n == anotherString.count) {
            char v1[] = value;
            char v2[] = anotherString.value;
            int i = offset;
            int j = anotherString.offset;
            while (n-- != 0) {
                if (v1[i++] != v2[j++])
                return false;
            }
            return true;
        }
    }
    return false;
    }

Like all good equals() implementations, this method tests identity equality first. Which means that, intern() or not, there's no reason to get into the bad habit of using == to compare strings (because sooner or later you'll do that with a string that isn't interned, and have a bug to track down).

The next comparison is length: if two strings have different length, they can't be equal. And finally, iteration through the actual characters of the string. Most calls should return from the length check, and in almost all real-world cases the strings will differ in the first few characters. This isn't going to take much time, particularly if the JVM decides to inline the code.

How about intern()? The source code for the JDK is available, and if you trace through method calls, you end up at symbolTable.cpp, which defines the StringTable class. Which is, unsurprisingly, a hash table.

All hash lookups work the same way: first you compute a hashcode, then you use that hashcode to probe a table to find a list of values, then you compare the actual value against every value in that list. In the example above, computing the hashcode for any of the expected strings will take as many memory accesses as all of the comparisons in the if-else chain.

So if intern() doesn't speed up comparisons, what is it good for? The answer is that it conserves memory.

A lot of programs read strings from files and then hold those strings in memory. For example, an XML parser builds up a DOM tree in which each Element instance holds its name, namespace, and perhaps a map of attributes. There's usually a lot of duplication in this data: think of the number of times that <div> and href appear in a typical XHTML file. And since you've read these names from a file, they come into your program as separate String instances; nothing is shared. By passing them through intern(), you're able to eliminate the duplication and hold one canonical instance.

The one drawback to String.intern() is that it stores data in the permgen (at least for the Sun JVM), which is often a scarce resource (particularly in app-servers). To get the benefits of canonicalization without risking an OutOfMemoryError, you can create your own canonicalizing map.

And if you'd like to replace long if-else chains with something more efficient, think about a Map of functors.

Friday, January 29, 2010

Micro-Optimization is Easy and Feels Good

Have you ever seen code like this?

private String foo(String arg1, int arg2)
{
    StringBuilder buf = new StringBuilder();
    buf.append(arg1).append(" = ").append(arg2);
    return buf.toString();
}

If you ask the person who wrote this why s/he used a StringBuilder rather than simply concatenating the strings, you'll probably hear a long lecture about how it's more efficient. If you're a bytecode geek, you can respond by demonstrating that the compiler generate exactly the same code. But then the argument changes: it's more efficient in some cases, and is never less efficient, so is still a Good Thing. At that point it's usually easiest to (snarkily) say “well, you should at least pre-allocate a reasonable buffer size,” and walk away.

These arguments are not caused by the time quantum fallacy. The StringBuilder fan recognizes that different operations take differing amounts of time, and is actively trying to minimize the overall time of his/her code. Never mind that, in context, string concatenation is rarely a significant time sink. The rationale is always “it can't hurt, and might help.”

This is the same attitude that drives curbside recycling. It feels good to put an aluminum can in the bin: the can can be melted down and reused, saving much of the energy needed to smelt the aluminum from ore. Doing so, however, avoids the bigger questions of whether aluminum is the most environmentally sound packaging material in the first place, or whether you should even be drinking the contents of that can.

In the same way, micro-optimizations let you feel good while avoiding the bigger question of what parts of your code should be optimized. Experienced programmers know that not all code should be optimized: there's an 80-20 (or more likely 90-10) rule at work. But finding the 10% that should be optimized is, quite frankly, hard.

It's not that we don't have tools: profilers of one form or another have been around since before I started working with computers. Java has had a built-in profiler since at least JDK 1.2, and a good profiler since (I think) 1.4. For that matter, some carefully placed logging statements can give you a good idea of what parts of your code take the longest (just be sure to use a logging guard so that you're not wasting CPU cycles!).

The real barrier to profiling — and by extension, intelligent optimization — is coming up with a representative workload. “Representative” is the key word here: you can write a performance test that spends all of its time doing things that your real users never do. Even when you think that you know what the users will do, it's tempting to take shortcuts: to profile an eCommerce application, you might add 10,000 items to a single cart. But the optimizations that you make after this test could be meaningless if the real world has 10,000 carts containing one item. And even if you get that right, there may be different implications running on an empty database versus one that has lots of data.

So yes, it's hard. But ultimately, the brain cycles that you spend thinking about representative workloads will have a greater payoff than those you spend thinking about micro-optimizations. And the end result feels better too, much like foregoing a can of soda for a glass of water.

Wednesday, January 27, 2010

The Time Quantum Fallacy

So why would someone who's writing a servlet worry about the time taken for a cast? I can't believe it's due to a fear of software bloat, simply because the relative impact is so minimal: a person truly concerned about bloat would think of replacing HTTP with a raw TCP connection. Or maybe UDP, to avoid all the overhead of ack/nack.

However, what if the questioner has a fundamental misunderstanding of how long it takes a computer to do anything? To be more specific, a belief that all operations take the same time, so removing a cast is just as good as removing a database lookup.

Experienced programmers will scoff at that idea. But think about it: on a human scale, all computer operations do seem to take the same amount of time. Google can come up with search suggestions in the time it takes you to type a single character; that involves a round-trip web request and some sort of database call.

Computer science classes reinforce this idea, with their emphasis on “Big O” analysis. The implications of such analysis is that an O(n²) algorithm is worse than an O(n log_n) algorithm, regardless of what the code actually does — or the context in which the code is executed.

In the real world, of course, context is everything. If I need to do lookups on thousands of items, a hashed lookup (O(1)) seems far more appropriate than a linear lookup (O(N)). But what if I only have a few items? Say, 10 or 20? Or if I only occasionally do a lookup? In this case, the O(1) versus O(N) analysis just doesn't matter. And a linear structure may be easier to construct and use than a hashed structure.

This brings me back to Mel. Mel's genius was not so much in understanding the instruction set, but understanding overall context. Of knowing how long each instruction took, how quickly the memory drum spun, and most important, when this knowledge made a difference. Perhaps that understanding only comes with experience, but I find that even experienced programmers fall into the micro-optimization trap, so there must be yet another factor at work.

Monday, January 25, 2010

The Urge to Optimize

I've a Servlet filter which performs the following type cast [...] What is the performance impact of such a cast? Is it worth to accept this performance degradation for a better architecture?

That's an exact quote from an online Q&A forum. The first response starts with an incredulous “As compared to handling a HTTP request?” (emphasis as written). To an experienced programmer, the questioner's concern is ridiculous: a cast takes nanoseconds, processing a request takes milliseconds; the difference is six orders of magnitude.

But questions like this aren't rare: on the same day, another user wanted to know the “exact” CPU difference, for allocating byte arrays in Java versus pooling them. Another was concerned about the performance of unsigned versus signed integers in C++. And yet another wondered how best to translate a Python program (not a module, a program) into C, “so that it can be smaller (and faster).”

Most of the answers to such questions involve platitudes about premature optimization; a few suggest profiling. Both of which are valid answers. And more important, well-publicized answers: a Google search on “software profiling” returns 5½ million results. And yet the micro-optimization questions keep coming. Why?

Part of the answer, I think, is that there are also an enormous number of pages devoted to “software bloat” — and an equally large fear that one's own software is bloated. Every culture has its creation myths, and The Story of Mel is a powerful myth for the software industry. Who wouldn't want to be Mel, a person who knew his machine so well that he laid out instructions according to the characteristics of its memory device?

I was one of the “new generation of programmers” that was the audience for that story; I started my professional career at about the same time the story was posted. At the time, I was doing a lot of low-level programming: one job involved inserting NOPs into an 8086 interrupt handler to ensure real-time response characteristics (real-time means predictable, not necessarily fast). Another job involved porting a windowing system that provided Macintosh-level capabilities for the Apple 2e.

To understand the latter project, recognize that this was 1984, the Macintosh had most of its windowing library in ROM, yet developers still struggled to fit a complex program (and user data) into its 128k of RAM. The Apple 2e had 64k of memory, which would have to hold this windowing library and a real program. The person who developed the library was perhaps the equal of Mel. I was nowhere close, and just porting the library was at the limits of my ability. I came into the project thinking I understood how to cram code into limited memory, I left with a much better understanding of the things I knew, and vague hints that there was a whole realm that I didn't.

I don't remember much of it now. With multiple gigabytes of RAM and megabytes of L2 cache, it just doesn't seem to matter. But one thing that I do remember is that it had nothing whatsoever to do with debates over the performance implications of a cast. You can find that lesson in the Story of Mel, if you look closely enough, yet somehow it always seems to be missed.

blog.kdgregory.com