Monday, December 28, 2020

Diving Into First-Run Times for Lambdas Written in Java

A few weeks ago I answered a Stack Overflow question about the poor performance of Java Lambdas. I gave some general suggestions, but didn't provide any data to back them up. This post is my data.

The question posed a situation where one Lambda invoked another, which did nothing. The asker seemed particularly concerned by the time taken by the first function: 6000+ milliseconds when cold, 300 when warm. Rather than set up an identical situation, I'm going to use a simpler example, retrieving a list of CloudWatch log groups. The execution time, as you'll see, is similar.

Before getting started, I want to clarify terminology. When I use the term “cold,” I refer to a Lambda that's just been deployed. The execution time in this case consists of the time to initialize the Lambda runtime, initialize the Java Virtual Machine (JVM), and execute the Lambda's handler function. By comparison, a “warm” invocation is able to use an existing execution environment, so only includes the time to execute the handler function.

OK, on to Teh Codez:

public void handler(Object ignored, Context context)
{
    long start = System.currentTimeMillis();
    
    AWSLogs client = AWSLogsClientBuilder.defaultClient();
    
    long clientCreated = System.currentTimeMillis();
    
    client.describeLogGroups();
    
    long apiInvoked = System.currentTimeMillis();
    
    System.err.format("time to create SDK client = %6d\n", (clientCreated - start));
    System.err.format("time to make API call     = %6d\n", (apiInvoked - clientCreated));
}

This is an extremely simple Lambda: it creates an AWS SDK client and then uses that client to invoke an API call. Before each of these actions I retrieve the current system time. How long can it take?

Setting a Baseline

People like to say things are “slow.” But that term is meaningless if you don't have context. And to develop that context, you have to know both what is happening, and the physical constraints that apply. For example, on my 100 Mbit/sec Internet connection, it should take a little over 80 seconds to download a gigabyte of data. That feels like a really long time if you're waiting for the file to download, but it's not slow. If that same file takes 5 minutes to download over the same connection, it's reasonable to say that it's slow, to try to figure out why, and to attempt to correct it.

If you look at the example code, you might think that it should execute instantaneously. After all, it's just two operations. If you have think about the fact that at least the second operation makes a network call, you might say that it should take 100 milliseconds or so.

But observations are always better than guesses. Running on my laptop, with an Intel i7-6500U CPU at 2.5 GHz, here's how long it takes (to get these numbers I created a main() that just invokes the handler function):

time to create SDK client =    744
time to make API call     =    469

That's … much longer than one might expect. Over a second. To make sure it isn't a fluke, you should run the code several times. You should also try some other variants of the code. For example, call the handler function twice in a row:

time to create SDK client =    702
time to make API call     =    522
time to create SDK client =     21
time to make API call     =    151

Hmmmm. The time to create the client dropped dramatically. The time to make the API call also dropped, and is closer to what I'd expect from a network service call. At this point I might also use the Linux time command:

> time java -jar target/sandbox-lambda-java-1.0-SNAPSHOT.jar 
time to create SDK client =    576
time to make API call     =    623

real    0m1.320s
user    0m2.130s
sys     0m0.113s

The “real” value is also known as “wall clock” time: it's what I'm measuring by saving timestamps, and it's pretty close to the timings I print from within the program. The “user” time is the actual CPU time consumed by the program; it's nearly twice the real time, which indicates that the CPU is doing a lot of work. If the program was spending all of its time making network calls, that user time should be less than the real time.

This information alone is enough to make a first pass at optimization.

Memory Allocation

When you create a Java Lambda using the Console, you get a default memory allocation of 512 MB. While Java has a not always deserved reputation for excessive memory consumption, this is more than enough to run the example program. In fact, if you look at the post-run statistics, you'll see that “Max memory used” is around 150 MB.

However, maximum memory consumption only tells part of the story. A much — much — more important fact is that Lambda provides CPU in proportion to allocated memory. To get one virtual CPU, you need to allocate 1,769 MB of memory to your function. The AWS docs don't provide a table of how much CPU you'd get for a given memory size, but assuming a linear relationship, 512 MB would be a third of a CPU — a virtual CPU of unspecified performance.

The table below compares timings for different memory allocations. These are all “cold” times: before each test I uploaded the deployment bundle, which forced Lambda to create a new execution environment. I invoked the Lambda from the Console, providing a dummy event. For consistency with the previous numbers, times are in milliseconds.

  512 MB 1024 MB 2048 MB 4096 MB
Create client 5298 2493 1272 1019
Invoke API call 3844 2023 1061 613
Billed duration 9213 4555 2349 1648

If you add up the numbers, you'll see that the billed duration is slightly larger than the sum of the two recorded times. I believe this corresponds to the time taken to start the JVM and invoke the handler function (much like running on my laptop took 1320 ms total, but only 1199 was accounted for by my timestamps).

These numbers also omit the Lambda initialization time, which was approximately 500 ms. This is the time taken to start the Lambda's container, download the function code onto the container, and start the runtime environment. You aren't billed for this time, but it does affect the response time of the function, and Java Lambdas seem to take much longer than, say Python (where initialization takes around 150 milliseconds).

Based on what we know from the baseline performance test, the numbers make sense: at 4096 MB we have the equivalent of slightly more than two virtual CPUs, and the execution times are in line with what I saw running on my laptop (for what it's worth, increasing the memory size to 8192 MB, which should be 4+ vCPUs, does not significantly affect these timings). This leads to my number one rule for Lambda performance tuning:

The most important thing that you can do to improve Lambda performance is increase memory allocation.

This advice is, of course, subject to caveats. This example program is short, has high CPU usage relative to its runtime, and can exploit multiple virtual CPUs. A long-running Lambda that spends most of its time waiting for network calls may not benefit as much from the CPU boost.

However, almost any Java Lambda will benefit from increased memory allotment as a result of garbage collection: the less memory you give it, the more often collection has to run. I added some code to track garbage collection time, and the 512 MB run consumed nearly 200 milliseconds, versus 30 for the 4096 MB run. Again, this depends on what the program is doing, but in general a larger heap means that more of the program's garbage will never make it out of the “young” generation, which can be collected more efficiently than the “tenured” generation.

Classloading

So what's consuming all of the CPU time for this Lambda? At least part of the answer is classloading: to run this simple program, the JVM loads 3,329 classes.

The JVM loads classes on-demand. It won't load the AWSLogsClientBuilder class until it it executes the line that calls defaultClient(). Loading and initializing a class require loading any classes that it depends on, and so on. Even though an individual class can be loaded very quickly, the total classloading time adds up.

Unfortunately, there aren't a lot of ways to avoid this cost, especially in a simple program like the example. The tricks available to stand-alone Java programs aren't available in Lambda.

However, for real-world applications you can make architectural choices that minimize the number of classes that need to be loaded. One of the simplest is to avoid large frameworks such as Spring.

A Different SDK

Another possibility might be to replace the standard AWS SDK with something that loads fewer classes. In November 2018 AWS released version 2 of its Java SDK, which is described as “major rewrite of the version 1.x code base.” You'll occasionally see recommendations to use it for improved performance (including in the SDK docs themselves, via a link that doesn't go anywhere).

But, as I said before, there's no substitution for observation. Here are the numbers using version 2.15.53:

  512 MB 1024 MB 2048 MB 4096 MB
Create client 4965 2278 1141 959
Invoke API call 4235 2062 1047 661
Billed duration 9237 4357 2204 1637

No meaningful change. To be fair, I just used the default configuration. The v2 SDK lets you change out the underlying HTTP implementation, so maybe a different one would give better numbers. But that seems like a lot of work for “maybe.”

For that matter, switching to the v2 SDK requires a signficant amount of tedious recoding to change package and method names. And as-of this writing, there are still some features that aren't supported by v2. So I don't recommend making that switch until and unless there's a compelling reason.

Packaging

Earlier this year I wrote an article about the different ways to package a Java Lambda. In that article I explored why the Lambda documentation recommended against using an “UberJar” produced by the Maven Shade plugin — even though that same documentation uses that plugin for examples. However, I didn't record the performance gained by switching to the Assembly plugin.

Here's that comparison. I've taken the billed duration from the previous table, and compared it to the billed duration when packaged via the assembly plugin. Rather than show all of the memory sizes, I just show the two extremes:

  512 MB 4096 MB
Shade Plugin 9213 1648
Assembly Plugin 8138 1358

So, a decent speedup with a small memory allocation (which I hope you woudn't use after reading this post!), and a minor speedup when you have plenty of memory (and CPU). Still, any speedup is a good speedup, and this requires little effort.

Provisioned Concurrency

On Stack Overflow, the typical answer to people who are concerned about first-run execution time is “use Provisioned Concurrency.” I suspect that the people who say this have never actually used provisioned concurrency, because it's not a silver bullet. In fact, if you enable provisioned concurrency for my example program, you'll see no change in first-run execution time.

The reason is that my example does everything inside its handler function, and so incurs all of the cost of classloading and initialization when that function executes. Provisioned concurrency won't help with that.

To make Provisioned Concurrency help with Java first-start times, you need to move all of the code that triggers classloading and initialization into a constructor (or a static initializer, but trust me, don't go there). This adds to the complexity of your Lambda, because you have to ensure that you fully load the SDK (and other) classes that you need to run, without actually changing anything (you don't want to write bogus data to your production DynamoDB table!).

Assuming that you've gotten everything working, you'll still have the “N + 1” problem: unless you dramatically over-provision, you'll still get the occasional cold start. Perhaps it happens first thing in the morning, when all of your users connect for the first time. Or perhaps it happens when your site gets mentioned on Hacker News. Sooner or later, it will happen.

Finally, there's the matter of cost: with Provisioned Concurrency, you are paying for an always-on machine. In my example, with 4 GB allocated to the Lambda, enabling Provisioned Concurrency would cost approximately $45/month in addition to the per-request cost. It's hard to find an exact equivalent in the EC2 world, but a t3.medium has the same memory and virtual CPU count, and costs a little over $30 per month (both prices are for us-east-2). So if you're planning to replace your “traditional” Java web-app with Lambdas to save costs, you'll get the opposite.

Wrapping Up

I first implemented a Lambda web-app with Java nearly four years ago, and my opinion hasn't changed since then: if you need fast response times, then Java should not be your language of choice. Use Python, or NodeJS, or Go. Or if you must use Java, but want the benefits of Serverless computing, deploy it on ECS Fargate as a traditional J2EE application with embedded Jetty server.

That doesn't mean that Java is always a bad choice for Lambdas. If you have a long-running task, such as processing large files, then startup time pales in comparison to overall runtime. And Java is an especially good choice for CPU-intensive tasks, because the Hotspot engine will optimize performance.

The bottom line is that writing your Lambda functions in Java, like everything else in software engineering, is a tradeoff. Pick your application architecture based on your actual needs, not philosophical stances.