blog.kdgregory.com

Monday, December 28, 2020

Diving Into First-Run Times for Lambdas Written in Java

A few weeks ago I answered a Stack Overflow question about the poor performance of Java Lambdas. I gave some general suggestions, but didn't provide any data to back them up. This post is my data.

The question posed a situation where one Lambda invoked another, which did nothing. The asker seemed particularly concerned by the time taken by the first function: 6000+ milliseconds when cold, 300 when warm. Rather than set up an identical situation, I'm going to use a simpler example, retrieving a list of CloudWatch log groups. The execution time, as you'll see, is similar.

Before getting started, I want to clarify terminology. When I use the term “cold,” I refer to a Lambda that's just been deployed. The execution time in this case consists of the time to initialize the Lambda runtime, initialize the Java Virtual Machine (JVM), and execute the Lambda's handler function. By comparison, a “warm” invocation is able to use an existing execution environment, so only includes the time to execute the handler function.

OK, on to Teh Codez:

public void handler(Object ignored, Context context)
{
    long start = System.currentTimeMillis();
    
    AWSLogs client = AWSLogsClientBuilder.defaultClient();
    
    long clientCreated = System.currentTimeMillis();
    
    client.describeLogGroups();
    
    long apiInvoked = System.currentTimeMillis();
    
    System.err.format("time to create SDK client = %6d\n", (clientCreated - start));
    System.err.format("time to make API call     = %6d\n", (apiInvoked - clientCreated));
}

This is an extremely simple Lambda: it creates an AWS SDK client and then uses that client to invoke an API call. Before each of these actions I retrieve the current system time. How long can it take?

Setting a Baseline

People like to say things are “slow.” But that term is meaningless if you don't have context. And to develop that context, you have to know both what is happening, and the physical constraints that apply. For example, on my 100 Mbit/sec Internet connection, it should take a little over 80 seconds to download a gigabyte of data. That feels like a really long time if you're waiting for the file to download, but it's not slow. If that same file takes 5 minutes to download over the same connection, it's reasonable to say that it's slow, to try to figure out why, and to attempt to correct it.

If you look at the example code, you might think that it should execute instantaneously. After all, it's just two operations. If you have think about the fact that at least the second operation makes a network call, you might say that it should take 100 milliseconds or so.

But observations are always better than guesses. Running on my laptop, with an Intel i7-6500U CPU at 2.5 GHz, here's how long it takes (to get these numbers I created a main() that just invokes the handler function):

time to create SDK client =    744
time to make API call     =    469

That's … much longer than one might expect. Over a second. To make sure it isn't a fluke, you should run the code several times. You should also try some other variants of the code. For example, call the handler function twice in a row:

time to create SDK client =    702
time to make API call     =    522
time to create SDK client =     21
time to make API call     =    151

Hmmmm. The time to create the client dropped dramatically. The time to make the API call also dropped, and is closer to what I'd expect from a network service call. At this point I might also use the Linux time command:

> time java -jar target/sandbox-lambda-java-1.0-SNAPSHOT.jar 
time to create SDK client =    576
time to make API call     =    623

real    0m1.320s
user    0m2.130s
sys     0m0.113s

The “real” value is also known as “wall clock” time: it's what I'm measuring by saving timestamps, and it's pretty close to the timings I print from within the program. The “user” time is the actual CPU time consumed by the program; it's nearly twice the real time, which indicates that the CPU is doing a lot of work. If the program was spending all of its time making network calls, that user time should be less than the real time.

This information alone is enough to make a first pass at optimization.

Memory Allocation

When you create a Java Lambda using the Console, you get a default memory allocation of 512 MB. While Java has a not always deserved reputation for excessive memory consumption, this is more than enough to run the example program. In fact, if you look at the post-run statistics, you'll see that “Max memory used” is around 150 MB.

However, maximum memory consumption only tells part of the story. A much — much — more important fact is that Lambda provides CPU in proportion to allocated memory. To get one virtual CPU, you need to allocate 1,769 MB of memory to your function. The AWS docs don't provide a table of how much CPU you'd get for a given memory size, but assuming a linear relationship, 512 MB would be a third of a CPU — a virtual CPU of unspecified performance.

The table below compares timings for different memory allocations. These are all “cold” times: before each test I uploaded the deployment bundle, which forced Lambda to create a new execution environment. I invoked the Lambda from the Console, providing a dummy event. For consistency with the previous numbers, times are in milliseconds.

  512 MB 1024 MB 2048 MB 4096 MB
Create client 5298 2493 1272 1019
Invoke API call 3844 2023 1061 613
Billed duration 9213 4555 2349 1648

If you add up the numbers, you'll see that the billed duration is slightly larger than the sum of the two recorded times. I believe this corresponds to the time taken to start the JVM and invoke the handler function (much like running on my laptop took 1320 ms total, but only 1199 was accounted for by my timestamps).

These numbers also omit the Lambda initialization time, which was approximately 500 ms. This is the time taken to start the Lambda's container, download the function code onto the container, and start the runtime environment. You aren't billed for this time, but it does affect the response time of the function, and Java Lambdas seem to take much longer than, say Python (where initialization takes around 150 milliseconds).

Based on what we know from the baseline performance test, the numbers make sense: at 4096 MB we have the equivalent of slightly more than two virtual CPUs, and the execution times are in line with what I saw running on my laptop (for what it's worth, increasing the memory size to 8192 MB, which should be 4+ vCPUs, does not significantly affect these timings). This leads to my number one rule for Lambda performance tuning:

The most important thing that you can do to improve Lambda performance is increase memory allocation.

This advice is, of course, subject to caveats. This example program is short, has high CPU usage relative to its runtime, and can exploit multiple virtual CPUs. A long-running Lambda that spends most of its time waiting for network calls may not benefit as much from the CPU boost.

However, almost any Java Lambda will benefit from increased memory allotment as a result of garbage collection: the less memory you give it, the more often collection has to run. I added some code to track garbage collection time, and the 512 MB run consumed nearly 200 milliseconds, versus 30 for the 4096 MB run. Again, this depends on what the program is doing, but in general a larger heap means that more of the program's garbage will never make it out of the “young” generation, which can be collected more efficiently than the “tenured” generation.

Classloading

So what's consuming all of the CPU time for this Lambda? At least part of the answer is classloading: to run this simple program, the JVM loads 3,329 classes.

The JVM loads classes on-demand. It won't load the AWSLogsClientBuilder class until it it executes the line that calls defaultClient(). Loading and initializing a class require loading any classes that it depends on, and so on. Even though an individual class can be loaded very quickly, the total classloading time adds up.

Unfortunately, there aren't a lot of ways to avoid this cost, especially in a simple program like the example. The tricks available to stand-alone Java programs aren't available in Lambda.

However, for real-world applications you can make architectural choices that minimize the number of classes that need to be loaded. One of the simplest is to avoid large frameworks such as Spring.

A Different SDK

Another possibility might be to replace the standard AWS SDK with something that loads fewer classes. In November 2018 AWS released version 2 of its Java SDK, which is described as “major rewrite of the version 1.x code base.” You'll occasionally see recommendations to use it for improved performance (including in the SDK docs themselves, via a link that doesn't go anywhere).

But, as I said before, there's no substitution for observation. Here are the numbers using version 2.15.53:

  512 MB 1024 MB 2048 MB 4096 MB
Create client 4965 2278 1141 959
Invoke API call 4235 2062 1047 661
Billed duration 9237 4357 2204 1637

No meaningful change. To be fair, I just used the default configuration. The v2 SDK lets you change out the underlying HTTP implementation, so maybe a different one would give better numbers. But that seems like a lot of work for “maybe.”

For that matter, switching to the v2 SDK requires a signficant amount of tedious recoding to change package and method names. And as-of this writing, there are still some features that aren't supported by v2. So I don't recommend making that switch until and unless there's a compelling reason.

Packaging

Earlier this year I wrote an article about the different ways to package a Java Lambda. In that article I explored why the Lambda documentation recommended against using an “UberJar” produced by the Maven Shade plugin — even though that same documentation uses that plugin for examples. However, I didn't record the performance gained by switching to the Assembly plugin.

Here's that comparison. I've taken the billed duration from the previous table, and compared it to the billed duration when packaged via the assembly plugin. Rather than show all of the memory sizes, I just show the two extremes:

  512 MB 4096 MB
Shade Plugin 9213 1648
Assembly Plugin 8138 1358

So, a decent speedup with a small memory allocation (which I hope you woudn't use after reading this post!), and a minor speedup when you have plenty of memory (and CPU). Still, any speedup is a good speedup, and this requires little effort.

Provisioned Concurrency

On Stack Overflow, the typical answer to people who are concerned about first-run execution time is “use Provisioned Concurrency.” I suspect that the people who say this have never actually used provisioned concurrency, because it's not a silver bullet. In fact, if you enable provisioned concurrency for my example program, you'll see no change in first-run execution time.

The reason is that my example does everything inside its handler function, and so incurs all of the cost of classloading and initialization when that function executes. Provisioned concurrency won't help with that.

To make Provisioned Concurrency help with Java first-start times, you need to move all of the code that triggers classloading and initialization into a constructor (or a static initializer, but trust me, don't go there). This adds to the complexity of your Lambda, because you have to ensure that you fully load the SDK (and other) classes that you need to run, without actually changing anything (you don't want to write bogus data to your production DynamoDB table!).

Assuming that you've gotten everything working, you'll still have the “N + 1” problem: unless you dramatically over-provision, you'll still get the occasional cold start. Perhaps it happens first thing in the morning, when all of your users connect for the first time. Or perhaps it happens when your site gets mentioned on Hacker News. Sooner or later, it will happen.

Finally, there's the matter of cost: with Provisioned Concurrency, you are paying for an always-on machine. In my example, with 4 GB allocated to the Lambda, enabling Provisioned Concurrency would cost approximately $45/month in addition to the per-request cost. It's hard to find an exact equivalent in the EC2 world, but a t3.medium has the same memory and virtual CPU count, and costs a little over $30 per month (both prices are for us-east-2). So if you're planning to replace your “traditional” Java web-app with Lambdas to save costs, you'll get the opposite.

Wrapping Up

I first implemented a Lambda web-app with Java nearly four years ago, and my opinion hasn't changed since then: if you need fast response times, then Java should not be your language of choice. Use Python, or NodeJS, or Go. Or if you must use Java, but want the benefits of Serverless computing, deploy it on ECS Fargate as a traditional J2EE application with embedded Jetty server.

That doesn't mean that Java is always a bad choice for Lambdas. If you have a long-running task, such as processing large files, then startup time pales in comparison to overall runtime. And Java is an especially good choice for CPU-intensive tasks, because the Hotspot engine will optimize performance.

The bottom line is that writing your Lambda functions in Java, like everything else in software engineering, is a tradeoff. Pick your application architecture based on your actual needs, not philosophical stances.

Monday, November 9, 2020

Building Modular Inline Policies with Terraform

Modules are the primary mechanism for reusability in Terraform. They're especially useful in cases where you need to stamp out a bunch of related resources multiple times. For example, an SQS queue along with its dead-letter queue and “least privilege” access policies.

The problem with creating a separate managed policy for each module invocation is that role/policy attachments are a limited resource: by default, you're limited to ten. You can increase that by opening a support ticket, but the limit is still low (20 unless you can show special circumstances), and it has to be done in every account where you'll create the role.

Alternatively, you can construct a combined policy in the invoking module as long as the child module outputs the queue ARNs. However, this eliminates much of the benefit of reusability, as you have to explicitly build the combined policy everywhere that you use the queue module. It also means that the person writing the combined policy has to understand the IAM permissions that are appropriate to a queue consumer or producer. At that point, why not just explicitly create your resources?

In this post, I look at a different alternative: the queue module exports the JSON for the producer and consumer statements. The invoking module then combines the statements for each of its queues into a single applicaton policy.

Wait, doesn't Terraform allow you to combine policies?

Not in the way that I want.

The iam_policy_document data source allows you to “layer” policies using its source_json and override_json arguments, but these only accept a single document. Which means that you can build a complex policy document step by step, but not as a single data source. In a real-world configuration, I think that would quickly turn into an unmaintainable mess.

The policy documents guide doesn't give any other alternatives: it recommends the data source or a multi-line “heredc”. Surprisingly, it doesn't mention using jsonencode() at all.

Creating an SQS queue module

A modern web-application is a complex beast: it often consists of a (relatively) simple API server, with background processes to handle long-running tasks such as sending notifications, rendering images, or generating reports. And linking these together, at least on AWS, SQS queues.

An SQS queue, although it seems simple, actually has a lot of hidden complexity:

  • Each queue should have a companion dead-letter queue, along with a "redrive policy" that determines how many times a consumer can fail to process a message before it goes to the dead-letter queue.
  • How long a does a consumer have to process a message before it's retried (the visibility timeout).
  • How long can a message stay in the queue without being picked up (the retention period, used to prevent queues from growing without bound when there are no/insufficient consumers).
  • You should also have “least-privilege” policies for the queue's producers and consumers.

This is a perfect use for Terraform modules: rather than repeat all of the resource definitions (as you have to do with CloudFormation), you simply create a module and reference it from your main configuration:

module "notifications_queue" {
  source = "./modules/sqs"
  queue_name = "Notifications"
}

module "reports_queue" {
  source = "./modules/sqs"
  queue_name = "ReportGeneration"
  visibility_timeout = 600
  receive_count = 1
}

The core of this module are the following two resource definitions, for the primary queue and its companion dead-letter queue (every primary should have a DLQ; trust me, you'll want it sooner or later):

resource "aws_sqs_queue" "base_queue" {
  name                        = var.queue_name
  message_retention_seconds   = var.retention_period
  visibility_timeout_seconds  = var.visibility_timeout
  redrive_policy              = jsonencode({
                                    "deadLetterTargetArn" = aws_sqs_queue.deadletter_queue.arn,
                                    "maxReceiveCount" = var.retry_count
                                })
}

resource "aws_sqs_queue" "deadletter_queue" {
  name                        = "${var.queue_name}-DLQ"
  message_retention_seconds   = var.retention_period
  visibility_timeout_seconds  = var.visibility_timeout
}

As you can see, all configuration is via module variables. Rather than show the entire variables.tf file, I'll use just two to give you a sense of my philosophy. The first, of course, is queue_name: this is the name of the queue, and can't be defaulted. The second, retention_period, is something that most consumers don't care about, but some might. So I picked a valid default (in my experience most messages have no value after 24 hours), and the invoking module can override if needed.

variable "queue_name" {
  description = "The name of the queue. Used as a prefix for related resource names."
  type = string
}

variable "retention_period" {
  description = "Time (in seconds) that messages will remain in queue before being purged"
  type = number
  default = 86400
}

So far, nothing unusual. But what I'm not showing (yet) are the policy documents that let your applications read and write the queue.

If you were just building a proof-of-concept, you might consider using the pre-existing AmazonSQSFullAccess managed policy. However, that policy grants all operations on all queues. Not just read and write, but the ability to create and delete them as well. For a real-world application, you want a “least privilege” policy that only allows actions on the queue that you've just created. And really, you want separate policies for the programs that write to the queue and those that read from the queue. To support this, my module also creates two managed policies; here's the producer policy:

resource "aws_iam_policy" "producer_policy" {
  name        = "SQS-${var.queue_name}-${data.aws_region.current.name}-consumer_policy"
  description = "Attach this policy to producers for ${var.queue_name} SQS queue"
  policy      = data.aws_iam_policy_document.producer_policy.json
}

data "aws_iam_policy_document" "producer_policy" {
  statement {
    actions = [
      "sqs:GetQueueAttributes",
      "sqs:GetQueueUrl",
      "sqs:SendMessage",
      "sqs:SendMessageBatch"
    ]
    resources = [
      aws_sqs_queue.base_queue.arn
    ]
  }
}

Simple, but as I said above, there's a limit on the number of policies that you can attach to a single role. In a real-world micro-service application, you might have a half-dozen queues. Or more.

Extracting the Policy Statement

I originally created this example for a user group talk, so while it was based on real-world experience it wasn't actually used in the real world. I recently had to implement an SQS module for the real world, and decided to tackle the “too many policies” problem (ironically, I don't think it will be a problem for that specific use case).

After reviewing the various documents linked above, I realized that none of them mentioned the jsonencode function. This seemed like a rather strange omission, especially given the inclusion of approaches based on string literals (although not nearly as bad as XML, JSON still has its share of gotchas for those who wish to write it using strings).

As it turns out, creating policy documents with jsonencode() instead of using a data source works quite well:

resource "aws_iam_policy" "producer_policy" {
  name        = "SQS-${var.queue_name}-${data.aws_region.current.name}-producer"
  description = "Attach this policy to producers for ${var.queue_name}"
  policy      = jsonencode({
                  "Version" = "2012-10-17",
                  "Statement" = [{
                    "Effect" = "Allow",
                    "Action" = [
                      "sqs:GetQueueAttributes",
                      "sqs:GetQueueUrl",
                      "sqs:SendMessage",
                      "sqs:SendMessageBatch"
                    ],
                    "Resource" = [
                      aws_sqs_queue.base_queue.arn
                    ]
                  }]
                })
}

That was a start, but it wasn't really where I wanted to go. The next step was to move the JSON into a local:

locals {
  producer_policy_statement = {
    "Effect" = "Allow",
    "Action" = [
      "sqs:GetQueueAttributes",
      "sqs:GetQueueUrl",
      "sqs:SendMessage",
      "sqs:SendMessageBatch"
    ],
    "Resource" = [
      aws_sqs_queue.base_queue.arn
    ]
  }
}

Now I could make the policy reference the local, rather than either inline JSON or a datasource:

resource "aws_iam_policy" "producer_policy" {
  name        = "SQS-${var.queue_name}-${data.aws_region.current.name}-producer"
  description = "Attach this policy to producers for ${var.queue_name}"
  policy      = jsonencode({
                  "Version" = "2012-10-17",
                  "Statement" = [local.producer_policy_statement]
                })
}

Constructing an application role

That's all well and good, but how does it provide any benefit over the former self-contained module? The answer is that you can reference the JSON in outputs.tf:

output "producer_policy_statement" {
  value = local.producer_policy_statement
}

And that enables the following change in the invoking module: rather than a bunch of role/policy attachments, the application role combines these snippets of JSON into an inline policy.

resource "aws_iam_role" "application_role" {
  name = "ApplicationRole"

  assume_role_policy = # you saw this above, no need to repeat
}

resource "aws_iam_role_policy" "application_role_queue_policy" {
  name = "application_role_queue_policy"
  role = aws_iam_role.application_role.id

  policy = jsonencode({
    "Version"   = "2012-10-17",
    "Statement" = [
        module.notifications_queue.producer_policy_statement,
        module.rendering_queue.producer_policy_statement,
        module.reports_queue.producer_policy_statement
    ]
  })
}

Now, rather than counting against your quota of policy attachments, these statements count against your quote for inline policy size. Which, for role policies, is 10,240 characters. If that's too low, you can create managed policies composed of individual statements, for a total of nearly 70k per role (10 policies at 6k each, and one inline policy). If your application needs a role that's more complex than that, you might want to think about breaking it into micro-services.

But in my opinion, the best feature of this implmentation is that it keeps the module as the arbiter of its resource's permissions. Which means there's less of a risk that a poorly crafted policy will open your system to an attacker.

If you'd like to try this out, I've saved both the original and the modular versions as GitHub Gists. One thing to note: Gists don't have any conception of sub-directories, and don't allow slashes in filenames, so I've used dashes to indicate directory boundaries (this should be clear once you look at the files).

Saturday, July 4, 2020

Packaging Java for AWS Lambda

When I want to package a Java application in a single JAR with all of its dependencies, I normally turn to Maven’s Shade plugin. This plugin builds an “UberJAR” by unpacking all of the files from the project's dependencies and repackaging them into a single JAR.

The Shade plugin is also used by the Lambda Developer's Guide. I was surprised, then to see this later in the same guide:

Reduce the time it takes Lambda to unpack deployment packages authored in Java by putting your dependency .jar files in a separate /lib directory. This is faster than putting all your function’s code in a single jar with a large number of .class files.

I was surprised because a single JAR containing class files should be the most efficient way to deploy a project. The JVM memory-maps the JAR files on its classpath, which means that it can access arbitrary parts of those files without an explicit call to the OS kernel. And UberJAR means that the JVM only has to examine a single directory structure.

After some investigation, I discovered the reason: Lamdba unpacks the contents of the deployment package into its filesystem. Which means that an UberJAR turns into a lot of little files, each of which must be read using several kernel calls.

OK, so if you’re a Maven user, how do you package your Lambdas in the prefered way? The answer is that instead of using the Shade plugin, you use the Assembly plugin. The stated goal of this plugin is to allow developers to package their project artifact with related information such as a website. But it's a very flexible plugin, using a “deployment descriptor” to specify which files should be included in the final artifact, and where they should go.

This is the descriptor that I used for my Java Lambda example. It's duplicated for each of the Lambdas, and is found at src/assembly/deployment.xml:

<assembly xmlns="http://maven.apache.org/ASSEMBLY/2.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/ASSEMBLY/2.0.0 http://maven.apache.org/xsd/assembly-2.0.0.xsd">
    <id>deployment</id>
    <includeBaseDirectory>false</includeBaseDirectory>
    <formats>
        <format>zip</format>
    </formats>
    <dependencySets>
        <dependencySet>
            <outputDirectory>/lib</outputDirectory>
            <useProjectArtifact>true</useProjectArtifact>
            <unpack>false</unpack>
            <scope>runtime</scope>
        </dependencySet>
    </dependencySets>
</assembly>

The descriptor is simple: it says that every JAR used by the Lambda — including the one built by the project itself — should be stored in the /lib directory of the final artifact. To use it, I replace the Shade configuration in my POM with this (the plugin.assembly.version property defined as 3.2.0):

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-assembly-plugin</artifactId>
    <version>${plugin.assembly.version}</version>
    <configuration>
        <descriptors>
            <descriptor>src/assembly/deployment.xml</descriptor>
        </descriptors>
    </configuration>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>single</goal>
            </goals>
        </execution>
    </executions>
</plugin>

And that's it: when I run mvn package I get a file in my target directory with the suffix -deployment.zip.