Monday, December 28, 2020

Diving Into First-Run Times for Lambdas Written in Java

A few weeks ago I answered a Stack Overflow question about the poor performance of Java Lambdas. I gave some general suggestions, but didn't provide any data to back them up. This post is my data.

The question posed a situation where one Lambda invoked another, which did nothing. The asker seemed particularly concerned by the time taken by the first function: 6000+ milliseconds when cold, 300 when warm. Rather than set up an identical situation, I'm going to use a simpler example, retrieving a list of CloudWatch log groups. The execution time, as you'll see, is similar.

Before getting started, I want to clarify terminology. When I use the term “cold,” I refer to a Lambda that's just been deployed. The execution time in this case consists of the time to initialize the Lambda runtime, initialize the Java Virtual Machine (JVM), and execute the Lambda's handler function. By comparison, a “warm” invocation is able to use an existing execution environment, so only includes the time to execute the handler function.

OK, on to Teh Codez:

public void handler(Object ignored, Context context)
    long start = System.currentTimeMillis();
    AWSLogs client = AWSLogsClientBuilder.defaultClient();
    long clientCreated = System.currentTimeMillis();
    long apiInvoked = System.currentTimeMillis();
    System.err.format("time to create SDK client = %6d\n", (clientCreated - start));
    System.err.format("time to make API call     = %6d\n", (apiInvoked - clientCreated));

This is an extremely simple Lambda: it creates an AWS SDK client and then uses that client to invoke an API call. Before each of these actions I retrieve the current system time. How long can it take?

Setting a Baseline

People like to say things are “slow.” But that term is meaningless if you don't have context. And to develop that context, you have to know both what is happening, and the physical constraints that apply. For example, on my 100 Mbit/sec Internet connection, it should take a little over 80 seconds to download a gigabyte of data. That feels like a really long time if you're waiting for the file to download, but it's not slow. If that same file takes 5 minutes to download over the same connection, it's reasonable to say that it's slow, to try to figure out why, and to attempt to correct it.

If you look at the example code, you might think that it should execute instantaneously. After all, it's just two operations. If you have think about the fact that at least the second operation makes a network call, you might say that it should take 100 milliseconds or so.

But observations are always better than guesses. Running on my laptop, with an Intel i7-6500U CPU at 2.5 GHz, here's how long it takes (to get these numbers I created a main() that just invokes the handler function):

time to create SDK client =    744
time to make API call     =    469

That's … much longer than one might expect. Over a second. To make sure it isn't a fluke, you should run the code several times. You should also try some other variants of the code. For example, call the handler function twice in a row:

time to create SDK client =    702
time to make API call     =    522
time to create SDK client =     21
time to make API call     =    151

Hmmmm. The time to create the client dropped dramatically. The time to make the API call also dropped, and is closer to what I'd expect from a network service call. At this point I might also use the Linux time command:

> time java -jar target/sandbox-lambda-java-1.0-SNAPSHOT.jar 
time to create SDK client =    576
time to make API call     =    623

real    0m1.320s
user    0m2.130s
sys     0m0.113s

The “real” value is also known as “wall clock” time: it's what I'm measuring by saving timestamps, and it's pretty close to the timings I print from within the program. The “user” time is the actual CPU time consumed by the program; it's nearly twice the real time, which indicates that the CPU is doing a lot of work. If the program was spending all of its time making network calls, that user time should be less than the real time.

This information alone is enough to make a first pass at optimization.

Memory Allocation

When you create a Java Lambda using the Console, you get a default memory allocation of 512 MB. While Java has a not always deserved reputation for excessive memory consumption, this is more than enough to run the example program. In fact, if you look at the post-run statistics, you'll see that “Max memory used” is around 150 MB.

However, maximum memory consumption only tells part of the story. A much — much — more important fact is that Lambda provides CPU in proportion to allocated memory. To get one virtual CPU, you need to allocate 1,769 MB of memory to your function. The AWS docs don't provide a table of how much CPU you'd get for a given memory size, but assuming a linear relationship, 512 MB would be a third of a CPU — a virtual CPU of unspecified performance.

The table below compares timings for different memory allocations. These are all “cold” times: before each test I uploaded the deployment bundle, which forced Lambda to create a new execution environment. I invoked the Lambda from the Console, providing a dummy event. For consistency with the previous numbers, times are in milliseconds.

  512 MB 1024 MB 2048 MB 4096 MB
Create client 5298 2493 1272 1019
Invoke API call 3844 2023 1061 613
Billed duration 9213 4555 2349 1648

If you add up the numbers, you'll see that the billed duration is slightly larger than the sum of the two recorded times. I believe this corresponds to the time taken to start the JVM and invoke the handler function (much like running on my laptop took 1320 ms total, but only 1199 was accounted for by my timestamps).

These numbers also omit the Lambda initialization time, which was approximately 500 ms. This is the time taken to start the Lambda's container, download the function code onto the container, and start the runtime environment. You aren't billed for this time, but it does affect the response time of the function, and Java Lambdas seem to take much longer than, say Python (where initialization takes around 150 milliseconds).

Based on what we know from the baseline performance test, the numbers make sense: at 4096 MB we have the equivalent of slightly more than two virtual CPUs, and the execution times are in line with what I saw running on my laptop (for what it's worth, increasing the memory size to 8192 MB, which should be 4+ vCPUs, does not significantly affect these timings). This leads to my number one rule for Lambda performance tuning:

The most important thing that you can do to improve Lambda performance is increase memory allocation.

This advice is, of course, subject to caveats. This example program is short, has high CPU usage relative to its runtime, and can exploit multiple virtual CPUs. A long-running Lambda that spends most of its time waiting for network calls may not benefit as much from the CPU boost.

However, almost any Java Lambda will benefit from increased memory allotment as a result of garbage collection: the less memory you give it, the more often collection has to run. I added some code to track garbage collection time, and the 512 MB run consumed nearly 200 milliseconds, versus 30 for the 4096 MB run. Again, this depends on what the program is doing, but in general a larger heap means that more of the program's garbage will never make it out of the “young” generation, which can be collected more efficiently than the “tenured” generation.


So what's consuming all of the CPU time for this Lambda? At least part of the answer is classloading: to run this simple program, the JVM loads 3,329 classes.

The JVM loads classes on-demand. It won't load the AWSLogsClientBuilder class until it it executes the line that calls defaultClient(). Loading and initializing a class require loading any classes that it depends on, and so on. Even though an individual class can be loaded very quickly, the total classloading time adds up.

Unfortunately, there aren't a lot of ways to avoid this cost, especially in a simple program like the example. The tricks available to stand-alone Java programs aren't available in Lambda.

However, for real-world applications you can make architectural choices that minimize the number of classes that need to be loaded. One of the simplest is to avoid large frameworks such as Spring.

A Different SDK

Another possibility might be to replace the standard AWS SDK with something that loads fewer classes. In November 2018 AWS released version 2 of its Java SDK, which is described as “major rewrite of the version 1.x code base.” You'll occasionally see recommendations to use it for improved performance (including in the SDK docs themselves, via a link that doesn't go anywhere).

But, as I said before, there's no substitution for observation. Here are the numbers using version 2.15.53:

  512 MB 1024 MB 2048 MB 4096 MB
Create client 4965 2278 1141 959
Invoke API call 4235 2062 1047 661
Billed duration 9237 4357 2204 1637

No meaningful change. To be fair, I just used the default configuration. The v2 SDK lets you change out the underlying HTTP implementation, so maybe a different one would give better numbers. But that seems like a lot of work for “maybe.”

For that matter, switching to the v2 SDK requires a signficant amount of tedious recoding to change package and method names. And as-of this writing, there are still some features that aren't supported by v2. So I don't recommend making that switch until and unless there's a compelling reason.


Earlier this year I wrote an article about the different ways to package a Java Lambda. In that article I explored why the Lambda documentation recommended against using an “UberJar” produced by the Maven Shade plugin — even though that same documentation uses that plugin for examples. However, I didn't record the performance gained by switching to the Assembly plugin.

Here's that comparison. I've taken the billed duration from the previous table, and compared it to the billed duration when packaged via the assembly plugin. Rather than show all of the memory sizes, I just show the two extremes:

  512 MB 4096 MB
Shade Plugin 9213 1648
Assembly Plugin 8138 1358

So, a decent speedup with a small memory allocation (which I hope you woudn't use after reading this post!), and a minor speedup when you have plenty of memory (and CPU). Still, any speedup is a good speedup, and this requires little effort.

Provisioned Concurrency

On Stack Overflow, the typical answer to people who are concerned about first-run execution time is “use Provisioned Concurrency.” I suspect that the people who say this have never actually used provisioned concurrency, because it's not a silver bullet. In fact, if you enable provisioned concurrency for my example program, you'll see no change in first-run execution time.

The reason is that my example does everything inside its handler function, and so incurs all of the cost of classloading and initialization when that function executes. Provisioned concurrency won't help with that.

To make Provisioned Concurrency help with Java first-start times, you need to move all of the code that triggers classloading and initialization into a constructor (or a static initializer, but trust me, don't go there). This adds to the complexity of your Lambda, because you have to ensure that you fully load the SDK (and other) classes that you need to run, without actually changing anything (you don't want to write bogus data to your production DynamoDB table!).

Assuming that you've gotten everything working, you'll still have the “N + 1” problem: unless you dramatically over-provision, you'll still get the occasional cold start. Perhaps it happens first thing in the morning, when all of your users connect for the first time. Or perhaps it happens when your site gets mentioned on Hacker News. Sooner or later, it will happen.

Finally, there's the matter of cost: with Provisioned Concurrency, you are paying for an always-on machine. In my example, with 4 GB allocated to the Lambda, enabling Provisioned Concurrency would cost approximately $45/month in addition to the per-request cost. It's hard to find an exact equivalent in the EC2 world, but a t3.medium has the same memory and virtual CPU count, and costs a little over $30 per month (both prices are for us-east-2). So if you're planning to replace your “traditional” Java web-app with Lambdas to save costs, you'll get the opposite.

Wrapping Up

I first implemented a Lambda web-app with Java nearly four years ago, and my opinion hasn't changed since then: if you need fast response times, then Java should not be your language of choice. Use Python, or NodeJS, or Go. Or if you must use Java, but want the benefits of Serverless computing, deploy it on ECS Fargate as a traditional J2EE application with embedded Jetty server.

That doesn't mean that Java is always a bad choice for Lambdas. If you have a long-running task, such as processing large files, then startup time pales in comparison to overall runtime. And Java is an especially good choice for CPU-intensive tasks, because the Hotspot engine will optimize performance.

The bottom line is that writing your Lambda functions in Java, like everything else in software engineering, is a tradeoff. Pick your application architecture based on your actual needs, not philosophical stances.

Monday, November 9, 2020

Building Modular Inline Policies with Terraform

Modules are the primary mechanism for reusability in Terraform. They're especially useful in cases where you need to stamp out a bunch of related resources multiple times. For example, an SQS queue along with its dead-letter queue and “least privilege” access policies.

The problem with creating a separate managed policy for each module invocation is that role/policy attachments are a limited resource: by default, you're limited to ten. You can increase that by opening a support ticket, but the limit is still low (20 unless you can show special circumstances), and it has to be done in every account where you'll create the role.

Alternatively, you can construct a combined policy in the invoking module as long as the child module outputs the queue ARNs. However, this eliminates much of the benefit of reusability, as you have to explicitly build the combined policy everywhere that you use the queue module. It also means that the person writing the combined policy has to understand the IAM permissions that are appropriate to a queue consumer or producer. At that point, why not just explicitly create your resources?

In this post, I look at a different alternative: the queue module exports the JSON for the producer and consumer statements. The invoking module then combines the statements for each of its queues into a single applicaton policy.

Wait, doesn't Terraform allow you to combine policies?

Not in the way that I want.

The iam_policy_document data source allows you to “layer” policies using its source_json and override_json arguments, but these only accept a single document. Which means that you can build a complex policy document step by step, but not as a single data source. In a real-world configuration, I think that would quickly turn into an unmaintainable mess.

The policy documents guide doesn't give any other alternatives: it recommends the data source or a multi-line “heredc”. Surprisingly, it doesn't mention using jsonencode() at all.

Creating an SQS queue module

A modern web-application is a complex beast: it often consists of a (relatively) simple API server, with background processes to handle long-running tasks such as sending notifications, rendering images, or generating reports. And linking these together, at least on AWS, SQS queues.

An SQS queue, although it seems simple, actually has a lot of hidden complexity:

  • Each queue should have a companion dead-letter queue, along with a "redrive policy" that determines how many times a consumer can fail to process a message before it goes to the dead-letter queue.
  • How long a does a consumer have to process a message before it's retried (the visibility timeout).
  • How long can a message stay in the queue without being picked up (the retention period, used to prevent queues from growing without bound when there are no/insufficient consumers).
  • You should also have “least-privilege” policies for the queue's producers and consumers.

This is a perfect use for Terraform modules: rather than repeat all of the resource definitions (as you have to do with CloudFormation), you simply create a module and reference it from your main configuration:

module "notifications_queue" {
  source = "./modules/sqs"
  queue_name = "Notifications"

module "reports_queue" {
  source = "./modules/sqs"
  queue_name = "ReportGeneration"
  visibility_timeout = 600
  receive_count = 1

The core of this module are the following two resource definitions, for the primary queue and its companion dead-letter queue (every primary should have a DLQ; trust me, you'll want it sooner or later):

resource "aws_sqs_queue" "base_queue" {
  name                        = var.queue_name
  message_retention_seconds   = var.retention_period
  visibility_timeout_seconds  = var.visibility_timeout
  redrive_policy              = jsonencode({
                                    "deadLetterTargetArn" = aws_sqs_queue.deadletter_queue.arn,
                                    "maxReceiveCount" = var.retry_count

resource "aws_sqs_queue" "deadletter_queue" {
  name                        = "${var.queue_name}-DLQ"
  message_retention_seconds   = var.retention_period
  visibility_timeout_seconds  = var.visibility_timeout

As you can see, all configuration is via module variables. Rather than show the entire file, I'll use just two to give you a sense of my philosophy. The first, of course, is queue_name: this is the name of the queue, and can't be defaulted. The second, retention_period, is something that most consumers don't care about, but some might. So I picked a valid default (in my experience most messages have no value after 24 hours), and the invoking module can override if needed.

variable "queue_name" {
  description = "The name of the queue. Used as a prefix for related resource names."
  type = string

variable "retention_period" {
  description = "Time (in seconds) that messages will remain in queue before being purged"
  type = number
  default = 86400

So far, nothing unusual. But what I'm not showing (yet) are the policy documents that let your applications read and write the queue.

If you were just building a proof-of-concept, you might consider using the pre-existing AmazonSQSFullAccess managed policy. However, that policy grants all operations on all queues. Not just read and write, but the ability to create and delete them as well. For a real-world application, you want a “least privilege” policy that only allows actions on the queue that you've just created. And really, you want separate policies for the programs that write to the queue and those that read from the queue. To support this, my module also creates two managed policies; here's the producer policy:

resource "aws_iam_policy" "producer_policy" {
  name        = "SQS-${var.queue_name}-${}-consumer_policy"
  description = "Attach this policy to producers for ${var.queue_name} SQS queue"
  policy      = data.aws_iam_policy_document.producer_policy.json

data "aws_iam_policy_document" "producer_policy" {
  statement {
    actions = [
    resources = [

Simple, but as I said above, there's a limit on the number of policies that you can attach to a single role. In a real-world micro-service application, you might have a half-dozen queues. Or more.

Extracting the Policy Statement

I originally created this example for a user group talk, so while it was based on real-world experience it wasn't actually used in the real world. I recently had to implement an SQS module for the real world, and decided to tackle the “too many policies” problem (ironically, I don't think it will be a problem for that specific use case).

After reviewing the various documents linked above, I realized that none of them mentioned the jsonencode function. This seemed like a rather strange omission, especially given the inclusion of approaches based on string literals (although not nearly as bad as XML, JSON still has its share of gotchas for those who wish to write it using strings).

As it turns out, creating policy documents with jsonencode() instead of using a data source works quite well:

resource "aws_iam_policy" "producer_policy" {
  name        = "SQS-${var.queue_name}-${}-producer"
  description = "Attach this policy to producers for ${var.queue_name}"
  policy      = jsonencode({
                  "Version" = "2012-10-17",
                  "Statement" = [{
                    "Effect" = "Allow",
                    "Action" = [
                    "Resource" = [

That was a start, but it wasn't really where I wanted to go. The next step was to move the JSON into a local:

locals {
  producer_policy_statement = {
    "Effect" = "Allow",
    "Action" = [
    "Resource" = [

Now I could make the policy reference the local, rather than either inline JSON or a datasource:

resource "aws_iam_policy" "producer_policy" {
  name        = "SQS-${var.queue_name}-${}-producer"
  description = "Attach this policy to producers for ${var.queue_name}"
  policy      = jsonencode({
                  "Version" = "2012-10-17",
                  "Statement" = [local.producer_policy_statement]

Constructing an application role

That's all well and good, but how does it provide any benefit over the former self-contained module? The answer is that you can reference the JSON in

output "producer_policy_statement" {
  value = local.producer_policy_statement

And that enables the following change in the invoking module: rather than a bunch of role/policy attachments, the application role combines these snippets of JSON into an inline policy.

resource "aws_iam_role" "application_role" {
  name = "ApplicationRole"

  assume_role_policy = # you saw this above, no need to repeat

resource "aws_iam_role_policy" "application_role_queue_policy" {
  name = "application_role_queue_policy"
  role =

  policy = jsonencode({
    "Version"   = "2012-10-17",
    "Statement" = [

Now, rather than counting against your quota of policy attachments, these statements count against your quote for inline policy size. Which, for role policies, is 10,240 characters. If that's too low, you can create managed policies composed of individual statements, for a total of nearly 70k per role (10 policies at 6k each, and one inline policy). If your application needs a role that's more complex than that, you might want to think about breaking it into micro-services.

But in my opinion, the best feature of this implmentation is that it keeps the module as the arbiter of its resource's permissions. Which means there's less of a risk that a poorly crafted policy will open your system to an attacker.

If you'd like to try this out, I've saved both the original and the modular versions as GitHub Gists. One thing to note: Gists don't have any conception of sub-directories, and don't allow slashes in filenames, so I've used dashes to indicate directory boundaries (this should be clear once you look at the files).

Saturday, July 4, 2020

Packaging Java for AWS Lambda

When I want to package a Java application in a single JAR with all of its dependencies, I normally turn to Maven’s Shade plugin. This plugin builds an “UberJAR” by unpacking all of the files from the project's dependencies and repackaging them into a single JAR.

The Shade plugin is also used by the Lambda Developer's Guide. I was surprised, then to see this later in the same guide:

Reduce the time it takes Lambda to unpack deployment packages authored in Java by putting your dependency .jar files in a separate /lib directory. This is faster than putting all your function’s code in a single jar with a large number of .class files.

I was surprised because a single JAR containing class files should be the most efficient way to deploy a project. The JVM memory-maps the JAR files on its classpath, which means that it can access arbitrary parts of those files without an explicit call to the OS kernel. And UberJAR means that the JVM only has to examine a single directory structure.

After some investigation, I discovered the reason: Lamdba unpacks the contents of the deployment package into its filesystem. Which means that an UberJAR turns into a lot of little files, each of which must be read using several kernel calls.

OK, so if you’re a Maven user, how do you package your Lambdas in the prefered way? The answer is that instead of using the Shade plugin, you use the Assembly plugin. The stated goal of this plugin is to allow developers to package their project artifact with related information such as a website. But it's a very flexible plugin, using a “deployment descriptor” to specify which files should be included in the final artifact, and where they should go.

This is the descriptor that I used for my Java Lambda example. It's duplicated for each of the Lambdas, and is found at src/assembly/deployment.xml:

<assembly xmlns="" xmlns:xsi=""

The descriptor is simple: it says that every JAR used by the Lambda — including the one built by the project itself — should be stored in the /lib directory of the final artifact. To use it, I replace the Shade configuration in my POM with this (the plugin.assembly.version property defined as 3.2.0):


And that's it: when I run mvn package I get a file in my target directory with the suffix

Sunday, June 21, 2020

A History of Java Logging Frameworks, or, Why Commons-Logging is Still So Common

In the beginning there was System.out.println(). Or, for those of us who were purists (or wanted our output immediately), System.err.println(). And every developer had his or her own preferences for how their log messages would appear, leading to some very messy output as the number of developers on a team increased.

To bring order to this chaos, many projects implemented their own logging framework. I worked on one such project in the late 90s: the framework consisted of a single class that implemented a static log() method. I can't remember what the logging output looked like, but I suspect that it included a consistently-formatted timestamp.

According to this article by Ceki Gülcü, a project that he was working on in 1996 also implemented their own logging framework. But unlike the project I worked on, its framework was released to the public in 1999 as Log4J.

Something else that became public in 1999 was the Jakarta project, a collaboration between Sun and the Apache Software Foundation to produce the Tomcat application server. And of course Tomcat, being a large application with contributions by many people, had its own logging framework (and it still does, although the implementation and purpose has changed over time).

And lastly, 1999 was also the year that JSR 047, the Java Logging API Specification, was released. It turned into the java.util.logging (JUL) package, released as part of JDK 1.4 in 2002.

A plethora of logging frameworks isn't a problem if you're an application developer: you pick one and stick with it. If you're developing a library that might be used by those applications, however, it's a nightmare. If your library uses Log4J and the application uses JUL, then the output becomes a mess and the developers using your library complain.

At the time, the Jakarta project was arguably the largest single producer of libraries for Java application developers, so they added another: Jakarta Commons Logging (since renamed to Apache Commons Logging, but you'll still see the initials "JCL" in the documentation). The idea of Commons Logging was that you would write your code against the JCL API, add the JCL JAR to your dependencies, and it would figure out what actual logging framework you were using.

Although Commons Logging was intended for libraries, application developers adopted it as well. I can't speak for anyone else, but I looked at it as “won't hurt, and means I don't need to keep track of multiple APIs.” Unfortunately, some developers discovered that it could hurt: they were using Tomcat, regularly redeploying their applications, and had memory leaks that would eventually cause Tomcat to stop working.

Looking back, it appears that these leaks were due to missing transitive dependencies in the deployment bundle.* This took place in the days before Maven 2, when developers were responsible for identifying every JAR that went into their application, and ensuring that it somehow got there (which often meant lots of JARs checked into source control). It wouldn't be obvious that a library used Commons Logging, so the application developer wouldn't bother to add it to the deployed WAR. Unfortunately, Tomcat made it available on the system classpath (because it used Commons Logging internally), so the developers never knew they were missing the JAR. And since Commons Logging needed to know about the actual deployed logging framework, it would establish a strong reference to the Log4J implementation that was in the WAR, preventing the classloader from unloading the classes belonging to the WAR.

That problem was rather quickly resolved: Commons Logging version 1.1 was released in 2006, Tomcat 6 moved it off the public classpath (although Tomcat 5 remained “in the wild” for quite some time), and Maven 2 ensured that a WAR would contain all of the dependencies that it needed. But developers have very long memories for things that go wrong, especially things that happened to someone else who blogged about it.**

At the same time, several popular Java frameworks appeared; Hibernate in 2001 and Spring in 2002 are two of the most familiar. These frameworks were complex enough to need logging, but for obvious reasons wouldn't want to be tied to a specific implementation. Commons Logging provided that capability (and thus became an untracked dependency for many builds).

Moving forward, the new millennium saw continued change in the logging world. Log4J became an Apache project. Ceki Gülcü left Apache and developed Logback and SLF4J. And in the early 2010s, the remaining Apache Log4J committers decided that the Log4J 1.x implementation couldn't be extended and completely rewrote it as Log4J 2.x.

Of these, SLF4J is particularly interesting because it was a logging facade, in direct competition with Commons Logging. Unlike Commons Logging, which tried to infer what underlying framework you were using, SLF4J required you to explicitly include “bridge” JARs for your actual implementation. SLF4J also provided additional features, such as formatted log messages, that were very attractive to application developers.

However, adopting SLF4J had its own pain point: if you used Spring (or Hibernate, or any other library that dependent on Commons Logging), Maven would add it to your build as a transitive dependency. Where it might take precedence over the “slf4j-jcl” bridge from SLF4J (it all depended on the order that JARs were given to the classloader). A key feature of Maven POMs from this era are multiple <exclusions> to prevent such transitive dependencies.

So here we are in 2020, and the logging frameworks scene is more complex than ever:

  • Log4J 1.x is still used by many projects, even though it was officially end-of-lifed in 2015. One of its most useful features doesn't work under Java 9 (and, I presume, later versions), so its popularity may fade (although it seems that many people, particularly those using OpenJDK, are quite happy with Java 8).
  • SLF4J/Logback is still used by many developers (including myself), even though new releases seem to have stalled at the 1.3.0-alpha stage (after 25 years of writing logging frameworks, I'm guessing Ceki is in need of a break).
  • Log4J 2.x provides “bridge” JARs that let people use Commons Logging and SLF4J as their API, with Log4J2 as the back-end.
  • Commons Logging still exists, but hasn't seen a release since 2014. Nor has its list of supported frameworks changed: Log4J 1.x, JUL, and Avalon LogKit.

Perhaps counter-intuitively, even with all of these changes, Commons Logging is still used by many libraries. However, it's become less visible. Spring Framework, for example, implements the API internally; as an application developer, you no longer need to explicitly exclude the JAR. And if you use Spring Boot, its a 3,000+ line dependency-management POM will explicitly exclude Commons Logging from the libraries that use it.

If you're developing a library, I think that Commons Logging is still the best choice for internal logging. It provides a consistent interface, and it's reasonable to expect that the consumers of your library already have the bridge JARs that they need (which might mean the internal implementation in Spring Framework). But there are a few best practices to keep your users from cursing you:

  • Mark your dependency as provided. This tells Maven (or Gradle, or any of the other tools that follow the Maven standard) not to resolve the transitive dependency; it will rely instead on an explicitly-referenced JAR to provide the necessary classes.
  • Ensure that you don't establish a transitive dependency via a package that you depend on, like HTTP Components. Take the time to look at your entire dependency tree (using mvn dependency:tree or equivalent), and add an exclusion if anything tries to pull in Commons Logging.
  • Don't implement your own logging facade. It's tempting, I know: you want to protect the people that haven't configured logging into their application. And it seems easy: two classes (a log factory and a logger), with some reflection to pick an appropriate back end. But there's a lot of room for error. If you get it wrong, you'll introduce subtle bugs and performance issues, and your experienced users will curse you (and look for an alternative library). And if you get it right, you'll find that you've re-implemented Commons Logging. And good or bad, it won't actually help inexperienced users: they should learn to use logging rather than deploy a black box and cross their fingers.

Bottom line: if you're writing an application, use whatever logging framework you want. I have a strong preference for SLF4J/Logback, but Log4J 2.x does have some features that it's missing. However, if you're implementing a library, stick with Commons Logging. In almost every case it will Just Work™.

* Open Source Archaeology can be a fascinating experience. It's next to impossible to find a source-code repository that gives a full history for older projects: they typically started on CVS, then moved to SVN, and are now on GIT. In some cases, moving between hosting providers (although it looks like Hibernate is still actively maintained on SourceForge, which makes me feel better about a couple of my older projects). Each move lost information such as version tags (assuming they ever existed).

Maven Central is also less helpful than expected, because many projects changed their group or artifact IDs over their lifetime (plus, who wants to dig through 55 pages of org.springframework JAR listings). And a lot of older versions are “retconned”: they were imported into the repository long after release, with a made-up, minimal POM.

Fortunately, most of the larger projects maintain their own archives, so you can download a particular version and unpack it to see what's there. And if you're looking for dependencies, you can pick a likely class and run javap -c to disassemble it and then look at variable declarations. It's a lot of work, and some might call it obsessive. But that's how I learned that Spring Framework 1.0 didn't log at all, while 1.2 used Commons Logging.

** They're also very vocal about it. I wrote an article on logging techniques in 2008 that referenced Commons Logging (because in my world it wasn't an issue), and constantly received comments telling me that it was horrible and should never be used. I eventually updated the article to reference SLF4J, primarily because it was focused at application logging, but also to keep the critics quiet.

Thursday, April 9, 2020

Stupid Git Tricks: Interactive Rebase

I like to provide a nice, clean history in my pull requests. Reviewers should be able to follow each commit, and see how the functionality is built up. No extranneous commits. Nothing out of order. Everything looking like one smooth path from idea to implementation.

Unfortunately, my development process doesn't quite work that way. For one thing, I commit (and push) frequently — as in every 10-15 minutes when I'm making lots of changes. For another, I'll often realize that there's a small change that should have been introduced several commits previously. For these, and other reasons, I find git rebase -i invaluable.

OK, some of you are probably outraged: “you're changing history!” Settle down. This is for development branches, not master. And I'm willing to adapt in a team setting: if my team members want to see messy commit histories in a pull request, I'm OK with giving that to them. But only if they squash merges.

So, here are a few of the ways that I change history. You're free to avoid them.

Combining commits

Here's one morning's commit history:

commit 6aefd6989ba7712cb047d661b68d34c888badea4 (HEAD -> dev-writing_log4j2, origin/dev-writing_log4j2)
Author: Me 
Date:   Sun Apr 5 12:13:19 2020 -0400

    checkpoint: content updates


commit e8503f01c72618709ac5231a78cfa8549fcfb7b3
Author: Me 
Date:   Sun Apr 5 09:22:51 2020 -0400

    checkpoint: content updates

commit 8bdb788421c56cb0defe73ce87b9e1ffe4266b0c
Author: Me 
Date:   Sat Apr 4 13:57:27 2020 -0400

    add reference to sample project

Three hours of changes, split up over eight commits, with regular pushes so I wouldn't lose work if my SSD picked today to fail. I really don't want to see all of those in my history.

The solution is to squash those commits down using an interactive rebase:

git rebase -i 8bdb788421c56cb0defe73ce87b9e1ffe4266b0c

When I run this, it starts my editor and shows me the following:

pick e8503f0 checkpoint: content updates
pick f71ddca checkpoint: content updates
pick a8d7a25 checkpoint: content updates
pick 6b87b9b checkpoint: content updates
pick 556a346 checkpoint: content updates
pick 466dd26 checkpoint: content updates
pick 0034657 checkpoint: content updates
pick 6aefd69 checkpoint: content updates

# Rebase 8bdb788..6aefd69 onto 8bdb788 (8 commands)
# Commands:
# p, pick = use commit
# r, reword = use commit, but edit the commit message
# e, edit = use commit, but stop for amending
# s, squash = use commit, but meld into previous commit
# f, fixup = like "squash", but discard this commit's log message
# x, exec = run command (the rest of the line) using shell
# d, drop = remove commit
# These lines can be re-ordered; they are executed from top to bottom.
# If you remove a line here THAT COMMIT WILL BE LOST.
# However, if you remove everything, the rebase will be aborted.
# Note that empty commits are commented out

A list of commits, instructions on how to work with them, and a few warnings about what happens if I do something dumb. To squash these commits I update all but the first to be a “fixup”:

pick e8503f0 checkpoint: content updates
f f71ddca checkpoint: content updates
f a8d7a25 checkpoint: content updates
f 6b87b9b checkpoint: content updates
f 556a346 checkpoint: content updates
f 466dd26 checkpoint: content updates
f 0034657 checkpoint: content updates
f 6aefd69 checkpoint: content updates

Save this and exit the editor, and Git applies all of those changes:

Successfully rebased and updated refs/heads/dev-writing_log4j2.

And now when I look at my history, this is what I see:

commit 51f5130422b524603d6249ef40e012aeecde5422 (HEAD -> dev-writing_log4j2)
Author: Me 
Date:   Sun Apr 5 09:22:51 2020 -0400

    checkpoint: content updates

commit 8bdb788421c56cb0defe73ce87b9e1ffe4266b0c
Author: Me 
Date:   Sat Apr 4 13:57:27 2020 -0400

    add reference to sample project

Note that the last commit hash has changed, and my working HEAD no longer refers to the origin branch. This means that I'm going to need to force-push these changes. But before that, there's one more thing that I want to do:

git commit --amend -m "content updates" --reset-author

This command does two things. First, it updates my commit message: this is no longer a “checkpoint” commit. The second thing it does is update the basic commit info, in this case just the timestamp. If you looked closely at the history above, you saw that all of the commits had been marked with the timestamp of the first; --reset-author makes the history more closely reflect what actually happened (it can also be used to pretend that other people didn't contribute to the commit, but I'll assume you're more honorable than that).

Now the log looks like this:

commit fdef5d6f0a19218784b87a596322816347db2232 (HEAD -> dev-writing_log4j2)
Author: Me 
Date:   Sun Apr 5 12:22:46 2020 -0400

    content updates

commit 8bdb788421c56cb0defe73ce87b9e1ffe4266b0c
Author: Me 
Date:   Sat Apr 4 13:57:27 2020 -0400

    add reference to sample project

Which is what I want to see, so time to force-push and overwrite the previous chain of commits:

> git push -f
Counting objects: 4, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 4.52 KiB | 2.26 MiB/s, done.
Total 4 (delta 3), reused 0 (delta 0)
To ssh://
 + 6aefd69...fdef5d6 dev-writing_log4j2 -> dev-writing_log4j2 (forced update)

I should note here that the previous chain of commits still exists in your repository. If, for some reason, you want to retrieve them, you can explicitly check-out the former head commit:

git checkout -b recovery 6aefd6989ba7712cb047d661b68d34c888badea4

Of course, if you close your terminal window, you might not find that commit hash again, so if you're worried you should write it down it somewhere. When I'm making a large set of changes, I'll often create a temporary branch from the one that's being rebased, just in case (unfortunately, I often forget to switch back to the branch that I want to rebase).

Re-ordering commits

Especially at the start of a new project, I might jump around and edit several different things, resulting in a messy commit history:

pick fc46b5b update docs
pick 0f734fb minor change to feature X
pick 2233c01 update docs
pick fe56f59 another change to feature X
pick d3fb025 related change to feature Y
pick aec87c1 update docs
pick 66ef266 something unrelated to either X or Y
pick 96179b3 changing Y
pick 904a779 update docs

Interactive rebase allows you to move commits around, and then optionally squash those moved commits:

pick fc46b5b update docs
f 2233c01 update docs
f aec87c1 update docs
f 904a779 update docs
pick 0f734fb minor change to feature X
f fe56f59 another change to feature X
pick d3fb025 related change to feature Y
f 96179b3 changing Y
pick 66ef266 something unrelated to either X or Y

There are a couple of gotchas when you do this. First, you need to make sure that you're not changing both X and Y in the same commit. If you do, you can still squash the commits together, but it's pointless to try to track the work in each feature separately.

Second, make sure that you preserve order: in my example, commit 0f734fb happened before fe56f59 the interactive rebase needs to keep them in this order. If you don't, you can end up with merge conflicts that are challenging to resolve.

Lastly, and most important, make sure you have the same number of commits that you started with. If you accidentally delete a commit rather than move it, you will lose that work. For this reason, I tend to use interactive rebase on small pieces of my history, perhaps making several passes over it.

Editing commits

When writing my article about Log4J2 appenders, I saw a comment that I wanted to change in the accompanying example code. Unfortunately, it wasn't the HEAD commit:

commit 8007214ef232abf528baf2968162b51dcd2c09ca
Author: Me 
Date:   Sat Apr 4 09:34:53 2020 -0400


commit 38c610db6a02747d7017dff0a9c2b7ed290e30e1
Author: Me 
Date:   Sat Apr 4 08:34:12 2020 -0400

    stage-10: add tests

commit 5dfd79e3f879038e915fa04c83f8eb9b0f695e35
Author: Me 
Date:   Tue Mar 31 08:38:17 2020 -0400

    stage-9: implement a lookup

There are two ways that I could have approached this. The first would be to create a new commit and then reorder it and turn it into a fixup. The second is to edit the file as part of an interactive rebase, by marking the commit with an "e":

pick 5dfd79e stage-9: implement a lookup
e 38c610d stage-10: add tests
pick 8007214 README

When I do this, git works through the commits, and stops when it reaches the marked one:

Stopped at 38c610d...  stage-10: add tests
You can amend the commit now, with

  git commit --amend 

Once you are satisfied with your changes, run

  git rebase --continue

I can now edit any files in my working tree (they don't have to be part of the original commit). Once I'm done, I do a git add for changed files, followed by both git commit --amend and git rebase --continue. After the rebase completes, I can force-push the changes.

Beware that editing commits can introduce merge conflicts: if a later commit touches the same code, you'll have to stop and resolve the conflict. This is more likely when you edit early commits, or when the edits are wide-ranging. It is far less likely for changes like comments.

Cherry-picking into the middle of a branch

You may be familiar with git cherry-pick, which takes an arbitrary commit and puts it at the HEAD of your current branch. This can be useful when two teams are working on the same general area of the codebase: often one team will incidentally do something that the other team finds valuable.

Interactive rebase is like cherry-picking on steroids: you can insert a commit anywhere in your commit tree. To be honest, I find this more risky than beneficial; instead I would cherry-pick to HEAD and then perhaps use an interactive rebase to move the commit to where it “belongs.” But in the interest of “stupid git tricks,” here we go.

Let's say that you've been working on a branch and have been making changes, starting with changes to the build scripts. Then you talk with a colleague, and learn that she has also made changes to the build. You could cherry-pick her change to the end of your branch and use it moving forward, but you're somewhat OCD and want to keep the build changes together. So you fire up git rebase -i and add your colleague's commit as a new “pick”:

pick ffc954d build scripts
p 1438a13d11d6001de876a034f434a050c09b587d
pick b497403 update 1
pick 18e8415 update 2
pick 33a4e9d update 3

Now when you do a git log, you see something like this:

... skipping two commits

commit 7b62acb8d9100f379a0d43e3227c36ae91c1edd9
Author: Me 
Date:   Fri Mar 27 10:11:01 2020 -0400

    update 1

commit c579ed88403354faed83213da63d4546c5aa13b5
Author: Someone Else 
Date:   Sun Jan 5 09:30:14 2020 -0500

    some build changes

commit ffc954dc41555282ece3e2b7a0197472c0af9f11
Author: Me 
Date:   Mon Jan 6 08:02:30 2020 -0500

    build scripts

Note that the commit hash has changed: from 1438a13d to c579ed88. This is because it's now part of a new branch: a commit hash is based not merely on the content of the commit, but also the commit chain that it's a part of. However, the committer's name and the commit date are unchanged.

Wrapping up: a plea for clean commit histories

A standard Git merge, by preserving the chains of commits that led to the merge point, is both incredibly useful and incredibly annoying. It's useful, in that you can move along the original branch to understand the context of a commit. It's annoying, in that your first view of the log shows the commits intermingled and ordered by date, completely removing context.

I find messy histories to be similar. Software development doesn't happen in a clean, orderly fashion: developers often attach a problem from multiple sides at once. And that can result in commit histories that jump around: instead of “fix foo” you have “add test for bar”, followed by “make test work”, followed by an endless string of “no, really, this time everything works”.

Maybe you find that informative. If not, do your future self (and your coworkers) a favor and clean it up before making it part of the permanent record.

Monday, March 30, 2020

S3 Troubleshooting: when 403 is really 404

It's generally a bad idea to click on links in strange websites, but this one is key to this post. If you were to click this link, you'd see a response like the following, and more importantly, the HTTP response code would be a 403 (Forbidden).

  <Message>Access Denied</Message>

However, that bucket is open to the world: AWS reminds me of this by flagging it as “Public” in bucket listings, and putting a little orange icon on the “Permissions” tab when I look at its properties. And if you click this similar link you'll get a perfectly normal test file.

The difference between the two links, of course, is that the former file doesn't exist in my bucket. But why isn't S3 giving me a 404 (Not Found) error? When I look at the list of S3 error responses I see that there's a NoSuchKey response — indeed, it's the example at the head of that page.

As it turns out, the reason for the 403 is called out in the S3 GetObject API documentation:

  • If you have the s3:ListBucket permission on the bucket, Amazon S3 will return an HTTP status code 404 ("no such key") error.
  • If you don’t have the s3:ListBucket permission, Amazon S3 will return an HTTP status code 403 ("access denied") error.

While my bucket allows public read access to its objects via an access policy, that policy follows the principle of least privilege and only grants s3:GetObject. As do almost all of the IAM policies that I write for things like Lambdas.

Which brings me to the reason for writing this post: Friday afternoon I was puzzling over an error with one of my Lambdas: I was creating dummy files to load-test an S3-triggered Lambda, and it was consistently failing with a 403 error. The files were in the source bucket, and the Lambda had s3:GetObject permission.

I had to literally sleep on the error before realizing what happened. I was generating filenames using the Linux date command, which would produce output like 2020-03-27T15:43:31-04:00. However, S3 notifications url-encode the object key, so the key in my events looked like this: 2020-03-27T15%3A43%3A31-04%3A00-c. Which, when passed back to S3, didn't refer to any object in the bucket. But because my Lambda didn't have s3:ListObjects I was getting the 403 rather than a 404.

So, to summarize the lessons from my experience:

  1. Always url-decode the keys in an S3 notification event. How you do this depends on your language; for Python use the unquote_plus() function from the urllib.parse module.
  2. If you see a 403 error, check your permissions first, but also look to see if the file you're trying to retrieve actually exists.

You'll note that I didn't say “grant s3:ListObjects in your IAM policies.” The principle of least privilege still applies.

Monday, January 6, 2020

The Future of Open Source

The world of open source software seems to be going through a period of soul-searching. On the one hand, individual maintainers have retracted packages, causing disruption for the communities that depended on those packages. On the other, software-as-a-service providers are making more money from some applications than their creators.

This is all happening in a world where businesses depend on open-source to operate. It doesn't matter whether you're an individual launching a startup with PHP and MySQL, or a multi-national replacing your mainframe with a fleet of Linux boxes running Java. Your business depends on the work of people that have their own motivations, and those motivations may not align with yours. I think this is an untenable situation, one that will eventually resolve by changing the nature of open-source.

Before looking at how I think it will resolve, I want to give some historical perspective. This is one person's view; you may not agree with it.

I date the beginning of “professional” open source as March 1985: that was the month that Dr Dobbs published an article by Richard Stallman, an article that would turn into the GNU Manifesto. There was plenty of freely available software published prior to that time; my experience was with the Digital Equipment Corporation User Society (DECUS), which published an annual catalog of programs ranging in complexity from fast fourier transform routines to complete language implementations. These came with source code and no copyright attached (or, at least, no registered copyright, which was an important distinction in the 1970s and early 1980s).

What was different about the GNU Manifesto, and why I refer to it as the start of “professional” open source, was that Stallman set out a vision of how programmers could make money when they gave away their software. In his view, companies would get software for free but then hire programmers to maintain and enhance it.

In 1989, Stallman backed up the ideas of the GNU Manifesto with the Gnu Public License (GPL), which was applied to the software produced by the GNU project. This licence introduced the idea of “copyleft”: a requirement that any “derivative works” also be licensed using the GPL, meaning that software developers could not restrict access to their code. Even though that requirement was relaxed in 1991 with the “library” (now “lesser”) license, meaning that you could use the GNU C compiler to compile your programs without them becoming open source by that act, the GPL scared most corporations away from any use of the GNU tools (as late as 1999, I was met with a look of shock when I suggested that the GNU C compiler could make our multi-platform application easier to manage).

In my opinion, it was the Apache web server, introduced in 1995, that made open-source palatable (or at least acceptable) to the corporate world. In large part, this was due to the Apache license, which essentially said “do what you want, but don't blame us if anything goes wrong.” But also, I think it was because the corporate world was completely unprepared for the web. To give a sense of how quickly things moved: in 1989 I helped set up the DNS infrastructure for a major division of one of the world's largest corporations; I had only a few years of experience with TCP/IP networking, but it was more than the IT team. NCSA Mosaic appeared four years later, and within a year or two after that companies were scrambling to create a web presence. Much like the introduction of PCs ten years earlier, this happened outside of corporate IT; while there were commercial web-servers (including Microsoft and Netscape), “free as in beer” was a strong incentive.

Linux, of course, was a thing in the late 1990s, but in my experience wasn't used outside of a hobbyist community; corporations that wanted UNIX used a commercial distribution. In my view, Linux became popular due to two things: first, Eric Raymond published The Cathedral and the Bazaar in 1997, which made the case that open source was actually better than commercial systems: it has to be good to survive. But also, after the dot-com crash, “free as in beer” became a selling point, especially to the startups that would create “Web 2.0”

Jumping forward 20 years, open-source software is firmly embedded in the corporate world. While I'm an oddity for running Linux on the desktop, all of the companies I've worked with in the last ten or more years used it for their production deployments. And not just Linux; the most popular database systems are open source, as are the tools to provision and manage servers, and even productivity tools such as LibreOffice. And for most of the users of these tools, “free as in beer” is an important consideration.

But stability is (or should be) another important consideration, and I think that many open-source consumers have been lulled into a false sense of stability. The large projects, such as GNU and Apache, have their own repositories and aren't going anywhere. And the early “public” repositories, such as SourceForge and Maven Central, adopted a policy that “once you publish something here, it will never go away.” But newer repositories don't have such a policy, and as we saw with left-pad in 2016 and chef-sugar in 2019, authors are willing and able to pull their work down.

At the same time, companies such as Mongo and Elastic.NV found that releasing their core products as open-source might not have been such a great idea. Software-as-a-service companies such as AWS are able to take those products and host them as a paid service, often making more money from the hosting than the original companies do from the services they offer around the product. And in response, the product companies have changed the license on their software, attempting to cut off that usage (or at least capture a share of it).

Looking at both behaviors, I can't help but think that one of the core tenets of the GNU manifesto has been forgotten: that the developers of open-source software do not have the right to control its use. Indeed, the Manifesto is quite clear on this point: “[programmers] deserve to be punished if they restrict the use of these programs.”

You may or may not agree with that idea. I personally believe that creators have the right to decide how their work is used. But I also believe that releasing your work under an open-source license is a commitment, one that can't be retracted.

Regardless of any philosophical view on the matter, I think there are two practical outcomes.

The first is that companies — or development teams — that depend on open-source software need to ensure their continued access to that software. Nearly eight years ago I wrote about using a local repository server when working with Maven and Java. At the time I was focused on coordination between different development teams at the same company. If I were to rewrite the post today, it would focus on using the local server to ensure that you always have access to your dependencies.

A second, and less happy change, is that I think open-source creators will lose the ability to control their work. One way this will happen is for companies whose products are dependent on open-source to provide their own public repositories — indeed, I'm rather amazed that Chef doesn't offer such a repository (although perhaps they're gun-shy after the reaction to their hamfisted attempt to redistributed chef-sugar).

The other way this will happen is for service-provider companies to fork open-source projects and maintain their own versions. Amazon has already done this, for Elasticsearch and also OpenJDK; I don't expect them to be the only company to do so. While these actions may damage the companies' reputations within a small community of open-source enthusiasts, the much larger community of their clients will applaud those actions. I can't imagine there are many development teams that will say “we're going to self-host Elasticsearch as an act of solidarity”; convenience will always win out.

If you're like me, a person responsible for a few niche open-source projects, this probably won't matter: nobody's going to care about your library (although note that both left-pad and chef-sugar at least started out single-maintainer niche projects). But if you're a company that is planning to release your core product as open-source, you should think long and hard about why you want to do this, and whether your plan to make money is viable. And remember these words from the GNU Manifesto: “programming will not be as lucrative on the new basis as it is now.”