Monday, November 9, 2020

Building Modular Inline Policies with Terraform

Modules are the primary mechanism for reusability in Terraform. They're especially useful in cases where you need to stamp out a bunch of related resources multiple times. For example, an SQS queue along with its dead-letter queue and “least privilege” access policies.

The problem with creating a separate managed policy for each module invocation is that role/policy attachments are a limited resource: by default, you're limited to ten. You can increase that by opening a support ticket, but the limit is still low (20 unless you can show special circumstances), and it has to be done in every account where you'll create the role.

Alternatively, you can construct a combined policy in the invoking module as long as the child module outputs the queue ARNs. However, this eliminates much of the benefit of reusability, as you have to explicitly build the combined policy everywhere that you use the queue module. It also means that the person writing the combined policy has to understand the IAM permissions that are appropriate to a queue consumer or producer. At that point, why not just explicitly create your resources?

In this post, I look at a different alternative: the queue module exports the JSON for the producer and consumer statements. The invoking module then combines the statements for each of its queues into a single applicaton policy.

Wait, doesn't Terraform allow you to combine policies?

Not in the way that I want.

The iam_policy_document data source allows you to “layer” policies using its source_json and override_json arguments, but these only accept a single document. Which means that you can build a complex policy document step by step, but not as a single data source. In a real-world configuration, I think that would quickly turn into an unmaintainable mess.

The policy documents guide doesn't give any other alternatives: it recommends the data source or a multi-line “heredc”. Surprisingly, it doesn't mention using jsonencode() at all.

Creating an SQS queue module

A modern web-application is a complex beast: it often consists of a (relatively) simple API server, with background processes to handle long-running tasks such as sending notifications, rendering images, or generating reports. And linking these together, at least on AWS, SQS queues.

An SQS queue, although it seems simple, actually has a lot of hidden complexity:

Each queue should have a companion dead-letter queue, along with a "redrive policy" that determines how many times a consumer can fail to process a message before it goes to the dead-letter queue.
How long a does a consumer have to process a message before it's retried (the visibility timeout).
How long can a message stay in the queue without being picked up (the retention period, used to prevent queues from growing without bound when there are no/insufficient consumers).
You should also have “least-privilege” policies for the queue's producers and consumers.

This is a perfect use for Terraform modules: rather than repeat all of the resource definitions (as you have to do with CloudFormation), you simply create a module and reference it from your main configuration:

module "notifications_queue" {
  source = "./modules/sqs"
  queue_name = "Notifications"
}

module "reports_queue" {
  source = "./modules/sqs"
  queue_name = "ReportGeneration"
  visibility_timeout = 600
  receive_count = 1
}

The core of this module are the following two resource definitions, for the primary queue and its companion dead-letter queue (every primary should have a DLQ; trust me, you'll want it sooner or later):

resource "aws_sqs_queue" "base_queue" {
  name                        = var.queue_name
  message_retention_seconds   = var.retention_period
  visibility_timeout_seconds  = var.visibility_timeout
  redrive_policy              = jsonencode({
                                    "deadLetterTargetArn" = aws_sqs_queue.deadletter_queue.arn,
                                    "maxReceiveCount" = var.retry_count
                                })
}

resource "aws_sqs_queue" "deadletter_queue" {
  name                        = "${var.queue_name}-DLQ"
  message_retention_seconds   = var.retention_period
  visibility_timeout_seconds  = var.visibility_timeout
}

As you can see, all configuration is via module variables. Rather than show the entire variables.tf file, I'll use just two to give you a sense of my philosophy. The first, of course, is queue_name: this is the name of the queue, and can't be defaulted. The second, retention_period, is something that most consumers don't care about, but some might. So I picked a valid default (in my experience most messages have no value after 24 hours), and the invoking module can override if needed.

variable "queue_name" {
  description = "The name of the queue. Used as a prefix for related resource names."
  type = string
}

variable "retention_period" {
  description = "Time (in seconds) that messages will remain in queue before being purged"
  type = number
  default = 86400
}

So far, nothing unusual. But what I'm not showing (yet) are the policy documents that let your applications read and write the queue.

If you were just building a proof-of-concept, you might consider using the pre-existing AmazonSQSFullAccess managed policy. However, that policy grants all operations on all queues. Not just read and write, but the ability to create and delete them as well. For a real-world application, you want a “least privilege” policy that only allows actions on the queue that you've just created. And really, you want separate policies for the programs that write to the queue and those that read from the queue. To support this, my module also creates two managed policies; here's the producer policy:

resource "aws_iam_policy" "producer_policy" {
  name        = "SQS-${var.queue_name}-${data.aws_region.current.name}-consumer_policy"
  description = "Attach this policy to producers for ${var.queue_name} SQS queue"
  policy      = data.aws_iam_policy_document.producer_policy.json
}

data "aws_iam_policy_document" "producer_policy" {
  statement {
    actions = [
      "sqs:GetQueueAttributes",
      "sqs:GetQueueUrl",
      "sqs:SendMessage",
      "sqs:SendMessageBatch"
    ]
    resources = [
      aws_sqs_queue.base_queue.arn
    ]
  }
}

Simple, but as I said above, there's a limit on the number of policies that you can attach to a single role. In a real-world micro-service application, you might have a half-dozen queues. Or more.

Extracting the Policy Statement

I originally created this example for a user group talk, so while it was based on real-world experience it wasn't actually used in the real world. I recently had to implement an SQS module for the real world, and decided to tackle the “too many policies” problem (ironically, I don't think it will be a problem for that specific use case).

After reviewing the various documents linked above, I realized that none of them mentioned the jsonencode function. This seemed like a rather strange omission, especially given the inclusion of approaches based on string literals (although not nearly as bad as XML, JSON still has its share of gotchas for those who wish to write it using strings).

As it turns out, creating policy documents with jsonencode() instead of using a data source works quite well:

resource "aws_iam_policy" "producer_policy" {
  name        = "SQS-${var.queue_name}-${data.aws_region.current.name}-producer"
  description = "Attach this policy to producers for ${var.queue_name}"
  policy      = jsonencode({
                  "Version" = "2012-10-17",
                  "Statement" = [{
                    "Effect" = "Allow",
                    "Action" = [
                      "sqs:GetQueueAttributes",
                      "sqs:GetQueueUrl",
                      "sqs:SendMessage",
                      "sqs:SendMessageBatch"
                    ],
                    "Resource" = [
                      aws_sqs_queue.base_queue.arn
                    ]
                  }]
                })
}

That was a start, but it wasn't really where I wanted to go. The next step was to move the JSON into a local:

locals {
  producer_policy_statement = {
    "Effect" = "Allow",
    "Action" = [
      "sqs:GetQueueAttributes",
      "sqs:GetQueueUrl",
      "sqs:SendMessage",
      "sqs:SendMessageBatch"
    ],
    "Resource" = [
      aws_sqs_queue.base_queue.arn
    ]
  }
}

Now I could make the policy reference the local, rather than either inline JSON or a datasource:

resource "aws_iam_policy" "producer_policy" {
  name        = "SQS-${var.queue_name}-${data.aws_region.current.name}-producer"
  description = "Attach this policy to producers for ${var.queue_name}"
  policy      = jsonencode({
                  "Version" = "2012-10-17",
                  "Statement" = [local.producer_policy_statement]
                })
}

Constructing an application role

That's all well and good, but how does it provide any benefit over the former self-contained module? The answer is that you can reference the JSON in outputs.tf:

output "producer_policy_statement" {
  value = local.producer_policy_statement
}

And that enables the following change in the invoking module: rather than a bunch of role/policy attachments, the application role combines these snippets of JSON into an inline policy.

resource "aws_iam_role" "application_role" {
  name = "ApplicationRole"

  assume_role_policy = # you saw this above, no need to repeat
}

resource "aws_iam_role_policy" "application_role_queue_policy" {
  name = "application_role_queue_policy"
  role = aws_iam_role.application_role.id

  policy = jsonencode({
    "Version"   = "2012-10-17",
    "Statement" = [
        module.notifications_queue.producer_policy_statement,
        module.rendering_queue.producer_policy_statement,
        module.reports_queue.producer_policy_statement
    ]
  })
}

Now, rather than counting against your quota of policy attachments, these statements count against your quote for inline policy size. Which, for role policies, is 10,240 characters. If that's too low, you can create managed policies composed of individual statements, for a total of nearly 70k per role (10 policies at 6k each, and one inline policy). If your application needs a role that's more complex than that, you might want to think about breaking it into micro-services.

But in my opinion, the best feature of this implmentation is that it keeps the module as the arbiter of its resource's permissions. Which means there's less of a risk that a poorly crafted policy will open your system to an attacker.

If you'd like to try this out, I've saved both the original and the modular versions as GitHub Gists. One thing to note: Gists don't have any conception of sub-directories, and don't allow slashes in filenames, so I've used dashes to indicate directory boundaries (this should be clear once you look at the files).

Saturday, July 4, 2020

Packaging Java for AWS Lambda

When I want to package a Java application in a single JAR with all of its dependencies, I normally turn to Maven’s Shade plugin. This plugin builds an “UberJAR” by unpacking all of the files from the project's dependencies and repackaging them into a single JAR.

The Shade plugin is also used by the Lambda Developer's Guide. I was surprised, then to see this later in the same guide:

Reduce the time it takes Lambda to unpack deployment packages authored in Java by putting your dependency .jar files in a separate /lib directory. This is faster than putting all your function’s code in a single jar with a large number of .class files.

I was surprised because a single JAR containing class files should be the most efficient way to deploy a project. The JVM memory-maps the JAR files on its classpath, which means that it can access arbitrary parts of those files without an explicit call to the OS kernel. And UberJAR means that the JVM only has to examine a single directory structure.

After some investigation, I discovered the reason: Lamdba unpacks the contents of the deployment package into its filesystem. Which means that an UberJAR turns into a lot of little files, each of which must be read using several kernel calls.

OK, so if you’re a Maven user, how do you package your Lambdas in the prefered way? The answer is that instead of using the Shade plugin, you use the Assembly plugin. The stated goal of this plugin is to allow developers to package their project artifact with related information such as a website. But it's a very flexible plugin, using a “deployment descriptor” to specify which files should be included in the final artifact, and where they should go.

This is the descriptor that I used for my Java Lambda example. It's duplicated for each of the Lambdas, and is found at src/assembly/deployment.xml:

<assembly xmlns="http://maven.apache.org/ASSEMBLY/2.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/ASSEMBLY/2.0.0 http://maven.apache.org/xsd/assembly-2.0.0.xsd">
    <id>deployment</id>
    <includeBaseDirectory>false</includeBaseDirectory>
    <formats>
        <format>zip</format>
    </formats>
    <dependencySets>
        <dependencySet>
            <outputDirectory>/lib</outputDirectory>
            <useProjectArtifact>true</useProjectArtifact>
            <unpack>false</unpack>
            <scope>runtime</scope>
        </dependencySet>
    </dependencySets>
</assembly>

The descriptor is simple: it says that every JAR used by the Lambda — including the one built by the project itself — should be stored in the /lib directory of the final artifact. To use it, I replace the Shade configuration in my POM with this (the plugin.assembly.version property defined as 3.2.0):

<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-assembly-plugin</artifactId>
    <version>${plugin.assembly.version}</version>
    <configuration>
        <descriptors>
            <descriptor>src/assembly/deployment.xml</descriptor>
        </descriptors>
    </configuration>
    <executions>
        <execution>
            <phase>package</phase>
            <goals>
                <goal>single</goal>
            </goals>
        </execution>
    </executions>
</plugin>

And that's it: when I run mvn package I get a file in my target directory with the suffix -deployment.zip.

Sunday, June 21, 2020

A History of Java Logging Frameworks, or, Why Commons-Logging is Still So Common

In the beginning there was System.out.println(). Or, for those of us who were purists (or wanted our output immediately), System.err.println(). And every developer had his or her own preferences for how their log messages would appear, leading to some very messy output as the number of developers on a team increased.

To bring order to this chaos, many projects implemented their own logging framework. I worked on one such project in the late 90s: the framework consisted of a single class that implemented a static log() method. I can't remember what the logging output looked like, but I suspect that it included a consistently-formatted timestamp.

According to this article by Ceki Gülcü, a project that he was working on in 1996 also implemented their own logging framework. But unlike the project I worked on, its framework was released to the public in 1999 as Log4J.

Something else that became public in 1999 was the Jakarta project, a collaboration between Sun and the Apache Software Foundation to produce the Tomcat application server. And of course Tomcat, being a large application with contributions by many people, had its own logging framework (and it still does, although the implementation and purpose has changed over time).

And lastly, 1999 was also the year that JSR 047, the Java Logging API Specification, was released. It turned into the java.util.logging (JUL) package, released as part of JDK 1.4 in 2002.

A plethora of logging frameworks isn't a problem if you're an application developer: you pick one and stick with it. If you're developing a library that might be used by those applications, however, it's a nightmare. If your library uses Log4J and the application uses JUL, then the output becomes a mess and the developers using your library complain.

At the time, the Jakarta project was arguably the largest single producer of libraries for Java application developers, so they added another: Jakarta Commons Logging (since renamed to Apache Commons Logging, but you'll still see the initials "JCL" in the documentation). The idea of Commons Logging was that you would write your code against the JCL API, add the JCL JAR to your dependencies, and it would figure out what actual logging framework you were using.

Although Commons Logging was intended for libraries, application developers adopted it as well. I can't speak for anyone else, but I looked at it as “won't hurt, and means I don't need to keep track of multiple APIs.” Unfortunately, some developers discovered that it could hurt: they were using Tomcat, regularly redeploying their applications, and had memory leaks that would eventually cause Tomcat to stop working.

Looking back, it appears that these leaks were due to missing transitive dependencies in the deployment bundle.* This took place in the days before Maven 2, when developers were responsible for identifying every JAR that went into their application, and ensuring that it somehow got there (which often meant lots of JARs checked into source control). It wouldn't be obvious that a library used Commons Logging, so the application developer wouldn't bother to add it to the deployed WAR. Unfortunately, Tomcat made it available on the system classpath (because it used Commons Logging internally), so the developers never knew they were missing the JAR. And since Commons Logging needed to know about the actual deployed logging framework, it would establish a strong reference to the Log4J implementation that was in the WAR, preventing the classloader from unloading the classes belonging to the WAR.

That problem was rather quickly resolved: Commons Logging version 1.1 was released in 2006, Tomcat 6 moved it off the public classpath (although Tomcat 5 remained “in the wild” for quite some time), and Maven 2 ensured that a WAR would contain all of the dependencies that it needed. But developers have very long memories for things that go wrong, especially things that happened to someone else who blogged about it.**

At the same time, several popular Java frameworks appeared; Hibernate in 2001 and Spring in 2002 are two of the most familiar. These frameworks were complex enough to need logging, but for obvious reasons wouldn't want to be tied to a specific implementation. Commons Logging provided that capability (and thus became an untracked dependency for many builds).

Moving forward, the new millennium saw continued change in the logging world. Log4J became an Apache project. Ceki Gülcü left Apache and developed Logback and SLF4J. And in the early 2010s, the remaining Apache Log4J committers decided that the Log4J 1.x implementation couldn't be extended and completely rewrote it as Log4J 2.x.

Of these, SLF4J is particularly interesting because it was a logging facade, in direct competition with Commons Logging. Unlike Commons Logging, which tried to infer what underlying framework you were using, SLF4J required you to explicitly include “bridge” JARs for your actual implementation. SLF4J also provided additional features, such as formatted log messages, that were very attractive to application developers.

However, adopting SLF4J had its own pain point: if you used Spring (or Hibernate, or any other library that dependent on Commons Logging), Maven would add it to your build as a transitive dependency. Where it might take precedence over the “slf4j-jcl” bridge from SLF4J (it all depended on the order that JARs were given to the classloader). A key feature of Maven POMs from this era are multiple <exclusions> to prevent such transitive dependencies.

So here we are in 2020, and the logging frameworks scene is more complex than ever:

Log4J 1.x is still used by many projects, even though it was officially end-of-lifed in 2015. One of its most useful features doesn't work under Java 9 (and, I presume, later versions), so its popularity may fade (although it seems that many people, particularly those using OpenJDK, are quite happy with Java 8).
SLF4J/Logback is still used by many developers (including myself), even though new releases seem to have stalled at the 1.3.0-alpha stage (after 25 years of writing logging frameworks, I'm guessing Ceki is in need of a break).
Log4J 2.x provides “bridge” JARs that let people use Commons Logging and SLF4J as their API, with Log4J2 as the back-end.
Commons Logging still exists, but hasn't seen a release since 2014. Nor has its list of supported frameworks changed: Log4J 1.x, JUL, and Avalon LogKit.

Perhaps counter-intuitively, even with all of these changes, Commons Logging is still used by many libraries. However, it's become less visible. Spring Framework, for example, implements the API internally; as an application developer, you no longer need to explicitly exclude the JAR. And if you use Spring Boot, its a 3,000+ line dependency-management POM will explicitly exclude Commons Logging from the libraries that use it.

If you're developing a library, I think that Commons Logging is still the best choice for internal logging. It provides a consistent interface, and it's reasonable to expect that the consumers of your library already have the bridge JARs that they need (which might mean the internal implementation in Spring Framework). But there are a few best practices to keep your users from cursing you:

Mark your dependency as provided. This tells Maven (or Gradle, or any of the other tools that follow the Maven standard) not to resolve the transitive dependency; it will rely instead on an explicitly-referenced JAR to provide the necessary classes.
Ensure that you don't establish a transitive dependency via a package that you depend on, like HTTP Components. Take the time to look at your entire dependency tree (using mvn dependency:tree or equivalent), and add an exclusion if anything tries to pull in Commons Logging.
Don't implement your own logging facade. It's tempting, I know: you want to protect the people that haven't configured logging into their application. And it seems easy: two classes (a log factory and a logger), with some reflection to pick an appropriate back end. But there's a lot of room for error. If you get it wrong, you'll introduce subtle bugs and performance issues, and your experienced users will curse you (and look for an alternative library). And if you get it right, you'll find that you've re-implemented Commons Logging. And good or bad, it won't actually help inexperienced users: they should learn to use logging rather than deploy a black box and cross their fingers.

Bottom line: if you're writing an application, use whatever logging framework you want. I have a strong preference for SLF4J/Logback, but Log4J 2.x does have some features that it's missing. However, if you're implementing a library, stick with Commons Logging. In almost every case it will Just Work™.

* Open Source Archaeology can be a fascinating experience. It's next to impossible to find a source-code repository that gives a full history for older projects: they typically started on CVS, then moved to SVN, and are now on GIT. In some cases, moving between hosting providers (although it looks like Hibernate is still actively maintained on SourceForge, which makes me feel better about a couple of my older projects). Each move lost information such as version tags (assuming they ever existed).

Maven Central is also less helpful than expected, because many projects changed their group or artifact IDs over their lifetime (plus, who wants to dig through 55 pages of org.springframework JAR listings). And a lot of older versions are “retconned”: they were imported into the repository long after release, with a made-up, minimal POM.

Fortunately, most of the larger projects maintain their own archives, so you can download a particular version and unpack it to see what's there. And if you're looking for dependencies, you can pick a likely class and run javap -c to disassemble it and then look at variable declarations. It's a lot of work, and some might call it obsessive. But that's how I learned that Spring Framework 1.0 didn't log at all, while 1.2 used Commons Logging.

** They're also very vocal about it. I wrote an article on logging techniques in 2008 that referenced Commons Logging (because in my world it wasn't an issue), and constantly received comments telling me that it was horrible and should never be used. I eventually updated the article to reference SLF4J, primarily because it was focused at application logging, but also to keep the critics quiet.

blog.kdgregory.com