Tuesday, March 13, 2018

Simplifying AWS Application Deployment with CloudFormation

Application deployment is hard. Well, maybe not so hard when you only have one or two applications, because you can manually provision a server and copy the deployment package onto it. And maybe not when you're at the scale of Google, because you had to solve the hard problems to get there. But at the scale of a dozen or so applications, deployed onto a couple of dozen EC2 instances (ie, a typical micro-service architecture), you might end up in a state where you're spending far too much time deploying, and far too little time developing.

There are many tools that try to solve the problem from different angles: Chef and Puppet for configuring machines, Terraform to bring those (virtual) machines into existence. Ansible to do both. And Docker, which holds the promise that your deployment package can contain everything needed to run the application (but which, in my experience, drastically limits your ability to look inside that application while it's running).

This post examines my experience using CloudFormation and CFNDSL to manage the deployment of several dozen Java applications and supporting infrastructure onto a hundred or so EC2 instances. Each section is a lesson learned.

Generate Your Templates

I think the first thing that everyone realizes about CloudFormation is that its templates are unmaintainable. A simple application deployment with auto-scaling group, alarms, and elastic load balancer needs over 300 lines of pretty-printed JSON. It's not (just) that JSON is verbose: a YAML version is still north of 100 lines. The real problem is that a CloudFormation template has to be explicit about every component that goes into the deployment, from EC2 instance type to how long the load balancer should wait before verifying that an instance is healthy.

For a single application this isn't too bad: once you understand what goes into the various resources, you can make changes fairly easily. But that ability diminishes rapidly once you start adding deployments to the file, even if you're careful about naming and ordering. If you want to, for example, change the maximum number of instances for an application it's not enough to search for MaxSize: you also need to verify that you're in the right resource definition.

The solution is to write a program to generate the templates, leveraging the DRY (Don't Repeat Yourself) principle. Write one function to generate an auto-scaling group (or whatever) and call that for all of your deployments. I prefer an interactive language such as Ruby for developing such programs because of its quick turnaround time, and was happy to be introduced to CFNDSL, but you can find tools in almost any language.

Hide Complexity

CloudFormation templates are hundreds of lines long because they have to specify every detail about a deployment. However, as I learned, real-world deployments depend on only a few parameters; most of the configuration can be defaulted. Here are the items that I used to configure my deployments:

  • A name, to differentiate deployments. This is used to generate CloudFormation logical IDs and export names, so must use a limited character set.
  • A “friendly” name, which is stored in Name tags for all of the resources created by the stack.
  • A pointer to the application's deployment bundle. We used Maven, so this was a group-artifact-version specification.
  • For auto-scaled deployments, the minimum and maximum number of instances and scaling configuration.
  • For deployments that included an Elastic Load Balancer, the ports that were to be exposed, the protocol they would listen to, and the destination port (we used ELB-Classic almost exclusively; one deployment used an Application Load Balancer, which requires more configuration).
  • Any “notification” alarms: for us these were based on passing a maximum CPU level, missing heartbeats on CloudWatch, or excessive messages sitting in a queue. For each of those, the actual configuration amounted to two or three pieces of data (eg: queue name and age of oldest message).

The simplicity of this configuration, however, means that there is complexity behind the scenes. The program that generated application templates was around 800 lines of Ruby code and DSL directives (although to be fair, some of that was because it had to support multiple deployment types — a violation of the next rule). But it's still much easier to add features: when I needed to add an alarm based on the oldest messag in queue, the code itself was maybe a half hour, followed by another half hour of testing, followed by an hour or so to update all of our deployments (most of which could have been parallelized).

Maintain Consistency

Manual deployments tend to be unique snowflakes, with different software packages, different directories for deployment and configuration, and perhaps different logins. This is fine as long as everybody on your team knows the details of every deployment. But that knowledge quickly breaks down as the number of deployments increase. And sooner or later you'll be the person trying to diagnose a problem in a server deployed by someone else, and have no idea where to even start.

There's a strong temptation, when you're generating templates programmatically, to support different deployment types. This temptation is especially strong if you're trying to support existing deployments. Resist this temptation, as it just makes your template generator more complex.

There are many practices to help maintain consistency. Two that I try to follow are the Twelve-Factor methodolocy, particularly regarding configuration, and the Linux filesystem hierarchy. It really is half the battle to know that you'll always find an application's configuration in `/etc/opt/mycompany` and its logs at `/var/log/mycompany`.

Use a Pre-Built AMI

One of the best tools that I know of to maintain consistency is to pre-build your AMIs with all of the tools and support software that you need. This is also one of the best ways to improve your startup times, versus running Chef or similar tools once the AMI starts.

That isn't to say that you can't or shouldn't use tools to create the AMI in the first place. You should definitely do that, with your build scripts checked into source control. This is especially important because you'll need to re-create those AMIs on a regular basis in order to get the latest patches.

Provide Debugging Tools

I've occasionally said that I can track down any Java bug with a thread dump and a heap dump. That's not true, but the converse is: without being able to see what's happening inside a running JVM, you have almost no chance of figuring out problems (especially if your logging isn't that great). But, amazingly, the default Java installation for AWS Linux is the JRE, not the JDK, so you don't have debugging tools available.

If you build your own AMI, this is your opportunity to include whatever tools you think you might need when production goes down late at night. Don't skimp.