Monday, November 9, 2020

Building Modular Inline Policies with Terraform

Modules are the primary mechanism for reusability in Terraform. They're especially useful in cases where you need to stamp out a bunch of related resources multiple times. For example, an SQS queue along with its dead-letter queue and “least privilege” access policies.

The problem with creating a separate managed policy for each module invocation is that role/policy attachments are a limited resource: by default, you're limited to ten. You can increase that by opening a support ticket, but the limit is still low (20 unless you can show special circumstances), and it has to be done in every account where you'll create the role.

Alternatively, you can construct a combined policy in the invoking module as long as the child module outputs the queue ARNs. However, this eliminates much of the benefit of reusability, as you have to explicitly build the combined policy everywhere that you use the queue module. It also means that the person writing the combined policy has to understand the IAM permissions that are appropriate to a queue consumer or producer. At that point, why not just explicitly create your resources?

In this post, I look at a different alternative: the queue module exports the JSON for the producer and consumer statements. The invoking module then combines the statements for each of its queues into a single applicaton policy.

Wait, doesn't Terraform allow you to combine policies?

Not in the way that I want.

The iam_policy_document data source allows you to “layer” policies using its source_json and override_json arguments, but these only accept a single document. Which means that you can build a complex policy document step by step, but not as a single data source. In a real-world configuration, I think that would quickly turn into an unmaintainable mess.

The policy documents guide doesn't give any other alternatives: it recommends the data source or a multi-line “heredc”. Surprisingly, it doesn't mention using jsonencode() at all.

Creating an SQS queue module

A modern web-application is a complex beast: it often consists of a (relatively) simple API server, with background processes to handle long-running tasks such as sending notifications, rendering images, or generating reports. And linking these together, at least on AWS, SQS queues.

An SQS queue, although it seems simple, actually has a lot of hidden complexity:

  • Each queue should have a companion dead-letter queue, along with a "redrive policy" that determines how many times a consumer can fail to process a message before it goes to the dead-letter queue.
  • How long a does a consumer have to process a message before it's retried (the visibility timeout).
  • How long can a message stay in the queue without being picked up (the retention period, used to prevent queues from growing without bound when there are no/insufficient consumers).
  • You should also have “least-privilege” policies for the queue's producers and consumers.

This is a perfect use for Terraform modules: rather than repeat all of the resource definitions (as you have to do with CloudFormation), you simply create a module and reference it from your main configuration:

module "notifications_queue" {
  source = "./modules/sqs"
  queue_name = "Notifications"
}

module "reports_queue" {
  source = "./modules/sqs"
  queue_name = "ReportGeneration"
  visibility_timeout = 600
  receive_count = 1
}

The core of this module are the following two resource definitions, for the primary queue and its companion dead-letter queue (every primary should have a DLQ; trust me, you'll want it sooner or later):

resource "aws_sqs_queue" "base_queue" {
  name                        = var.queue_name
  message_retention_seconds   = var.retention_period
  visibility_timeout_seconds  = var.visibility_timeout
  redrive_policy              = jsonencode({
                                    "deadLetterTargetArn" = aws_sqs_queue.deadletter_queue.arn,
                                    "maxReceiveCount" = var.retry_count
                                })
}

resource "aws_sqs_queue" "deadletter_queue" {
  name                        = "${var.queue_name}-DLQ"
  message_retention_seconds   = var.retention_period
  visibility_timeout_seconds  = var.visibility_timeout
}

As you can see, all configuration is via module variables. Rather than show the entire variables.tf file, I'll use just two to give you a sense of my philosophy. The first, of course, is queue_name: this is the name of the queue, and can't be defaulted. The second, retention_period, is something that most consumers don't care about, but some might. So I picked a valid default (in my experience most messages have no value after 24 hours), and the invoking module can override if needed.

variable "queue_name" {
  description = "The name of the queue. Used as a prefix for related resource names."
  type = string
}

variable "retention_period" {
  description = "Time (in seconds) that messages will remain in queue before being purged"
  type = number
  default = 86400
}

So far, nothing unusual. But what I'm not showing (yet) are the policy documents that let your applications read and write the queue.

If you were just building a proof-of-concept, you might consider using the pre-existing AmazonSQSFullAccess managed policy. However, that policy grants all operations on all queues. Not just read and write, but the ability to create and delete them as well. For a real-world application, you want a “least privilege” policy that only allows actions on the queue that you've just created. And really, you want separate policies for the programs that write to the queue and those that read from the queue. To support this, my module also creates two managed policies; here's the producer policy:

resource "aws_iam_policy" "producer_policy" {
  name        = "SQS-${var.queue_name}-${data.aws_region.current.name}-consumer_policy"
  description = "Attach this policy to producers for ${var.queue_name} SQS queue"
  policy      = data.aws_iam_policy_document.producer_policy.json
}

data "aws_iam_policy_document" "producer_policy" {
  statement {
    actions = [
      "sqs:GetQueueAttributes",
      "sqs:GetQueueUrl",
      "sqs:SendMessage",
      "sqs:SendMessageBatch"
    ]
    resources = [
      aws_sqs_queue.base_queue.arn
    ]
  }
}

Simple, but as I said above, there's a limit on the number of policies that you can attach to a single role. In a real-world micro-service application, you might have a half-dozen queues. Or more.

Extracting the Policy Statement

I originally created this example for a user group talk, so while it was based on real-world experience it wasn't actually used in the real world. I recently had to implement an SQS module for the real world, and decided to tackle the “too many policies” problem (ironically, I don't think it will be a problem for that specific use case).

After reviewing the various documents linked above, I realized that none of them mentioned the jsonencode function. This seemed like a rather strange omission, especially given the inclusion of approaches based on string literals (although not nearly as bad as XML, JSON still has its share of gotchas for those who wish to write it using strings).

As it turns out, creating policy documents with jsonencode() instead of using a data source works quite well:

resource "aws_iam_policy" "producer_policy" {
  name        = "SQS-${var.queue_name}-${data.aws_region.current.name}-producer"
  description = "Attach this policy to producers for ${var.queue_name}"
  policy      = jsonencode({
                  "Version" = "2012-10-17",
                  "Statement" = [{
                    "Effect" = "Allow",
                    "Action" = [
                      "sqs:GetQueueAttributes",
                      "sqs:GetQueueUrl",
                      "sqs:SendMessage",
                      "sqs:SendMessageBatch"
                    ],
                    "Resource" = [
                      aws_sqs_queue.base_queue.arn
                    ]
                  }]
                })
}

That was a start, but it wasn't really where I wanted to go. The next step was to move the JSON into a local:

locals {
  producer_policy_statement = {
    "Effect" = "Allow",
    "Action" = [
      "sqs:GetQueueAttributes",
      "sqs:GetQueueUrl",
      "sqs:SendMessage",
      "sqs:SendMessageBatch"
    ],
    "Resource" = [
      aws_sqs_queue.base_queue.arn
    ]
  }
}

Now I could make the policy reference the local, rather than either inline JSON or a datasource:

resource "aws_iam_policy" "producer_policy" {
  name        = "SQS-${var.queue_name}-${data.aws_region.current.name}-producer"
  description = "Attach this policy to producers for ${var.queue_name}"
  policy      = jsonencode({
                  "Version" = "2012-10-17",
                  "Statement" = [local.producer_policy_statement]
                })
}

Constructing an application role

That's all well and good, but how does it provide any benefit over the former self-contained module? The answer is that you can reference the JSON in outputs.tf:

output "producer_policy_statement" {
  value = local.producer_policy_statement
}

And that enables the following change in the invoking module: rather than a bunch of role/policy attachments, the application role combines these snippets of JSON into an inline policy.

resource "aws_iam_role" "application_role" {
  name = "ApplicationRole"

  assume_role_policy = # you saw this above, no need to repeat
}

resource "aws_iam_role_policy" "application_role_queue_policy" {
  name = "application_role_queue_policy"
  role = aws_iam_role.application_role.id

  policy = jsonencode({
    "Version"   = "2012-10-17",
    "Statement" = [
        module.notifications_queue.producer_policy_statement,
        module.rendering_queue.producer_policy_statement,
        module.reports_queue.producer_policy_statement
    ]
  })
}

Now, rather than counting against your quota of policy attachments, these statements count against your quote for inline policy size. Which, for role policies, is 10,240 characters. If that's too low, you can create managed policies composed of individual statements, for a total of nearly 70k per role (10 policies at 6k each, and one inline policy). If your application needs a role that's more complex than that, you might want to think about breaking it into micro-services.

But in my opinion, the best feature of this implmentation is that it keeps the module as the arbiter of its resource's permissions. Which means there's less of a risk that a poorly crafted policy will open your system to an attacker.

If you'd like to try this out, I've saved both the original and the modular versions as GitHub Gists. One thing to note: Gists don't have any conception of sub-directories, and don't allow slashes in filenames, so I've used dashes to indicate directory boundaries (this should be clear once you look at the files).

No comments: