blog.kdgregory.com

Sunday, February 18, 2018

Cleaning up AWS ElasticSearch indexes with Lambda

The Amazon ElasticSearch Service is a great solution for in-house logging: it's an easily-configurable search engine with built-in Kibana service to explore your log messages. If you use my Log4J AWS appenders to route application logging into a Kinesis Data Stream and thence through Kinesis Firehose to ElasticSearch, you can have a complete logging framework up and running within a few hours.

But there's one problem: there's no automated purge of old messages. True, you can use Curator, which is the standard management tool for ElasticSearch, but you need to run it from somewhere. Which means your personal PC, or bringing up an EC2 instance just for system management, and both of which are a step away from the "managed for you" promise of AWS.

This post presents an alternative: invoking the ElasticSearch Indices API via a AWS Lambda function that's triggered by a CloudWatch Scheduled Event.

The Lambda Function

Warning: this code is destructive. While I do not believe that it contains any bugs, I do not make a warranty of any kind, and do not not accept any responsibility for your deployment. I strongly recommend that you fully understand and test it before using in a production environment. Use at your own risk.

OK, with that out of the way, the Lambda function is quite simple: it reads the list of indexes from the ElasticSearch cluster, discards those with names that don't match a configured prefix, sorts the rest (relying on the datestamp that Firehose appends to each index), and deletes all but the desired number. Both retrieve and delete are HTTP requests, which is why you do not want to expose your cluster to the open Internet.

Most of the people that I've talked with protect their ElasticSearch cluster with an IP-address-based access policy. Unfortunately, such a policy blocks Lambda, which receives a new IP for every instantiation. You can work around that restriction with signed requests, but that makes the deployment significantly more complex.

I've provided one version of the code that uses signed requests and one that doesn't; pick the one that's most relevant to your environment.

Version 1: unsigned requests

If you have a VPC with a NAT, and your ElasticSearch cluster allows unrestricted requests from the NAT, then this is the version for you. It would also work if your ElasticSearch cluster was running inside the VPC, but as of the time of this writing Firehose can't talk to an in-VPC ElasticSearch cluster, so that case is irrelevant (at least for a Firehose-based logging pipeline).

Under these conditions, we can use the HTTP connection code that's built-in to the Python runtime. And because of that, we can enter the script directly into the AWS Console.

import http.client
import json
import os

def lambda_handler(event, context):
    endpoint = os.environ['ELASTIC_SEARCH_ENDPOINT']
    numIndexesToKeep = int(os.environ['NUM_INDEXES_TO_KEEP'])
    indexPrefix = os.environ['INDEX_PREFIX']
    
    cxt = http.client.HTTPConnection(endpoint);
    
    cxt.request('GET', '/*')
    indexResponse = cxt.getresponse()
    indexResponseBody = indexResponse.read().decode("utf-8")
    if (indexResponse.status != 200):
        raise Exception('failed to retrieve indexes: ' + indexResponseBody)

    indexData = json.loads(indexResponseBody)
    indexNames = sorted([x for x in indexData.keys() if x.startswith(indexPrefix)])
    indexesToDelete = indexNames[0 : max(0, len(indexNames) - numIndexesToKeep)]

    for idx in indexesToDelete:
        cxt.request('DELETE', "/" + idx)
        deleteResponse = cxt.getresponse()
        deleteResponseBody = deleteResponse.read().decode("utf-8")
        if deleteResponse.status == 200:
            print("deleted " + idx)
        else:
            raise Exception("failed to delete " + idx + ": " + deleteResponseBody)

I'm going to assume that you're comfortable creating a Lambda function (if not, go through the tutorial). Here are the key points for setting up your function:

  • Pick the "Python 3.6" runtime.
  • Configure the environment variables described below.
  • You can leave the default memory allotment, but increase the runtime to 30 seconds (HTTP calls may take a long time).
  • You will need a role that has the AWS-provided "AWSLambdaVPCAccessExecutionRole" policy. I recommend creating a new role just for the cleanup Lambdas.
  • Ensure that the Lambda is configured to run inside the VPC, on a private subnet that routes outbound requests through the NAT.

Version 2: signed requests

If your ElasticSearch cluster limits access by IP and does not permit access from a NAT, you'll need to use this version. It makes arbitrary signed HTTP requests, a feature that is not currently supported by the AWS Python SDK (Boto3). So instead, I use the aws-requests-auth and requests libraries, which means that we have to create a deployment package rather than simply pasting the source code into the AWS Console. And we have to ensure that the Lambda function has permission to update the ElasticSearch cluster. As I said, significantly more complex.

To start, you'll need to create a directory and install dependencies (I'm assuming that you're working on Linux and have Python 3.6 and PIP already installed).

mkdir escleanup

cd escleanup

pip install aws-requests-auth -t `pwd`
Next, the code. Save this in the file lambda_function.py:
import json
import os
import requests

from aws_requests_auth.aws_auth import AWSRequestsAuth

def lambda_handler(event, context):
    endpoint = os.environ['ELASTIC_SEARCH_ENDPOINT']
    numIndexesToKeep = int(os.environ['NUM_INDEXES_TO_KEEP'])
    indexPrefix = os.environ['INDEX_PREFIX']
    
    auth = AWSRequestsAuth(aws_access_key=os.environ['AWS_ACCESS_KEY_ID'],
                           aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'],
                           aws_token=os.environ['AWS_SESSION_TOKEN'],
                           aws_region=os.environ['AWS_REGION'],
                           aws_service='es',
                           aws_host=endpoint)

    indexResponse = requests.get('https://' + endpoint + '/*', auth=auth)
    if (indexResponse.status_code != 200):
        raise Exception('failed to retrieve indexes: ' + indexResponse.text)
        
    indexData = indexResponse.json()
    indexNames = sorted([x for x in indexData.keys() if x.startswith(indexPrefix)])
    indexesToDelete = indexNames[0 : max(0, len(indexNames) - numIndexesToKeep)]
    
    for idx in indexesToDelete:
        deleteResponse = requests.delete('https://' + endpoint + '/' + idx, auth=auth)
        if deleteResponse.status_code == 200:
            print("deleted " + idx)
        else:
            raise Exception("failed to delete " + idx + ": " + deleteResponse.text)

This has to be turned into a ZIP file, along with all of its dependencies:

zip -r /tmp/escleanup.zip .

Now you can create your Lambda function. As with above, we use the "Python 3.6" environment, and start with the default Lambda execution role. In the second page of the creation wizard you will upload the zipfile and set environment variables as below (you can also decided to run in a public subnet of your VPC, but it's fine to leave the Lambda outside your VPC).

The big change with this version is that you should create a new role rather that reuse an existing one, because we're going to grant permissions to that role in the ElasticSearch cluster. If you've configured your cluster to allow IP-based access, then it probably has an access policy that looks like this (only with more IPs):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "es:*",
      "Resource": "arn:aws:es:us-east-1:123456789012:domain/example/*",
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": [
            "54.85.66.156",
            "52.3.98.34"
          ]
        }
      }
    }
  ]
}

You will need to add a statement that allows access from the role (replacing the ARNs shown here with those for your role and ElasticSearch domain):

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::123456789012:role/ESCleanup"
      },
      "Action": "es:*",
      "Resource": "arn:aws:es:us-east-1:123456789012:domain/example/*"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "*"
      },
      "Action": "es:*",
      "Resource": "arn:aws:es:us-east-1:123456789012:domain/example/*",
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": [
            "54.85.66.156",
            "52.3.98.34"
          ]
        }
      }
    }
  ]
}

Configuration

Like all good Lambda functions, these are configured via environment variables (note: the signed version also uses variables provided by Lambda itself):

  • ELASTIC_SEARCH_ENDPOINT is the endpoint of your cluster, copied from the “Overview” tab on the AWS console.
  • NUM_INDEXES_TO_KEEP is the number of indexes that you want to keep. This is easier than date arithmetic: rather than “keep the last month,” you keep the last 31 days.
  • INDEX_PREFIX identifies the indexes that should be considered for deletion: an ElasticSearch cluster may be used for multiple purposes, and you don't want to destroy another project's data (or the internal .kibana index). Assuming you're populating your ElasticSearch cluster via Kinesis Firehose, use the IndexName from the firehose destination configuration.

Cloudwatch Event Trigger

So you've got the function, but how do you invoke it? This is where CloudWatch Events comes in: in addition to tracking changes to your AWS environment, it can generate scheduled events, which can be used as a trigger for our Lambda.

Use the AWS Console to create a new rule and assign it to your Lambda on a single page. I recommend using a cron expression rather than a fixed rate, so that you can ensure that the cleanup happens when there's not much logging; here's an example that runs at 1 AM EST:

0 6 * * ? *

Testing

As I said earlier, you should test this code before moving to production. If you already have a test cluster that you don't mind playing with, that's the best solution. Alternatively, you could create a copy of your production cluster, although this may be impractical if you have a multi-terabyte cluster (and the instructions take a few tries to get right — I got very good at migrating clusters while writing this post).

Another alternative is to create a small test cluster (a single t2.small.elasticsearch instance, which costs under 4 cents an hour) and manually create empty indexes using curl:

curl -XPUT 'https://search-example-redacted.us-east-1.es.amazonaws.com/example-18-02-18'

As long as the indexes that you create can be sorted and follow a standard naming scheme, the Lambda function will do its thing. You can also adjust the parameters of the CloudWatch Event to perform more frequent deletions, and watch the function execute.

Licensing

Normally I don't mention licensing for code snippets, but I'm planning to use this in my day job and my employment contract requires an explicit license for any not-for-hire code that I use. So, it's licensed under the Apache License 2.0. That said, I really don't want to put a dozen lines of license boilerplate on a 30-line program, and the license says that it can be "attached to" the work. If you use the code, please copy the following boilerplate:

Copyright 2018 Keith D Gregory

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Source and documentation is available at http://blog.kdgregory.com/2018/02/cleaning-up-aws-elasticsearch-indexes.html

Contains example code from https://github.com/DavidMuller/aws-requests-auth

Monday, November 20, 2017

NAT Instance versus NAT Gateway

One of the typical configurations for an Amazon Web Services (AWS) Virtual Private Cloud (VPC) is dividing it into public and private subnets. Your application servers and databases run in the private part of the VPC, inaccessible from the Internet. The only things living in the public part of the cloud are your load balancers and a bastion host.

The problem with this architecture is that servers on the private subnet are truly isolated: they won't receive connections form the open Internet, but neither can they make such connections. So you won't be able to connect to an external data provider, load updates from an external source, or even connect to most of Amazon's services. To let your servers connect to the Internet (but still prevent the Internet from connecting to them) you need a NAT.

Amazon gives you not one, but two options for your NATting: NAT Gateways and NAT Instances. NAT Instances (I'll capitalize in this post) have been around since VPCs became available: they're simply EC2 instances with specially configured routing tables. NAT Gateways were introduced in October 2015; they are part of the VPC infrastructure, like the routers that let your subnets communicate with each other.

Amazon provides a comparison table to help you decide which you should pick. Reading through that table it seems like a no-brainer: pick the NAT Gateway. But in the real world the decision is a little more complex, which is no doubt why Amazon still maintains AMIs for NAT Instances. The rest of this post walks through some of the things that you should consider when configuring your network (although, if in doubt at the end, pick the NAT Gateway).

Do you even need a NAT?

If you're running a self-contained application consisting of web-servers, databases, and/or other home-grown infrastructure, you might not need a NAT. This will require some discipline on your part: for example, any non-AWS-supplied software has to be installed onto an AMI, which is then launched by your auto-scaling group(s). Same for updates: you can't patch an existing server, you must build a new AMI. But this is, arguably, how you should be deploying for the cloud anyway.

A bigger problem with this strategy is that it severely limits the AWS services that you can use, because they're also accessed via public IP addresses. So if you use SQS queues to connect worker instances, you'll need a NAT. Ditto for Kinesis, CloudWatch, and almost any other AWS service. There are two exceptions as of the time I'm writing: Amazon provides VPC endpoints for S3 and DynamoDB. These endpoints are essentially routing rules that direct traffic to those services through Amazon's network rather than over the Internet. I hope that Amazon makes more services available this way, but am not holding my breath: the S3 endpoint appeared in early 2015, but DynamoDB wasn't supported until late 2016.

How much traffic will be going through the NAT?

As I said, NAT Gateways seem to be a no-brainer choice, so that's what I did when first moving our third-party integrations into the VPC. And then at the end of the first month, our CFO asked about this new, large charge on the AWS bill. The reason, as I discovered after looking at flow logs, was that we pushed about a terabyte of data through the NAT each day (which is way too much, but that's a project for another day).

Amazon's data transfer pricing rules are, in a word, Byzantine, and NAT Gateways add another layer to the model. But the bottom line is that you'll pay 4½¢ per gigabyte for traffic through the NAT. Which doesn't seem like much, but translates to $45/terabyte; if you're pushing a terabyte a day, it adds up quickly. By comparison, if you use a NAT Instance, you'll be paying the basic bandwidth charges, just like your server had a public IP.

Beware, however, that one NAT Instance isn't enough: you'll want at least two for redundancy. And AWS will charge you for cross-AZ traffic within your VPC, so you'll probably want one per availability zone. But if you're pushing enough traffic, the cost of the NAT Instances will be less than the cost of a NAT Gateway. You can also adjust the instance size based on your traffic, although I wouldn't recommend one of the smaller T2 instances.

Are you willing to care for a pet?

If you're not familiar with the “cattle vs pets” metaphor, take a look at this slide deck. To summarize: you spend a lot of time and money caring for a family pet, but cattle barons turn their herds out to graze in the summer, round them up in the winter, and if one dies that's too bad for it. Or, as it applies to deployments, cloud applications don't get patched, they get terminated and rebuilt.

But you can't think of a NAT Instance that way. It is a critical part of your infrastructure, and if it ever goes down the machines on your private subnet(s) won't be able to talk to the Internet.

And unfortunately, there are a lot of reasons for the instance to go down.

One reason is that the underlying hardware suffers a problem. For this there is no warning, so you'll need a pager alarm to let you know when the NAT is down, and a plan for bringing it back up (typically by re-assigning an elastic IP to another instance). Along the same lines, AWS may schedule maintenance on an instance; the difference here is that you'll have days to plan or preempt the outage. And lastly, there will be times when you need to restart the NAT yourself, as a result of applying kernel-level patches.

Regardless of how the NAT goes out of service, it will take all of the open connections with it. Your applications must be written to handle network timeouts and reconnect, or to gracefully recover when shut down (in which case it's often easiest to just reboot all machines in the availability zone and let them establish new connections).

For us, the cost of running and caring for four NAT Instances remains less than the cost of the NAT Gateway. If we can bring down our traffic to third-party sites that decision will be re-evaluated.

Thursday, November 2, 2017

Ghost Traffic on AWS

We're reconfiguring some of our network endpoints at work, and yesterday, the day after Halloween, I see this traffic appear in our local access logs:

127.0.0.1 - - [01/Nov/2017:15:09:43 -0400] "GET /grave/Charles-Karlson/16427428 HTTP/1.1" 404 967

Line after line. Every few seconds someone — or something — is asking about another grave.

This environment was running under Amazon's Elastic Beanstalk service, which assigns public DNS addresses based on the environment name, and I figured we probably weren't the only people to come up with the name tracker-test-two. Which is true, and leads to one of the take-aways that I'll list below. But the truth, as revealed from the ELB access logs, was stranger:

2017-11-01T19:09:43.168404Z awseb-e-f-AWSEBLoa-1FG2EMDGMS77G 46.229.168.73:33060 10.2.187.173:80 0.000045 0.003369 0.00002 404 404 0 967 "GET https://hu.billiongraves.com:443/grave/Charles-Karlson/16427428 HTTP/1.1" "Mozilla/5.0 (compatible; SemrushBot/1.2~bl; +http://www.semrush.com/bot.html)" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2

That's a real site: if you go to it, you'll see an image of a gravestone and information about the person buried under that stone. Useful, I'm sure, to people who are tracking their ancestry. But why was the traffic going to our web server?

The answer is DNS caching: to reduce load on the domain name system, intermediate nameservers are allowed to cache the results of a DNS lookup. That generally works pretty well, and billiongraves.com has a 60 second timeout, which should have stopped traffic to that IP address fairly quickly (if you haven't already guessed, they also run on AWS servers, and recently reconfigured their environments). However, some programs and platforms do their own caching, and don't pay attention to the timeout — Java is particularly bad at this. So “sembot” is using its cached value for the URL, and will continue to make those requests until restarted.

There was another interesting artifact in our logs, although not quite as spooky:

2017-11-02T11:49:26.343413Z awseb-e-h-AWSEBLoa-1SXOP8PJPZS8R 108.244.128.203:58237 10.230.25.177:80 0.000116 0.013641 0.000074 200 200 0 7520 "GET https://54.243.172.96:443/action/Login/view HTTP/1.1" "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2

This traffic was appearing every few seconds in the logs for our old server group. This isn't a DNS caching issue: it's hitting an explicit IP address. And it doesn't look like a script kiddie trying to find vulnerabilities in our servers: it's always the same endpoint, always from the same source IP (which is probably somewhere in the midwest), and happens every 60 seconds. As far as I can tell, it's a homegrown health-check program.

Which means that someone is going to panic when I shut down that environment later this morning.

To wrap up, here are some take-aways that you might find useful:

  1. Someone had that IP address before you. Expect unexpected traffic, and think about how you will react if it's excessive.
  2. Don't write your own health checks. If you're using an ELB, you can configure it to run a health check and raise an alarm when the health check fails.
  3. If you feel you must write your own health check, use a hostname rather than an explicit IP. I know you think you'll have that IP forever, but you probably won't.
  4. Pick an appropriate time-to-live for your DNS entries. The Internet Police might yell at me for suggesting 60 seconds, but it makes your life a lot easier if you need to transition environments. Interestingly, Google (which hosts this blog) uses 78 seconds.
  5. If you're using Elastic Beanstalk, use URLs that contain random text (the environment ID is a good choice). Remember that it's a shared namespace.
  6. If you're writing an application that repeatedly connects to remote hosts, understand how your platform deals with DNS caching, and try to honor the server's intentions.
  7. If you run billiongraves.com — or any site with similar content — don't change your IPs on Halloween.