Monday, November 20, 2017

NAT Instance versus NAT Gateway

One of the typical configurations for an Amazon Web Services (AWS) Virtual Private Cloud (VPC) is dividing it into public and private subnets. Your application servers and databases run in the private part of the VPC, inaccessible from the Internet. The only things living in the public part of the cloud are your load balancers and a bastion host.

The problem with this architecture is that servers on the private subnet are truly isolated: they won't receive connections form the open Internet, but neither can they make such connections. So you won't be able to connect to an external data provider, load updates from an external source, or even connect to most of Amazon's services. To let your servers connect to the Internet (but still prevent the Internet from connecting to them) you need a NAT.

Amazon gives you not one, but two options for your NATting: NAT Gateways and NAT Instances. NAT Instances (I'll capitalize in this post) have been around since VPCs became available: they're simply EC2 instances with specially configured routing tables. NAT Gateways were introduced in October 2015; they are part of the VPC infrastructure, like the routers that let your subnets communicate with each other.

Amazon provides a comparison table to help you decide which you should pick. Reading through that table it seems like a no-brainer: pick the NAT Gateway. But in the real world the decision is a little more complex, which is no doubt why Amazon still maintains AMIs for NAT Instances. The rest of this post walks through some of the things that you should consider when configuring your network (although, if in doubt at the end, pick the NAT Gateway).

Do you even need a NAT?

If you're running a self-contained application consisting of web-servers, databases, and/or other home-grown infrastructure, you might not need a NAT. This will require some discipline on your part: for example, any non-AWS-supplied software has to be installed onto an AMI, which is then launched by your auto-scaling group(s). Same for updates: you can't patch an existing server, you must build a new AMI. But this is, arguably, how you should be deploying for the cloud anyway.

A bigger problem with this strategy is that it severely limits the AWS services that you can use, because they're also accessed via public IP addresses. So if you use SQS queues to connect worker instances, you'll need a NAT. Ditto for Kinesis, CloudWatch, and almost any other AWS service. There are two exceptions as of the time I'm writing: Amazon provides VPC endpoints for S3 and DynamoDB. These endpoints are essentially routing rules that direct traffic to those services through Amazon's network rather than over the Internet. I hope that Amazon makes more services available this way, but am not holding my breath: the S3 endpoint appeared in early 2015, but DynamoDB wasn't supported until late 2016.

How much traffic will be going through the NAT?

As I said, NAT Gateways seem to be a no-brainer choice, so that's what I did when first moving our third-party integrations into the VPC. And then at the end of the first month, our CFO asked about this new, large charge on the AWS bill. The reason, as I discovered after looking at flow logs, was that we pushed about a terabyte of data through the NAT each day (which is way too much, but that's a project for another day).

Amazon's data transfer pricing rules are, in a word, Byzantine, and NAT Gateways add another layer to the model. But the bottom line is that you'll pay 4½¢ per gigabyte for traffic through the NAT. Which doesn't seem like much, but translates to $45/terabyte; if you're pushing a terabyte a day, it adds up quickly. By comparison, if you use a NAT Instance, you'll be paying the basic bandwidth charges, just like your server had a public IP.

Beware, however, that one NAT Instance isn't enough: you'll want at least two for redundancy. And AWS will charge you for cross-AZ traffic within your VPC, so you'll probably want one per availability zone. But if you're pushing enough traffic, the cost of the NAT Instances will be less than the cost of a NAT Gateway. You can also adjust the instance size based on your traffic, although I wouldn't recommend one of the smaller T2 instances.

Are you willing to care for a pet?

If you're not familiar with the “cattle vs pets” metaphor, take a look at this slide deck. To summarize: you spend a lot of time and money caring for a family pet, but cattle barons turn their herds out to graze in the summer, round them up in the winter, and if one dies that's too bad for it. Or, as it applies to deployments, cloud applications don't get patched, they get terminated and rebuilt.

But you can't think of a NAT Instance that way. It is a critical part of your infrastructure, and if it ever goes down the machines on your private subnet(s) won't be able to talk to the Internet.

And unfortunately, there are a lot of reasons for the instance to go down.

One reason is that the underlying hardware suffers a problem. For this there is no warning, so you'll need a pager alarm to let you know when the NAT is down, and a plan for bringing it back up (typically by re-assigning an elastic IP to another instance). Along the same lines, AWS may schedule maintenance on an instance; the difference here is that you'll have days to plan or preempt the outage. And lastly, there will be times when you need to restart the NAT yourself, as a result of applying kernel-level patches.

Regardless of how the NAT goes out of service, it will take all of the open connections with it. Your applications must be written to handle network timeouts and reconnect, or to gracefully recover when shut down (in which case it's often easiest to just reboot all machines in the availability zone and let them establish new connections).

For us, the cost of running and caring for four NAT Instances remains less than the cost of the NAT Gateway. If we can bring down our traffic to third-party sites that decision will be re-evaluated.

Thursday, November 2, 2017

Ghost Traffic on AWS

We're reconfiguring some of our network endpoints at work, and yesterday, the day after Halloween, I see this traffic appear in our local access logs: - - [01/Nov/2017:15:09:43 -0400] "GET /grave/Charles-Karlson/16427428 HTTP/1.1" 404 967

Line after line. Every few seconds someone — or something — is asking about another grave.

This environment was running under Amazon's Elastic Beanstalk service, which assigns public DNS addresses based on the environment name, and I figured we probably weren't the only people to come up with the name tracker-test-two. Which is true, and leads to one of the take-aways that I'll list below. But the truth, as revealed from the ELB access logs, was stranger:

2017-11-01T19:09:43.168404Z awseb-e-f-AWSEBLoa-1FG2EMDGMS77G 0.000045 0.003369 0.00002 404 404 0 967 "GET HTTP/1.1" "Mozilla/5.0 (compatible; SemrushBot/1.2~bl; +" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2

That's a real site: if you go to it, you'll see an image of a gravestone and information about the person buried under that stone. Useful, I'm sure, to people who are tracking their ancestry. But why was the traffic going to our web server?

The answer is DNS caching: to reduce load on the domain name system, intermediate nameservers are allowed to cache the results of a DNS lookup. That generally works pretty well, and has a 60 second timeout, which should have stopped traffic to that IP address fairly quickly (if you haven't already guessed, they also run on AWS servers, and recently reconfigured their environments). However, some programs and platforms do their own caching, and don't pay attention to the timeout — Java is particularly bad at this. So “sembot” is using its cached value for the URL, and will continue to make those requests until restarted.

There was another interesting artifact in our logs, although not quite as spooky:

2017-11-02T11:49:26.343413Z awseb-e-h-AWSEBLoa-1SXOP8PJPZS8R 0.000116 0.013641 0.000074 200 200 0 7520 "GET HTTP/1.1" "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2

This traffic was appearing every few seconds in the logs for our old server group. This isn't a DNS caching issue: it's hitting an explicit IP address. And it doesn't look like a script kiddie trying to find vulnerabilities in our servers: it's always the same endpoint, always from the same source IP (which is probably somewhere in the midwest), and happens every 60 seconds. As far as I can tell, it's a homegrown health-check program.

Which means that someone is going to panic when I shut down that environment later this morning.

To wrap up, here are some take-aways that you might find useful:

  1. Someone had that IP address before you. Expect unexpected traffic, and think about how you will react if it's excessive.
  2. Don't write your own health checks. If you're using an ELB, you can configure it to run a health check and raise an alarm when the health check fails.
  3. If you feel you must write your own health check, use a hostname rather than an explicit IP. I know you think you'll have that IP forever, but you probably won't.
  4. Pick an appropriate time-to-live for your DNS entries. The Internet Police might yell at me for suggesting 60 seconds, but it makes your life a lot easier if you need to transition environments. Interestingly, Google (which hosts this blog) uses 78 seconds.
  5. If you're using Elastic Beanstalk, use URLs that contain random text (the environment ID is a good choice). Remember that it's a shared namespace.
  6. If you're writing an application that repeatedly connects to remote hosts, understand how your platform deals with DNS caching, and try to honor the server's intentions.
  7. If you run — or any site with similar content — don't change your IPs on Halloween.

Saturday, August 12, 2017

Announcing log4j-aws-appenders

A few months ago I started a "weekend project" to enable logging from my application to CloudWatch. I had used the AWS-provided log appender when working with AWS Lambda, and liked its convenience. For applications running on EC2 instances, however, the CloudWatch Logs Agent was the recommended way to go. I looked around, but all I found was an appender for Log4J 2.0 (I assumed that the Lambda appender uses some Lambda-specific features).

So, as I said, weekend project. Except that I started adding features and refining how the appender worked, based on my use with a semi-production project (runs 24/7, but not business-critical at the moment). At this point it's been running apparently bug-free for weeks, and I can't think of any features that I want to add, so it's time to release.

The JAR is available on Maven Central, so you can simply add it to your project POM:


Then you need to add the appender to your Log4J config:

log4j.rootLogger=WARN, console, cloudwatch

log4j.appender.console.layout.ConversionPattern=%d [%t] %-5p %c %x - %m%n
log4j.appender.cloudwatch.layout.ConversionPattern=%d [%t] %-5p %c %x - %m%n


Note that I create a default ConsoleAppender, and only attach the CloudWatchAppender to my program's package (com.example.log4j). You may prefer to send everything to CloudWatch, but if you do beware that the AWS SDK does its own logging; you won't want to use DEBUG level for it or the Apache HTTP client:

Second thing to note is the logStream configuration parameter: it (and logGroup) can use substitution variables. Here I'm writing a new stream for each application run, rotated daily, with a sequence number to keep track of the different streams.

For more information, head over to the project on GitHub. Feel free to submit issues if you find problems or want an enhancement; I can't guarantee turnaround time for enhancements, but will try to get bugs fixed within a few days.

Next up: an appender for Kinesis Firehose in order to use Kibana with ElasticSearch.

Wednesday, August 9, 2017

Managing Secrets with KMS

Update, April 2018: Amazon just introduced AWS Secrets Manager, which allows you to securely store, retrieve, and version secrets with attached metadata. It also provides direct integration with RDS (MySQL, Postgres, and Aurora only), allowing you to rotate passwords for these services without a code update. Unless you really want to track your secrets in source control, this is a better solution than what's described in this post.

Managing secrets — database passwords, webservice logins, and the like — is one of the more painful parts of software development and deployment. You don't want them to appear in plaintext, because that's a security hole waiting to be exploited. Yet you want them to be stored in your source control system, so that you can track changes. In order to use the secrets you need to decrypt them, but storing a decryption key in source control is equivalent to storing the secrets themselves in plaintext.

I've seen several ways to solve this problem, and liked none of them. On one end of the spectrum are tools like BlackBox, which try to federate user-held keys so that those users can encrypt or decrypt a file of secrets. It was the primary form of secret-sharing at one company I worked at, and I found it quite brittle, with users quietly losing the ability to decrypt files (I suspect as a result of a bad merge).

On the other end of the spectrum is something like HashiCorp Vault, which is a service that provides encrypted secret storage as one of its many capabilities. But you have to manage the Vault server(s) yourself, and you still need to manage the secret that's used to authenticate yourself (or your application) with the service.

One alternative that I do like is Amazon's Key Management Service (KMS). With KMS, you create a “master key” that is stored in Amazon's data center and never leaves. Encryption and decryption are web services, with access controlled via the Amazon Identity and Access Management service. The big benefit of this service is that you can assign the decryption role to an EC2 instance or Lambda function, so that you never need to store physical credentials. The chief drawback is that service-based encryption is limited to 4k worth of data, and you'll pay for each request; for managing configuration secrets, this shouldn't be an issue.

In this post I'm going to show two examples of using KMS. The first is simple command-line encryption and decryption, useful for exchanging secrets between coworkers over an untrusted medium like email. The second shows how KMS can be used for application configuration. To follow along you'll need to create a key, which will cost you $1 for each month or fraction thereof that the key exists, plus a negligible amount per request.

Command-line Encryption and Decryption

Security professionals may suffer angina at the very idea of sharing passwords, but there are times when it's the easiest way to accomplish a task. However, the actual process of sharing is a challenge. The most secure way is to write the password on a piece of paper (with a felt-tip pen so that you don't leave an imprint on the sheet below), physically hand that sheet to the recipient, and expect her to burn it after use. But I can't read my own writing, much less expect others to do so. And physical sharing only works when the people are colocated, otherwise the delays become annoying and you have to trust the person carrying the message.

Sending plaintext passwords over your instant messaging service is a bad idea. I trust Slack as much as anybody, but it saves everything that you send, meaning that you're one data breach away from being forced to change all of your passwords. You could use GPG to encrypt the secret, but that requires that the recipient have your public key (mine is here, but do you have faith that I will have control over that server when “I” send you a message?).

If both of you have access to the same KMS master key, however, there is a simple solution:

> aws kms encrypt --key-id alias/example --output text --query CiphertextBlob --plaintext "Hello, world!"

That string is a Base64-encoded blob of ciphertext that encrypts both the plaintext and the key identifier. You can paste it into an email or instant messenger window, and the person at the other end simply needs to decode the Base64 and decrypt it.

> echo "encrypted string goes here" | base64 -d > /tmp/cipherblob

> aws kms decrypt --ciphertext-blob fileb:///tmp/cipherblob
    "Plaintext": "SGVsbG8sIHdvcmxkIQ==",
    "KeyId": "arn:aws:kms:us-east-1:717623742438:key/dc46c8c3-2269-49ef-befd-b244c7f364af"

Well, that's almost correct. I showed the complete output to highlight that while encrypt and decrypt work with binary data, AWS uses JSON as its transport container. That means that the plaintext remains Base64-encoded. To actually decrypt, you would use the following command, which extracts the Base64-encoded plaintext from the response and pipes it through the Base64 decoder:

> aws kms decrypt --ciphertext-blob fileb:///tmp/cipherblob --output text --query Plaintext | base64 -d
Hello, world!

Secrets Management

The previous example assumed the user that was allowed to use the key for both encryption and decryption. But the ability to encrypt does not imply the ability to decrypt: you control access to the key using IAM policies, and can grant encryption to one set of users (developers) and decryption to another (your applications).

Let's start with the policy that controls encryption:

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": [

For your own policy you would replace 1234567890 with your AWS account ID, and dc46… with the UUID of your own key. Note that you have to use a UUID rather than an alias (ie: alias/example from the command-line above); you can get this UUID from the AWS Console.

Attach this policy to the users (or better, groups) that are allowed to encrypt secrets (typically your entire developer group).

The decryption policy is almost identical; only the action has changed.

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": [

However, rather than attaching this policy to a user or group, attach it to an EC2 instance role. Then, inside your application, use this code to do the decryption:

    private String decodeSecret(AWSKMS kmsClient, String secret) {
        byte[] encryptedBytes = BinaryUtils.fromBase64(secret);
        ByteBuffer encryptedBuffer = ByteBuffer.wrap(encryptedBytes);

        DecryptRequest request = new DecryptRequest().withCiphertextBlob(encryptedBuffer);
        DecryptResult response = kmsClient.decrypt(request);

        byte[] plaintextBytes = BinaryUtils.copyAllBytesFrom(response.getPlaintext());
        try {
            return new String(plaintextBytes, "UTF-8");
        catch (UnsupportedEncodingException ex) {
            throw new RuntimeException("UTF-8 encoding not supported; JVM may be corrupted", ex);

Note that you pass in the KMS client object: like other AWS client objects, these are intended to be shared. If you use a dependency-injection framework, you can create a singleton instance and inject it where needed. As with any other AWS client, you should always use the client's default constructor (or, for newer AWS SDK releases, the default client-builder), which uses the default provider chain to find actual credentials.

This Doesn't Work!

There are a few things that can trip you up. Start debugging by verifying that you've assigned the correct permissions to the users/groups/roles that you think you have: open the user (or group, or role) in the AWS Console and click the “Access Advisor” tab. It's always worth clicking through to the policy, to ensure that you haven't accidentally assigned the encrypt policy to the decrypt user and vice-versa.

Amazon also provides a policy simulator that lets you verify explicit commands against a user/role/group. If use it, remember that you have to explicitly reference the ARN of the resource that you're testing (in this case, the key); by default the Policy Simulator uses a wildcard (“*”), which will be rejected by a well-written policy.

When making changes to policies, remember that AWS is a distributed system, so changes may take a short amount of time to propagate (the IAM FAQ uses the term “almost immediately” several times).

And lastly, be aware that KMS keys have their own policies, and that the key's own policy must grant access to the AWS account for any IAM roles to be valid. If you create your KMS key via the console it will have an appropriate default policy; this may not be the case if you create it via the SDK or command line (although the docs indicate that, even then, there's a default policy that grants access to the account, so this problem is unlikely).

But I don't want to be locked in to AWS!

I've heard several people raise this issue, about various AWS services. It's not an issue that I think much about: the last few companies that I've worked for were running 100% in AWS. We're already locked in, making use of multiple AWS services; KMS is just one more. If you're also running entirely in AWS, I think that you should embrace the services available, and not worry about lock-in; moving to another cloud provider isn't a task to be undertaken lightly even if you're just using compute services.

For those running hybrid deployments (part in the cloud part in a datacenter), or who use cloud services from multiple vendors, the concern is perhaps more relevant. If only because your operations become dependent on the network connection between you and AWS.

The cost-benefit analysis in that case becomes a little more complex. I still think that KMS — or a similar service provided by other clouds, like Azure Key Vault — is worthwhile, simply because it applies rigor to your secrets management. With a little thought to your configuration management, you should be able to keep running even if the cloud is unavailable.

Wednesday, August 2, 2017

What I Look For When Evaluating Coding Challenges

Like many companies, my employer uses a coding challenge as part of the interview process. You may or may not like coding challenges, or think they're fair or representative, but I think it's a useful tool as part of the interview process. This post explains why, and how I look at the challenge responses. Other companies and people may do things differently.

Our challenge is the third step in the interview process. The process begins with several developers reading the candidate's resume and voting. If a candidate receives a net positive vote, we proceed to the second stage, which is a conversation with our HR representative to go over the position. If the candidate considers him- or herself a fit, we send the coding challenge, at a date and time of the candidate's choosing.

The challenge is taken almost verbatim from one of the many coding interview books, a fact that one candidate pointed out to us with reference to the exact book and page — and followed that with an answer to a completely different question. We allow the candidate to take as much time as desired, with the suggestion that it should only require an hour; the implication is that if you're still struggling after several hours it's time to give up. Originally we had a hard one-hour time limit, but I found that candidates were making silly mistakes and pushed for the relaxed deadline in the hope that they'd take time for polishing. Unfortunately, candidates still make silly mistakes.

The question is relatively simple, but involves relationships between data structures; something that seems to give people trouble. It could be implemented using a six-line SQL query with a self-join, and if anyone ever submits that I will give him or her an immediate thumbs-up.

But, barring that, here are the things that I'm evaluating, in order of importance:

  • It's gotta run
    Seems self-evident, no? But it's amazing to me how many submissions have glaring bugs that indicate the candidate hasn't actually run the code. We used to give a pass if we could fix the bug (most of them were one-liners) and the algorithm was reasonable, but we stopped doing that: if you can't submit running code for your interview, why should we expect you to submit running code as part of your daily work?
  • It's gotta do what we asked
    The acceptance criteria are simple: “print X.” But it's amazing how many people do that only coincidentally. Again, we used to give a pass if you printed X along with Y and Z, but doing so raises the same issue as above: will you suddenly become less sloppy if we hire you? I'm betting no.
  • Use appropriate data structures
    This is a somewhat fuzzy criterion, but an example might help. If you use a List and then write deduplication logic, I'm going to wonder if you know what a Set is. That's not a show-stopper for an entry- or mid-level position, but it's a serious negative for a lead. That said, so far I haven't seen a submission that used inappropriate data structures and didn't have other significant problems.

That's it. I'm not looking for a particular answer or a particular style of coding; there are several equally valid approaches. If you chose one of them (or something completely different that passes our test cases) you'll be invited to the next stage of the interview process, the technical phone screen. And that brings up my final expectation:

Be prepared to discuss your work

One of my personal annoyances with coding challenges is when they appear to go into the bitbucket: you don't know whether the company even looked at it, or just assigned it as a way to weed out people who weren't committed enough to submit anything. So when I do a phone interview, I open up the code and ask the candidate questions about it (they've been told that this will happen, so have the opportunity to prepare — it's still surprising to me how many don't).

I present the discussion as a code review: I want to see how the candidate responds to critique. And there's almost always something to criticize: the one-hour suggested time limit usually leads to corners being cut (although I did have one candidate whose code was almost perfectly written; he did coding challenges as a hobby). I believe a question like “why did you iterate this structure twice?” can lead to useful insights about candidates; at the least, it shows whether they can look at their own code dispassionately.

Other topics of my phone interview include asking the candidate to evaluate the code him- or herself, asking the candidate to compare his or her approach with the other standard approach to the problem (it's interesting: most candidates “see” just one approach or the other), and finally, asking what tests would be appropriate for the code.

Does all of this lead to a better candidate? To be honest, I think our false-positive rate is still too high: we get people who pass the coding challenge but then fail the in-person interview (which has design questions and a “do I want to work with this person” focus).

But compared to my experience at a former company that let HR do all the screening, it's a lot better: I've never had the experience of sitting down in a face-to-face interview with someone who has a long resume but no competence.

Saturday, May 20, 2017


I recently sold three calculators on eBay: an HP15C, an HP16C, and a TI59. Got slightly over $300 for them, which was less than I paid 30-plus years ago (especially adjusted for inflation), but they had no value sitting on a shelf in my office. Along the way, I realized two things.

First, I was, and may still be, a calculator geek.

I was never one of those people that walked around with his calculator on his belt (indeed, only one of the three even had a case with a belt loop). But, let's be honest, owning a dozen calculators just ain't normal (no, I didn't sell all of my old calculators; sue me). Nor is it normal for a 15-year-old to save his allowance and odd-job money so that he could buy a programmable calculator (a TI-55, which I still have; again, sue me).

The second thing I realized is just how irrelevant calculators are today.

These three calculators were the pinnacle of late-70s/early-80s calculator design. All three were programmable. The HP15 was designed for engineers: it could work with complex numbers, and could compute integrals and roots based on your programmed function. The TI-59, with nearly 1 KB (!) of memory and a magnetic card reader, was a viable competitor for then-current home computers, at 1/10th the price (and was touted as such in contemporary reviews). Certainly I used it as such: looking over my personal program card library, I saw one card that I used to balance my checkbook, tracking credits and debits.

But as desktop computers became more prevalent, nobody needed programmable calculators. In the early 90s I bought what I still consider to be HP's best calculator, the HP27S. It wasn't programmable, but you could still find the root of a function: you simply entered it as a formula and the calculator was smart enough to either solve it algebraically or numerically. It also combined functionality from the financial and computing domains with the basic scientific functions. And it used algebraic notation, which I'm sure upset a lot of HP purists.

I first got a hint of how irrelevant the physical devices had become a few years ago, when I was doing some bit manipulation operations and took out my 16C. When I learned that a new set of batteries would cost $15, I looked for something similar that I could run on my phone. And found apps that were based on the actual HP ROMs (they weren't copyrighted), running on an emulator. For less than the cost of batteries.

So the actual calculator remained on a shelf, unused. And really, I don't use the apps that much either (I also downloaded an HP11 emulation, so plural). When sitting at the computer it's easier to start up a Python shell or a spreadsheet. Really, the only place that I use the calculator is when I'm at Home Depot trying to figure out how many bags of fertilizer I need.

Perhaps high school and college students still use calculators, but I suspect not. Laptops are pervasive and inexpensive; why carry two devices. Which makes me wonder: will the calculators I sold sit unused on the shelf of someone who's more of a calculator geek than me?

Sunday, March 5, 2017

MySQL: The problem with synthetic primary keys

One of the standard practices in database design is to use a “synthetic” primary key — often an auto-increment integer — rather than a “natural” key relevant to the business domain. This is a Good Thing: natural keys are often not unique or ever-present, even when you think they are. Social security number is a perfect example: aside from the fact that you're not legally allowed to use it for non-tax-related identification, not everybody has one, and some people (either maliciously or accidentally) give you the wrong one.

So this post is not against synthetic primary keys per se. Rather, it's about the performance penalty that you'll pay if you create a single-column auto-increment primary key for every table in your database.

As an example, I'm going to use the USERS and EVENTS tables from my previous post. The former has 100,000 rows, the latter 10,000,000. As before, only 25% of events have associated users, so it's not a strict parent-child relationship, but we'll get to that.

create table USERS
    ID          integer not null auto_increment,
    CREATED_AT  timestamp not null default CURRENT_TIMESTAMP,
    VALUE       varchar(255),

    primary key (ID),
    key         (CREATED_AT)

create table EVENTS
    ID          integer not null auto_increment,
    CREATED_AT  timestamp not null default CURRENT_TIMESTAMP,
    TYPE        smallint not null,
    USER_ID     integer null,
    VALUE       varchar(255),

    primary key (ID),
    key         EVENTS_USER_IDX (USER_ID)

Both of these tables have synthetic primary keys, an auto-increment ID column. The EVENTS table has a foreign key relationship to USERS via USER_ID, and a secondary index to (we hope!) improve join times. Here's a typical query, with its execution plan:

select  count(*)
from    USERS U
join    EVENTS E on E.USER_ID = U.ID
where   U.CREATED_AT > subdate(CURRENT_TIMESTAMP(), 7)
and     E.TYPE = 7;
| id | select_type | table | partitions | type  | possible_keys      | key             | key_len | ref          | rows | filtered | Extra                    |
|  1 | SIMPLE      | U     | NULL       | range | PRIMARY,CREATED_AT | CREATED_AT      | 4       | NULL         | 3891 |   100.00 | Using where; Using index |
|  1 | SIMPLE      | E     | NULL       | ref   | EVENTS_USER_IDX    | EVENTS_USER_IDX | 5       | example.U.ID |   98 |    10.00 | Using where              |

That looks perfectly reasonable: the query selects rows from USERS using the index on CREATED_AT, and then joins to EVENTS using the index on USER_ID. With my test database it returns slightly under 10,000 rows.

But why does it take 12 seconds the first time I run it, and only 1 second for subsequent runs?

The problem is that MySQL manages primary keys as a clustered index: it orders the physical rows in the table according to the primary key value. What this means for the join between USERS and EVENTS is that the events for a given user will be sprinkled all over the disk: each new row will be assigned a primary key value that's physically located at the end of the current data. As a result, if you select all of the rows for a particular user, you may have to access a large number of disk blocks. And to access a block you have to read it from disk.

That's not so bad if you have enough RAM to cache your entire database — which is why my subsequent queries ran so quickly. But if you don't, or queries against the EVENTS table are infrequent, you'll be hitting the disk constantly. Worse, you'll be evicting pages from the disk cache, so all your other queries will slow down because they also have to read data from the disk.

So what's the alternative?

create table NEW_EVENTS
  ID          int         not null auto_increment,
  CREATED_AT  timestamp   not null default CURRENT_TIMESTAMP on update CURRENT_TIMESTAMP,
  TYPE        smallint    not null,
  USER_ID     int         not null,
  VALUE       varchar(255),


With this new table, my example query has the same execution plan. But the primary key on the new table organizes its data such that rows for the same user are collocated: assuming that your conditions don't select all users, it should read fewer blocks. On my laptop — after restarting MySQL and flushing the disk cache — it took just under half a second for the first query, and 0.04 seconds for subsequent queries.

So clearly the theory is borne out in practice, but can we gain more insight into what's happening? The answer is yes, provided that you have the performance schema enabled for your database (and you really should, even in production!). You can retrieve information about the number of disk operations and the time they take by querying TABLE_IO_WAITS_SUMMARY_BY_INDEX_USAGE:

from    performance_schema.table_io_waits_summary_by_index_usage
where   OBJECT_NAME in ('USERS', 'EVENTS');

Like other tables in the performance schema, the counters are not tied to a query; you need to ensure that nothing else is happening and truncate the table between your test queries. Arguably better is to restart your server and flush the disk caches, as this will show you the worst-case timings.

Here are the stats for the first query:

| example       | EVENTS      | PRIMARY         |          0 |              0 |           0 |               0 |           0 |               0 |
| example       | EVENTS      | EVENTS_USER_IDX |      97669 | 12350283888004 |           0 |               0 |       97669 |  12350283888004 |
| example       | USERS       | PRIMARY         |          0 |              0 |           0 |               0 |           0 |               0 |
| example       | USERS       | CREATED_AT      |       3892 |     5101149804 |           0 |               0 |        3892 |      5101149804 |

And for the second:

| example       | NEW_EVENTS  | PRIMARY           |      97669 |   455075854140 |           0 |               0 |       97669 |    455075854140 |
| example       | NEW_EVENTS  | NEW_EVENTS_ID_KEY |          0 |              0 |           0 |               0 |           0 |               0 |
| example       | USERS       | PRIMARY           |          0 |              0 |           0 |               0 |           0 |               0 |
| example       | USERS       | CREATED_AT        |       3892 |     2637773460 |           0 |               0 |        3892 |      2637773460 |

The number of reads are identical, but look at the difference in the read timer. This value is measured in trillionths of a second, so after adding a decimal point and dropping digits from the end we see that accesses to the EVENTS table took 12.350 seconds for the first query and 0.455 seconds for the second, with the same number of reads in each. The difference is that, to find all of the relevant events for a given user the first query had to read (and wait for) far more blocks to come from the disk.

You will also notice that the time taken to access the USERS table via the CREATED_AT index has decreased as well. This can't be explained by data clustering — indeed, in creating the test data for this post, I ensured that users would not be clustered by creation date. However, I believe that these numbers show the overall benefit of reducing disk operations: by reading fewer blocks in the NEW_EVENTS table, there's a higher likelihood that blocks from USERS will remain in the disk cache.

There are, of course, tradeoffs to reindexing your tables. The most obvious one is that EVENTS.USER_ID can no longer be nullable, because it's part of the primary key. This will cause some pain when dealing with anonymous users: you must insert an explicit value rather than leave the column NULL (I will note that some authorities, including C. J. Date, will say you shouldn't use nulls anyway, so this isn't necessarily a hardship).

There's also a price to pay with inserts: in order to keep the rows ordered by primary key, MySQL may need to split blocks on insert, in order to make room for the new rows. I don't think this is a major issue: for active users, you'll be inserting new blocks into the table rather than splitting existing blocks, and for inactive users you won't often pay the price of a split.

And finally, beware that this technique this isn't appropriate everywhere. It works best with large tables, where physical row data has to be read to perform a query (versus a covering index). As with all performance optimizations, there's no substitute for actual measurement.

Sunday, February 26, 2017

Warming up RDS Snapshots

I learned something last week: when you start an RDS database instance from a snapshot, the instance's volume isn't copied all at once. Instead, it's lazily paged in from the snapshot, which is stored in S3. I suppose that shouldn't surprise me; after all, it's documented behavior (albeit for EC2, not RDS).

What did surprise me was just how much of a performance hit it causes. For example, a table scan of one of our large, rarely accessed (so not in cache) tables took about 10 minutes on an instance that had been running for several months. The same query, on a new but otherwise identical replica took nearly an hour.

Read replicas are one of the places where this behavior is particularly painful: one of the primary reasons for using a replica is to move long-running queries away from your primary machine. But with lazy-loading, for the first few hours (or days!) of a replica's life, it's significantly slower than running against the master. Worse, if you ever need to restore from a snapshot: your whole system will run slower while the blocks are loaded.

So what can you do?

You need to “warm up” the replica, by reading data from it before you put it into active use. And there are several ways to do this: one is to write queries that touch every table, preferably with a table-scan. An alternative is to take a backup of the database, and immediately throw it away:

mysqldump --all-databases --host HOST --user USER --password > /dev/null

I'm not convinced this is sufficient: it will touch all of the data blocks, but may leave the index blocks behind. I welcome any better approaches.

Saturday, February 11, 2017

MySQL: when NOT IN is not equal to NOT EXISTS

When you want to perform a difference operation between two tables, you have a choice: NOT EXISTS with a correlated subquery, or NOT IN. The latter is arguably simpler to write and makes the intent of the query more obvious. And modern database systems will optimize the two queries into similar execution plans, handling the correlation between outer and inner queries (I say “modern” because when I was working with Oracle 7.3 in the mid-90s I learned the hard way that it did not).

There is one key difference between the two constructs: if the subquery returns a NULL in its results then the NOT IN condition will fail, because null is neither equal-to nor not-equal-to any other value. But if you guard against that, they should be equivalent — indeed, some sources will tell you that NOT IN is faster and therefore preferred.

This post is about one case where it's dramatically slower, and nulls are to blame.

Consider the following two tables, which might be used to track clickstream data. Since we track both anonymous and registered users, EVENTS.USER_ID is nullable. However, when the user is not null, the secondary index has high cardinality.

create table USERS
  ID    integer auto_increment primary key,

create table EVENTS
  ID      integer auto_increment primary key,
  TYPE    smallint not null,
  USER_ID integer


OK, now let's use these tables: from a small set of users, we want to find the ones that haven't had a particular event. Using a NOT IN, and ensuring that nulls don't appear in the inner results, the query looks like this:

select  ID
from    USERS
where   ID in (1, 7, 2431, 87142, 32768)
and     ID not in
        select  USER_ID
        from    EVENTS
        where   TYPE = 7
        and     USER_ID is not null

For my test dataset, the USERS table has 100,000 rows and the EVENTS table has 10,000,000, of which approximately 75% have a null USER_ID. I'm running on my laptop, which has a Core i7 processor, 12 GB of RAM, and an SSD.

And I consistently get runtimes of around 2 minutes, which is … wow.

Let's replace that NOT IN with a NOT EXISTS and correlated subuery:

select  ID
from    USERS
where   ID in (1, 7, 2431, 87142, 32768)
and     not exists
        select  1
        from    EVENTS
        where   USER_ID = USERS.ID
        and     TYPE = 7

This version runs in 0.01 seconds, which is more what I expected.

Time to compare execution plans. The first plan is from the NOT IN query, the second is from the NOT EXISTS.

| id | select_type        | table  | partitions | type           | possible_keys   | key             | key_len | ref  | rows | filtered | Extra                    |
|  1 | PRIMARY            | USERS  | NULL       | range          | PRIMARY         | PRIMARY         | 4       | NULL |    5 |   100.00 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | EVENTS | NULL       | index_subquery | EVENTS_USER_IDX | EVENTS_USER_IDX | 5       | func |  195 |    10.00 | Using where              |
| id | select_type        | table  | partitions | type  | possible_keys   | key             | key_len | ref              | rows | filtered | Extra                    |
|  1 | PRIMARY            | USERS  | NULL       | range | PRIMARY         | PRIMARY         | 4       | NULL             |    5 |   100.00 | Using where; Using index |
|  2 | DEPENDENT SUBQUERY | EVENTS | NULL       | ref   | EVENTS_USER_IDX | EVENTS_USER_IDX | 5       | example.USERS.ID |   97 |    10.00 | Using where              |

Almost identical: both select rows from the USERS table, and then use a nested-loops join (“dependent subquery&rquo;) to retrieve rows from the EVENTS table. Both claim to use EVENTS_USER_IDX to select rows in the subquery. And they estimate similar numbers of rows at each step.

But look more closely at the join types. The NOT IN version uses index_subquery, while the NOT EXISTS version uses ref. Also look at the ref column: the NOT EXISTS version uses an explicit reference to the outer column, while the NOT IN uses a function. What's going on here?

The index_subquery join type indicates that MySQL will scan the index to find relevant rows for the subquery. Could that be the problem? I don't think so, because the EVENTS_USER_IDX is “narrow”: it only has one column, so the engine should not have to read a lot of blocks to find rows corresponding to the IDs from the outer query (indeed, I've tried a variety of queries to exercise this index, and all run in a few hundredths of a second).

For more information, I turned to the “extended” execution plan. To see this plan, prefix the query with explain extended, and follow it with show warnings. Here's what you get from the NOT IN query (reformatted for clarity):

/* select#1 */  select `example`.`USERS`.`ID` AS `ID` 
                from    `example`.`USERS` 
                where   ((`example`.`USERS`.`ID` in (1,7,2431,87142,32768)) 
                        and (not((`example`.`USERS`.`ID`,
                            (((`example`.`USERS`.`ID`) in EVENTS 
                            on      EVENTS_USER_IDX checking NULL
                            where   ((`example`.`EVENTS`.`TYPE` = 7) 
                                    and (`example`.`EVENTS`.`USER_ID` is not null))
                            having (`example`.`EVENTS`.`USER_ID`)))))))

I haven't been able to find an explanation of “on EVENTS_USER_IDX checking NULL” but here's what I think is happening: the optimizer believes that it is executing an IN query that can have NULL in the results; it does not consider the null check in the where clause when making this decision. As a result, it will examine the 7.5 million rows where USER_ID is null, along with the few dozen rows where it matches the values from the outer query. And by “examine,” I mean that it will read the table row and only then apply the is not null condition. Moreover, based on the time taken to run the query, I think it's doing this for every one of the candidate values in the outer query.

So, bottom line: whenever you're thinking of using an IN or NOT IN subquery on a nullable column, rethink and use EXISTS or NOT EXISTS instead.

Friday, January 27, 2017

Trusting the Internet: Picking Third-Party Libraries

Many applications today are like the human body:* a relatively small proportion of “in-house” code, leveraged by dozens if not hundreds of third-party libraries — everything from object-relational mappers to a single function that left-pads a string. And that leads to a conundrum: how do you pick the libraries that you include in your project? Or in other words, is it OK to download something from the Internet and make it a fundamental part of your business?

Sometimes, of course, you don't have a choice. If you use the JUnit testing framework, for example, you are going to get the Hamcrest library along with it (and maybe you'll feel some concern that the domain is no longer registered). But what criteria did you use to pick JUnit?

I was recently faced with that very question, in looking for a library to parse and validate JSON Web Tokens in Java. There are several libraries to choose from; these are the criteria that I used to pick one, from most important to least.

  1. Following the crowd
    If 100,000 projects use a particular library without issue, chances are good that you can too. But how do you know how many projects use a library? For JavaScript projects, npm gives you numbers of downloads; ditto for Ruby projects and the gems they use. Java projects don't have it so easy: while Maven Central does keep statistics on downloads (they're available to package maintainers), that information isn't available to consumers (other than a listing of the top 10 downloads).

    One interesting technique for following the crowd is looking in your local repository, to see if the package is already there as a dependency of another library. If you can create a dependency tree for your project, look at where the candidate lives in the tree: is it close to the root or deep in the weeds? Is it included by multiple other libraries or just one? These are all signals of how the rest of the world views the library.

  2. Documentation
    I believe that care in documentation is a good proxy for care in implementation. Things I want to see are complete JavaDoc and examples. For projects hosted on GitHub, I should be able to understand the library based solely on the README (and there's no reason that a non-GitHub project should omit a README, although I admit to being guilty in that regard).

  3. Author Credibility
    This can be difficult, especially where the “author” is a corporation (although large corporations tend to do their own vetting before letting projects out under the corporate name). In the case of a sole maintainer, I Google the person's name and see what comes up. I'd like to see web pages that demonstrate deep knowledge of the subject (especially for security-related libraries). Even better are slides from a conference, because that implies tha the author has at least some recognition in the community.

  4. Issue Handling
    Every library has issues. Does the maintainer respond to them in a reasonable timeframe? A large number of outstanding issues should raise a red flag, as should a maintainer that responds in a non-professional manner. You wouldn't accept that from a coworker (I hope), and by using a package you make the maintainer your coworker.

Once I have decided on a candidate library (or small number of candidates) I try it out for my use case. If it looks good, it becomes part of my application. One thing that I do not do is dig into the library source code.

The promise of open-source software is that you can download the sources and inspect them. The reality is that nobody every does that — and nobody could, because it would be more than a full-time job. So we choose as best we can, and hope that there isn't a dependency-of-a-dependency-of-a-dependency that's going to hurt us.

* The reference is to the amount of bacteria and other organisms that don't share your DNA but live on or in your body. You'll often find a 10:1 (bacteria:human) ratio quoted, but see this article for commentary on the history and validity of that ratio (tl;dr, it's more like 60:40).