blog.kdgregory.com

Wednesday, December 28, 2016

Server-side Authentication with Amazon Cognito IDP

I have to say this up-front: I wouldn't choose Cognito for a production project.

Its feature-set is truly compelling: a user database with customizable attributes, authentication with tokens for authorization, automatic verification of both email and SMS, multi-factor authentication, federation with other authentication services such as Google and Facebook, synchronization of user data between devices; I'm sure there's something that I missed. What's not to like?

Well, let's start with its documentation. It is far and away the worst-documented of any AWS service that I've ever used (and I've used a lot of them): the documentation neglects to describe parameters; there are no examples for server-side code, and the client-side examples are snippets that use variables assigned elsewhere; there's no description of the algorithms or techniques used (and enough acronyms that searching is challenging). Moving on, the API is wonky as well, with inconsistent request creation, exceptions used to signal non-exceptional conditions, and a lack of granularity in responses. Plus, the Java API doesn't support all of the actions in the API reference. And there is functionality that's documented but doesn't actually work.

I originally thought that Cognito was an internal Amazon API exposed for external use. Perhaps the API used for authentication by the AWS Console. As such, I expected it to be well-used and rock solid. After working with it, I realized that it is at best a beta-quality product, and appears to be completely separate from anything that Amazon is currently exposing in the realm of authentication/authorization.

So why am I writing this post?

Partly, it's because I'm trying to recover sunk costs. I originally investigated Cognito as part of a presentation on serverless Java apps with AWS Lamba. Authentication was a (small) part of the app, and I wanted to stick with AWS technologies. But getting Cognito working turned out to be a huge investment of time and effort: reading, re-reading, and cross-checking the documentation; reading the Android SDK source code to fill in the holes in the documentation; and writing lots of experiments. I figured that I should have something to show for that effort.

Another reason is that there aren't any examples of server-side code available — or my Google-fu isn't that good. So, this is a public service for anyone else thinking of an all-AWS solution. But it comes with a caveat: the Cognito team may consider my approach completely wrong. If so, I welcome their well-written examples of how to do it right.

And finally, well, the feature-set really is compelling. I hope that the powers-that-be at Amazon realize that Cognito is a beta-quality product and invest in improving it. The best-case scenario is that Cognito becomes the user database for amazon.com, or at least for the AWS web console.

Overview: What to expect in the rest of this post

My example program has three functions: signing up for a new account, signing in to an existing account, and verifying that a user has signed in. The example program is built on Java servlets; I intentionally avoided frameworks such as Spring in order to focus on behavior. Similarly, the browser side uses simple HTML pages, with JQuery to send simple POST requests to the server.

All servlets return 200 for all requests unless there's an uncaught exception. The response body is a string that indicates the result of the operation. Depending on the result, the client-side code will either show an alert (for bad inputs) or move to the next step in the flow.

On successful sign-in the servlet stores two cookies, ACCESS_TOKEN and REFRESH_TOKEN, which are used to authorize subsequent requests. These cookies are marked “httpOnly” in order to prevent cross-site scripting attacks. The example code does not make any attempt to prevent cross-site request forgery attacks, as such prevention generally relies data passed as page content.

Also on the topic of security: all communication is sent in clear-text, on the assumption that real-world application will use HTTPS to secure all communications. Cognito provides a client-side library that exchanges secrets in a secure manner, but I'm not using it (because this is intended as a server-side example).

For those that want to follow along at home, the source code is here.

Usernames

I strongly believe in using an email address as the primary account identifier; I get annoyed every time I'm told that “kdgregory” is already in use and that I must guess at a username that's not in use. Email addresses are a unique identifier that won't change (well, usually: my Starwood Preferred Guest account seems to be irrevocably tied to my address at a former employer, even though I attempt to change it every time I check in).

Cognito also has strong opinions about email addresses: they're secondary to the actual username. It does support the ability to validate an email address and use it in place of the username, but the validation process requires the actual username. While we could generate random usernames on the server, that adds cognitive load on the user — something we all want to avoid.

Fortunately, the rules governing legal usernames allow the use of email addresses. And Cognito allows you to generate an initial password and send it via email, which prevents hijacking of an account by a user that doesn't own that address. Cognito doesn't consider this email to provide validation, which leads to some pain; I'll talk more about that later on.

Creating a User Pool

It's easy to create a user pool, but there are a few gotchas. The following points are ordered by the steps in the current documentation. At the time of writing, you can't use CloudFormation to create pools or clients, so for the example code I provide a shell script that creates a pool and client that match my needs.

  • Require the email attribute but do not mark it as an alias

    Cognito allows users to have alias identifiers that work in place of their username. However, as I mentioned above, these aliases only work if they pass Cognito's validation process. Since we'll be using the email address as the primary identifier, there's no need to mark it as an alias. But we do want to send email to the user so must save it as the email attribute in addition to the username.

  • Do not enable automatic verification of email addresses

    If you enable this feature, Cognito sends your users an email with a random number, and you'll have to provide a page/servlet where they enter this number. Note, however, that by skipping this feature you currently lose the ability to send the user a reset-password email.

  • Do not create a client secret

    When you create a client, you have the option for Cognito to create a secret hash in addition to the client ID. The documentation does not describe how to pass this hash in a request, and Cognito will throw an exception if you don't. Moreover, the JavaScript SDK doesn't support client secrets; they're only used by the Android/iOS SDKs.

Creating a User (sign-up)

The sign-up servlet calls the adminCreateUser() function.

try
{
    AdminCreateUserRequest cognitoRequest = new AdminCreateUserRequest()
            .withUserPoolId(cognitoPoolId())
            .withUsername(emailAddress)
            .withUserAttributes(
                    new AttributeType()
                        .withName("email")
                        .withValue(emailAddress),
                    new AttributeType()
                        .withName("email_verified")
                        .withValue("true"))
            .withDesiredDeliveryMediums(DeliveryMediumType.EMAIL)
            .withForceAliasCreation(Boolean.FALSE);

    cognitoClient.adminCreateUser(cognitoRequest);
    reportResult(response, Constants.ResponseMessages.USER_CREATED);
}
catch (UsernameExistsException ex)
{
    logger.debug("user already exists: {}", emailAddress);
    reportResult(response, Constants.ResponseMessages.USER_ALREADY_EXISTS);
}
catch (TooManyRequestsException ex)
{
    logger.warn("caught TooManyRequestsException, delaying then retrying");
    ThreadUtil.sleepQuietly(250);
    doPost(request, response);
}

There are a couple of variables and functions that are used by this snippet but set elsewhere:

  • The cognitoClient variable is defined in an abstract superclass; it holds an instance of AWSCognitoIdentityProviderClient. Like other AWS “client” classes, this class is threadsafe and, I assume, holds a persistent connection to the service. As such, you're encouraged to create a single client and reuse it.
  • The emailAddress variable is populated from a request parameter. I don't want to clutter my examples with boilerplate code to retrieve parameters, so you can assume in future snippets that any variable not explicitly described comes from a parameter.
  • I have two functions in the abstract superclass that retrieve configuration values from the servlet context (which is loaded from the web.xml file). Here I call cognitoPoolId(), which returns the AWS-assigned pool ID. Later you'll see cognitoClientId().

Moving on to the the code itself: most Cognito client functions take a request object and return a response object. The request objects are constructed using the Builder pattern: each “with” function adds something to the request and returns the request object so that calls can be chained.

Most of the information I provide in the request has to do with the email address. It's the username, as I said above. But I also have to explicitly store it in the email attribute, or Cognito won't be able to send mail to the user. And we want that, so that Cognito will generate and send a temporary password.

I'm also setting the email_verified attribute. This attribute is normally set by Cognito itself when the user performs the verification flow, and it's required to send follow-up emails such as a password reset message. However the documentation states that it can be forcibly set at the time the user is created, and that fits with my desired user flow. Unfortunately, even though this seems to work (the AWS console shows additional actions when it's set), subsequent attempts to reset the password fail with an “unverified email” error. I'm leaving the code here in the hopes that the Cognito team will make it work for real.

Exception handling is the other big part of this snippet. As I said before, Cognito uses a mix of exceptions and return codes; the latter signal a “normal” flow, while exceptions are for things that break the flow — even if they're a normal part of user authentication. If you look at the documentation for adminCreateUser() you'll see an almost overwhelming number of possible exceptions. However, many of these are irrelevant to the way that I'm creating users. For example, since I let Cognito generate temporary passwords, there's no need to handle an InvalidPasswordException.

For this example, the only important “creation” exception is thrown when there's already a user with that email address. My example responds to this by showing an alert on the client, but a real application should initiate a “forgotten password” flow.

The second exception that I'm handling, TooManyRequestsException, can be thrown by any operation; you always need to handle it. The documentation isn't clear on the purpose of this exception, but I'm assuming that it's thrown when you exceed the AWS rate limit for requests (versus a security measure specific to repeated signup attempts). My example uses a rather naive solution: retry the operation after sleeping for 250 milliseconds. If you're under heavy load, this could delay the response to the user for several seconds, which could cause them to think your site is down; you might prefer to simply tell them to try again later.

A successful call to adminCreateUser() is only the first part of signing up a new user. The second step is for the user to log in using the temporary password that Cognito sends to their email address. My example page responds to the “user created” response with a client-side redirect to a confirmation page, which has fields for the email address, temporary password, and permanent password. These values are then submitted to the confirmation servlet.

As far as Cognito is concerned, there's no code-level difference between signing in with the temporary password and the final password: you call the adminInitiateAuth() method. The difference is in the response: when you sign in with the temporary password you'll be challenged to provide the final password.

This ends up being a fairly large chunk of code; I've split it into two pieces. The first chunk is straightforward; it handles the initial authentication attempt.

Map initialParams = new HashMap();
initialParams.put("USERNAME", emailAddress);
initialParams.put("PASSWORD", tempPassword);

AdminInitiateAuthRequest initialRequest = new AdminInitiateAuthRequest()
        .withAuthFlow(AuthFlowType.ADMIN_NO_SRP_AUTH)
        .withAuthParameters(initialParams)
        .withClientId(cognitoClientId())
        .withUserPoolId(cognitoPoolId());

AdminInitiateAuthResult initialResponse = cognitoClient.adminInitiateAuth(initialRequest);

The key thing to here is AuthFlowType.ADMIN_NO_SRP_AUTH. Cognito supports several authentication flows; later we'll use the same function to refresh the access token. Client SDKs use the Secure Remote Password (SRP) flow; on the server, where we can secure the credentials, we use the ADMIN_NO_SRP_AUTH flow.

As with the previous operation, we need the pool ID. We also need the client ID; you can create multiple clients per pool, and track which user uses which client (although it's still a single pool of users). You must create at least one client (known in the console as an app). As I noted earlier, both IDs are configured in web.xml and retrieved via functions defined in the abstract superclass.

As I said, the difference between initial signup and normal signup is in the response. In the case of a normal sign-in, which we'll see later, the response contains credentials. In the case of an initial signin, it contains a challenge:

if (! ChallengeNameType.NEW_PASSWORD_REQUIRED.name().equals(initialResponse.getChallengeName()))
{
    throw new RuntimeException("unexpected challenge: " + initialResponse.getChallengeName());
}

Here I expect the “new password required” challenge, and am not prepared for anything else (since the user should only arrive here after a password change). In a real-world application I'd use a nicer error response rather than throwing an exception.

We respond to this challenge with adminRespondToAuthChallenge(), providing the temporary and final passwords. One thing to note is withSession(): Cognito needs to link the challenge response with the challenge request, and this is how it does that.

Map challengeResponses = new HashMap();
challengeResponses.put("USERNAME", emailAddress);
challengeResponses.put("PASSWORD", tempPassword);
challengeResponses.put("NEW_PASSWORD", finalPassword);

AdminRespondToAuthChallengeRequest finalRequest = new AdminRespondToAuthChallengeRequest()
        .withChallengeName(ChallengeNameType.NEW_PASSWORD_REQUIRED)
        .withChallengeResponses(challengeResponses)
        .withClientId(cognitoClientId())
        .withUserPoolId(cognitoPoolId())
        .withSession(initialResponse.getSession());

AdminRespondToAuthChallengeResult challengeResponse = cognitoClient.adminRespondToAuthChallenge(finalRequest);
if (StringUtil.isBlank(challengeResponse.getChallengeName()))
{
    updateCredentialCookies(response, challengeResponse.getAuthenticationResult());
    reportResult(response, Constants.ResponseMessages.LOGGED_IN);
}
else
{
    throw new RuntimeException("unexpected challenge: " + challengeResponse.getChallengeName());
}

Assuming that the provided password was acceptable, and there were no other errors (see below), then we should get a response that (1) has a blank challenge, and (2) has valid credentials. I don't handle the case where we get a new challenge (which in a real-world app might be for multi-factor authentication). I store the returned credentials as cookies (another method in the abstract superclass), and return a message indicating the the user is logged in.

Now for the “other errors.” Unlike the initial signup servlet, there are a bunch of exceptions that might apply to this operation. TooManyRequestsException, of course, is possible for any call, but here are the ones specific to my flow:

  • InvalidPasswordException if you've set rules for passwords and the user's permanent password doesn't satisfy those rules. Cognito lets you require a combination of uppercase letters, lowercase letters, numbers, and special characters, along with a minimum length.
  • UserNotFoundException if the user enters bogus email address. This could be an honest accident, or it could be a fishing expedition. A security-conscious site should attempt to discourage such attacks; one simple approach is to delay the response after every failed request (but note that could lead to a denial-of-service attack against your site!).
  • NotAuthorizedException if the user provides an incorrect temporary password. Again, this could be an honest mistake or an attack; do not give the caller any indication that they have a valid user but invalid password (I return the same “no such user” message for this exception and the previous one).

Before wrapping up this section, I want to point out that there are cases where Cognito throws an exception but the user has been created. There's no good way to recover from this, other than to provide the user with a “lost password” flow.

Authentication (sign-in)

You've already seen the sign-in code, as part of sign-up confirmation. Here I want to focus on the response handling.

Map authParams = new HashMap();
authParams.put("USERNAME", emailAddress);
authParams.put("PASSWORD", password);

AdminInitiateAuthRequest authRequest = new AdminInitiateAuthRequest()
        .withAuthFlow(AuthFlowType.ADMIN_NO_SRP_AUTH)
        .withAuthParameters(authParams)
        .withClientId(cognitoClientId())
        .withUserPoolId(cognitoPoolId());

AdminInitiateAuthResult authResponse = cognitoClient.adminInitiateAuth(authRequest);
if (StringUtil.isBlank(authResponse.getChallengeName()))
{
    updateCredentialCookies(response, authResponse.getAuthenticationResult());
    reportResult(response, Constants.ResponseMessages.LOGGED_IN);
    return;
}
else if (ChallengeNameType.NEW_PASSWORD_REQUIRED.name().equals(authResponse.getChallengeName()))
{
    logger.debug("{} attempted to sign in with temporary password", emailAddress);
    reportResult(response, Constants.ResponseMessages.FORCE_PASSWORD_CHANGE);
}
else
{
    throw new RuntimeException("unexpected challenge on signin: " + authResponse.getChallengeName());
}

With my example pool configuration, once the user has completed signup there shouldn't be any additional challenges. However, we have to handle the “new password required” flow for two reasons: first, to support a forced password change, and second, because the user might not complete signup in one sitting. If she instead attempts to login with her temporary password via the normal sign-in page. So we return a code for that case, and let the sign-in page redirect to the confirmation page.

Exception handling is identical to the signup confirmation code, with the exception of InvalidPasswordException (since we don't change the password here).

Authorization (token validation)

You've seen that updateCredentialCookies() called whenever authentication is successful; it takes the authentication result and stores the relevant tokens as cookies (so that they'll be provided on every request). There are several tokens in the result; I care about two of them:

  • The access token represents a signed-in user, and will expire an hour after sign-in.
  • The refresh token allows the application to generate a new access token without forcing the user to re-authenticate. The lifetime of refresh tokens is measured in days or years (by default, 10 years).
These tokens aren't simply random strings; they're JSON Web Tokens, which include a base64-encoded JSON blob that describes the user:
{
 "sub": "1127b8bd-c828-4a00-92ad-40a786cac946",
 "token_use": "access",
 "scope": "aws.cognito.signin.user.admin",
 "iss": "https:\/\/cognito-idp.us-east-1.amazonaws.com\/us-east-1_rCQ6gAd1Q",
 "exp": 1482239852,
 "iat": 1482236252,
 "jti": "96732ef7-fc62-4265-843e-343a43b6caf7",
 "client_id": "5co5s8e43krcdps2lrp4fo301i",
 "username": "test0716@mailinator.com"
}

You could use the token as the sole indication of whether the user is logged in, by comparing the exp field to the current timestamp (note that exp is seconds since the epoch, while System.currentTimeMillis() is milliseconds, so multiple the former by 1000 before comparing). Each token is signed; you verify this signature using a third-party library and keys downloaded from https://cognito-idp.{region}.amazonaws.com/{userPoolId}/.well-known/jwks.json (there's also an API call to retrieve the keys, but it is not currently supported by the Java SDK).

One significant limitation of authorizing the user based on the token content is that there's no way to force logout before the token expires; you never confirm the user's status with AWS. For many applications, that's fine: tokens expire after an hour, and there's rarely a need to force logout before that time.

I'm going to show a different approach to authorization: asking AWS to validate the token as part of retrieving information about the user. You'll find this code in the ValidatedAction servlet. In a normal application, it would be common code that's called by any servlet that needs validation.

try
{
    GetUserRequest authRequest = new GetUserRequest().withAccessToken(accessToken);
    GetUserResult authResponse = cognitoClient.getUser(authRequest);

    logger.debug("successful validation for {}", authResponse.getUsername());
    tokenCache.addToken(accessToken);
    reportResult(response, Constants.ResponseMessages.LOGGED_IN);
}
catch (NotAuthorizedException ex)
{
    if (ex.getErrorMessage().equals("Access Token has expired"))
    {
        attemptRefresh(refreshToken, response);
    }
    else
    {
        logger.warn("exception during validation: {}", ex.getMessage());
        reportResult(response, Constants.ResponseMessages.NOT_LOGGED_IN);
    }
}
catch (TooManyRequestsException ex)
{
    logger.warn("caught TooManyRequestsException, delaying then retrying");
    ThreadUtil.sleepQuietly(250);
    doPost(request, response);
}

Before calling this code I retrieve accessToken and refreshToken from their cookies. Ignore tokenCache for now; I'll talk about it below.

The response from getUser() includes all of the user's attributes; you could use them to personalize your web page, or provide profile information. Here, all I care about is whether the request was successful; if it was, I return the “logged in” status.

As before, we have to catch exceptions, and use the “delay and retry” technique if we get TooManyRequestsException. NotAuthorizedException is the one that we need to think about. Unfortunately, it can be thrown for a variety of reasons, ranging from an expired token to one that's completely bogus. More unfortunately, in order to tell the difference we have to look at the actual error message — not something that I like to do, but Amazon didn't provide different exception classes for the different causes.

If the access token has expired, we need to move on to the refresh operation (you'll also need to do this if you're validating tokens based on their contents and the access token has expired).

private void attemptRefresh(String refreshToken, HttpServletResponse response)
throws ServletException, IOException
{
    try
    {
        Map authParams = new HashMap();
        authParams.put("REFRESH_TOKEN", refreshToken);

        AdminInitiateAuthRequest refreshRequest = new AdminInitiateAuthRequest()
                                          .withAuthFlow(AuthFlowType.REFRESH_TOKEN)
                                          .withAuthParameters(authParams)
                                          .withClientId(cognitoClientId())
                                          .withUserPoolId(cognitoPoolId());

        AdminInitiateAuthResult refreshResponse = cognitoClient.adminInitiateAuth(refreshRequest);
        if (StringUtil.isBlank(refreshResponse.getChallengeName()))
        {
            logger.debug("successfully refreshed token");
            updateCredentialCookies(response, refreshResponse.getAuthenticationResult());
            reportResult(response, Constants.ResponseMessages.LOGGED_IN);
        }
        else
        {
            logger.warn("unexpected challenge when refreshing token: {}", refreshResponse.getChallengeName());
            reportResult(response, Constants.ResponseMessages.NOT_LOGGED_IN);
        }
    }
    catch (TooManyRequestsException ex)
    {
        logger.warn("caught TooManyRequestsException, delaying then retrying");
        ThreadUtil.sleepQuietly(250);
        attemptRefresh(refreshToken, response);
    }
    catch (AWSCognitoIdentityProviderException ex)
    {
        logger.debug("exception during token refresh: {}", ex.getMessage());
        reportResult(response, Constants.ResponseMessages.NOT_LOGGED_IN);
    }
}

Note that refreshing a token uses the same function — adminInitiateAuth() — as signin. The difference is that here we use AuthFlowType.REFRESH_TOKEN as the type of authentication, and pass REFRESH_TOKEN as an auth parameter. As before, we have to be prepared for a challenge, although we don't expect any (in a real application, it's possible that the user could request a password change while still logged in, so there may be real challenges).

We do the usual handling of TooManyRequestsException, and consider any other exception to be an error. Assuming that the refresh succeeds, we save the new access token in the response cookies.

All well and good, but let's return to TooManyRequestsException. If we were to authenticate every user action by going to going to AWS then we'd be sure to hit a request limit. Validating credentials based on their content solves this problem, but I've taken a different approach: I maintain a cache of tokens, associated with expiration dates. Rather than the one hour expiration provided by AWS, I use a shorter time interval; this allows me to check for a forced logout.

You'll find the code in CredentialsCache; I'm going to skip over a detailed description here, because in a real-world application, I would probably just accept the one-hour timeout for access tokens and validate based on their contents; the intent of the example code is to show calls to AWS.

Additional Features

If you've been following along at home, you now have a basic authentication system for your website, without having to manage the users yourself. However, there's plenty of room for improvement, and here are a few things to consider.

Password reset

Users forget their passwords, and will expect you to reset it so that they can log in. Cognito provides the adminResetUserPassword() function to force-reset passwords, but there are some caveats regarding its use. The first of these is that Cognito requires a verified email address to execute the function, and my demo code doesn't (at present) result in such an address. You could wait for Amazon to implement the documented functionality, or accept their validation flow (with two emails to the user), or you could simply delete and re-create the user (remembering to preserve all custom attributes).

A bigger caveat has to do with the process around password resets, and that's under your control. You don't want to simply reset a password just because someone on the Internet clicked a button. Instead, you should send the user an email that allows her to confirm the password reset, typically by using a time-limited link that triggers the reset. Please don't redirect the user to a sign-in page as a result of this link; doing so conditions your users to be vulnerable to phishing attacks. Instead, let the link reset the password (which will generate an email with a new temporary password) and tell your user to log in normally once they receive that link.

Tracking last login

I personally like it when a site shows me when and where I last logged in: if I don't remember logging in, I can change my password or start investigating. It's also useful for the site: if a person just logged in an hour ago, they shouldn't be clicking “forgot password” now.

Unfortunately, Amazon doesn't track this information by default; it just tracks the dates that an account was created and last updated. However, you can add your own custom attributes to each user, and retrieve those with the getUser() call.

Multi-factor authentication (MFA)

Multi-factor authentication requires a user to present credentials based on something that she knows (ie, a password) as well as something that she has. One common approach for the latter requirement is a time-based token, either via a dedicated device or an app like Google Authenticator. These devices hold a secret key that's shared with the server, and generate a unique numeric code based on the current time (typically changing every 30 seconds). As long as the user has physical possession of the device, you know that it's her logging in.

Unfortunately, this isn't how Cognito does MFA (even though it is how the AWS Console works). Instead, Cognito sends a code via SMS to the user's cellphone. This means that you must require the user's phone number as an attribute, and verify that phone number when the user signs up.

Assuming that you do this, the response from adminInitiateAuth() will be a challenge of type SMS_MFA. Cognito will send the user a text message with a secret code, and you need a page to accept the secret code and provide it in the challenge response along with the username. I haven't implemented this, but you can see the general process in the Android SDK function CognitoUser.respondToMfaChallenge().

Federation with other identify providers

Jeff Atwood calls OpenID the driver's license of the Internet. If you haven't used an OpenID-enabled site, the basic premise is this: you already have credentials with a known Internet presense such as Google, so why not delegate authentication to them? In the years since Stack Overflow adopted OpenID as its primary authentication mechanism, many other sites have followed suit; for example, if I want to 3D print something at Shapeways, I log in with my Google account.

However, federated identities are a completely different product than the Cognito Identity Provider (IDP) API that I've been describing. You can't create a user in Cognito IDP and then delegate authentication to another provider.

Instead, Cognito federated identities are a way to let users establish their own identities, which takes the form of a unique identifier that is associated with their third-party login (and in this case, Cognito IDP is considered a third party). You can use this identifier as-is, or you can associate an AWS role with the identity pool. Given that you have no control over who belongs to the pool, you don't want to grant many permissions via this role — access to Cognito Sync is one valid use, as is (perhaps!) read-only access to an S3 bucket.

Wednesday, December 21, 2016

Hacking the VPC: ELB as Bastion

A common deployment structure for Amazon Virtual Private Clouds (VPCs) is to separate your servers into public and private subnets. For example, you put your webservers into the public subnet, and database servers in the private subnet. Or for more security you put all of your servers in the private subnet, with an Elastic Load Balancer (ELB) in the public subnet as the only point-of-contact with the open Internet.

The problem with this second architecture is that you have no way to get to those servers for troubleshooting: the definition of a private subnet is that it does not expose servers to the Internet.*

The standard solution involves a “bastion” host: a separate EC2 instance that runs on the public subnet and exposes a limited number of ports to the outside world. For a Linux-centric distribution, it might expose port 22 (SSH), usually restricted to a limited number of source IP addresses. In order to access a host on the private network, you first connect to the bastion host and then from there connect to the private host (although there's a neat trick with netcat that lets you connect via the bastion without an explicit login).

The problem with a bastion host — or, for Windows users, an RD Gateway — is that it costs money. Not much, to be sure: ssh forwarding doesn't require much in the way of resources, so a t2.nano instance is sufficient. But still …

It turns out that you've already got a bastion host in your public subnet: the ELB. You might think of your ELB as just front-end for your webservers: it accepts requests and forwards them to one of a fleet of servers. If you get fancy, maybe you enable session stickiness, or do HTTPS termination at the load balancer. But what you may not realize is that an ELB can forward any TCP port.**

So, let's say that you're running some Windows servers in the private subnet. To expose them to the Internet, go into your ELB config and forward traffic from port 3389:

Of course, you don't really want to expose those servers to the Internet, you want to expose them to your office network. That's controlled by the security group that's attached to the ELB; add an inbound rule that just allows access from your home/office network (yeah, I'm not showing my real IP here):

Lastly, if you use an explicit security group to control traffic from the ELB to the servers, you'll also need to open the port on it. Personally, I like the idea of a “default” security group that allows all components of an application within the VPC to talk with each other.

You should now be able to fire up your favorite rdesktop client and connect to a server.

> xfreerdp --plugin cliprdr -u Administrator 52.15.40.131
loading plugin cliprdr
connected to 52.15.40.131:3389
Password: 
...

The big drawback, of course, is that you have to control over which server you connect to. But for many troubleshooting tasks, that doesn't matter: any server in the load balancer's list will show the same behavior. And in development, where you often have only one server, this technique lets you avoid creating special configuration that won't run in production.


* Actually, the definition of a public subnet is that it routes non-VPC traffic to an Internet Gateway, which is a precondition for exposing servers to the Internet. However, this isn't a sufficient condition: even if you have an Internet Gateway you can prevent access to a host by not giving it a public IP. But such pedantic distinctions are not really relevant to the point of this post; for practical purposes, a private subnet doesn't allow any access from the Internet to its hosts, while a public subnet might.

** I should clarify: the Classic Load Balancer can forward any port; an Application load balancer just handles HTTP and HTTPS, but has highly configurable routing. See the docs for more details.

Friday, November 11, 2016

Git Behind The Curtain: what happens when you commit, branch, and merge

This is the script for a talk that I'm planning to give tomorrow at BarCamp Philly. The talk is a “live committing” exercise, and this post contains all the information needed to follow along. With some links to relevant source material and minus my pauses, typos, and attempts at humor.


Object Management

I think the first thing to understand about Git is that it's not strictly a source control system; it's more like a versioned filesystem that happens to be good at source control. Traditionally, source control systems focused on the evolution of files. For example, RCS (and its successor CVS) maintain a separate file in the repository for each source file; these repository files hold the entire history of the file, as a sequence of diffs that allow the tool to reconstruct any version. Subversion applies the idea of diffs to the entire repository, allowing it to track files as they move between directories.

Git takes a different approach: rather than constructing the state of the repository via diffs, it maintains snapshots of the repository and constructs diffs from those (if you don't believe this, read on). This allows very efficient comparisons between any two points in history, but does consume more disk space. I think the key insight is not just that disk is cheap and programmer time expensive, but that real-world software projects don't have a lot of large files, and those files don't experience a lot of churn.

To see Git in action, we'll create a temporary directory, initialize it as a repository, and create a couple of files. I should note here that I'm using bash on Linux; if you're running Windows you're on your own re commands. Anything that starts with “>” is a command that I typed; anything else is the response from the system.

> mkdir /tmp/$$

> cd /tmp/$$

> git init
Initialized empty Git repository in /tmp/13914/.git/

> touch foo.txt

> mkdir bar

> touch bar/baz.txt

> git add *

> git commit -m "initial revision"
[master (root-commit) 37a649d] initial revision
 2 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 bar/baz.txt
 create mode 100644 foo.txt

Running git log shows you this commit, identified by its SHA-1 hash.

> git log
commit 37a649dd6dec75cd68a2c3dfb7fa2948a0d3426e
Author: Keith Gregory 
Date:   Thu Nov 3 09:20:29 2016 -0400

    initial revision

What you might not realize is that a commit is a physical object in the Git repository, and the SHA-1 hash is actually the hash of its contents. Git has several different types of objects, and each object is uniquely identified by the SHA-1 hash of its contents. Git stores these objects under the directory .git/objects, and the find command will help you explore this directory. Here I sort the results by timestamp and then filename, to simplify tracing the changes to the repository.

> find .git/objects -type f -ls | sort -k 10,11
13369623    4 -r--r--r--   1 kgregory kgregory       82 Nov  3 09:20 .git/objects/2d/2de60b0620e7ac574fa8050997a48efa469f5d
13369612    4 -r--r--r--   1 kgregory kgregory       52 Nov  3 09:20 .git/objects/34/707b133d819e3505b31c17fe67b1c6eacda817
13369637    4 -r--r--r--   1 kgregory kgregory      137 Nov  3 09:20 .git/objects/37/a649dd6dec75cd68a2c3dfb7fa2948a0d3426e
13369609    4 -r--r--r--   1 kgregory kgregory       15 Nov  3 09:20 .git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391

OK, one commit created four objects. To understand what they are, it helps to have a picture:

Each commit represents a snapshot of the project: from the commit you can access all of the files and directories in the project as they appeared at the time of the commit. Each commit contains three pieces of information: metadata about the commit (who made it, when it happened, and the message), a list of parent commits (which is empty for the first commit but has at least one entry for every other commit), and a reference to the “tree” object holding the root of the project directory.

Tree objects are like directories in a filesystem: they contain a list of names and references to the content for each name. In the case of Git, a name may either reference another tree object (in this example, the “bar” sub-directory), or a “blob” object that holds the content of a regular file.

As I said above, an object's SHA-1 is built from the object's content. That's why we have two files in the project but only one blob: because they're both empty files, the content is identical and therefore the SHA-1 is identical.

You can use the cat-file command to look at the objects in the repository. Starting with the commit, here are the four objects from this commit (the blob, being empty, doesn't have any output from this command):

> git cat-file -p 37a649dd6dec75cd68a2c3dfb7fa2948a0d3426e
tree 2d2de60b0620e7ac574fa8050997a48efa469f5d
author Keith Gregory  1478179229 -0400
committer Keith Gregory  1478179229 -0400

initial revision
> git cat-file -p 2d2de60b0620e7ac574fa8050997a48efa469f5d
040000 tree 34707b133d819e3505b31c17fe67b1c6eacda817 bar
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 foo.txt
> git cat-file -p 34707b133d819e3505b31c17fe67b1c6eacda817
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 baz.txt
> git cat-file -p e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

Now let's create some content. I'm generating random text with that I think is a neat hack: you get a stream of random bytes from /dev/urandom, then use sed to throw away anything that you don't want. Since the random bytes includes a newline every 256 bytes (on average), you get a file that can be edited just like typical source code. One thing that's non-obvious: Linux by default would interpret the stream of bytes as UTF-8, meaning a lot of invalid characters from the random source; explicitly setting the LANG variable to an 8-bit encoding solves this.

> cat /dev/urandom | LANG=iso8859-1 sed -e 's/[^ A-Za-z0-9]//g' | head -10000 > foo.txt
> ls -l foo.txt 
-rw-rw-r-- 1 kgregory kgregory 642217 Nov  3 10:05 foo.txt
> head -4 foo.txt 
U4UIESn4HN61l6Is epQXlHSVaLpJGt
N8opIkrSt5NQsWnqYYmt9BBmEWBVaaSVjzTTFJCXHT2vay2CoDT7J rm3f7CWefdgicOdfs0tUdgx
OvqjwOykmKToWkd8nxWNtCCCkUi cxn3Bn5gN4Im38y cfS6IdXgIj9O6gBEGgBW6BcZJ 
2BluWmwQYgyNFIHP8RUL8m2aAjM1FwcY8ZX9fvmvJi30p9sBEkq6giuoRvJSWRW8PLCsrEWfSXeZXxO2HK2IS3MFNpviKRagug3HE96I

When we commit this change, our object directory gets three new entries:

13369623    4 -r--r--r--   1 kgregory kgregory       82 Nov  3 09:20 .git/objects/2d/2de60b0620e7ac574fa8050997a48efa469f5d
13369612    4 -r--r--r--   1 kgregory kgregory       52 Nov  3 09:20 .git/objects/34/707b133d819e3505b31c17fe67b1c6eacda817
13369637    4 -r--r--r--   1 kgregory kgregory      137 Nov  3 09:20 .git/objects/37/a649dd6dec75cd68a2c3dfb7fa2948a0d3426e
13369609    4 -r--r--r--   1 kgregory kgregory       15 Nov  3 09:20 .git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391
13369652    4 -r--r--r--   1 kgregory kgregory      170 Nov  3 10:07 .git/objects/16/c0d98f4476f088c46086583b9ebf76dee03bb9
13369650    4 -r--r--r--   1 kgregory kgregory       81 Nov  3 10:07 .git/objects/bf/21b205ba0ceff1655ca5c8476bbc254748d2b2
13369647  488 -r--r--r--   1 kgregory kgregory   496994 Nov  3 10:07 .git/objects/ef/49cb59aee81783788c17b5a024bd377f2b119e

In the diagram, you can see what happened: adding content to the file created a new blob. Since this had a different SHA-1 than the original file content, it meant that we got a new tree to reference it. And of course, we have a new commit that references that tree. Since baz.txt wasn't changed, it continues to point to the original blob; in turn that means that the directory bar hasn't changed, so it can be represented by same tree object.

One interesting thing is to compare the size of the original file, 642,217 bytes, with the 496,994 bytes stored in the repository. Git compresses all of its objects (so you can't just cat the files). I generated random data for foo.txt, which is normally uncompressible, but limiting it to alphanumeric characters meant that each character only takes fit in 6 bits rather than 8; compressing the file therefore saves roughly 25% of its space.

I'm going to diverge from the diagram for the next example, because I think it's important to understand that Git stores full objects rather than diffs. So, we'll add a few new lines to the file:

> cat /dev/urandom | LANG=iso8859-1 sed -e 's/[^ A-Za-z0-9]//g' | head -100 >> foo.txt

When we look at the objects, we see that there is a new blob that is slightly larger than the old one, and that the commit object (c620754c) is nowhere near large enough to hold the change. Clearly, the commit does not encapsulate a diff.

13369652    4 -r--r--r--   1 kgregory kgregory      170 Nov  3 10:07 .git/objects/16/c0d98f4476f088c46086583b9ebf76dee03bb9
13369650    4 -r--r--r--   1 kgregory kgregory       81 Nov  3 10:07 .git/objects/bf/21b205ba0ceff1655ca5c8476bbc254748d2b2
13369647  488 -r--r--r--   1 kgregory kgregory   496994 Nov  3 10:07 .git/objects/ef/49cb59aee81783788c17b5a024bd377f2b119e
13369648  492 -r--r--r--   1 kgregory kgregory   501902 Nov  3 10:18 .git/objects/58/2f9a9353fc84a6a3571d3983fbe2a0418007db
13369656    4 -r--r--r--   1 kgregory kgregory       81 Nov  3 10:18 .git/objects/9b/671a1155ddca40deb84af138959d37706a8b03
13369658    4 -r--r--r--   1 kgregory kgregory      165 Nov  3 10:18 .git/objects/c6/20754c351f5462bf149a93bdf0d3d51b7d91a9

Before moving on, I want to call out Git's two-level directory structure. Filesystem directories are typically a linear list of files, and searching for a specific filename becomes a significant cost once you have more than a few hundred files. Even a small repository, however, may have 10,000 or more objects. A two-level filesystem is the first step to solving this problem: the top level consists of sub-directories with two-character names, representing the first byte of the object's SHA-1 hash. Each sub-directory only holds those objects whose hashes start with that byte, thereby partitioning the total search space.

Large repositories would still be expensive to search, so Git also uses “pack” files, stored in .git/objects/pack; each pack file contains some large number of commits, indexed for efficient access. You can trigger this compression using git gc, although that only affects your local repository. Pack files are also used when retrieving objects from a remote repository, so your initial clone gives you a pre-packed object directory.

Branches

OK, you've seen how Git stores objects, what happens when you create a branch?

> git checkout -b my-branch
Switched to a new branch 'my-branch'

> cat /dev/urandom | LANG=iso8859-1 sed -e 's/[^ A-Za-z0-9]//g' | head -100 >> foo.txt

> git commit -m "add some content in a branch" foo.txt 
[my-branch 1791626] add some content in a branch
 1 file changed, 100 insertions(+)

If you look in .git/objects, you'll see another three objects, and at this point I assume that you know what they are.

13369648  492 -r--r--r--   1 kgregory kgregory   501902 Nov  3 10:18 .git/objects/58/2f9a9353fc84a6a3571d3983fbe2a0418007db
13369656    4 -r--r--r--   1 kgregory kgregory       81 Nov  3 10:18 .git/objects/9b/671a1155ddca40deb84af138959d37706a8b03
13369658    4 -r--r--r--   1 kgregory kgregory      165 Nov  3 10:18 .git/objects/c6/20754c351f5462bf149a93bdf0d3d51b7d91a9
13510395    4 -r--r--r--   1 kgregory kgregory      175 Nov  3 11:50 .git/objects/17/91626ece8dce3c713261e470a60049553d5411
13377993  496 -r--r--r--   1 kgregory kgregory   506883 Nov  3 11:50 .git/objects/57/c091e77d909f2f20e985150ea922c3ec303ab6
13377996    4 -r--r--r--   1 kgregory kgregory       81 Nov  3 11:50 .git/objects/f2/c824ab04c8f902ab195901e922ab802d8bc37b

So how does Git know that this commit belongs to a branch? The answer is that Git stores that information elsewhere:

> ls -l .git/refs/heads/
total 8
-rw-rw-r-- 1 kgregory kgregory 41 Nov  3 10:18 master
-rw-rw-r-- 1 kgregory kgregory 41 Nov  3 11:50 my-branch
> cat .git/refs/heads/master 
c620754c351f5462bf149a93bdf0d3d51b7d91a9

> cat .git/refs/heads/my-branch
1791626ece8dce3c713261e470a60049553d5411

That's it: text files that hold the SHA-1 of a commit. There is one additional file, .git/HEAD, which says which of those text files represent the “working” branch:

> cat .git/HEAD 
ref: refs/heads/my-branch

This isn't quite the entire story. For one thing, it omits tags, stored in .git/refs/tags. Or “unattached HEAD” state, when .git/HEAD holds a commit rather than a ref. And most important, remote branches and their relationship to local branches. But this post (and the talk) are long enough as it is.

Merges

You've been making changes on a development branch, now it's time to merge those changes into master (or an integration branch). It's useful to know what happens when you type git merge,

Fast-Forward Merges

The simplest type of merge — indeed, you could argue that it's not really a merge at all, because it doesn't create new commits — is a fast-forward merge. This is the sort of merge that we'd get if we merged our example project branch into master.

> git checkout master
Switched to branch 'master'

> git merge my-branch
Updating c620754..1791626
Fast-forward
 foo.txt | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

As I said, it's not really a merge at all; if you look in .git/objects you'll see the same list of files that were there after the last commit. What has changed are the refs:

> ls -l .git/refs/heads/
total 8
-rw-rw-r-- 1 kgregory kgregory 41 Nov  3 12:15 master
-rw-rw-r-- 1 kgregory kgregory 41 Nov  3 11:50 my-branch
> cat .git/refs/heads/master 
1791626ece8dce3c713261e470a60049553d5411

> cat .git/refs/heads/my-branch
1791626ece8dce3c713261e470a60049553d5411

Here's a diagram of what happened, showing the repository state pre- and post-merge:

Topologically, the “branch” represents a straight line, extending the last commit on master. Therefore, “merging” the branch is as simple as re-pointing the master reference to that commit.

Squashed Merges

I'll start this section with a rant: one of the things that I hate, when looking at the history of a project, is to see a series of commits like this: “added foo”; “added unit test for foo”; “added another testcase for foo”; “fixed foo to cover new testcase”… Really. I don't care what you did to make “foo” work, I just care that you did it. And if your commits are interspersed with those of the person working on “bar” the evolution of the project becomes almost impossible to follow (I'll return to this in the next section).

Fortunately, this can be resolved easily with a squash merge:

> git merge --squash my-branch
Updating c620754..1791626
Fast-forward
Squash commit -- not updating HEAD
 foo.txt | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

The “not updting HEAD” part of this message is important: rather than commit the changes, a squash merge leaves them staged and lets you commit (with a summary message) once you've verified the changes. The diagram looks like this:

The thing to understand about squashed commits is that they're not actually merges: there's no connection between the branch and master. As a result, “git branch -d my-branch” will fail, warning you that it's an unmerged branch; you need to force deletion by replacing -d with -D.

To wrap up this section: I don't think you need to squash all merges, just the ones that merge a feature onto master or an integration branch. Use normal merges when pulling an integration branch onto master, or when back-porting changes from the integration branch to the development branch (I do, however, recommend squashing back-ports from master to integration).

“Normal” Merges, with or without conflicts

To understand what I dislike about “normal” merges, we need to do one. For this we'll create a completely new repository, one where we'll compile our favorite quotes from Lewis Carroll. We start by creating the file carroll.txt in master; for this section I'll just show changes to the file, not the actual commits.

Jabberwocky

Walrus and Carpenter

In order to work independently, one person will make the changes to Jabberwocky on a branch named “jabber”:

Jabberwocky

Twas brillig, and the slithy toves
Did gire and gimbal in the wabe
All mimsy were the borogroves
And the mome raths outgrabe

Walrus and Carpenter

After that's checked in, someone else modifies the file on master:

Jabberwocky

Walrus and Carpenter

The time has come, the Walrus said,
To talk of many things:
Of shoes, and ships, and sealing wax
Of cabbages, and kings
And why the sea is boiling hot
And whether pigs have wings

Back on branch jabber, someone has found the actual text and corrected mistakes:

Jabberwocky

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe
All mimsy were the borogoves
And the mome raths outgrabe

Walrus and Carpenter

At this point we have two commits on jabber, and one on master after the branch was made. If you look at the commit log for jabber you see this:

commit 2e68e9b9905b043b06166cf9aa5566d550dbd8ad
Author: Keith Gregory 
Date:   Thu Nov 3 12:59:40 2016 -0400

    fix typos in jabberwocky

commit 5a7d657fb0c322d9d5aca5c7a9d8ef6eb690eaeb
Author: Keith Gregory 
Date:   Thu Nov 3 12:41:35 2016 -0400

    jabberwocky: initial verse

commit e2490ffad2ba7032367f94c0bf16ca44e4b28ee6
Author: Keith Gregory 
Date:   Thu Nov 3 12:39:41 2016 -0400

    initial commit, titles without content

And on master, this is the commit log:

commit 5269b0742e49a3b201ddc0255c50c37b1318aa9c
Author: Keith Gregory 
Date:   Thu Nov 3 12:57:19 2016 -0400

    favorite verse of Walrus

commit e2490ffad2ba7032367f94c0bf16ca44e4b28ee6
Author: Keith Gregory 
Date:   Thu Nov 3 12:39:41 2016 -0400

    initial commit, titles without content

Now the question is: what happens when you merge jabber onto master? In my experience, most people think that the branch commits are added into master based on when they occurred. This is certainly reinforced by the post-merge commit log:

commit 8ba1b90996b4fb67529a2964836f8524a220f2d8
Merge: 5269b07 2e68e9b
Author: Keith Gregory 
Date:   Thu Nov 3 13:03:09 2016 -0400

    Merge branch 'jabber'

commit 2e68e9b9905b043b06166cf9aa5566d550dbd8ad
Author: Keith Gregory 
Date:   Thu Nov 3 12:59:40 2016 -0400

    fix typos in jabberwocky

commit 5269b0742e49a3b201ddc0255c50c37b1318aa9c
Author: Keith Gregory 
Date:   Thu Nov 3 12:57:19 2016 -0400

    favorite verse of Walrus

commit 5a7d657fb0c322d9d5aca5c7a9d8ef6eb690eaeb
Author: Keith Gregory 
Date:   Thu Nov 3 12:41:35 2016 -0400

    jabberwocky: initial verse

commit e2490ffad2ba7032367f94c0bf16ca44e4b28ee6
Author: Keith Gregory 
Date:   Thu Nov 3 12:39:41 2016 -0400

    initial commit, titles without content

But take a closer look at those commit hashes, and compare them to the hashes from the separate branches. They're the same, which means that the commits are unchanged. In fact, git log walks the parent references of both branches, making it appear that commits are on one branch when they aren't. The diagram actually looks like this (yes, I know, there are two commits too many):

The merge created a new commit, which has two parent references. It also created a new blob to hold the combined file, along with a modified tree. If we run cat-file on the merge commit, this is what we get:

tree 589187777a672868e50aefc491ed640125d1e3ed
parent 5269b0742e49a3b201ddc0255c50c37b1318aa9c
parent 2e68e9b9905b043b06166cf9aa5566d550dbd8ad
author Keith Gregory  1478192589 -0400
committer Keith Gregory  1478192589 -0400

Merge branch 'jabber'

So, what does it mean that git log produces the illusion of a series of merged commits? Consider what happens when you check out one of the commits in the list, for example 2e68e9b9.

This was a commit that was made on the branch. If you check out that commit and look at the commit log from that point, you'll see that commit 5269b074 no longer appears. It was made on master, in a completely different chain of commits.

In a complex series of merges (say, multiple development branches onto an integration branch, and several integration branches onto a feature branch) you can completely lose track of where and why a change was made. If you try to diff your way through the commit history, you'll find that the code changes dramatically between commits, and appears to flip-flop; you're simply seeing the code state on different branches.

Wrapping up: how safe is SHA-1?

“Everybody knows” that SHA-1 is a “broken” hash, so why is it the basis for storing objects in Git?

The answer to that question has two parts. The first is that SHA-1 is “broken” in terms of an attacker being able to create a false message that has the same SHA-1 hash as a real message: it takes fewer than the expected number of attempts (although still a lot!). This is problem if the message that you're hashing is involved in validating a server certificate. It's not a problem in the case of Git, because at worst the attacher would be able to replace a single object — you might lose one file within a commit, or need to manually rebuild a directory.

That's an active attacker, but what about accidental collisions: let's say that you have a commit with a particular hash, and it just so happens that you create a blob with the same hash. It could happen, although the chances are vanishingly small. And if it does happen, Git ignores the later file.

So don't worry, be happy, and remember to squash features from development branches.