Wednesday, December 28, 2016

Server-side Authentication with Amazon Cognito IDP

This post was written at the end of 2016. Today, rather than interacting with Cognito directly, I would use the Hosted UI with an Application Load Balancer.

When I first wrote this post, I opened with the caveat that I would not use Cognito in a production application, along with the hope that Amazon would invest in improving it. My comments made their way to the Cognito product manager at Amazon, and we spent an hour on the phone discussing my concerns. I came away with the belief that Amazon is working to improve the product and its documentation, including server-side code. And as I was looking at the documentation in preparation for that call, I saw that it has been improved since I started on this project in October.

He also pointed out some areas where Amazon recommends a different approach than I chose. I have updated this post to either incorporate his suggestions, or note them as sidebars where I prefer my original approach.

So where does that leave me?

Well, as I said in the original post, Cognito offers a compelling feature set. And in general, user management is a distraction from actual application development. So if I can offload that task, I will. But I think that there are still some very rough edges to Cognito; if you choose it for your application, be prepared to jump through some hoops.

Overview: What to expect in the rest of this post

In this post I build a simple authentication framework for a web application. It has three functions: signing up for a new account, signing in to an existing account, and verifying that a user has signed in. The example is built using Java servlets; I intentionally avoided frameworks such as Spring in order to focus on behavior. Similarly, the browser side uses simple HTML pages, with JQuery to send POST requests to the server.

All servlets return 200 for every request unless there's an uncaught exception at the server. The response body is a string that indicates the result of the operation. Depending on the result, the client-side code will either show an alert (for bad inputs) or move to the next step in the flow.

On successful sign-in the servlet stores two cookies, ACCESS_TOKEN and REFRESH_TOKEN, which are used to authorize subsequent requests. These cookies are marked “httpOnly” in order to prevent cross-site scripting attacks. The example code does not make any attempt to prevent cross-site request forgery attacks, as such prevention generally relies data passed as page content.

Also on the topic of security: all communication is sent in clear-text, on the assumption that real-world application will use HTTPS to secure all communications. Cognito provides a client-side library that exchanges secrets in a secure manner, but I'm not using it (because this is intended as a server-side example).

For those that want to follow along at home, the source code is here.


I strongly believe in using an email address as the primary account identifier; I get annoyed every time I'm told that “kdgregory” is already in use and that I must guess at a username that's not in use. Email addresses are a unique identifier that won't change (well, usually: my Starwood Preferred Guest account seems to be irrevocably tied to my address at a former employer, even though I attempt to change it every time I check in).

Cognito also has strong opinions about email addresses: they're secondary to the actual username. It does support the ability to validate an email address and use it in place of the username, but the validation process requires the actual username. While we could generate random usernames on the server and store them in a cookie during the signup process, that's more effort than I want to expend.

Fortunately, the rules governing legal usernames allow the use of email addresses. And Cognito allows you to generate an initial password and send it via email, which prevents hijacking of an account by a user that doesn't own that address. Cognito doesn't consider this email to provide validation, which leads to some pain; I'll talk more about that later on.

Creating a User Pool

It's easy to create a user pool, but there are a few gotchas. The following points are ordered by the steps in the current documentation. At the time of writing, you can't use CloudFormation to create pools or clients, so for the example code I provide a shell script that creates a pool and client that match my needs.

  • Require the email attribute but do not mark it as an alias

    Cognito allows users to have alias identifiers that work in place of their username. However, as I mentioned above, these aliases only work if they pass Cognito's validation process. Since we'll be using the email address as the primary identifier, there's no need to mark it as an alias. But we do want to send email to the user so must save it as the email attribute in addition to the username.

  • Do not enable automatic verification of email addresses

    If you enable this feature, Cognito sends your users an email with a random number, and you'll have to provide a page/servlet where they enter this number. Note, however, that by skipping this feature you currently lose the ability to send the user a reset-password email.

  • Do not create a client secret

    When you create a client, you have the option for Cognito to create a secret hash in addition to the client ID. The documentation does not describe how to pass this hash in a request, and Cognito will throw an exception if you don't. Moreover, the JavaScript SDK doesn't support client secrets; they're only used by the Android/iOS SDKs.

Creating a User (sign-up)

Note: Amazon considers adminCreateUser() to be intended for systems where administrators add users, while signUp() is the preferred function for user-initiated signups. The flow for the two functions is quite different: the signUP() flow lets the user pick her own password, and sends a verification email. Personally, I prefer the ability to generate temporary passwords with adminCreateUser().

The sign-up servlet calls the adminCreateUser() function.

    AdminCreateUserRequest cognitoRequest = new AdminCreateUserRequest()
                    new AttributeType()
                    new AttributeType()

    reportResult(response, Constants.ResponseMessages.USER_CREATED);
catch (UsernameExistsException ex)
    logger.debug("user already exists: {}", emailAddress);
    reportResult(response, Constants.ResponseMessages.USER_ALREADY_EXISTS);
catch (TooManyRequestsException ex)
    logger.warn("caught TooManyRequestsException, delaying then retrying");
    doPost(request, response);

There are a couple of variables and functions that are used by this snippet but set elsewhere:

  • The cognitoClient variable is defined in an abstract superclass; it holds an instance of AWSCognitoIdentityProviderClient. Like other AWS “client” classes, this class is threadsafe and, I assume, holds a persistent connection to the service. As such, you're encouraged to create a single client and reuse it.
  • The emailAddress variable is populated from a request parameter. I don't want to clutter my examples with boilerplate code to retrieve parameters, so you can assume in future snippets that any variable not explicitly described comes from a parameter.
  • I have two functions in the abstract superclass that retrieve configuration values from the servlet context (which is loaded from the web.xml file). Here I call cognitoPoolId(), which returns the AWS-assigned pool ID. Later you'll see cognitoClientId().

Moving on to the the code itself: most Cognito client functions take a request object and return a response object. The request objects are constructed using the Builder pattern: each “with” function adds something to the request and returns the updated object so that calls can be chained.

Most of the information I provide in the request has to do with the email address. It's the username, as I said above. But I also have to explicitly store it in the email attribute, or Cognito won't be able to send mail to the user. And we want that, so that Cognito will generate and send a temporary password.

I'm also setting the email_verified attribute. This attribute is normally set by Cognito itself when the user performs the verification flow, and it's required to send follow-up emails such as a password reset message. Unfortunately, at this point Cognito doesn't consider the temporary password email to be a suitable verification of the address. So instead I forcibly mark the address as verified when creating the account. If the address doesn't actually exist, the user won't receive her temporary password, and therefore won't be able to sign in.

Exception handling is the other big part of this snippet. As I said before, Cognito uses a mix of exceptions and return codes; the latter signal a “normal” flow, while exceptions are for things that break the flow — even if they're a normal part of user authentication. If you look at the documentation for adminCreateUser() you'll see an almost overwhelming number of possible exceptions. However, many of these are irrelevant to the way that I'm creating users. For example, since I let Cognito generate temporary passwords, there's no need to handle an InvalidPasswordException.

For this example, the only important “creation” exception is thrown when there's already a user with that email address. My example responds to this by showing an alert on the client, but a real application should initiate a “forgotten password” flow.

The second exception that I'm handling, TooManyRequestsException, can be thrown by any operation; you always need to handle it. The documentation isn't clear on the purpose of this exception, but I'm assuming that it's thrown when you exceed the AWS rate limit for requests (versus a security measure specific to repeated signup attempts). My example uses a rather naive solution: retry the operation after sleeping for 250 milliseconds. If you're under heavy load, this could delay the response to the user for several seconds, which could cause them to think your site is down; you might prefer to simply tell them to try again later.

A successful call to adminCreateUser() is only the first part of signing up a new user. The second step is for the user to log in using the temporary password that Cognito sends to their email address. My example page responds to the “user created” response with a client-side redirect to a confirmation page, which has fields for the email address, temporary password, and permanent password. These values are then submitted to the confirmation servlet.

As far as Cognito is concerned, there's no code-level difference between signing in with the temporary password and the final password: you call the adminInitiateAuth() method. The difference is in the response: when you sign in with the temporary password you'll be challenged to provide the final password.

This ends up being a fairly large chunk of code; I've split it into two pieces. The first chunk is straightforward; it handles the initial authentication attempt.

Map initialParams = new HashMap();
initialParams.put("USERNAME", emailAddress);
initialParams.put("PASSWORD", tempPassword);

AdminInitiateAuthRequest initialRequest = new AdminInitiateAuthRequest()

AdminInitiateAuthResult initialResponse = cognitoClient.adminInitiateAuth(initialRequest);

The key thing to here is AuthFlowType.ADMIN_NO_SRP_AUTH. Cognito supports several authentication flows; later we'll use the same function to refresh the access token. Client SDKs use the Secure Remote Password (SRP) flow; on the server, where we can secure the credentials, we use the ADMIN_NO_SRP_AUTH flow.

As with the previous operation, we need the pool ID. We also need the client ID; you can create multiple clients per pool, and track which user uses which client (although it's still a single pool of users). You must create at least one client (known in the console as an app). As I noted earlier, both IDs are configured in web.xml and retrieved via functions defined in the abstract superclass.

As I said, the difference between initial signup and normal signup is in the response. In the case of a normal sign-in, which we'll see later, the response contains credentials. In the case of an initial signin, it contains a challenge:

if (!
    throw new RuntimeException("unexpected challenge: " + initialResponse.getChallengeName());

Here I expect the “new password required” challenge, and am not prepared for anything else (since the user should only arrive here after a password change). In a real-world application I'd use a nicer error response rather than throwing an exception.

We respond to this challenge with adminRespondToAuthChallenge(), providing the temporary and final passwords. One thing to note is withSession(): Cognito needs to link the challenge response with the challenge request, and this is how it does that.

Map challengeResponses = new HashMap();
challengeResponses.put("USERNAME", emailAddress);
challengeResponses.put("PASSWORD", tempPassword);
challengeResponses.put("NEW_PASSWORD", finalPassword);

AdminRespondToAuthChallengeRequest finalRequest = new AdminRespondToAuthChallengeRequest()

AdminRespondToAuthChallengeResult challengeResponse = cognitoClient.adminRespondToAuthChallenge(finalRequest);
if (StringUtil.isBlank(challengeResponse.getChallengeName()))
    updateCredentialCookies(response, challengeResponse.getAuthenticationResult());
    reportResult(response, Constants.ResponseMessages.LOGGED_IN);
    throw new RuntimeException("unexpected challenge: " + challengeResponse.getChallengeName());

Assuming that the provided password was acceptable, and there were no other errors (see below), then we should get a response that (1) has a blank challenge, and (2) has valid credentials. I don't handle the case where we get a new challenge (which in a real-world app might be for multi-factor authentication). I store the returned credentials as cookies (another method in the abstract superclass), and return a message indicating the the user is logged in.

Now for the “other errors.” Unlike the initial signup servlet, there are a bunch of exceptions that might apply to this operation. TooManyRequestsException, of course, is possible for any call, but here are the ones specific to my flow:

  • InvalidPasswordException if you've set rules for passwords and the user's permanent password doesn't satisfy those rules. Cognito lets you require a combination of uppercase letters, lowercase letters, numbers, and special characters, along with a minimum length.
  • UserNotFoundException if the user enters bogus email address. This could be an honest accident, or it could be a fishing expedition. A security-conscious site should attempt to discourage such attacks; one simple approach is to delay the response after every failed request (but note that could lead to a denial-of-service attack against your site!).
  • NotAuthorizedException if the user provides an incorrect temporary password. Again, this could be an honest mistake or an attack; do not give the caller any indication that they have a valid user but invalid password (I return the same “no such user” message for this exception and the previous one).

Before wrapping up this section, I want to point out that there are cases where Cognito throws an exception but the user has been created. There's no good way to recover from this, other than to provide the user with a “lost password” flow.

Authentication (sign-in)

You've already seen the sign-in code, as part of sign-up confirmation. Here I want to focus on the response handling.

Map authParams = new HashMap();
authParams.put("USERNAME", emailAddress);
authParams.put("PASSWORD", password);

AdminInitiateAuthRequest authRequest = new AdminInitiateAuthRequest()

AdminInitiateAuthResult authResponse = cognitoClient.adminInitiateAuth(authRequest);
if (StringUtil.isBlank(authResponse.getChallengeName()))
    updateCredentialCookies(response, authResponse.getAuthenticationResult());
    reportResult(response, Constants.ResponseMessages.LOGGED_IN);
else if (
    logger.debug("{} attempted to sign in with temporary password", emailAddress);
    reportResult(response, Constants.ResponseMessages.FORCE_PASSWORD_CHANGE);
    throw new RuntimeException("unexpected challenge on signin: " + authResponse.getChallengeName());

With my example pool configuration, once the user has completed signup there shouldn't be any additional challenges. However, we have to handle the “new password required” flow, because the user might not complete signup in one sitting. If she instead attempts to login with her temporary password via the normal sign-in page. So we return a code for that case, and let the sign-in page redirect to the confirmation page.

Exception handling is identical to the signup confirmation code, with the exception of InvalidPasswordException (since we don't change the password here).

Authorization (token validation)

You've seen that updateCredentialCookies() called whenever authentication is successful; it takes the authentication result and stores the relevant tokens as cookies (so that they'll be provided on every request). There are several tokens in the result; I care about two of them:

  • The access token represents a signed-in user, and will expire an hour after sign-in.
  • The refresh token allows the application to generate a new access token without forcing the user to re-authenticate. The lifetime of refresh tokens is measured in days or years (by default, 30 days).
These tokens aren't simply random strings; they're JSON Web Tokens, which include a base64-encoded JSON blob that describes the user:
 "sub": "1127b8bd-c828-4a00-92ad-40a786cac946",
 "token_use": "access",
 "scope": "aws.cognito.signin.user.admin",
 "iss": "https:\/\/\/us-east-1_rCQ6gAd1Q",
 "exp": 1482239852,
 "iat": 1482236252,
 "jti": "96732ef7-fc62-4265-843e-343a43b6caf7",
 "client_id": "5co5s8e43krcdps2lrp4fo301i",
 "username": ""

You could use the token as the sole indication of whether the user is logged in, by comparing the exp field to the current timestamp (note that exp is seconds since the epoch, while System.currentTimeMillis() is milliseconds, so multiply the former by 1000 before comparing). Each token is signed; you verify this signature using a third-party library and keys downloaded from https://cognito-idp.{region}{userPoolId}/.well-known/jwks.json.

However, there are a couple of significant limitations to using the token as sole source of authorization, both caused by the fixed one hour expiration. The first is that there's no way to force logout before the token expires. Cognito provides a function to invalidate tokens, adminUserGlobalSignOut(), but it's only relevant if you request token validation from AWS. The second is that one hour is excessively short for some purposes, and you'll be forced to refresh the token.

I'm going to show a different approach to authorization: asking AWS to validate the token as part of retrieving information about the user. You'll find this code in the ValidatedAction servlet. In a normal application, it would be common code that's called by any servlet that needs validation.

    GetUserRequest authRequest = new GetUserRequest().withAccessToken(accessToken);
    GetUserResult authResponse = cognitoClient.getUser(authRequest);

    logger.debug("successful validation for {}", authResponse.getUsername());
    reportResult(response, Constants.ResponseMessages.LOGGED_IN);
catch (NotAuthorizedException ex)
    if (ex.getErrorMessage().equals("Access Token has expired"))
        attemptRefresh(refreshToken, response);
        logger.warn("exception during validation: {}", ex.getMessage());
        reportResult(response, Constants.ResponseMessages.NOT_LOGGED_IN);
catch (TooManyRequestsException ex)
    logger.warn("caught TooManyRequestsException, delaying then retrying");
    doPost(request, response);

Before calling this code I retrieve accessToken and refreshToken from their cookies. Ignore tokenCache for now; I'll talk about it below.

The response from getUser() includes all of the user's attributes; you could use them to personalize your web page, or provide profile information. Here, all I care about is whether the request was successful; if it was, I return the “logged in” status.

As before, we have to catch exceptions, and use the “delay and retry” technique if we get TooManyRequestsException. NotAuthorizedException is the one that we need to think about. Unfortunately, it can be thrown for a variety of reasons, ranging from an expired token to one that's completely bogus. More unfortunately, in order to tell the difference we have to look at the actual error message — not something that I like to do, but Amazon didn't provide different exception classes for the different causes.

If the access token has expired, we need to move on to the refresh operation (you'll also need to do this if you're validating tokens based on their contents and the access token has expired).

private void attemptRefresh(String refreshToken, HttpServletResponse response)
throws ServletException, IOException
        Map authParams = new HashMap();
        authParams.put("REFRESH_TOKEN", refreshToken);

        AdminInitiateAuthRequest refreshRequest = new AdminInitiateAuthRequest()

        AdminInitiateAuthResult refreshResponse = cognitoClient.adminInitiateAuth(refreshRequest);
        if (StringUtil.isBlank(refreshResponse.getChallengeName()))
            logger.debug("successfully refreshed token");
            updateCredentialCookies(response, refreshResponse.getAuthenticationResult());
            reportResult(response, Constants.ResponseMessages.LOGGED_IN);
            logger.warn("unexpected challenge when refreshing token: {}", refreshResponse.getChallengeName());
            reportResult(response, Constants.ResponseMessages.NOT_LOGGED_IN);
    catch (TooManyRequestsException ex)
        logger.warn("caught TooManyRequestsException, delaying then retrying");
        attemptRefresh(refreshToken, response);
    catch (AWSCognitoIdentityProviderException ex)
        logger.debug("exception during token refresh: {}", ex.getMessage());
        reportResult(response, Constants.ResponseMessages.NOT_LOGGED_IN);

Note that refreshing a token uses the same function — adminInitiateAuth() — as signin. The difference is that here we use AuthFlowType.REFRESH_TOKEN as the type of authentication, and pass REFRESH_TOKEN as an auth parameter. As before, we have to be prepared for a challenge, although we don't expect any (in a real application, it's possible that the user could request a password change while still logged in, so there may be real challenges).

We do the usual handling of TooManyRequestsException, and consider any other exception to be an error. Assuming that the refresh succeeds, we save the new access token in the response cookies.

All well and good, but let's return to TooManyRequestsException. If we were to authenticate every user action by going to going to AWS then we'd be sure to hit a request limit. Validating credentials based on their content solves this problem, but I've taken a different approach: I maintain a cache of tokens, associated with expiration dates. Rather than the one hour expiration provided by AWS, I use a shorter time interval; this allows me to check for a forced logout.

You'll find the code in CredentialsCache; I'm going to skip over a detailed description here, because in a real-world application, I would probably just accept the one-hour timeout for access tokens and validate based on their contents; the intent of the example code is to show calls to AWS.

Additional Features

If you've been following along at home, you now have a basic authentication system for your website, without having to manage the users yourself. However, there's plenty of room for improvement, and here are a few things to consider.

Password reset

Users forget their passwords, and will expect you to reset it so that they can log in. Cognito provides the adminResetUserPassword() function to force-reset passwords, as well as forgotPassword(). The former sends the user a new temporary password, and subsequent attempts to login will receive a challenge. The latter sends an email with a confirmation code, but (at least at this time) does not prevent the user from logging in with the original credentials.

The bigger concern is not which function you use, but the entire process around password resets. You don't want to simply reset a password just because someone on the Internet clicked a button. Instead, you should send the user an email that allows her to confirm the password reset, typically by using a time-limited link that triggers the reset. Please don't redirect the user to a sign-in page as a result of this link; doing so conditions your users to be vulnerable to phishing attacks. Instead, let the link reset the password (which will generate an email with a new temporary password) and tell your user to log in normally once they receive that link.

Multi-factor authentication (MFA)

Multi-factor authentication requires a user to present credentials based on something that she knows (ie, a password) as well as something that she has. One common approach for the latter requirement is a time-based token, either via a dedicated device or an app like Google Authenticator. These devices hold a secret key that's shared with the server, and generate a unique numeric code based on the current time (typically changing every 30 seconds). As long as the user has physical possession of the device, you know that it's her logging in.

Unfortunately, this isn't how Cognito does MFA (even though it is how the AWS Console works). Instead, Cognito sends a code via SMS to the user's cellphone. This means that you must require the user's phone number as an attribute, and verify that phone number when the user signs up.

Assuming that you do this, the response from adminInitiateAuth() will be a challenge of type SMS_MFA. Cognito will send the user a text message with a secret code, and you need a page to accept the secret code and provide it in the challenge response along with the username. I haven't implemented this, but you can see the general process in the Android SDK function CognitoUser.respondToMfaChallenge().

Federation with other identify providers

Jeff Atwood calls OpenID the driver's license of the Internet. If you haven't used an OpenID-enabled site, the basic premise is this: you already have credentials with a known Internet presense such as Google, so why not delegate authentication to them? In the years since Stack Overflow adopted OpenID as its primary authentication mechanism, many other sites have followed suit; for example, if I want to 3D print something at Shapeways, I log in with my Google account.

However, federated identities are a completely different product than the Cognito Identity Provider (IDP) API that I've been describing. You can't create a user in Cognito IDP and then delegate authentication to another provider.

Instead, Cognito federated identities are a way to let users establish their own identities, which takes the form of a unique identifier that is associated with their third-party login (and in this case, Cognito IDP is considered a third party). You can use this identifier as-is, or you can associate an AWS role with the identity pool. Given that you have no control over who belongs to the pool, you don't want to grant many permissions via this role — access to Cognito Sync is one valid use, as is (perhaps!) read-only access to an S3 bucket.

Wednesday, December 21, 2016

Hacking the VPC: ELB as Bastion

A common deployment structure for Amazon Virtual Private Clouds (VPCs) is to separate your servers into public and private subnets. For example, you put your webservers into the public subnet, and database servers in the private subnet. Or for more security you put all of your servers in the private subnet, with an Elastic Load Balancer (ELB) in the public subnet as the only point-of-contact with the open Internet.

The problem with this second architecture is that you have no way to get to those servers for troubleshooting: the definition of a private subnet is that it does not expose servers to the Internet.*

The standard solution involves a “bastion” host: a separate EC2 instance that runs on the public subnet and exposes a limited number of ports to the outside world. For a Linux-centric distribution, it might expose port 22 (SSH), usually restricted to a limited number of source IP addresses. In order to access a host on the private network, you first connect to the bastion host and then from there connect to the private host (although there's a neat trick with netcat that lets you connect via the bastion without an explicit login).

The problem with a bastion host — or, for Windows users, an RD Gateway — is that it costs money. Not much, to be sure: ssh forwarding doesn't require much in the way of resources, so a t2.nano instance is sufficient. But still …

It turns out that you've already got a bastion host in your public subnet: the ELB. You might think of your ELB as just front-end for your webservers: it accepts requests and forwards them to one of a fleet of servers. If you get fancy, maybe you enable session stickiness, or do HTTPS termination at the load balancer. But what you may not realize is that an ELB can forward any TCP port.**

So, let's say that you're running some Windows servers in the private subnet. To expose them to the Internet, go into your ELB config and forward traffic from port 3389:

Of course, you don't really want to expose those servers to the Internet, you want to expose them to your office network. That's controlled by the security group that's attached to the ELB; add an inbound rule that just allows access from your home/office network (yeah, I'm not showing my real IP here):

Lastly, if you use an explicit security group to control traffic from the ELB to the servers, you'll also need to open the port on it. Personally, I like the idea of a “default” security group that allows all components of an application within the VPC to talk with each other.

You should now be able to fire up your favorite rdesktop client and connect to a server.

> xfreerdp --plugin cliprdr -u Administrator
loading plugin cliprdr
connected to

The big drawback, of course, is that you have to control over which server you connect to. But for many troubleshooting tasks, that doesn't matter: any server in the load balancer's list will show the same behavior. And in development, where you often have only one server, this technique lets you avoid creating special configuration that won't run in production.

* Actually, the definition of a public subnet is that it routes non-VPC traffic to an Internet Gateway, which is a precondition for exposing servers to the Internet. However, this isn't a sufficient condition: even if you have an Internet Gateway you can prevent access to a host by not giving it a public IP. But such pedantic distinctions are not really relevant to the point of this post; for practical purposes, a private subnet doesn't allow any access from the Internet to its hosts, while a public subnet might.

** I should clarify: the Classic Load Balancer can forward any port; an Application load balancer just handles HTTP and HTTPS, but has highly configurable routing. See the docs for more details.

Friday, November 11, 2016

Git Behind The Curtain: what happens when you commit, branch, and merge

This is the script for a talk that I'm planning to give tomorrow at BarCamp Philly. The talk is a “live committing” exercise, and this post contains all the information needed to follow along. With some links to relevant source material and minus my pauses, typos, and attempts at humor.

Object Management

I think the first thing to understand about Git is that it's not strictly a source control system; it's more like a versioned filesystem that happens to be good at source control. Traditionally, source control systems focused on the evolution of files. For example, RCS (and its successor CVS) maintain a separate file in the repository for each source file; these repository files hold the entire history of the file, as a sequence of diffs that allow the tool to reconstruct any version. Subversion applies the idea of diffs to the entire repository, allowing it to track files as they move between directories.

Git takes a different approach: rather than constructing the state of the repository via diffs, it maintains snapshots of the repository and constructs diffs from those (if you don't believe this, read on). This allows very efficient comparisons between any two points in history, but does consume more disk space. I think the key insight is not just that disk is cheap and programmer time expensive, but that real-world software projects don't have a lot of large files, and those files don't experience a lot of churn.

To see Git in action, we'll create a temporary directory, initialize it as a repository, and create a couple of files. I should note here that I'm using bash on Linux; if you're running Windows you're on your own re commands. Anything that starts with “>” is a command that I typed; anything else is the response from the system.

> mkdir /tmp/$$

> cd /tmp/$$

> git init
Initialized empty Git repository in /tmp/13914/.git/

> touch foo.txt

> mkdir bar

> touch bar/baz.txt

> git add *

> git commit -m "initial revision"
[master (root-commit) 37a649d] initial revision
 2 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 bar/baz.txt
 create mode 100644 foo.txt

Running git log shows you this commit, identified by its SHA-1 hash.

> git log
commit 37a649dd6dec75cd68a2c3dfb7fa2948a0d3426e
Author: Keith Gregory 
Date:   Thu Nov 3 09:20:29 2016 -0400

    initial revision

What you might not realize is that a commit is a physical object in the Git repository, and the SHA-1 hash is actually the hash of its contents. Git has several different types of objects, and each object is uniquely identified by the SHA-1 hash of its contents. Git stores these objects under the directory .git/objects, and the find command will help you explore this directory. Here I sort the results by timestamp and then filename, to simplify tracing the changes to the repository.

> find .git/objects -type f -ls | sort -k 10,11
13369623    4 -r--r--r--   1 kgregory kgregory       82 Nov  3 09:20 .git/objects/2d/2de60b0620e7ac574fa8050997a48efa469f5d
13369612    4 -r--r--r--   1 kgregory kgregory       52 Nov  3 09:20 .git/objects/34/707b133d819e3505b31c17fe67b1c6eacda817
13369637    4 -r--r--r--   1 kgregory kgregory      137 Nov  3 09:20 .git/objects/37/a649dd6dec75cd68a2c3dfb7fa2948a0d3426e
13369609    4 -r--r--r--   1 kgregory kgregory       15 Nov  3 09:20 .git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391

OK, one commit created four objects. To understand what they are, it helps to have a picture:

Each commit represents a snapshot of the project: from the commit you can access all of the files and directories in the project as they appeared at the time of the commit. Each commit contains three pieces of information: metadata about the commit (who made it, when it happened, and the message), a list of parent commits (which is empty for the first commit but has at least one entry for every other commit), and a reference to the “tree” object holding the root of the project directory.

Tree objects are like directories in a filesystem: they contain a list of names and references to the content for each name. In the case of Git, a name may either reference another tree object (in this example, the “bar” sub-directory), or a “blob” object that holds the content of a regular file.

As I said above, an object's SHA-1 is built from the object's content. That's why we have two files in the project but only one blob: because they're both empty files, the content is identical and therefore the SHA-1 is identical.

You can use the cat-file command to look at the objects in the repository. Starting with the commit, here are the four objects from this commit (the blob, being empty, doesn't have any output from this command):

> git cat-file -p 37a649dd6dec75cd68a2c3dfb7fa2948a0d3426e
tree 2d2de60b0620e7ac574fa8050997a48efa469f5d
author Keith Gregory  1478179229 -0400
committer Keith Gregory  1478179229 -0400

initial revision
> git cat-file -p 2d2de60b0620e7ac574fa8050997a48efa469f5d
040000 tree 34707b133d819e3505b31c17fe67b1c6eacda817 bar
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 foo.txt
> git cat-file -p 34707b133d819e3505b31c17fe67b1c6eacda817
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 baz.txt
> git cat-file -p e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

Now let's create some content. I'm generating random text with that I think is a neat hack: you get a stream of random bytes from /dev/urandom, then use sed to throw away anything that you don't want. Since the random bytes includes a newline every 256 bytes (on average), you get a file that can be edited just like typical source code. One thing that's non-obvious: Linux by default would interpret the stream of bytes as UTF-8, meaning a lot of invalid characters from the random source; explicitly setting the LANG variable to an 8-bit encoding solves this.

> cat /dev/urandom | LANG=iso8859-1 sed -e 's/[^ A-Za-z0-9]//g' | head -10000 > foo.txt
> ls -l foo.txt 
-rw-rw-r-- 1 kgregory kgregory 642217 Nov  3 10:05 foo.txt
> head -4 foo.txt 
N8opIkrSt5NQsWnqYYmt9BBmEWBVaaSVjzTTFJCXHT2vay2CoDT7J rm3f7CWefdgicOdfs0tUdgx
OvqjwOykmKToWkd8nxWNtCCCkUi cxn3Bn5gN4Im38y cfS6IdXgIj9O6gBEGgBW6BcZJ 

When we commit this change, our object directory gets three new entries:

13369623    4 -r--r--r--   1 kgregory kgregory       82 Nov  3 09:20 .git/objects/2d/2de60b0620e7ac574fa8050997a48efa469f5d
13369612    4 -r--r--r--   1 kgregory kgregory       52 Nov  3 09:20 .git/objects/34/707b133d819e3505b31c17fe67b1c6eacda817
13369637    4 -r--r--r--   1 kgregory kgregory      137 Nov  3 09:20 .git/objects/37/a649dd6dec75cd68a2c3dfb7fa2948a0d3426e
13369609    4 -r--r--r--   1 kgregory kgregory       15 Nov  3 09:20 .git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391
13369652    4 -r--r--r--   1 kgregory kgregory      170 Nov  3 10:07 .git/objects/16/c0d98f4476f088c46086583b9ebf76dee03bb9
13369650    4 -r--r--r--   1 kgregory kgregory       81 Nov  3 10:07 .git/objects/bf/21b205ba0ceff1655ca5c8476bbc254748d2b2
13369647  488 -r--r--r--   1 kgregory kgregory   496994 Nov  3 10:07 .git/objects/ef/49cb59aee81783788c17b5a024bd377f2b119e

In the diagram, you can see what happened: adding content to the file created a new blob. Since this had a different SHA-1 than the original file content, it meant that we got a new tree to reference it. And of course, we have a new commit that references that tree. Since baz.txt wasn't changed, it continues to point to the original blob; in turn that means that the directory bar hasn't changed, so it can be represented by same tree object.

One interesting thing is to compare the size of the original file, 642,217 bytes, with the 496,994 bytes stored in the repository. Git compresses all of its objects (so you can't just cat the files). I generated random data for foo.txt, which is normally uncompressible, but limiting it to alphanumeric characters meant that each character only takes fit in 6 bits rather than 8; compressing the file therefore saves roughly 25% of its space.

I'm going to diverge from the diagram for the next example, because I think it's important to understand that Git stores full objects rather than diffs. So, we'll add a few new lines to the file:

> cat /dev/urandom | LANG=iso8859-1 sed -e 's/[^ A-Za-z0-9]//g' | head -100 >> foo.txt

When we look at the objects, we see that there is a new blob that is slightly larger than the old one, and that the commit object (c620754c) is nowhere near large enough to hold the change. Clearly, the commit does not encapsulate a diff.

13369652    4 -r--r--r--   1 kgregory kgregory      170 Nov  3 10:07 .git/objects/16/c0d98f4476f088c46086583b9ebf76dee03bb9
13369650    4 -r--r--r--   1 kgregory kgregory       81 Nov  3 10:07 .git/objects/bf/21b205ba0ceff1655ca5c8476bbc254748d2b2
13369647  488 -r--r--r--   1 kgregory kgregory   496994 Nov  3 10:07 .git/objects/ef/49cb59aee81783788c17b5a024bd377f2b119e
13369648  492 -r--r--r--   1 kgregory kgregory   501902 Nov  3 10:18 .git/objects/58/2f9a9353fc84a6a3571d3983fbe2a0418007db
13369656    4 -r--r--r--   1 kgregory kgregory       81 Nov  3 10:18 .git/objects/9b/671a1155ddca40deb84af138959d37706a8b03
13369658    4 -r--r--r--   1 kgregory kgregory      165 Nov  3 10:18 .git/objects/c6/20754c351f5462bf149a93bdf0d3d51b7d91a9

Before moving on, I want to call out Git's two-level directory structure. Filesystem directories are typically a linear list of files, and searching for a specific filename becomes a significant cost once you have more than a few hundred files. Even a small repository, however, may have 10,000 or more objects. A two-level filesystem is the first step to solving this problem: the top level consists of sub-directories with two-character names, representing the first byte of the object's SHA-1 hash. Each sub-directory only holds those objects whose hashes start with that byte, thereby partitioning the total search space.

Large repositories would still be expensive to search, so Git also uses “pack” files, stored in .git/objects/pack; each pack file contains some large number of commits, indexed for efficient access. You can trigger this compression using git gc, although that only affects your local repository. Pack files are also used when retrieving objects from a remote repository, so your initial clone gives you a pre-packed object directory.


OK, you've seen how Git stores objects, what happens when you create a branch?

> git checkout -b my-branch
Switched to a new branch 'my-branch'

> cat /dev/urandom | LANG=iso8859-1 sed -e 's/[^ A-Za-z0-9]//g' | head -100 >> foo.txt

> git commit -m "add some content in a branch" foo.txt 
[my-branch 1791626] add some content in a branch
 1 file changed, 100 insertions(+)

If you look in .git/objects, you'll see another three objects, and at this point I assume that you know what they are.

13369648  492 -r--r--r--   1 kgregory kgregory   501902 Nov  3 10:18 .git/objects/58/2f9a9353fc84a6a3571d3983fbe2a0418007db
13369656    4 -r--r--r--   1 kgregory kgregory       81 Nov  3 10:18 .git/objects/9b/671a1155ddca40deb84af138959d37706a8b03
13369658    4 -r--r--r--   1 kgregory kgregory      165 Nov  3 10:18 .git/objects/c6/20754c351f5462bf149a93bdf0d3d51b7d91a9
13510395    4 -r--r--r--   1 kgregory kgregory      175 Nov  3 11:50 .git/objects/17/91626ece8dce3c713261e470a60049553d5411
13377993  496 -r--r--r--   1 kgregory kgregory   506883 Nov  3 11:50 .git/objects/57/c091e77d909f2f20e985150ea922c3ec303ab6
13377996    4 -r--r--r--   1 kgregory kgregory       81 Nov  3 11:50 .git/objects/f2/c824ab04c8f902ab195901e922ab802d8bc37b

So how does Git know that this commit belongs to a branch? The answer is that Git stores that information elsewhere:

> ls -l .git/refs/heads/
total 8
-rw-rw-r-- 1 kgregory kgregory 41 Nov  3 10:18 master
-rw-rw-r-- 1 kgregory kgregory 41 Nov  3 11:50 my-branch
> cat .git/refs/heads/master 

> cat .git/refs/heads/my-branch

That's it: text files that hold the SHA-1 of a commit. There is one additional file, .git/HEAD, which says which of those text files represent the “working” branch:

> cat .git/HEAD 
ref: refs/heads/my-branch

This isn't quite the entire story. For one thing, it omits tags, stored in .git/refs/tags. Or “unattached HEAD” state, when .git/HEAD holds a commit rather than a ref. And most important, remote branches and their relationship to local branches. But this post (and the talk) are long enough as it is.


You've been making changes on a development branch, now it's time to merge those changes into master (or an integration branch). It's useful to know what happens when you type git merge,

Fast-Forward Merges

The simplest type of merge — indeed, you could argue that it's not really a merge at all, because it doesn't create new commits — is a fast-forward merge. This is the sort of merge that we'd get if we merged our example project branch into master.

> git checkout master
Switched to branch 'master'

> git merge my-branch
Updating c620754..1791626
 foo.txt | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

As I said, it's not really a merge at all; if you look in .git/objects you'll see the same list of files that were there after the last commit. What has changed are the refs:

> ls -l .git/refs/heads/
total 8
-rw-rw-r-- 1 kgregory kgregory 41 Nov  3 12:15 master
-rw-rw-r-- 1 kgregory kgregory 41 Nov  3 11:50 my-branch
> cat .git/refs/heads/master 

> cat .git/refs/heads/my-branch

Here's a diagram of what happened, showing the repository state pre- and post-merge:

Topologically, the “branch” represents a straight line, extending the last commit on master. Therefore, “merging” the branch is as simple as re-pointing the master reference to that commit.

Squashed Merges

I'll start this section with a rant: one of the things that I hate, when looking at the history of a project, is to see a series of commits like this: “added foo”; “added unit test for foo”; “added another testcase for foo”; “fixed foo to cover new testcase”… Really. I don't care what you did to make “foo” work, I just care that you did it. And if your commits are interspersed with those of the person working on “bar” the evolution of the project becomes almost impossible to follow (I'll return to this in the next section).

Fortunately, this can be resolved easily with a squash merge:

> git merge --squash my-branch
Updating c620754..1791626
Squash commit -- not updating HEAD
 foo.txt | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

The “not updting HEAD” part of this message is important: rather than commit the changes, a squash merge leaves them staged and lets you commit (with a summary message) once you've verified the changes. The diagram looks like this:

The thing to understand about squashed commits is that they're not actually merges: there's no connection between the branch and master. As a result, “git branch -d my-branch” will fail, warning you that it's an unmerged branch; you need to force deletion by replacing -d with -D.

To wrap up this section: I don't think you need to squash all merges, just the ones that merge a feature onto master or an integration branch. Use normal merges when pulling an integration branch onto master, or when back-porting changes from the integration branch to the development branch (I do, however, recommend squashing back-ports from master to integration).

“Normal” Merges, with or without conflicts

To understand what I dislike about “normal” merges, we need to do one. For this we'll create a completely new repository, one where we'll compile our favorite quotes from Lewis Carroll. We start by creating the file carroll.txt in master; for this section I'll just show changes to the file, not the actual commits.


Walrus and Carpenter

In order to work independently, one person will make the changes to Jabberwocky on a branch named “jabber”:


Twas brillig, and the slithy toves
Did gire and gimbal in the wabe
All mimsy were the borogroves
And the mome raths outgrabe

Walrus and Carpenter

After that's checked in, someone else modifies the file on master:


Walrus and Carpenter

The time has come, the Walrus said,
To talk of many things:
Of shoes, and ships, and sealing wax
Of cabbages, and kings
And why the sea is boiling hot
And whether pigs have wings

Back on branch jabber, someone has found the actual text and corrected mistakes:


'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe
All mimsy were the borogoves
And the mome raths outgrabe

Walrus and Carpenter

At this point we have two commits on jabber, and one on master after the branch was made. If you look at the commit log for jabber you see this:

commit 2e68e9b9905b043b06166cf9aa5566d550dbd8ad
Author: Keith Gregory 
Date:   Thu Nov 3 12:59:40 2016 -0400

    fix typos in jabberwocky

commit 5a7d657fb0c322d9d5aca5c7a9d8ef6eb690eaeb
Author: Keith Gregory 
Date:   Thu Nov 3 12:41:35 2016 -0400

    jabberwocky: initial verse

commit e2490ffad2ba7032367f94c0bf16ca44e4b28ee6
Author: Keith Gregory 
Date:   Thu Nov 3 12:39:41 2016 -0400

    initial commit, titles without content

And on master, this is the commit log:

commit 5269b0742e49a3b201ddc0255c50c37b1318aa9c
Author: Keith Gregory 
Date:   Thu Nov 3 12:57:19 2016 -0400

    favorite verse of Walrus

commit e2490ffad2ba7032367f94c0bf16ca44e4b28ee6
Author: Keith Gregory 
Date:   Thu Nov 3 12:39:41 2016 -0400

    initial commit, titles without content

Now the question is: what happens when you merge jabber onto master? In my experience, most people think that the branch commits are added into master based on when they occurred. This is certainly reinforced by the post-merge commit log:

commit 8ba1b90996b4fb67529a2964836f8524a220f2d8
Merge: 5269b07 2e68e9b
Author: Keith Gregory 
Date:   Thu Nov 3 13:03:09 2016 -0400

    Merge branch 'jabber'

commit 2e68e9b9905b043b06166cf9aa5566d550dbd8ad
Author: Keith Gregory 
Date:   Thu Nov 3 12:59:40 2016 -0400

    fix typos in jabberwocky

commit 5269b0742e49a3b201ddc0255c50c37b1318aa9c
Author: Keith Gregory 
Date:   Thu Nov 3 12:57:19 2016 -0400

    favorite verse of Walrus

commit 5a7d657fb0c322d9d5aca5c7a9d8ef6eb690eaeb
Author: Keith Gregory 
Date:   Thu Nov 3 12:41:35 2016 -0400

    jabberwocky: initial verse

commit e2490ffad2ba7032367f94c0bf16ca44e4b28ee6
Author: Keith Gregory 
Date:   Thu Nov 3 12:39:41 2016 -0400

    initial commit, titles without content

But take a closer look at those commit hashes, and compare them to the hashes from the separate branches. They're the same, which means that the commits are unchanged. In fact, git log walks the parent references of both branches, making it appear that commits are on one branch when they aren't. The diagram actually looks like this (yes, I know, there are two commits too many):

The merge created a new commit, which has two parent references. It also created a new blob to hold the combined file, along with a modified tree. If we run cat-file on the merge commit, this is what we get:

tree 589187777a672868e50aefc491ed640125d1e3ed
parent 5269b0742e49a3b201ddc0255c50c37b1318aa9c
parent 2e68e9b9905b043b06166cf9aa5566d550dbd8ad
author Keith Gregory  1478192589 -0400
committer Keith Gregory  1478192589 -0400

Merge branch 'jabber'

So, what does it mean that git log produces the illusion of a series of merged commits? Consider what happens when you check out one of the commits in the list, for example 2e68e9b9.

This was a commit that was made on the branch. If you check out that commit and look at the commit log from that point, you'll see that commit 5269b074 no longer appears. It was made on master, in a completely different chain of commits.

In a complex series of merges (say, multiple development branches onto an integration branch, and several integration branches onto a feature branch) you can completely lose track of where and why a change was made. If you try to diff your way through the commit history, you'll find that the code changes dramatically between commits, and appears to flip-flop; you're simply seeing the code state on different branches.

Wrapping up: how safe is SHA-1?

“Everybody knows” that SHA-1 is a “broken” hash, so why is it the basis for storing objects in Git?

The answer to that question has two parts. The first is that SHA-1 is “broken” in terms of an attacker being able to create a false message that has the same SHA-1 hash as a real message: it takes fewer than the expected number of attempts (although still a lot!). This is problem if the message that you're hashing is involved in validating a server certificate. It's not a problem in the case of Git, because at worst the attacher would be able to replace a single object — you might lose one file within a commit, or need to manually rebuild a directory.

That's an active attacker, but what about accidental collisions: let's say that you have a commit with a particular hash, and it just so happens that you create a blob with the same hash. It could happen, although the chances are vanishingly small. And if it does happen, Git ignores the later file.

So don't worry, be happy, and remember to squash features from development branches.

Tuesday, September 13, 2016

Incremental Development is Like a Savings Account

In January of 1996 I opened a mutual fund account, investing in an S&P index fund. My initial deposit was $1,000, and I set up an automatic transfer of $200/month from my checking account. This month I looked at my statement and the account had crossed the $100,000 mark.

You can take that as an example of compound returns, and why you should establish a savings account early in life. But it reminded me of a conversation that I had with a project manager a few years ago, after our company had introduced Scrum.

His complaint was “But we could just decide to stop working on the project at any time!”

My response was a customer-value-oriented “Right, if we no longer provide incremental business value, then we should stop.”

But I now think that was the wrong answer, because in my experience most Agile projects don't stop. They continue to be enhanced for years because there's always an increment of value to be had, worth more than the cost of providing it. By comparison, “big push” projects do stop, because there's always another big project to consume the resources. So companies hop from one big push to another, pay for the team(s) learn the environment, and end up with something that's often less than useful to the client.

Returning to my mutual fund: if I had invested $1,000 and stopped, my investment would be worth approximately $5,000 today — the S&P has returned about 8% annually, even with the 37% downturn in 2008. But the reason that I'm at $100k today is because of the $200 added to the account every month for the past 20 years. Something I could “just stop” at any time.

Thursday, July 14, 2016

Advice for Young Programmers

I am sometimes asked if I have any words of advice for young people. Well, here are a few simple admonitions.

–William S Burroughs

It's easier to solve problems if you don't panic. But panic may be a mandatory team activity.

There's more than one way to skin a cat. The way you choose determines whether you end up with a clean pelt or a tattered bloody mess.*

The difference between retrospective and second-guessing is that the former tries to reconcile benefits and consequences, while the latter says “this time we'll get it right.”

Experience doesn't make you smarter than anyone else. But once you've seen and made enough mistakes it sometimes seems that way. Unless you didn't learn from them.

Occam's razor is your most important tool. But if you're careless it will cut you.

Unit testing with mock objects lets you verify that your incorrect assumptions are internally consistent.**

Startups are a lot of fun, but eventually they'll burn you out. Join one for the challenge and excitement, not the stock options. You'll be long gone before they become worthless.

By all means spend some time in management. You might like it. You might even be good at it. But always remember: the role of a manager is to satisfy infinite desires with limited resources. And there's only a little stigma to saying “I don't like this” and returning to the front lines.

Everybody has a brand; it's the sum of everything that everybody else perceives about them. The secret to personal and professional success is to understand your brand, actively change what you don't like, and think carefully before changing what you do.

If you interview at a company that refers to people as “resources,” get up and leave. It doesn't matter if the person interviewing you is a nice person, or that your potential manager says the right things. Upper management doesn't value you, and won't think twice about replacing you with someone cheaper.

The world is not divisible into two groups of people, but three: thinkers, non-thinkers, and dogmatic non-thinkers. Non-thinkers will turn into thinkers if they value the outcome. Dogmatic non-thinkers won't, and will hate you for trying to enlighten them.

Whatever you call them, fools are unavoidable. The important thing is to avoid becoming one yourself.

* Stolen from Mike Boldi

** Stolen from Drew Sudell

Saturday, June 4, 2016

Target Fixation

Motorcyclists have a saying: you go where you look. If an animal runs out in front of you, or you there's a patch of sand in the middle of the corner, or a Corvette is coming the other way, your first response has to be to look elsewhere. If not, you'll almost certainly hit whatever it is that you didn't want to hit.

Another name for this phenomena is target fixation, and that name was driven home to me quite literally — and painfully — in a paintball game many years ago. I was slowly and carefully positioning myself to shoot one of the other players, when all of a sudden I felt a paintball hit the middle of my back. I was so fixated on my target that I stopped paying attention to what was around me.

I suspect that target fixation was an enormous help to our hunter-gatherer ancestors stalking their dinner. They would only get one chance to bring down their quarry, and didn't have the benefit of high-powered rifles and telescopic sights. To a modern human, surrounded by opportunities to fixate on the wrong thing, it's not so great.

Physical dangers are one thing, but we're also faced with intellectual dangers. If focus too closely on the scary thing that's right in front of you, you'll ignore all the pitfalls that lie just beyond. This is a particular concern for software developers, who may adopt and implement a particular design without taking the time to think of the ways that it can fail — or of alternative designs that are simpler and more robust.

For example, you might implement a web application that requires shared state, and become so fixated on transactional access to that state that you don't think about contention … until you start running at scale, and discover the delays that synchronization introduces. If you weren't fixated on concurrent access, you might have thought of better ways to share the state without the need for transactions.

So, how to avoid becoming fixated? In the physical world, where fixation has potentially deadly consequences, training programs focus on prevention via ritual. For motorcyclists, the ritual is “SEE”: search, evaluate, execute. For pilots, there are many rituals, but one that was burned into my brain is aviate, navigate, communicate.

For software development, I think that a preemptive “five whys” exercise is a useful way to avoid design fixation. This exercise is usually used after a problem occurs, to identify the root cause of the problem and potential solutions: you keep asking “why did this happen” until there are no more answers. Recast as a pre-emptive exercise, it is meant to challenge — and ultimately validate — the assumptions that underly your design.

Returning to the concurrency example, the first question might be “why do I want to prevent concurrent access?” One possible answer is ”this is inventory data, and we don't want two customers to buy the last item.” That could lead to several other questions, such as “do I need to use a database transaction?” and “do I need to make that guarantee at this point in the process?”

The chief danger in this exercise is “analysis paralysis,” which is itself a form of target fixation. To move forward, you must accept that you are making assumptions, and be comfortable that they're valid assumptions. If you fixate on the possibility that your assumptions are invalid, you'll never move.

You also need to recognize that, while target fixation is often dangerous, it can have a positive side: preventing you from paying attention to irrelevant details.

I had a real-world experience of this sort a few weeks ago, while riding my motorcycle on a twisting country road: I saw a pickup truck coming the other way and not keeping to his lane. With a closing speed in excess of 100 miles per hour there wasn't much time to make a decision, and not many good decisions to make. I could continue as I was going, assume that the driver would see me and be able to keep within his lane; if I was wrong in that assumption, my trip would be over. I could get on the brakes hard, now, but would come to a stop at the exact point where the pickup would leave his lane while exiting the corner.

My best option was to stop just past the apex of the corner, which would be where the pickup was most likely to be within his lane. I fixated on that spot, and let the muscle memory of 100,000+ miles balance the braking and turning forces necessary to get me there. I have no idea how close the truck came to hitting me; my riding partner said that it was an “oh shit” moment. But once I picked my destination, the pickup truck and everything around me simply disappeared.

Which leads me to think that there might be another name for the phenomena: “flow.”

Monday, May 23, 2016

How (and When) Clojure Compiles Your Code

When I started working with Clojure, one of the challenges I faced was understanding exactly what Clojure was doing with my code. I was intimately familiar with how Java and the JVM works, and tried to slot Clojure into that mental model — but found too many cases where my model didn't quite represent reality. The docs weren't much help: they talked about the features of the language (including compilation), but didn't provide detail on what “automatically compiled to JVM bytecode on the fly” actually meant.

I think that detail is important, especially if you come to Clojure from Java or are running your Clojure code within a Java-centric framework. And I see enough questions on the Internet to realize that not a lot of people actually understand how Clojure works. So this post demonstrates a few key points that are the basis of my new mental model.

I'm using Clojure 1.8 for examples, but I believe that everything that I say is correct for versions as early as 1.6 and probably before.

Clojure is a scripting language

I'll start with some definitions:
  • Compiled languages translate source code into an artifact that is then loaded, unchanged into the runtime. Any decisions in the code rely on state maintained within the executing program.
  • Scripting languages load the source code into the runtime, and execute that code as it is loading. Scripts may make decisions based on state that they manage, as well as global state that has been set by other scripts.

Clojure is very much a scripting language, even though it compiles its scripts into JVM bytecode. All Clojure source files are processed one expression at a time: the reader reads characters from the source until it finds a valid expression (a list of symbols delimited by balanced parentheses), expands any macros, compiles the result to bytecode, then executes that bytecode.

It doesn't matter whether the source code is entered by hand into the REPL, or read from a file as part of a (require ...) form. It's processed a single top-level expression at a time.

Note my term: “top-level” expression. You won't find this term in the Clojure docs; they refer to “expressions” and “forms” more-or-less interchangeably, but don't differentiate between expressions that are nested within other expressions. The reason that I do will become apparent later on.

In my opinion, this form of evaluation is what gives macros their power (the oft-proclaimed homoiconicity of the language simply means that they're easier to write). A macro is able to use any information that has already been loaded into the runtime, including variables that have been created earlier by the same or a different script.

Top-level expressions turn into classes

Here's an interesting experiment: start up a Clojure REPL with the following command (using the correct path to the Clojure distribution JAR):

java -XX:+TraceClassLoading -jar clojure-1.8.0.jar

You'll see a long list of classloading messages flash by before you end up at the REPL. Now enter a simple expression, such as (+ 1 2); you'll see more messages, as the Clojure runtime loads the classes that it needs. Enter that same expression again, and you'll see something like this:

user=> (+ 1 2)
[Loaded user$eval3 from __JVM_DefineClass__]

This message indicates that Clojure compiled that expression to bytecode and then loaded the newly-created class to execute it. The class definition is still in memory, and you can inspect it. For example, you can look at its superclass (I've removed the now-distracting classloader messages):

user=> (.getSuperclass (Class/forName "user$eval3"))

AFunction is a class within the Clojure runtime; it is a subclass of AFn, which implements the invoke() method. With this knowledge, it's apparent that the evaluation of this simple expression has four steps:

  1. Parse the expression (including macro expansion, which doesn't apply to this case) and generate the bytes that correspond to a Java .class file.
  2. Pass these bytes to a classloader, getting a Java Class back.
  3. Instantiate this class.
  4. Call invoke on the resulting object.

You can, in fact, do all of this by hand, provided that you know the classname:

user=> (.invoke (.newInstance (Class/forName "user$eval3")))

OK, so far so good. Now I want to show why earlier I called out a distinction between “top-level” expressions and nested expressions:

user=> (* 3 (+ 1 2))
[Loaded user$eval5 from __JVM_DefineClass__]

Here we have an expression that contains a nested expression. However, note that only one class was generated as a result. In theory, every expression could turn into its own class. Clojure takes a more pragmatic approach, which is a good thing for our memory footprint.

Variables are wrapped in objects

To outward appearances, a Clojure variable is similar to a final Java variable: you assign it (once) with def, and retrieve its value simply by inserting the variable in an expression:

user=> (def x 10)

user=> (class x)

user=> (+ x 2)

A hint of the truth can be seen if you attempt to use an unbound var in an expression:

user=> (def y)

user=> (+ 2 y)

ClassCastException clojure.lang.Var$Unbound cannot be cast to java.lang.Number  clojure.lang.Numbers.add (

In fact, variables are instances of clojure.lang.Var, which provides functions to get and set the variable's value. When you reference a variable within an expression, that reference translates into a method call that retrieves the actual value.

This allows a great deal of flexibility, including the ability to redefine variables. Application code can do this on a per-thread basis using binding and set!, or within a call tree using with-redefs. The Clojure runtime does much more, such as redefining all variables when you reload a namespace.

A namespace is not a class

For someone coming from a Java background, this is perhaps the hardest thing to grasp. A namespace definition certainly looks like a class definition: you have a dot-delimited namespace identifier, which corresponds to the path where you save the source code. And when you invoke a function from a namespace, you use the same syntax that you would to invoke a static method from a Java class.

The first hint that the two aren't equivalent is that the ns macro doesn't enclose the definitons within the namespace. Another is that you can switch between namespaces at will and add new definitions to each:

user=> (ns foo)
foo=> (def x 123)
foo=> (ns bar)
bar=> (def x 456)
bar=> (ns foo)
foo=> (def y 987)

You could take the above code snippets, save them in an arbitrary file, and then use the load-file function to execute that file as a script. In fact, you could write your entire application, with namespaces, as a single script.

But most (sane) people don't do that. Instead, they create one source file per namespace, store that file in a directory derived from the namespace name, and use the require function to load it (or more often, a :require directive in some other ns declaration).

Loading code from the classpath: require and load

The :require directive is another point of confusion for a Java developer starting Clojure. It certainly looks like the import statement that we already know, especially when it's used in an ns invocation:

(ns example.main
    :require [example.utils :as utils])

In reality, :require is almost, but not quite, entirely unlike import. The Java compiler uses import to load definitions from an already-compiled class so that they can be referenced by the class that's currently being compiled. On a superficial level, the Clojure runtime does the same thing when it sees :require, but it does this by loading (and compiling) the source code for that namespace.

OK, there are some caveats to that statement. First is that require only loads a namespace once, unless you specify the :reload option. So if the required namespace has already been loaded, it won't be loaded again. And if the namespace has already been compiled, and the source file is older than the compiled files, then the runtime loads the already compiled form. But still, there's a lot of stuff happening as the result of a seemingly simple directive.

So, let's dig into the behavior of require, along with its step-brother load. Earlier I wrote about using load-file to load an arbitrary file into the REPL. Here's that file, followed by the command to load and run it:

(ns foo)
(def x 123)

(ns bar)
(def x 456)

(ns user)
(do (println "myscript!") (+ foo/x bar/x))
user=> (load-file "src/example/myscript.clj")

When you load the file, it creates definitions within the two namespaces, then invokes an expression to add them. After loading the file, you can access those variables from the REPL:

user=> (* foo/x bar/x)

The load function is similar, but loads files relative to the classpath. It also assumes a .clj extension. I'm using Leiningen, so my classpath is everything under src; therefore, I can load the same file like so:

user=> (load "example/myscript")

Wait a second, what happened to the expression at the end of the script? It was still evaluated — the println executed — but the result was discarded and load returned nil.

Now let's try loading this same script with require:

user=> (require 'example.myscript :reload)

Different syntax, same result. The two variables are defined in their respective namespaces, and the stand-alone expression was evaluated. So what's the difference?

The first difference is that require gives you a bunch of options. For example, you can use :as to create a short alias for the namespace, so that you don't have to reference its vars with fully-qualified names. The way that the runtime uses these flags is probably worthy of a post of its own.

Another difference is that require is a little smarter about loading scripts: it only loads (and compiles) a script if it hasn't already done so — unless, of course, you use the :reload or :reload-all options, like I did here. Omitting that option, we see that a second require doesn't invoke the println.

user=> (require 'example.myscript)

Compiling your code (or, :gen-class doesn't do what you might think)

As you've seen above, the Clojure runtime normally compiles your code when it's loaded, producing the bytes of a .class file but not writing them to the filesystem. However, there are times that you want a real, on-disk class. For example, so that you can invoke that class from Java (note that you'll still need the Clojure JAR on your classpath). Or so that you can reduce startup time for a Clojure application, by avoiding load-time compilation (although I think this is probably premature optimization).

The compile function turns Clojure scripts into classes:

user=> (compile '

That was simple enough. Note, however, that I was running in lein repl, which sets the *compile-path* runtime global to a directory that it knows exists. If you try to execute this function from the clojure.main REPL, it will fail unless you create the directory classes.

Here's the example file that I compiled:


(def x 123)

(defn what [] "I'm compiled!")

(defn add2 [x] (+ 2 x))

And here are the classes that it produced:

-rw-rw-r--   1 kgregory kgregory     3008 May  7 09:29 target/base+system+user+dev/classes/example/foo__init.class
-rw-rw-r--   1 kgregory kgregory      683 May  7 09:29 target/base+system+user+dev/classes/example/foo$add2.class
-rw-rw-r--   1 kgregory kgregory     1320 May  7 09:29 target/base+system+user+dev/classes/example/foo$fn__1194.class
-rw-rw-r--   1 kgregory kgregory     1503 May  7 09:29 target/base+system+user+dev/classes/example/foo$loading__5569__auto____1192.class
-rw-rw-r--   1 kgregory kgregory      513 May  7 09:29 target/base+system+user+dev/classes/example/foo$what.class

If you're a bytecode geek like me, you'll of course run javap -c on those files to see what they contain (especially fn__1194, which doesn't appear anywhere in the source!). Have at it. For everyone else, here are the two things I think are important:

  • Every function turns into its own class. If you've been reading along, you aready knew that.
  • The foo__init class is responsible for pulling all of the other classes into memory, creating instances of those classes, and assigning them to vars in the namespace.

If you use Leiningen, you've probably noted that it adds a :gen-class directive to the main class of any “app” project that it creates. If you skim the docs for gen-class you might think this will produce a Java class that exposes all of your namespace's functions. Let's see what really happens, by adding a :gen-class directive to the example script:


When you compile, the list of classes now looks like this:

-rw-rw-r--   1 kgregory kgregory     1823 May  7 09:31 target/base+system+user+dev/classes/example/foo.class
-rw-rw-r--   1 kgregory kgregory     3009 May  7 09:31 target/base+system+user+dev/classes/example/foo__init.class
-rw-rw-r--   1 kgregory kgregory      683 May  7 09:31 target/base+system+user+dev/classes/example/foo$add2.class
-rw-rw-r--   1 kgregory kgregory     1320 May  7 09:31 target/base+system+user+dev/classes/example/foo$fn__1194.class
-rw-rw-r--   1 kgregory kgregory     1505 May  7 09:31 target/base+system+user+dev/classes/example/foo$loading__5569__auto____1192.class
-rw-rw-r--   1 kgregory kgregory      513 May  7 09:31 target/base+system+user+dev/classes/example/foo$what.class

Everything's the same, except that we now have foo.class. Looking at this class with javap, we find that it contains overrides of the basic Object methods: equals(), hashCode(), toString(), and clone(). It also creates a Java-standard main() function, which looks for the Clojure-standard -main (which doesn't exist for our script, so will fail if invoked). But it doesn't expose any of your functions.

Reading the doc more closely, if you want to use :gen-class to expose your functions, you need to specify the exposed functions in the directive itself — and use a specified naming format that separates the Clojure method implementations from the names exposed to Java.

Pitfalls of compiling your code

Let's change the namespace declaration on foo, so that it requires bar:

  (:require [ :as bar]))

This results in the the expected classes for foo, but also several for bar (which doesn't define any functions):

-rw-rw-r--   1 kgregory kgregory     2219 May  7 09:39 target/base+system+user+dev/classes/example/bar__init.class
-rw-rw-r--   1 kgregory kgregory     1320 May  7 09:39 target/base+system+user+dev/classes/example/bar$fn__1196.class
-rw-rw-r--   1 kgregory kgregory     1503 May  7 09:39 target/base+system+user+dev/classes/example/bar$loading__5569__auto____1194.class
-rw-rw-r--   1 kgregory kgregory     3009 May  7 09:39 target/base+system+user+dev/classes/example/foo__init.class
-rw-rw-r--   1 kgregory kgregory      683 May  7 09:39 target/base+system+user+dev/classes/example/foo$add2.class
-rw-rw-r--   1 kgregory kgregory     1320 May  7 09:39 target/base+system+user+dev/classes/example/foo$fn__1198.class
-rw-rw-r--   1 kgregory kgregory     1891 May  7 09:39 target/base+system+user+dev/classes/example/foo$loading__5569__auto____1192.class
-rw-rw-r--   1 kgregory kgregory      513 May  7 09:39 target/base+system+user+dev/classes/example/foo$what.class

This makes perfect sense: ff you want to ahead-of-time compile one namespace, you probably don't want its dependencies to be compiled at runtime. But recognize that the tree of dependencies can run very deep, and will include any third-party libraries that you use (poking around Clojars, there aren't a lot of libraries that come precompiled).

There is one other detail of compilation that may cause concern: require loads a namespace from the file(s) with the latest modification time. If you have both source and compiled classes on your classpath, this could mean that you're not loading what you think you are. Fortunately, in practice this primarily affects work in the REPL: Leiningen removes the target directory as part of the jar and uberjar tasks, so you won't produce an artifact with a source/class mismatch.


This has been a long post, so I'll wrap up with what I consider the main points.

  • Startup times for Clojure applications will be longer than for normal Java applications, because of the additional step of compiling and evaluating each expression. This isn't going to be an issue if you've written a long-running server in Clojure, but it does add significant overhead to short-running programs (so Clojure is even less appropriate for small command-line utilities than Java).
  • Pay attention to the Clojure version used by your dependencies, because they might rely on functions from a newer version than your application; this problem manifests itself as an “Unable to resolve symbol” runtime error. While this is a general issue with transitive dependencies, I've found that third-party libraries tend to be at the latest version, while corporate applications tend to use whatever was current when they were begun.
  • As far as I can tell, the Clojure runtime doesn't ever unload the classes that it creates. This means that — on pre-1.8 JVMs — you can fill the permgen space. Not a big problem in development, but be careful if you use a REPL when connected to a production instance.
  • Every script that you load adds to the global state of the runtime. Be aware that the behavior of your scripts may be dependent on the order that they're loaded.

Saturday, April 30, 2016

Taming Maven: Transitive Dependency Pitfalls

Like much of Maven, transitive dependencies are a huge benefit that brings with them the potential for pain. And while I titled this piece “Taming Maven,” the same issues apply to any build tool that uses the Maven dependency mechanism, including Gradle and Leiningen.

Let's start with definitions: direct dependencies are those listed in the <dependencies> section of your POM. Transitive dependencies are the dependencies needed to support those direct dependencies, recursively. You can display the entire dependency tree with mvn dependency:tree; here's the output for a simple Spring servlet:

[INFO] com.kdgregory.pathfinder:pathfinder-testdata-spring-dispatch-1:war:1.0-SNAPSHOT
[INFO] +- javax.servlet:servlet-api:jar:2.4:provided
[INFO] +- javax.servlet:jstl:jar:1.1.1:compile
[INFO] +- taglibs:standard:jar:1.1.1:compile
[INFO] +- org.springframework:spring-core:jar:3.1.1.RELEASE:compile
[INFO] |  +- org.springframework:spring-asm:jar:3.1.1.RELEASE:compile
[INFO] |  \- commons-logging:commons-logging:jar:1.1.1:compile
[INFO] +- org.springframework:spring-beans:jar:3.1.1.RELEASE:compile
[INFO] +- org.springframework:spring-context:jar:3.1.1.RELEASE:compile
[INFO] |  +- org.springframework:spring-aop:jar:3.1.1.RELEASE:compile
[INFO] |  |  \- aopalliance:aopalliance:jar:1.0:compile
[INFO] |  \- org.springframework:spring-expression:jar:3.1.1.RELEASE:compile
[INFO] +- org.springframework:spring-webmvc:jar:3.1.1.RELEASE:compile
[INFO] |  +- org.springframework:spring-context-support:jar:3.1.1.RELEASE:compile
[INFO] |  \- org.springframework:spring-web:jar:3.1.1.RELEASE:compile
[INFO] \- junit:junit:jar:4.10:test
[INFO]    \- org.hamcrest:hamcrest-core:jar:1.1:test

The direct dependencies of this project include servlet-api version 2.4 and :spring-core version 3.1.1.RELEASE. The latter has a dependency on spring-asm, which in turn has a dependency on commons-logging.

In a real-world application, the dependency tree may include hundreds of JARfiles with many levels of transitive dependencies. And it's not a simple tree, but a directed acyclic graph: many JARs will share the same dependencies — although possibly with differing versions.

So, how does this cause you pain?

The first (and easiest to resolve) pain is that you might end up with dependencies that you don't want. For example, commons-logging. I don't subscribe to the fear that commons-logging causes memory leaks, but I also use SLF4J, and don't want two logging facades in my application. Fortunately, it's (relatively) easy to exclude individual dependecial's, as I described in a previous “Taming Maven” post.

The second pain point, harder to resolve, is what, exactly, is the classpath?

A project's dependency tree is the project's classpath. Actually, “the” classpath is a bit misleading: there are separate classpaths for build, test, and runtime, depending on the <scope> specifications in the POM(s). Each plugin can define its own classpath, and some provide a goal that lets you see the classpath they use; mvn dependency:build-classpath will show you the classpath used to compile your code.

This tool lists dependencies in alphabetical order. But if you look at a generated WAR, they're in a different order (which seems to bear no relationship to how they're listed in the POM). If you're using a “shaded” JAR, you'll get a different order. Worse, since a shaded JAR flattens all classes into a single tree, you might end up with one JAR that overwrites classes from another (for example, SLF4J provides the jcl-over-slf4j artifact, which contains re-implemented classes from commons-logging).

Compounding classpath ordering, there is the possibility of version conflicts. This isn't an issue for the simple example above, but for real-world applications that have deep dependency trees, there are bound to be cases where dependencies-of-dependencies have different versions. For example, the Jenkins CI server has four different versions of commons-collections in its dependency tree, ranging from 2.1 to 3.2.1 — along with 20 other version conflicts.

Maven has rules for resolving such conflicts. The only one that matters is that direct dependencies take precedence over transitive. Yes, there are other rules regarding depth of transitive dependencies and ordering, but those are only valid to discover why you're getting the wrong version; they won't help you fix the problem.

The only sure fix is to lock down the version, either via a direct dependency, or a dependency-management section. This, however, carries its own risk: if one of your transitive dependencies requires a newer version than the one you've chosen, you'll have to update your POM. And, let's be honest, the whole point of transitive dependencies was to keep you from explicitly tracking every dependency that your app needs, so this solution is decidedly sub-optimal.

A final problem — and the one that I consider the most insidious — is directly relying on a transitive dependency.

As an example, I'm going to use the excellent XML manipulation library known as Practical XML. This library makes use of the equally excellent utility library KDGCommons. Having discovered the former, you might also start using the latter — deciding, for example, that its implementation of parallel map is far superior to others.

However, if you never updated your POM with a direct reference to KDGCommons, then when the author of PracticalXML decides that he can use functions from Jakarta commons-lang rather than KDGCommons, you've got a problem. Specifically, your build breaks, because the transitive depenedency has disappeared.

You might think that this is a uncommon situation, but it was actually what prompted this post: a colleague changed one of his application's direct dependencies, and his build started failing. After comparing dependencies between the old and new versions we discovered a transitive depenency that disappeared. Adding it back as a direct dependency fixed the build.

To wrap up, here are the important take-aways:

  • Pay attention to transitive dependency versions: whenever you change your direct dependencies, you should run mvn dependency:tree to see what's changed with your transitives. Pay particular attention to transitives that are omitted due to version conflicts.
  • If your code calls it, it should be a direct dependency. Plugging another of my creations, the PomUtil dependency tool can help you discover those.