Monday, March 19, 2018

Moving from EC2-Classic to VPC

If you're a long-time user of Amazon Web Services, chances are good that your application runs in “EC2-Classic”: the original AWS offering, where each server has a public IP on the open Internet. In 2011 Amazon introduced Virtual Private Cloud (VPC), in which your instances live on an isolated network using non-routable (private) IP addresses. At the end of 2013, Amazon made VPC the default deployment environment; if you created your AWS account in 2014 or later, you can stop reading now.

So, let's assume that you have applications deployed in EC2-Classic. Why would you want to move them to the VPC? Here are a few of the reasons that are important to me:

  • Minimize attack surface
    In EC2-Classic your instances have a public IP, and rely on security groups to control access. And while I'm not aware of any case where a properly configured security group has failed to prevent access, it's far too easy to accidentally open ports. On a private subnet, there is no direct route from the Internet to the instance, so opening a port has less potential for harm (although it's still to be avoided).
  • Control user access
    In addition to access by the general public, a VPC with private IPs and a Bastion host gives you greater control over access by your own people — or, more correctly, by their computers. If one of your employees loses his or her laptop, it's easier to disable that laptop's private key on a single bastion host rather than dozens or hundreds of externally-exposed application servers.
  • Control application access -- separate test and prod
    If you're running in EC2-Classic, the only thing that keeps a test server from updating a production database is its configuration. If test and prod have their own VPC, misconfiguration doesn't matter (unless you intentionally link the two VPCs).
  • Reduce cost of communication
    As I've said elsewhere, Amazon's data transfer pricing is complex, depending on where the traffic originates or terminates. But in general, communication via public IP is more expensive than via private IP. The savings may be only pennies per gigabyte, but they do add up.
  • Simplify whitelisting with external providers
    This was a big issue for my last company: several of our providers would only accept traffic from whitelisted IPs. That was solvable with Elastic IPs, but meant that we had to acquire an unattached IP from our pool when starting an instance (which meant that there had to be unused IPs in the pool, and unused Elastic IPs have a per-hour charge). With the VPC, we run everything through NATs with permanently assigned public IPs. To add another provider that requires whitelisting is a matter of telling them the existing NAT IPs; there's no need to provision additional Elastic IPs.

So, assuming these arguments are persuasive, how do you proceed?

Step 0: Pick a tool

My last post was about the benefits to using generated templates for application deployment. I think that the same arguments apply to infrastructure deployments, although there is less benefit to writing a program to generate the deployment script. There is still some benefit: you'll want to create production and test environments with slight variations, and you can play some bit-mapping tricks to associated availability zones with CIDRs (see below). But a lot of VPC configuration consists of one-off resource definitions, and you won't be running your VPC script more than a few times.

I think that the main reason to use a tool — at least, one that supports comments — is to provide documentation, which includes version-control history. Although your VPC won't change much over its lifetime, it will change.

Step 1: Architect your VPC

Amazon provides you with a default VPC in each region, and a lot of people start using that VPC without giving it a lot of thought. Unfortunately, that default VPC is almost certainly not optimal for your deployments; for one thing, all of the subnets are public, which means that you either assign a public IP to the instances running there or they can't talk to the outside world. A better approach, in my opinion, is to think about your workloads and plan the network accordingly before building. Here are a few of the things that I think about:

  • What CIDR block should you use?

    Amazon recommends creating VPCs that use non-routable IP addresses from one of the defined private subnet ranges. They allow you to create VPCs that cover public, routable IP addresses, but I think the only reason that you'd want to do that is to incorporate the VPC into your existing network. Don't do it unless you know that you need to.

    A more interesting question is which of the private address ranges should you use? To a large extent, I think this depends on what you're already using as a corporate network, because you want addresses on the VPC to either (1) be completely different than the corporate network, or (2) occupy a defined range of the corporate network. So, if you have a corporate network that uses 10.x.x.x addresses, you should either ask your IT department for a block of those addresses, or configure the VPC to use 172.x.x.x addresses. For the sake of the people who work from home, please don't be tempted to use a 192.168.x.x address.

    While you can define the VPC to use less than a /16 network address, the only reason to do so is to fit into an existing corporate standard. If that's not the case, then don't limit youself unnecessarily.

  • How many availability zones do you need?

    Amazon does not actually define what, exactly, an availability zone is; the conventional understanding is that each availability zone corresponds to a data center. What they are clear about is that an availability zone can become unavailable, and that has happened multiple times in Amazon's history. To prevent a single-AZ event from taking down your applications, you want to deploy into at least two zones.

    But should you configure your VPC with more than two zones? There have been multi-AZ failures in the past, but the only real way to keep running through those is to adopt multi-region redundancy. A bigger issue is one of capacity: while you can probably expect to get as many t2.small instances as you want, some of the larger instance types may have temporary or longer-term shortages in a particular AZ; if you have a limited number of AZs this may prevent you from deploying a workload. I've experienced this with compute-intensive elastic map-reduce clusters, and you can see the difference if you look at spot price history: on any given day you might see a 20 or 30 cent difference in the price of a c5.4xlarge instance between AZs.

    On the other hand, more AZs may mean a more complex or more expensive infrastructure. For example, if you use one NAT per AZ, your costs rise linearly with number of AZs; if you share a NAT, you have to manage the separate routing tables for AZs with and without NATs (and pay for cross-AZ traffic).

  • How many subnets do you need, and how big should they be?

    I'm going to assume that you'll have at least two subnets per availability zone: one public and one private. But do you need more? Some people like to group applications by subnet, and will divide up their address space with one subnet per application. I that's at least partly a historical artifact of physical networking constraints: when computers are connected by wires and switches, you isolate traffic using subnets. In the AWS world traffic management isn't your concern: routing is handled by the AWS infrastructure and you have no idea how individual instances are physically connected.

    My personal belief is to give your subnets as much space as they can get. For example, you can divide a /16 VPC into four /18 subnets, each of which will support 16,379 hosts (why not 16,384? the answer is that AWS reserves five addresses from each subnet — another reason to not use small subnets). Since you want a public and private subnet for each AZ, you could further subdivide one of the /18 address spaces, giving four /20 address subnets (this makes sense because public subnets will have far fewer hosts than private).

    This sort of division makes programmatic generation of subnets easy: starting with a /16 base address you use the four high-order bits of the third byte to encode availability zone and public/private. For example: in the us-east-1 region you assign the number 0 to us-east-1a, 1 to us-east-1b, and 2 to us-east-1c. The public subnet in us-east-1a would have a third byte 1100xxxx, while the private subnet in that availability zone would use 00xxxxxx; for us-east-1b, public would be 1101xxxx and private 01xxxxxx; for us-east-1c, 1110xxxx and 10xxxxxx (and yes, that leaves a /18 subnet with 1111xxxx).

  • NAT Instance or NAT Gateway?

    I've covered this pretty deeply here. If you don't know that you're going to be pushing a lot of data over the NAT, then I think it's best to start with a NAT Gateway and let Amazon manage it for you. For redundancy you should have one NAT per availability zone, but if you're willing to accept downtime (and cross-AZ data charges) you can save money with a single NAT.

Step 2: Move your applications

At this point you have a lot of resources in EC2-Classic and an empty VPC. If you can afford downtime you can do a mass move and be done with it. Most of us can't afford the downtime, so we have to move our resources in phases. And you'll be faced with a question: do you move the app-servers first or the databases. The answer to this question largely depends on how many app-servers you have, and what other pieces of infrastructure they need to connect to.

The problem is security groups: most deployments are configured to allow inbound traffic based on security group IDs. So, for example, you might have one security group per application, assigned to the EC2 instances, load balancer, and database for that application, which has a rule that allows traffic from itself. This is a nice clean way to control access, but it has one problem: the security groups that you define in your VPC can't reference security groups that you've defined in EC2-Classic.

There are two solutions. The first is ClassicLink: you explicitly associate the EC2-Classic instance(s) with the VPC. With a large number of instances this becomes a pain point, even though you can link an auto-scaling group rather than individual instances.

The other solution is to run your EC2 instances within the private subnet(s) and enable access to the infrastructure resources using the IP addresses assigned to the NAT. This does mean that you'll be paying for traffic over the NAT, which can add up for a busy system, but shouldn't be a long-term cost.

Step 3: Move your non-database infrastructure

Non-database infrastructure includes things like Redis caches or SOLR/ElasticSearch services. These can be (relatively) easily copied from an external server to an internal server, either as a fresh deployment or by making a snapshot of the original server's volume(s).

One of the things to consider at this time is whether or not you should continue to support these services as physical servers. For example, replacing Redis with ElastiCache. You will pay more to run as a managed service, but in many cases it makes economic sense to eliminate the responsibility of managing the physical servers.

Step 4: Move your database(s)

I hold the databases to last because they're the most work, and the rest of the migration can take place without them — if you choose, you can continue to run the database outside of the VPC. The pain largely depends on the size of your database and whether or not you can afford downtime.

If you can take your systems down while you back up and restore a database snapshot, then that's by far the preferable solution. It may be possible to do this for some of your databases but not all; take advantage of it where you can. The thing to remember is that once you flip the switch there's no going back without data loss.

The alternative — the only alternative if you can't afford downtime — is to create a read replica inside the VPC and promote it when you're ready. There's still some downtime with this approach, but it's measured in minutes rather than hours.

Unfortunately, as of this writing RDS does not support creating an in-VPC read replica for an out-of-VPC master database. So you have to do it the old-fashioned way, using the tools provided by your DBMS.

For MySQL I followed the process here: you create an RDS read replica that you then clone to create the in-VPC replica. It's a fairly complex process, and you'll probably have to redo it a couple of times due to mistakes. In the hope of minimizing those mistakes, here are a few of the things that I learned the hard way:

  • Give yourself plenty of time when you set the backup retention period. This will allow you to temporarily shut down replication to do things like optimize tables and indexes in the new database. And if you've been running your applications for since EC2-Classic days, this is probably very necessary.
  • Remember to create a security group on your out-of-VPC database that allows access from inside the VPC. This is easier if you put the in-VPC database on a private subnet that has a NAT. For various reasons I needed to leave our large database publicly accessible.
  • Rebooting your replica will change its IP address. Not an issue if you're connecting via NAT, but a huge issue if you've configured the replica as publicly accessible. If you plan to reboot (for example, to change replica configuration) turn off replication first, and make sure that you have the new replica IP in your security group before turning it back on.
  • The instructions tell you to save the value of Read_Master_Log_Pos from the initial slave. What it doesn't say is that Exec_Master_Log_Pos must have the same value. If they don't, you'll end up with a replication failure because the slave relay log contains transactions that haven't completed. I found that you could enable and disable replication during a time of low database activity to bring these two values into sync.

One last thing about migrating a database: when you restore an RDS database from snapshot it needs to “warm up”: the blocks from the snapshot are actually stored on S3 and will be read on an as-needed basis. This is a general issue with EBS volumes based on snapshots, and AWS provides instructions for doing this with EC2. Unfortunately, you can't do the same thing with RDS. I've experimented with different approaches to warming up an RDS instance; depending on the size of your database you might find one of them useful. You could also use the in-VPC read replica to run production queries, as long as you're OK with possible replication lag.

Step 5: Profit!

OK, profit is never guaranteed, but at least you'll be out of EC2-Classic. And hopefully the process of moving has pushed you toward improving your automated deployments, as well as giving you a chance to clean up some of the cruftier parts of your systems.

No comments: