blog.kdgregory.com

Monday, June 29, 2015

The Encryption Tax

Last year I wrote a post comparing build times using an SSD and a traditional hard disk. One of the conditions that I tested was using an encrypted home directory, and was surprised that encryption added 11% to my build times — about the same as using a “spinning rust” drive.

That was Xubuntu 12.04, which used eCryptfs, a FUSE filesystem, to handle encryption. At the time, I posited that the buffer cache was holding encrypted data blocks, so you paid a “decryption tax” every time you accessed one. Xubuntu 14.04, however, provides an option to enable full-disk encryption as part of the installation. Since this hooks more tightly into the kernel, I wondered if the decryption tax was still present. This called for another day of testing.

For this series of tests, I'm using my HP ProBook 640-G1:

Intel i5-4300M @ 2.6 GHz, 2 cores, hyperthreaded
800 MHz FSB
8 Gb RAM, 3 Mb L2 cache
Samsung 840 SSD
Xunbuntu 14.04 LTS 64-bit

And as a test subject, I'm building the Spring Framework 3.2.0.RELEASE. In my last test, it showed the biggest difference between SSD and platter, as well as the biggest difference between encrypted and non-encrypted access.

I tested three scenarios:

Standard Xubuntu 14.04 install, no encryption.
Home directory encryption (eCryptfs, running in userspace).
Full-disk encryption (dm-crypt, which runs — I believe — in the kernel).

As before, I pre-fetched all dependencies and disconnected from the network. Each timing is the average of two runs. Between runs I cleaned the build directory, TRIMed the filesystem, and rebooted to clear the buffer cache.

Here, then, are the numbers. Execution time is wall-clock time in seconds, as reported by Gradle, because that's what we really compare about.

	Execution Time	Difference from baseline
Baseline	288	–
Encrypted home directory	300	+ 4.2 %
Encrypted filesystem	295	+ 2.5 %

I'm not sure why the penalty of an ecrypted home directory was lower this time around, but it's still there. And the full-disk encryption is a lower penalty, as expected, but again it's still there. So you still have to make a tradeoff between performance and protection. But given the numbers, protection should win.

At the end of my earlier post, I suggested that there's no substitute for RAM, and the ability to maintain a large buffer cache. I'm not sure that argument applies with a project as big as Spring. After each build, I cleaned the directory and ran a second build, figuring that 8Gb should be enough RAM to maintain everything in the buffer cache. The numbers, however, tell a different story:

	Execution Time	Difference from baseline
Baseline	284	- 1.4 %
Encrypted home directory	294	+ 2.1 %
Encrypted filesystem	289	+ 0.3 %

Yes, a slight improvement over the previous runs, but not much. I think this might be due to the fact that compilers produce disk blocks, rather than consume them — reading existing blocks is a small part of the overall execution time.

Sunday, March 1, 2015

Developing in the Clouds

Deploying to the cloud — whether to a platform-as-a-service provider such as Heroku, or an infrastructure-as-a-service provider such as Amazon EC2 — is commonplace, particularly for startups that don't want to invest in infrastructure, or established companies that need resources to handle operational spikes. But not much has been written about using a cloud-based host as a development platform. I just wrapped up a six-month project where my primary development machine was an Amazon EC2 instance, and here are a few of my thoughts.

I'm going to start with the negatives:

Latency
I live in Philadelphia and was working on an Amazon EC2 instance in Ashburn Virgina, a distance of approximately 150 miles — at the speed of light, roughly a millisecond. However, there are multiple network hops between me and the server, all of which add up to a round-trip time (via ping) of roughly 40 milliseconds. If you confine yourself to text, that's nearly unnoticeable. If you run the X Window System, it's unbearable. If your workflow is GUI-intensive, cloud-based development might not be a good choice (although I consider VNC quite acceptable when using an IDE for development).
Capability
My desktop runs an Ivy Bridge Core i7 with 32 Gb of RAM. The t2.medium instance that I used for remote development has baseline performance of roughly 40% of an undisclosed Xeon and only 4 Gb of RAM. As it turns out, that's sufficient for many development tasks, especially with a rapid-turnaround platform such as Node.JS. If you have big compiles, you can always fire up a c4.8xlarge with a Haswell Xeon, 60 Gb of RAM, and disk throughput that's far better than your desktop SSD.
Cost
Mind you, that c4.8xlarge will cost you: as of this date, $1.68 per hour or fraction thereof. On another project, a colleague fired up a cluster of these instances and forgot to shut them down when he left the company. A month later the IT department gave me a call to ask if we really needed them, because they were costing us $5,000 a month. By comparison, the t2.medium instance costs $0.052 per hour, or $456 per year. More than a decent developer desktop on a three-year depreciation schedule, but not that bad in the larger scheme.
Security
This is the big one: if you're going to run machines in the cloud, you need to have at least a baseline knowledge of Internet security (or hire someone who does). Put simply, you will be attacked. To apply some numbers to that statement, I started an EC2 instance that exposed SSH and HTTP, and left it running for 24 hours. The first attempt to break into SSH happened within a half hour; there were 39 attempts over the course of the test. Yandex started exploring the site within 12 hours, followed by other web scrapers.*
Basic security rules will get you a long way: don't use dictionary passwords for any exposed service (and for SSH, don't use passwords at all), and don't expose any unprotected services to the outside world. Use a firewall that checks origin IP. If you're running on AWS, this feature is built into security groups. If you need to share access to your instance, or access it from locations that you don't know in advance, consider a VPN.
This is also a case where I think security by obscurity is useful — at least as a first line of defense. Most scannerbots and web crawlers look at well-known ports: 22 for SSH; 80, 8000, and 8080 for HTTP. Exposing your prototype website on port 21498 isn't going to stop a dedicated attacker (and there are bulk port scanners out there), but it will prevent your site's content from showing up in a search index before you're ready.

And now, the positives:

Availability
The ability to access a cloud-based host from anywhere, at any time, gives you an enormous amount of flexibility in how you do work. There's no need to lug a laptop home every night, and if the roads are snow-covered and filled with stopped traffic, you can easily work from home. With tools like screen or VNC, you can have a session that's always set up just how you want it, and which can run programs while you're not connected. Plus, it's easy to collaborate: unlike a personal computer, a cloud instance can be shared by multiple users.
Consistency
I don't know about you, but after a few years my computers all accumulate a significant amount of cruft: libraries or entire applications that I installed for a particular project and no longer need, along with various versions of standard tools, some of which are expected by the OS itself. Multiply that cruft by the number of members on your team, and add a fudge factor for different shell configuration files. None of which matches your production (or even test) environment. It's a mess.
To me, this is the biggest benefit of developing in the cloud: you can be certain that all machines are configured alike — or at least start out that way. Tools such as Chef and Puppet will take you from a base image to a fully configured server in one step. With Amazon, after you've configured the server once, you can create a private AMI and stamp out as many instances as you want.
Disposability
The flip side of starting servers quickly is disposing of them when no longer needed. There's no reason to patch or update your machine; that just accumulates cruft. This mantra has long been used by operations teams: I was out riding with a friend last fall when his beeper went off; we pulled over, he discovered that one of his production servers was having trouble, shut it down, and started a replacement. Analysis could wait for later; there was no need to try to make an emergency patch.

Finally, lessons learned:

Experience Helps — a Lot
I've been using AWS since 2009, and have been responsible for administering my personal Linux machine since well before that. However, the limits of my knowledge became apparent when the company hired Eric, a “cloud architect” who lived up to the name. He quickly had us running in an isolated virtual private cloud (VPC), with separate VPCs for our test, integration, and production environments, OpenVPN to secure access, LDAP to hold credentials, and automatic deployments from our CI server. If you can find such a person, hire him or her; it will save a lot of time.
Disposability Changes How You Work
I got a hint of this several years ago, when using AWS to test a distributed application: we would prepare a test, start up a dozen servers, run the test, then shut them down. It allowed a form of testing that was quite simply impossible just a few years earlier; no company that I've worked for had a spare closet full of machines, and even if they did, configuration would require hours.
As developers, we set up our machines just the way we want them; it takes a long time, and no two developers have the same configuration. But if you have the base development image preconfigured, firing up a new machine becomes a process of copying your personal configuration files and checking out your workspace. You learn to let go of “my machine.”

* I've written elsewhere about the questionable practices of web crawlers. If you have a site that's not ready for primetime, don't expose it to the Internet on a well-known port.

Monday, October 27, 2014

JDK Bug (Maybe): Namespaced XPath Variables

I'm revising my XPath article, and stumbled onto some strange behavior with namespaced variables.

For those not intimately familiar with the XPath spec, a VariableReference is a dollar sign followed by a qualified name. A qualified name may include a namespace prefix, and the spec requires that prefixes be associated with namespace declarations.

Here are two example paths, one with a namespace prefix and one without.

    //baz[@name=$myname]
    //baz[@name=$ns:myname]

Seems simple enough, let's see some example code. SimpleNamespaceResolver is a class from the Practical XML library that manages a single-entry namespace context. I use it here because a complete and correct implementation of NamespaceContext would be a distraction.

import java.io.StringReader;

import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathFactory;
import javax.xml.xpath.XPathVariableResolver;

import org.w3c.dom.Document;

import org.xml.sax.InputSource;

import net.sf.practicalxml.xpath.SimpleNamespaceResolver;


public class VariableExample
{
    public static void main(String[] argv) throws Exception
    {
        String xml = "<foo>"
                   + "    <bar name='argle'>"
                   +          "Argle"
                   + "    </bar>"
                   + "    <bar name='bargle'>"
                   +          "Bargle"
                   + "        <baz>Baz</baz>"
                   + "    </bar>"
                   + "</foo>";
        
        Document dom = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new InputSource(new StringReader(xml)));

        XPath xpath = XPathFactory.newInstance().newXPath();
        xpath.setNamespaceContext(new SimpleNamespaceResolver("ns", "foo"));
        xpath.setXPathVariableResolver(new XPathVariableResolver()
        {
            @Override
            public Object resolveVariable(QName var)
            {
                System.out.println("prefix    = " + var.getPrefix());
                System.out.println("namespace = " + var.getNamespaceURI());
                System.out.println("localName = " + var.getLocalPart());
                if (var.getLocalPart().equals("name"))
                    return "argle";
                else
                    return "";
            }
        });        
        
        String result = xpath.evaluate("//bar[@name=$ns:name]", dom);
        
        System.out.println("selected content: \"" + result + "\"");
    }
}

When we run this, it produces the following output:

prefix    = 
namespace = foo
localName = name
selected content: "Argle    "

The prefix has disappeared, but that's OK: we have the namespace. However, what if we comment out the setNamespaceContext() call?

prefix    = 
namespace = ns
localName = name
selected content: "Argle    "

The prefix has now become the namespace, without an exception being thrown.

Is this a real problem? I searched the Java Bug Parade and Apache Bug Database and didn't see anyone reporting it as an issue, so have to assume the answer is “no.”

Perhaps nobody uses namespaced variable references in the real world. I think this is a good idea on principle: if you have so many variables that you need namespaces, your XPath expressions are probably too complex.

And given the possibility that a misspelling will cause your expressions to silently fail, I think it's a good idea in practice as well.