blog.kdgregory.com

Sunday, March 1, 2015

Developing in the Clouds

Deploying to the cloud — whether to a platform-as-a-service provider such as Heroku, or an infrastructure-as-a-service provider such as Amazon EC2 — is commonplace, particularly for startups that don't want to invest in infrastructure, or established companies that need resources to handle operational spikes. But not much has been written about using a cloud-based host as a development platform. I just wrapped up a six-month project where my primary development machine was an Amazon EC2 instance, and here are a few of my thoughts.

I'm going to start with the negatives:

Latency
I live in Philadelphia and was working on an Amazon EC2 instance in Ashburn Virgina, a distance of approximately 150 miles — at the speed of light, roughly a millisecond. However, there are multiple network hops between me and the server, all of which add up to a round-trip time (via ping) of roughly 40 milliseconds. If you confine yourself to text, that's nearly unnoticeable. If you run the X Window System, it's unbearable. If your workflow is GUI-intensive, cloud-based development might not be a good choice (although I consider VNC quite acceptable when using an IDE for development).
Capability
My desktop runs an Ivy Bridge Core i7 with 32 Gb of RAM. The t2.medium instance that I used for remote development has baseline performance of roughly 40% of an undisclosed Xeon and only 4 Gb of RAM. As it turns out, that's sufficient for many development tasks, especially with a rapid-turnaround platform such as Node.JS. If you have big compiles, you can always fire up a c4.8xlarge with a Haswell Xeon, 60 Gb of RAM, and disk throughput that's far better than your desktop SSD.
Cost
Mind you, that c4.8xlarge will cost you: as of this date, $1.68 per hour or fraction thereof. On another project, a colleague fired up a cluster of these instances and forgot to shut them down when he left the company. A month later the IT department gave me a call to ask if we really needed them, because they were costing us $5,000 a month. By comparison, the t2.medium instance costs $0.052 per hour, or $456 per year. More than a decent developer desktop on a three-year depreciation schedule, but not that bad in the larger scheme.
Security
This is the big one: if you're going to run machines in the cloud, you need to have at least a baseline knowledge of Internet security (or hire someone who does). Put simply, you will be attacked. To apply some numbers to that statement, I started an EC2 instance that exposed SSH and HTTP, and left it running for 24 hours. The first attempt to break into SSH happened within a half hour; there were 39 attempts over the course of the test. Yandex started exploring the site within 12 hours, followed by other web scrapers.*
Basic security rules will get you a long way: don't use dictionary passwords for any exposed service (and for SSH, don't use passwords at all), and don't expose any unprotected services to the outside world. Use a firewall that checks origin IP. If you're running on AWS, this feature is built into security groups. If you need to share access to your instance, or access it from locations that you don't know in advance, consider a VPN.
This is also a case where I think security by obscurity is useful — at least as a first line of defense. Most scannerbots and web crawlers look at well-known ports: 22 for SSH; 80, 8000, and 8080 for HTTP. Exposing your prototype website on port 21498 isn't going to stop a dedicated attacker (and there are bulk port scanners out there), but it will prevent your site's content from showing up in a search index before you're ready.

And now, the positives:

Availability
The ability to access a cloud-based host from anywhere, at any time, gives you an enormous amount of flexibility in how you do work. There's no need to lug a laptop home every night, and if the roads are snow-covered and filled with stopped traffic, you can easily work from home. With tools like screen or VNC, you can have a session that's always set up just how you want it, and which can run programs while you're not connected. Plus, it's easy to collaborate: unlike a personal computer, a cloud instance can be shared by multiple users.
Consistency
I don't know about you, but after a few years my computers all accumulate a significant amount of cruft: libraries or entire applications that I installed for a particular project and no longer need, along with various versions of standard tools, some of which are expected by the OS itself. Multiply that cruft by the number of members on your team, and add a fudge factor for different shell configuration files. None of which matches your production (or even test) environment. It's a mess.
To me, this is the biggest benefit of developing in the cloud: you can be certain that all machines are configured alike — or at least start out that way. Tools such as Chef and Puppet will take you from a base image to a fully configured server in one step. With Amazon, after you've configured the server once, you can create a private AMI and stamp out as many instances as you want.
Disposability
The flip side of starting servers quickly is disposing of them when no longer needed. There's no reason to patch or update your machine; that just accumulates cruft. This mantra has long been used by operations teams: I was out riding with a friend last fall when his beeper went off; we pulled over, he discovered that one of his production servers was having trouble, shut it down, and started a replacement. Analysis could wait for later; there was no need to try to make an emergency patch.

Finally, lessons learned:

Experience Helps — a Lot
I've been using AWS since 2009, and have been responsible for administering my personal Linux machine since well before that. However, the limits of my knowledge became apparent when the company hired Eric, a “cloud architect” who lived up to the name. He quickly had us running in an isolated virtual private cloud (VPC), with separate VPCs for our test, integration, and production environments, OpenVPN to secure access, LDAP to hold credentials, and automatic deployments from our CI server. If you can find such a person, hire him or her; it will save a lot of time.
Disposability Changes How You Work
I got a hint of this several years ago, when using AWS to test a distributed application: we would prepare a test, start up a dozen servers, run the test, then shut them down. It allowed a form of testing that was quite simply impossible just a few years earlier; no company that I've worked for had a spare closet full of machines, and even if they did, configuration would require hours.
As developers, we set up our machines just the way we want them; it takes a long time, and no two developers have the same configuration. But if you have the base development image preconfigured, firing up a new machine becomes a process of copying your personal configuration files and checking out your workspace. You learn to let go of “my machine.”

* I've written elsewhere about the questionable practices of web crawlers. If you have a site that's not ready for primetime, don't expose it to the Internet on a well-known port.

Monday, October 27, 2014

JDK Bug (Maybe): Namespaced XPath Variables

I'm revising my XPath article, and stumbled onto some strange behavior with namespaced variables.

For those not intimately familiar with the XPath spec, a VariableReference is a dollar sign followed by a qualified name. A qualified name may include a namespace prefix, and the spec requires that prefixes be associated with namespace declarations.

Here are two example paths, one with a namespace prefix and one without.

    //baz[@name=$myname]
    //baz[@name=$ns:myname]

Seems simple enough, let's see some example code. SimpleNamespaceResolver is a class from the Practical XML library that manages a single-entry namespace context. I use it here because a complete and correct implementation of NamespaceContext would be a distraction.

import java.io.StringReader;

import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathFactory;
import javax.xml.xpath.XPathVariableResolver;

import org.w3c.dom.Document;

import org.xml.sax.InputSource;

import net.sf.practicalxml.xpath.SimpleNamespaceResolver;


public class VariableExample
{
    public static void main(String[] argv) throws Exception
    {
        String xml = "<foo>"
                   + "    <bar name='argle'>"
                   +          "Argle"
                   + "    </bar>"
                   + "    <bar name='bargle'>"
                   +          "Bargle"
                   + "        <baz>Baz</baz>"
                   + "    </bar>"
                   + "</foo>";
        
        Document dom = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new InputSource(new StringReader(xml)));

        XPath xpath = XPathFactory.newInstance().newXPath();
        xpath.setNamespaceContext(new SimpleNamespaceResolver("ns", "foo"));
        xpath.setXPathVariableResolver(new XPathVariableResolver()
        {
            @Override
            public Object resolveVariable(QName var)
            {
                System.out.println("prefix    = " + var.getPrefix());
                System.out.println("namespace = " + var.getNamespaceURI());
                System.out.println("localName = " + var.getLocalPart());
                if (var.getLocalPart().equals("name"))
                    return "argle";
                else
                    return "";
            }
        });        
        
        String result = xpath.evaluate("//bar[@name=$ns:name]", dom);
        
        System.out.println("selected content: \"" + result + "\"");
    }
}

When we run this, it produces the following output:

prefix    = 
namespace = foo
localName = name
selected content: "Argle    "

The prefix has disappeared, but that's OK: we have the namespace. However, what if we comment out the setNamespaceContext() call?

prefix    = 
namespace = ns
localName = name
selected content: "Argle    "

The prefix has now become the namespace, without an exception being thrown.

Is this a real problem? I searched the Java Bug Parade and Apache Bug Database and didn't see anyone reporting it as an issue, so have to assume the answer is “no.”

Perhaps nobody uses namespaced variable references in the real world. I think this is a good idea on principle: if you have so many variables that you need namespaces, your XPath expressions are probably too complex.

And given the possibility that a misspelling will cause your expressions to silently fail, I think it's a good idea in practice as well.

Tuesday, August 12, 2014

Scalarific FizzBuzz

This started as a lunchtime conversation about interviewing. I'm a big fan of FizzBuzz as a “screening” question: it weeds out the people that shouldn't be allowed near a computer (and, after conducting several hundred interviews, I can say that there's a depressingly large number of them, especially at a company without a preliminary phone screen).

For a Scala developer, what constitutes a good FizzBuzz? Clearly, it should be based around higher-order functions, such as a map() (and, as I don't consider myself a Scala developer, I'll leave the parentheses and dots in place):

(1 to 20).map(???)

A simple implementation might use an if expression:

def fbi(x: Int): String = {
    if ((x % 15) == 0) "fizzbuzz"
    else if ((x % 3) == 0) "fizz"
    else if ((x % 5) == 0) "buzz"
    else s"$x"
}

It gets the job done, but looks too much like Java. We need to add some Scala-specific syntax:

def fbm(x: Int): String = x match {
    case n if ((n % 15) == 0) => "fizzbuzz"
    case n if ((n % 3) == 0) => "fizz"
    case n if ((n % 5) == 0) => "buzz"
    case n => s"$n"
}

You can argue whether this is better or worse. It seems to me that it just wraps the previous if with more cruft.* As our conversation devolved, though, it led to the following implementation, which is about as far from Java as I can imagine:

object Fizz {
    def unapply(x: Int): Boolean = ((x % 3) == 0)
}
  
object Buzz {
    def unapply(x: Int): Boolean = ((x % 5) == 0)
}
  
object FizzBuzz {
    def unapply(x: Int): Boolean = ((x % 15) == 0)
}
 
def fbme(x: Int): String = x match {
    case FizzBuzz() => "fizzbuzz"
    case Fizz() => "fizz"
    case Buzz() => "buzz"
    case n => s"$n"
}

Mind you, I don't think I'd want to hire someone who implemented FizzBuzz this way.

* You'll find a nicer match-based implementation at RosettaCode. Along with some versions that make my extractor-based implementation look sane.