blog.kdgregory.com: 2014

Monday, October 27, 2014

JDK Bug (Maybe): Namespaced XPath Variables

I'm revising my XPath article, and stumbled onto some strange behavior with namespaced variables.

For those not intimately familiar with the XPath spec, a VariableReference is a dollar sign followed by a qualified name. A qualified name may include a namespace prefix, and the spec requires that prefixes be associated with namespace declarations.

Here are two example paths, one with a namespace prefix and one without.

    //baz[@name=$myname]
    //baz[@name=$ns:myname]

Seems simple enough, let's see some example code. SimpleNamespaceResolver is a class from the Practical XML library that manages a single-entry namespace context. I use it here because a complete and correct implementation of NamespaceContext would be a distraction.

import java.io.StringReader;

import javax.xml.namespace.QName;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathFactory;
import javax.xml.xpath.XPathVariableResolver;

import org.w3c.dom.Document;

import org.xml.sax.InputSource;

import net.sf.practicalxml.xpath.SimpleNamespaceResolver;


public class VariableExample
{
    public static void main(String[] argv) throws Exception
    {
        String xml = "<foo>"
                   + "    <bar name='argle'>"
                   +          "Argle"
                   + "    </bar>"
                   + "    <bar name='bargle'>"
                   +          "Bargle"
                   + "        <baz>Baz</baz>"
                   + "    </bar>"
                   + "</foo>";
        
        Document dom = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(new InputSource(new StringReader(xml)));

        XPath xpath = XPathFactory.newInstance().newXPath();
        xpath.setNamespaceContext(new SimpleNamespaceResolver("ns", "foo"));
        xpath.setXPathVariableResolver(new XPathVariableResolver()
        {
            @Override
            public Object resolveVariable(QName var)
            {
                System.out.println("prefix    = " + var.getPrefix());
                System.out.println("namespace = " + var.getNamespaceURI());
                System.out.println("localName = " + var.getLocalPart());
                if (var.getLocalPart().equals("name"))
                    return "argle";
                else
                    return "";
            }
        });        
        
        String result = xpath.evaluate("//bar[@name=$ns:name]", dom);
        
        System.out.println("selected content: \"" + result + "\"");
    }
}

When we run this, it produces the following output:

prefix    = 
namespace = foo
localName = name
selected content: "Argle    "

The prefix has disappeared, but that's OK: we have the namespace. However, what if we comment out the setNamespaceContext() call?

prefix    = 
namespace = ns
localName = name
selected content: "Argle    "

The prefix has now become the namespace, without an exception being thrown.

Is this a real problem? I searched the Java Bug Parade and Apache Bug Database and didn't see anyone reporting it as an issue, so have to assume the answer is “no.”

Perhaps nobody uses namespaced variable references in the real world. I think this is a good idea on principle: if you have so many variables that you need namespaces, your XPath expressions are probably too complex.

And given the possibility that a misspelling will cause your expressions to silently fail, I think it's a good idea in practice as well.

Tuesday, August 12, 2014

Scalarific FizzBuzz

This started as a lunchtime conversation about interviewing. I'm a big fan of FizzBuzz as a “screening” question: it weeds out the people that shouldn't be allowed near a computer (and, after conducting several hundred interviews, I can say that there's a depressingly large number of them, especially at a company without a preliminary phone screen).

For a Scala developer, what constitutes a good FizzBuzz? Clearly, it should be based around higher-order functions, such as a map() (and, as I don't consider myself a Scala developer, I'll leave the parentheses and dots in place):

(1 to 20).map(???)

A simple implementation might use an if expression:

def fbi(x: Int): String = {
    if ((x % 15) == 0) "fizzbuzz"
    else if ((x % 3) == 0) "fizz"
    else if ((x % 5) == 0) "buzz"
    else s"$x"
}

It gets the job done, but looks too much like Java. We need to add some Scala-specific syntax:

def fbm(x: Int): String = x match {
    case n if ((n % 15) == 0) => "fizzbuzz"
    case n if ((n % 3) == 0) => "fizz"
    case n if ((n % 5) == 0) => "buzz"
    case n => s"$n"
}

You can argue whether this is better or worse. It seems to me that it just wraps the previous if with more cruft.* As our conversation devolved, though, it led to the following implementation, which is about as far from Java as I can imagine:

object Fizz {
    def unapply(x: Int): Boolean = ((x % 3) == 0)
}
  
object Buzz {
    def unapply(x: Int): Boolean = ((x % 5) == 0)
}
  
object FizzBuzz {
    def unapply(x: Int): Boolean = ((x % 15) == 0)
}
 
def fbme(x: Int): String = x match {
    case FizzBuzz() => "fizzbuzz"
    case Fizz() => "fizz"
    case Buzz() => "buzz"
    case n => s"$n"
}

Mind you, I don't think I'd want to hire someone who implemented FizzBuzz this way.

* You'll find a nicer match-based implementation at RosettaCode. Along with some versions that make my extractor-based implementation look sane.

Friday, July 18, 2014

XPath String Comparisons

I recently learned something about XPath: relational operations don't compare strings.

When neither object to be compared is a node-set and the operator is <=, <, >= or >, then the objects are compared by converting both objects to numbers and comparing the numbers according to IEEE 754.

Here's an example program that demonstrates the problem: given a set of records, select those where a specific code is in the desired range:^*

public class XPathRange1
{
    private final static String xml = "<root>"
                                    +     "<entry id='1' code='A'>foo</entry>"
                                    +     "<entry id='2' code='A'>bar</entry>"
                                    +     "<entry id='3' code='C'>baz</entry>"
                                    +     "<entry id='4' code='E'>argle</entry>"
                                    +     "<entry id='4' code='E'>bargle</entry>" 
                                    + "</root>";
    

    public static void main(String[] argv) throws Exception
    {
        Document dom = ParseUtil.parse(xml);
        
        List<Node> selection1 = new XPathWrapper("//entry[@code='C']").evaluate(dom);
        System.out.println("nodes selected by exact match = " + selection1.size());
        
        List<Node> selection2 = new XPathWrapper("//entry[(@code >= 'B') and (@code <= 'D')]").evaluate(dom);
        System.out.println("nodes selected by comparison  = " + selection2.size());
    }
}

When you run this, the equality test in the first path returns a single element, the entry with id 1. The second expression, which use a range predicate, selects nothing — even though the expected code is within the specified range.

How to solve this problem? One approach is to use an XPath 2.0 implementation, such as Saxon. But, assuming that you're stuck with XPath 1.0, this is where a user-defined function will do the job:

public class XPathRange2
{
    private final static String xml = "<root>"
                                    +     "<entry id='1' code='A'>foo</entry>"
                                    +     "<entry id='2' code='A'>bar</entry>"
                                    +     "<entry id='3' code='C'>baz</entry>"
                                    +     "<entry id='4' code='E'>argle</entry>"
                                    +     "<entry id='4' code='E'>bargle</entry>" 
                                    + "</root>";
    
    private final static String FUNCTION_NS = "urn:uuid:" + UUID.randomUUID();
    
    private static class XPathRange implements XPathFunction
    {
        @Override
        @SuppressWarnings("rawtypes")
        public Object evaluate(List args) throws XPathFunctionException
        {
            try
            {
                java.util.Iterator argItx = args.iterator();
                NodeList selection = (NodeList)argItx.next();
                String lowerBound = (String)argItx.next();
                String upperBound = (String)argItx.next();
                
                String value = (selection.getLength() > 0) ? selection.item(0).getNodeValue() : "";
                return (value.compareTo(lowerBound) >= 0) && (value.compareTo(upperBound) <= 0);
            }
            catch (Exception ex)
            {
                throw new XPathFunctionException(ex);
            }
        }
    }
    

    public static void main(String[] argv) throws Exception
    {
        Document dom = ParseUtil.parse(xml);
        
        List<Node> selection1 = new XPathWrapper("//entry[@code='C']").evaluate(dom);
        System.out.println("nodes selected by exact match = " + selection1.size());
        
        List<Node> selection2 = new XPathWrapper("//entry[fn:inRange(@code, 'B', 'D')]")
                                .bindNamespace("fn", FUNCTION_NS)
                                .bindFunction(new QName(FUNCTION_NS, "inRange"), new XPathRange(), 3)
                                .evaluate(dom);
        System.out.println("nodes selected by function    = " + selection2.size());
    }
}

^* All examples in this post use the Practical XML library to reduce boilerplate code.

Saturday, July 12, 2014

Hotspot In Action

Yesterday I was writing some code that used the output of System.currentTimeMillis() as a database key. Yeah, yeah, that's a really bad idea, I know. But I figured that it was going to be a low-volume operation, and I added retry logic, and … I started to wonder about the resolution of the system clock, as returned by System.currentTimeMillis().

Returns the current time in milliseconds. Note that while the unit of time of the return value is a millisecond, the granularity of the value depends on the underlying operating system and may be larger. For example, many operating systems measure time in units of tens of milliseconds.

So, having some time to fill during a build, I wrote a program to determine the granularity of my operating system(s).

public static void main(String[] argv)
throws Exception
{
    ConcurrentHashMap<Long,AtomicLong> counters = new ConcurrentHashMap<Long,AtomicLong>();
    long finish = System.currentTimeMillis() + 1000;
    long current = 0;
    while ((current = System.currentTimeMillis()) < finish)
    {
        incrementCounter(counters, current);
    }

    System.out.println("number of counters = " + counters.size());
    TreeMap<Long,AtomicLong> displayMap = new TreeMap<Long,AtomicLong>(counters);
    for (Map.Entry<Long,AtomicLong> entry : displayMap.entrySet())
    {
        System.out.println(entry.getKey() + " = " + entry.getValue().longValue());
    }
}

private static void incrementCounter(ConcurrentHashMap<Long,AtomicLong> counters, long current)
{
    Long key = new Long(current);
    AtomicLong counter = counters.get(key);
    if (counter == null)
    {
        counters.putIfAbsent(key, new AtomicLong());
        counter = counters.get(key);
    }
    counter.incrementAndGet();
}

I'm not sure why I decided to use a Map, rather than simply storing the distinct timestamps in a Set (and yes, rather than roll my own counter map, I should have used an existing implementation). But the map gave me some interesting results:

number of counters = 1000
1405105663691 = 805
1405105663692 = 775
1405105663693 = 808
1405105663694 = 796
1405105663695 = 799
1405105663696 = 789
1405105663697 = 821
1405105663698 = 820
1405105663699 = 818
1405105663700 = 792
1405105663701 = 788
1405105663702 = 821
1405105663703 = 740
1405105663704 = 1103
1405105663705 = 1246
1405105663706 = 1921
1405105663707 = 2959
1405105663708 = 5960
1405105663709 = 6242
1405105663710 = 12867
1405105663711 = 13331
1405105663712 = 12883
1405105663713 = 13436
1405105663714 = 13457
1405105663715 = 19267
1405105663716 = 20521
1405105663717 = 21728

OK, first thing that jumps out is that Linux provides 1ms resolution (Windows is 10). But far more interesting is the jump in the number of counts captured: from approximately 800 in the first dozen measurements, then increasing in a few sharp jumps, ending up in the 20,000 range and staying there for the rest of the samples. This is a dramatic demonstration of the power of Hotspot to optimize code on the fly. If you want to see what it's really doing, here's a version with diagnostics turned on:

> java -cp target/classes/ -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining -XX:+PrintCompilation experiments.ClockResolution
     45   1       java.lang.Object:: (1 bytes)
     46   2       java.util.concurrent.ConcurrentHashMap::hash (49 bytes)
---   n   java.lang.System::currentTimeMillis (static)
---   n   sun.misc.Unsafe::compareAndSwapLong
     47   3       java.util.concurrent.ConcurrentHashMap::segmentFor (17 bytes)
     47   4       java.lang.Long::hashCode (14 bytes)
     47   5       java.lang.Number:: (5 bytes)
     47   6       java.util.concurrent.ConcurrentHashMap$Segment::get (66 bytes)
     48   7       java.util.concurrent.ConcurrentHashMap::get (19 bytes)
Inlining intrinsic _hashCode (virtual) at bci:1 in java.util.concurrent.ConcurrentHashMap::get (19 bytes)
     49   8       java.util.concurrent.ConcurrentHashMap$Segment::getFirst (14 bytes)
     49   9       experiments.ClockResolution::incrementCounter (54 bytes)
     50  10       java.lang.Long:: (10 bytes)
     50  11       java.lang.Long::equals (30 bytes)
Inlining intrinsic _compareAndSwapLong at bci:9 in java.util.concurrent.atomic.AtomicLong::compareAndSet (13 bytes)
Inlining intrinsic _compareAndSwapLong at bci:9 in java.util.concurrent.atomic.AtomicLong::compareAndSet (13 bytes)
     51  12       java.lang.Long::longValue (5 bytes)
     51  13       java.util.concurrent.atomic.AtomicLong::incrementAndGet (23 bytes)
Inlining intrinsic _compareAndSwapLong at bci:9 in java.util.concurrent.atomic.AtomicLong::compareAndSet (13 bytes)
Inlining intrinsic _compareAndSwapLong at bci:9 in java.util.concurrent.atomic.AtomicLong::compareAndSet (13 bytes)
     51  14       java.util.concurrent.atomic.AtomicLong::get (5 bytes)
     51  15       java.util.concurrent.atomic.AtomicLong::compareAndSet (13 bytes)
Inlining intrinsic _compareAndSwapLong at bci:9 in java.util.concurrent.atomic.AtomicLong::compareAndSet (13 bytes)
     52   1%      experiments.ClockResolution::main @ 22 (159 bytes)
Inlining intrinsic _compareAndSwapLong at bci:9 in java.util.concurrent.atomic.AtomicLong::compareAndSet (13 bytes)
Inlining intrinsic _compareAndSwapLong at bci:9 in java.util.concurrent.atomic.AtomicLong::compareAndSet (13 bytes)
Inlining intrinsic _currentTimeMillis at bci:28 in experiments.ClockResolution::main (159 bytes)
Inlining intrinsic _compareAndSwapLong at bci:9 in java.util.concurrent.atomic.AtomicLong::compareAndSet (13 bytes)
Inlining intrinsic _compareAndSwapLong at bci:9 in java.util.concurrent.atomic.AtomicLong::compareAndSet (13 bytes)
   1033   1%     made not entrant  experiments.ClockResolution::main @ -2 (159 bytes)
number of counters = 1000
...

Friday, July 11, 2014

Namespaces Aren't Just Attributes

Here's some code that builds an XML document and prints it (you'll find OutputUtil in the Practical XML library; it's a lot easier than configuring a serializer):

private static Document createDom() throws Exception
{
    DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
    dbf.setNamespaceAware(true);
    Document dom = dbf.newDocumentBuilder().newDocument();
    
    Element root = dom.createElement("data:root");
    dom.appendChild(root);
    root.setAttribute("xmlns:data", "http://www.example.com");
    root.setTextContent("foo");
    
    return dom;
}

public static void main(String[] argv)
throws Exception
{
    Document dom = createDom();
    System.out.println(OutputUtil.compactString(dom));
}

If you run this code, it produces the following output:

<data:root xmlns:data="http://www.example.com">foo</data:root>

Looks good; if you were using this output to debug a program, you might think that createDom() was doing the right thing. And you might be very confused to discover that your XPath expressions weren't returning anything (again, I'm using Practical XML to reduce boilerplate):

public static void main(String[] argv)
throws Exception
{
    Document dom = createDom();
    XPathWrapper xpath = new XPathWrapper("//data:root").bindNamespace("data", "http://www.example.com");
    System.out.println(xpath.evaluateAsString(dom));
}

At this point you might blame the library, and use the XPath implementation provided by the JDK. And you'd get the same result. To debug this problem, you need to be more creative:

public static void main(String[] argv)
throws Exception
{
    Document dom = createDom();
    Element root = dom.getDocumentElement();
    System.out.println("nodeName      = " + root.getNodeName());
    System.out.println("namespace URI = " + root.getNamespaceURI());
    System.out.println("prefix        = " + root.getPrefix());
    System.out.println("localname     = " + root.getLocalName());
}

Run this and you'll see the following output:

nodeName      = data:root
namespace URI = null
prefix        = null
localname     = null

What's happening here is that creating an xmlns attribute doesn't affect a node's namespace. The DOM — in Java and other language bindings — requires you to create the element with an explicit namespace-aware method:

Element root = dom.createElementNS("http://www.example.com", "data:root");

But what do you do if the createDom() method was written by some other team, and they don't want to listen to you telling them they're doing things wrong? If you can't get them fired, you can always serialize and reparse the XML:

public static void main(String[] argv)
throws Exception
{
    Document dom1 = createDom();
    String xml = OutputUtil.compactString(dom1);
    Document dom2 = ParseUtil.parse(xml);
    
    Element root = dom2.getDocumentElement();
    System.out.println("nodeName      = " + root.getNodeName());
    System.out.println("namespace URI = " + root.getNamespaceURI());
    System.out.println("prefix        = " + root.getPrefix());
    System.out.println("localname     = " + root.getLocalName());
}

Monday, June 16, 2014

Finalizers

I recently saw a comment on my reference objects article that said, in effect, “he mentioned finalizers but didn't say DON'T USE THEM EVER, so I stopped reading.” The capitalized words are taken exactly from the original comment; the rest is a paraphrase to tone down the rhetoric. The comment didn't bother me — if you stop reading so easily, you probably won't gain much from my articles anyway — but the attitude did. Because finalizers do indeed have a role to play.

They compensate for sloppy programming.

Good Java programmers know that non-memory resource allocation has to take place in a try / catch / finally construct. Just like good C programmers know that every malloc() must be matched by a free(). Sloppy programmers either don't know or don't care. And good programmers sometimes forget.

The question then becomes: what should the program do?

One alternative is to just explode. And this is not necessarily a bad thing: not closing your resources is a bug, and it's better to expose bugs early in the development process. In this view, forgetting to close your resource is no different than dereferencing a null pointer.

The problem with this view is that null pointer exceptions tend to show up early, the first time through the affected code. By comparison, leaked resources tend to hide until production, because you never generate sufficient load during development. Leaked file handles are a great example: a typical developer environment allows 4,096 open files. It will take a long time to run out of them, especially if you constantly restart the app. A typical server environment might allow twice or four times that much, but if you never shut down the app you'll eventually run out. Probably at your time of highest load.

And that's leads to an alternate solution: protect the programmers from themselves, by checking for open resources at the time the owning object is garbage collected. This isn't perfect: in a server with lots of memory, the garbage collection interval might exceed the time taken to exhaust whatever resource you're leaking.*

Once you accept the need for some alternate way to clean up non-memory resources, the only remaining question is how. Jave provides two mechanisms: finalizers or (since 1.2) phantom references.

So which should you use? Before answering that question, I want to point out something to the DON'T USE THEM EVER crowd: finalizers and phantom references are invoked under exactly the same conditions: when the garbage collector decides that an object is eligible for collection. The difference is what happens afterward. With phantom references the object gets collected right away, and the reference is put on a queue for later processing by the application. With finalizers, the object is passed to a JVM-managed thread that runs the finalize() method; actual collection happens once the finalizer runs.

Given the choice between a finalizer and a phantom reference, I will pick the finalizer 95% of the time. Phantom references require a lot of work to implement correctly. In the normal case, the only real benefit they provide is to avoid out-of-memory errors from large objects with long-running finalizers.**

A better answer is to avoid both. Far too many people think of finalizers as equivalent to C++ destructors, a place to put important cleanup code such as transaction commit/rollback. This is simply wrong. But it's just as wrong to invoke that code in a phantom reference.

Rather than dogmatically insist “no finalizers!”, I think the better approach is to adopt some development practices to prevent their invocation. One approach is to identify resources that are allocated but not released, using an analysis tool like FindBugs.

But if you want to recover from mistakes in the field, and can do so quickly and cleanly (so that you don't block the finalization thread), don't be afraid of finalizers.

* This is unlikely: if you're running through resources quickly, the associated objects will almost certainly be collected while still in the young generation. At this point, a pedantic pinhead will say “but the JLS doesn't guarantee that the garbage collector will ever run.” My best response to that: ”well, then you've got bigger problems.”

** A bigger benefit, and this is the 5% where I would use them, is to provide leaked-object tracking. For example, in a database connection pool. But I don't consider that the “normal” case.

Monday, June 2, 2014

Is that SSD Really Helping Your Build Times?

Update: I ran similar tests with full-disk encryption.

As developers, we want the biggest bad-ass machine that we can get, because waiting for the computer is so last century. And part of a bad-ass machine is having a solid-state drive, with sub-millisecond latency. Spinning platters covered in rust are not just last-century, they're reminiscent of the industrial age. But do we really benefit from an SSD?

This post emerged from a conversation with a co-worker: he was surprised that I encrypted my home directory, because of the penalty it caused to disk performance. My response was that I expected most of my files to be living in RAM, unencrypted, in the disk buffer. That led to a discussion about whether an SSD provided any significant benefit, given enough RAM to keep your workspace in the buffer cache. Turns out I was wrong about unencrypted data in the cache, but not about the SSD.

I was confident about the latter because a year ago, when I built my then-seriously-badass home computer (32Gb RAM — because I could), I ran some performance comparisons against my then-seriously-pathetic vintage 2002 machine. The new machine blew away the old, but much of the performance gain seemed to come from CPU-related items: faster clock speed, faster memory, huge L1 and L2 caches, and so on. Once CPU time was deducted, the difference between spinning rust and SSD wasn't that big.

I started to write a post at that time, but went down a rathole of trying to create a C program that could highlight the difference in L1/L2 cache. Then the old machine suffered an “accident,” and that was the end of the experiments.

Now, however, I have a far simpler task: quantify the difference that an SSD makes to a developer's workload. Which can be rephrased as “will buying an SSD speed up my compile times?” This is particularly important to me right now, because I'm on a project where single-module compiles take around a minute, and full builds are over 30.

Here's the experimental protocol:

Hardware:

Thinkpad W520: 4-core Intel Core i7-2860QM CPU @ 2.50GHz, 800MHz FSB. A bad-ass laptop when I got it (I just wish it wasn't so heavy).
8 GB RAM, 8 MB L2 cache
Intel “320 Series” SSD, 160 Gb, formatted as ext4. This is not a terribly fast drive, but with an average access time of 0.2 ms, and an average read rate of 270 MB/sec (as measured by the Gnome disk utility), it blows away anything with a platter.
Western Digital WD2500BMVU, 250 GB, 5400 RPM, formatted as ext4, accessed via USB 2.0. This is a spare backup drive; I don't think I own anything slower unless I were to reformat an old Apple SCSI drive and run it over a USB-SCSI connector (and yes, I have both). Average access time: 17.0 ms; average read rate: 35 MB/sec.
Xubuntu 12.04, 3.2.0-63-generic #95-Ubuntu SMP.

Workload:

Spring Framework 3.2.0.RELEASE. A large Java project with lots of dependencies, this should be the most disk-intensive of the three sample workloads. The build script is Gradle, which downloads and caches all dependencies.*
Scala 2.11.1. I'm currently working on a Scala project, and the Scala compiler itself seemed like a good sample. The main difference between Scala and Java (from a workload perspective) is that the Scala compiler does a lot more CPU-intensive work; in the office I can tell who's compiling because their CPU fan sounds like a jet engine spooling up. The build script is Ant, using Ivy to download and cache dependencies.**
GNU C Compiler 4.8.3. Added because not everyone uses the JVM. I didn't look closely at the makefile, but I'll assume that it has optimization turned up. Disk operations should be confined to reading source files, and repeated reads of header files.

Test conditions:

Unencrypted SSD
Encrypted SSD: same drive, default Ubuntu home directory encryption
USB hard drive

General configuration:

Each test is conducted as a distinct user, with its own home directory, to ensure that there aren't unexpected cross-filesystem accesses.
Each build is run once to configure (gcc) and/or download dependencies.
Timed builds are run from normal desktop environment, but without any other user programs (eg: browser) active.
Timed builds run with network (wifi and wired) disconnected.
The Spring and Scala times are an average of three runs. The gcc time is from a single run (I didn't have the patience to do repeated multi-hour builds, just to improve accuracy by a few seconds).

Per-test sequence:

Clean build directory (depends on build tool).
Sync any dirty blocks to disk (sync).
SSD TRIM (fstrim -v /)
Clear buffer cache (echo 3 > /proc/sys/vm/drop_caches)
Execute build, using time.

And now, the results. Each entry in the table contains the output from the Unix time command, formatted real / user / sys. I've converted all times to seconds, and rounded to the nearest second. The only number that really matters is the first, “real”: it's the time that you have to wait until the build is done. “User” is user-mode CPU time; it's primarily of interest as a measure of how parallel your build is (note that the JVM-based builds are parallel, the gcc build isn't). “Sys” is kernel-mode CPU time; it's included mostly for completeness, but notice the difference between encrypted and non-encrypted builds.

	Spring Framework	Scala	GCC
Unencrypted SSD	`273 / 527 / 10`	`471 / 1039 / 13`	`6355 / 5608 / 311`
Encrypted SSD	`303 / 534 / 38`	`491 / 1039 / 29`	`6558 / 5682 / 400`
USB Hard Drive	`304 / 525 / 11`	`477 / 1035 / 14`	`6462 / 5612 / 311`
Encryption Penalty	`11 %`	`4 %`	`3 %`
Spinning Rust Penalty	`11 %`	`1 %`	`2 %`

Do the numbers surprise you? I have to admit, they surprised me: I didn't realize that the penalty for encryption was quite so high. I haven't investigated, but it appears that ecryptfs, as a FUSE filesystem, does not maintain decrypted block buffers. Instead, the buffered data is encrypted and has to be decrypted on access. This explains the significantly higher sys numbers. Of course, losing my laptop with unencrypted client data has it's own penalty, so I'm willing to pay the encryption tax.

As for the difference between the SSD and hard drive: if you look at your drive indicator light while compiling, you'll see that it really doesn't flash much. Most of a compiler's work is manipulating data in-memory, not reading and writing. So the benefit that you'll get from those sub-millisecond access times is just noise.

On the other hand, if you're doing data analysis with large datasets, I expect the numbers would look very different. I would have killed for an SSD 3-4 years ago, when I was working with files that were tens of gigabytes in length (and using a 32 GB Java heap to process them).

Finally, to borrow an adage from drag racers: there's no subtitute for RAM. With 8 GB, my machine can spare a lot of memory for the buffer cache: free indicated 750 Mb after the Scala build, and several gigabytes after the gcc build. Each block in the cache is a block that doesn't have to be read from the disk, and developers tend to hit the same blocks over and over again: source code, the compiler executable, and libraries. If you have enough RAM, you could conceivably load your entire development environment with the first build Monday morning, and not have to reload it all week.†

At least, that's what I told myself to justify 32Gb in my home computer.

* I picked this particular version tag because it mostly builds: it fails while building spring-context, due to a missing dependency. However, it spends enough time up to that point that I consider it a reasonable example of a “large Java app.” I also tried building the latest 3.X tag, but it fails right away due to a too-long classname. That may be due to the version of Groovy that I have installed, but this experience has shaken my faith in the Spring framework as a whole.

** Scala has issues with long classnames as well, which means that the build will crash if you run it as-is on an encrypted filesystem (because encryption makes filenames longer). Fortunately, there's an option to tell the compiler to use shorter names: ant -Dscalac.args='-Xmax-classfile-name 140'

† What if you don't have a lot of memory to spare? I also tried running the Spring build on my anemic-when-I-bought-it netbook: an Intel Atom N450 @ 1.66 GZ with 1GB of RAM and a 512KB L2 cache. The stock hard drive is a Fujitsu MJA2250BH: 250 GB, 5400 RPM, an average read rate of 72 MB/sec, and an average access time of 18 ms. I also have a Samsung 840 SSD that I bought when I realized just how anemic this machine was, thinking that, if nothing else, it would act as a fast swap device. However, it doesn't help much: with the stock hard drive, Spring builds in 44 minutes; with the SSD, 39. An 11% improvement, but damn! If you look at the specs, that netbook is a more powerful machine than a Cray I. But it's completely unusable as a modern development platform.

Friday, May 23, 2014

The Uncanny Valley of Programming Languages

Marvin, the paranoid android The uncanny valley was a term coined by Masahiro Mori to describe human interaction with robots. He posited that, “in climbing toward the goal of making robots appear human, our affinity for them increases until we come to a valley […]”. Things in this valley (robots, but also artificial limbs), by being humanlike but cold and unmoving, evoke a visceral negative reaction. If you found The Polar Express creepy, you've experienced this valley.

I've been thinking about how the idea of an uncanny valley might apply to programming languages. On the one side of the valley, you have the (unattainable?) perfect language. On the other, you have an array of languages: some are missing features, some have inconvenient structure, some have inconsistent behavior. But we accept those quirks, find them charming even. It's easy to imagine a Lisp programmer on this side of the valley, surrounded by adorable little parentheses.

And in the middle of the valley are those languages that come close to perfect but aren't quite there. If you read my last post, I think you know at least one of the languages that I see in that valley. It has many features that I desire, but they're … not … quite … right.

Monday, May 5, 2014

Thoughts on Scala After Using It for Six Months

I started working with Scala professionally last September. At the time, I wrote a post about the features of the language that I liked and didn't like, from the perspective of an experienced developer new to the language. My plan was to write a retrospective after several months of use, looking at how my feelings about those features had changed over time.

I wrote several drafts that followed that theme, but liked none of them. While I certainly built up a list of likes and dislikes, I found that those were overwhelmed by two general themes: I find the language clumsy, and it requires too much mental effort — effort that is taken away from whatever problem I'm trying to solve.

With that sentence, you may think the rest of this post is a (long) rant. That isn't my intent. I'm not trying to convince anyone that Scala is a horrible language; in fact, there are many parts of it that I quite like. It's simply one person's experience and opinion, with illustrative examples. Feel free to disagree with my points and with the examples I chose.

I'll start with clumsiness. The example that comes to mind most quickly is the special syntax needed for one function with a variable argument list to call another with the same argument list:

  def foo(xs: Int*) = ???
  
  def bar(xs: Int*) = {
    foo(xs)     // this won't compile
    foo(xs:_*)  // this will
  }

I have no doubt that there is a theoretical reason for this behavior. From the practical perspective, however, it's annoying: I can't take a parameter and pass it as an argument, even though the values are declared identically. It's not a big annoyance (and I find that I use varargs far less in Scala than in Java), but every time that I run into it — or any other syntactical oddity — I have to stop and think about why the compiler is complaining.*

And that is part of my second issue: high mental effort, because I need to think about what the compiler is doing rather than what my code is doing. Two examples of this are type inference and for comprehensions.

My issues with Scala type inference surprised me. I've worked with duck-typed languages, and never felt that their complete absence of type information impeded my understanding of code. With Scala, however, I don't know what I'd do without my IDE showing me type info on hover (and, unfortunately, it doesn't do that very well).

I think that the reason is that Scala functions tend to be more complex, with chains of higher-order function calls that may themselves require type inferencing from other functions. Such code usually makes perfect sense, but requires the programmer to carefully piece together what's happening at each step. I've adopted the habit of adding a return type specification to every function and minimizing the complexity of anonymous functions, and find these go a long way toward demystifying such constructs.

Something that I find harder to resolve are for-comprehensions. I first worked with for-comprehensions (aka list-comprehensions) in Python, and the mental model from that experience was reinforced when I learned Erlang. In both of these languages, a for-comprehension translates into nested iteration (with access to enclosing scope). **

Scala, by comparison, translates a for-comprehension into map() and flatMap() calls. On the one hand, this lets you do cool things like stringing together operations that return an Option: if any of the operations return None, the operation short-circuits and returns None. On the other hand, it makes you dependent on how a particular class implements those methods.

Here's a somewhat contrived example that represents a common use of for-comprehensions: flattening a hierarchical data structure.

val data = List(
            "foo" -> List("foo", "bar", "baz"),
            "argle" -> List("argle", "bargle", "wargle"))

val result = for {
  (key, values) <- data
  value <- values
} yield (key, value)

The result is a list of tuples: ("foo", "foo"), ("foo", "bar"), and so on. Now a slight change:

val data = Map(
            "foo" -> List("foo", "bar", "baz"),
            "argle" -> List("argle", "bargle", "wargle"))

val result = for {
  (key, values) <- data
  value <- values
} yield (key, value)

This comprehension translates into an identical sequence of calls, but now the outermost flatMap() is called on Map rather than List. This means that every tuple produced by yield is added to the map, with its first member as the key. And that means that the result contains only two entries, as repeated tuples with the same key are discarded.

You can look at this, say that it's not something you're likely to do, and moreover, that programmers should know the types of their data. But consider the case where data is actually a function call that's defined in some other module. That function originally returned the List, but then some developer noted that the keys are all unique, needed a Map for some other piece of code, and made the change. At that point, your for-comprehension has silently broken.

I'm going to give one more for-comprehension example; this one doesn't compile.

val data = Map(
            "foo" -> List("foo", "bar", "baz"),
            "argle" -> List("argle", "bargle", "wargle"))

val key = "foo"
val result = for {
  values <- data.get(key)
  value <- values
} yield (key, value)

The problem here is that Map.get() returns an Option, and Option.flatMap() expects a function that returns an Option. But the generator value <- values returns a List. To make this compile, you need to turn the Option into a sequence:

val result = for {
  values <- data.get(key).toSeq
  value <- values
} yield (key, value)

The overall issue is one of mental models. In Erlang, for-comprehensions represent a very simple mental model that can be applied identically in all cases. In Scala, the mental model seems equally simple at first, but in reality it changes depending on runtime data types. To use a Scala for-comprehension effectively, the programmer has to spend time thinking about how a particular class implements map() and flatMap() — and hope that nobody else mucks with the data.

How much mental effort? That's hard to quantify, but subjectively I feel that I take twice as long to do a task in Scala, even when the task is one that's most naturally implemented in a functional style.

Perhaps this is an indictment of my mental capacity, rather than the language. Or perhaps six months just isn't enough time to become productive with Scala. Either of those cases, however, begs the question of whether Scala is an appropriate language for an average development team. Because, regardless of whatever nice features a language provides, you don't want to choose a language that reduces productivity.

* My comment when I ran into this issue: “Whatever would Scala programmers do without an underbar to smooth over the rough spots?”

** To me, coming from a database background, the Erlang approach is very natural: it's equivalent to a query, with joins and predicates. In Scala, as long as every term produces a Seq, the behavior is identical.

Tuesday, April 29, 2014

Mock Objects are a Testing Smell

I like mock objects, at least in principle. I think that well-designed object-oriented programs consist of collaborating objects, and mocks do a good job of capturing that collaboration: the general idea is “push on A and verify something happens at B.”

But I feel uncomfortable when I see tests that use multiple mocks, or set many expectations. These tests have moved beyond testing behavior to testing implementation, and are tightly coupled to the mainline code. These tests are brittle, breaking at the slightest change to the mainline code. And on the flip side, the mainline code becomes much harder to refactor, because changes will break the tests.

This is a particular problem for service-level objects. To see what I mean, consider a simple service to manage bank accounts:

public interface BankService {
    BigDecimal getBalance(String accountId) throws BankException;
    List getTransactions(String accountId) throws BankException;

    void deposit(String accountId, BigDecimal amount) throws BankException;
    void withdraw(String accountId, BigDecimal amount) throws BankException;
}

To support this service, we need a data-access object, which can be equally simple:

public interface AccountRepository {
    Account find(String accountId);

    void update(Account account);
}

This seems to be a good use for mock objects: to test deposit(), you create a mock for AccountRepository and set expectations on find() and update(). Actually making this work, however, is not quite as easy as describing it. Here's an implementation using EasyMock:

public class TestAccountOperations {
    private static class AccountBalanceMatcher
    implements IArgumentMatcher {
        BigDecimal expected;

        public AccountBalanceMatcher(BigDecimal expected) {
            this.expected = expected;
        }

        @Override
        public boolean matches(Object argument) {
            Account account = (Account)argument;
            return account.getBalance().equals(expected);
        }

        @Override
        public void appendTo(StringBuffer sb) {
            sb.append(expected);
        }
    }

    public static Account eqBalance(BigDecimal expected) {
        EasyMock.reportMatcher(new AccountBalanceMatcher(expected));
        return null;
    }

    private final String accountId = "test-account-1234";
    
    private Account account = new Account(accountId);
    private AccountRepository mock = createMock(AccountRepository.class);
    private BankService service = new BankServiceImpl(mock);

    @Test
    public void testDepositHappyPath() throws Exception {
        final BigDecimal initialBalance = new BigDecimal("300.00");
        final BigDecimal depositAmount = new BigDecimal("200.00");
        final BigDecimal expectedBalance = initialBalance.add(depositAmount);
        
        account.setBalance(initialBalance);

        expect(mockAccountRepo.find(accountId)).andReturn(account);
        mockAccountRepo.update(EasyMockSupport.eqBalance(expectedBalance));
        replay(mockAccountRepo);

        service.deposit(accountId, depositAmount);
        
        verify(mockAccountRepo);
    }

That's a lot of code. More than half of it is infrastructure, needed so that EasyMock can validate the account balance in the call to update(). It is, however, relatively readable, and the mock object allows us to easily test “sad path” cases such as an invalid account number simply by replacing andReturn() with andThrow().

However, this test is missing something important: it doesn't validate that we update the transaction log. How do you discover this? You can't rely on a coverage report, because you're not actually exercising mainline code. You just have to know.

Fortunately, it's easy to add another mock (and it's also time to refactor the EasyMock extensions into their own class):

public class TestAccountOperations {
    private final String accountId = "test-account-1234";
    
    private Account account = new Account(accountId);
    private AccountRepository mockAccountRepo = createMock(AccountRepository.class);
    private TransactionRepository mockTxRepo = createMock(TransactionRepository.class);
    private BankService service = new BankServiceImpl(mockAccountRepo, mockTxRepo);

    @Test
    public void testDepositHappyPath() throws Exception {
        final BigDecimal initialBalance = new BigDecimal("300.00");
        final BigDecimal depositAmount = new BigDecimal("200.00");
        final BigDecimal expectedBalance = initialBalance.add(depositAmount);
        final Transaction expectedTx = new Transaction(accountId, TransactionType.DEPOSIT, depositAmount);
        
        account.setBalance(initialBalance);

        expect(mockAccountRepo.find(accountId)).andReturn(account);
        mockAccountRepo.update(EasyMockSupport.eqBalance(expectedBalance));
        mockTxRepo.store(EasyMockSupport.eqTransaction(expectedTx));
        replay(mockAccountRepo, mockTxRepo);

        service.deposit(accountId, depositAmount);
        
        verify(mockAccountRepo, mockTxRepo);
    }
}

We're done, although to my eye it's cluttered: you don't immediately see what is being tested, because you're distracted by how it's being tested.

Time marches on, and another department introduces a fraud-prevention service. We need to integrate with it, and verify that we reject transactions when the fraud service throws an exception. This is a case where mock-object testing is at its best: rather than understand the conditions that trigger a fraud exception, we just need to tell the mock to create one:

@Test(expected=BankException.class)
public void testWithdrawalFraudFailure() throws Exception {
    final BigDecimal initialBalance = new BigDecimal("300.00");
    final BigDecimal withdrawalAmount = new BigDecimal("200.00");
    
    account.setBalance(initialBalance);
    
    expect(mockAccountRepo.find(accountId)).andReturn(account);
    mockFraudService.check(accountId, FraudService.TransactionType.WITHDRAWAL, withdrawalAmount);
    expectLastCall().andThrow(new FraudException(accountId, "test of fraudulent transaction"));
    replay(mockAccountRepo, mockTxRepo, mockFraudService);

    service.withdraw(accountId, withdrawalAmount);
    
    verify(mockAccountRepo, mockTxRepo, mockFraudService);
}

As I said, this is one of the true benefits of mocks, but it comes at a cost: we need to update all of our tests to include an expectation on the service. When version 2.0 of the fraud service gets released, with an all-new API that doesn't use exceptions, the update to our mainline code is simple; our test library, now consisting of dozens of tests, not so much.

So here you are, spending hours updating your tests, while your manager is wondering why the team “wastes so much effort” on something that “clearly adds little value.” Perhaps you start to feel the same way, especially after you've spent a few hours trying to understand a test that hides its purpose within a mass of expectations.

But what's the alternative? Do you throw away the tests and hope that everything just works?

In my opinion, the only valid test for a service-layer object is an integration test, backed by a real (albeit preferably in-memory) database. Once you get rid of the mocks, not only do your tests become less brittle, but they clearly demonstrate the behavior that you're trying to test.

@Test
public void testDepositHappyPath() throws Exception {
    final BigDecimal initialBalance = new BigDecimal("300.00");
    final BigDecimal depositAmount = new BigDecimal("200.00");

    final BigDecimal expectedBalance = initialBalance.add(depositAmount);
    final Transaction expectedTx = new Transaction(accountId, TransactionType.DEPOSIT, depositAmount);
    
    // create account as part of @Before
    List initialTxs = service.getTransactions(accountId);
    service.deposit(accountId, depositAmount);
    List finalTxs = service.getTransactions(accountId);
     
    assertEquals("balance reflects deposit", expectedBalance, service.getBalance(accountId));
    assertEquals("transaction added by deposit", initialTxs.size() + 1, finalTxs.size());
    assertEquals("correct transaction", expectedTx, finalTxs.get(finalTxs.size() - 1));
}

That's not to say that you need to wire together the entire system for every test. As I said above, the fraud service is a good use for a stand-in. But that stand-in doesn't need to be an expectation-setting mock, it can be a behavior-delivering stub.

Wednesday, February 26, 2014

Nosy Web Spiders

The other day I was Googling for some technical detail that I'd written about (you think I remember where I put stuff?), and saw some surprising results: in addition to the public web page that I wanted, I also saw the PHP “fragment” that I use to build the public page. This was surprising because there aren't any links to that page on my site; the only reference to it is in a table that associates the page name to a filename.

Looking at the structure of my site, I realized what had happened. In my original site design, I had put all of the page-related content in the same directory as the fragment file. A few years ago I re-organized the files, but left one presentation in place; it was linked by a couple of sites, and I had no way at that time to do a 301 redirect. While that decision seemed harmless, it left me open to the following:

66.249.73.164 - - [30/Aug/2013:15:21:34 -0700] "GET /programming/ HTTP/1.1" 200 1542 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.73.164 - - [30/Aug/2013:15:23:03 -0700] "GET /programming/intro.frag HTTP/1.1" 200 2355 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.73.164 - - [30/Aug/2013:15:23:57 -0700] "GET /programming/scm.git.frag HTTP/1.1" 200 34073 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

To explain: they took the URL for the presentation, stripped off the filename component, and retrieved a directory listing. Then they crawled each of the files returned from that listing, and added them to their index. This is, quite frankly, a script-kiddie hack. In fact, the first time I saw this behavior it was a script-kiddie exploring my site.

I don't really care about script-kiddies: the directories that I want to keep private have directory listings disabled, and also have a dummy index.html that redirects back to my home page. If someone wants to learn my madd skillz with PHP, fine, have at it (but don't feel compelled to comment if you don't find my skillz that madd).

But why does Google (and Bing, and Baidu, and many others) feel the need to do this?

I try to follow their rules: I have a sitemap that lists the “interesting” pages on my site; I use noindex and nofollow meta tags (and canonical, which is a whole ‘nother rant); I have a robots.txt that lists directories that aren't covered by the meta tags. In short, I have presented them with my website as I want it to appear in their index.

But they choose to ignore that, poke around my website, and pick up whatever droppings they can find. Which ultimately makes their index, and the Internet as a whole, a worse place.

I suppose it's time to update robots.txt.

Monday, February 3, 2014

Coder vs Engineer

Stack Overflow is a fabulous resource for programmers. When I have programming questions, the first page of Google results is filled with links to its pages, and they usually have the answers I need. So why do I often feel depressed after browsing its questions?

The answer came to me this weekend: it's a hangout for coders, not engineers.

The question that prompted this revelation was yet another request for help with premature optimization. The program in question was tracking lap times for race cars, and the OP (original poster, for those not familiar with the acronym) was worried that he (she?) was extracting the list of cars and sorting it after every update. He saw this as a performance and garbage-collection hit, that would happen “thousands of times a second.”

That last line raised a red flag for me: I'm not a huge race fan, but but I can't imagine why you would expect to update lap times so frequently. The Daytona 500, for example, has approximately 40 cars, each of which take approximately a minute per lap. Even if they draft, you have a maximum of 40 updates per second, for a rather small set of objects.

To me, this is one of the key differences between a coder and an engineer: not attempting to bound the problem. Those interview questions about counting gas stations in Manhattan are all about this. You don't have to be exact, but if you can't set a bound to a problem, you can't find an effective solution. Sure, updating and sorting thousands of cars, thousands of times a second, that might have performance issues. But that's not how real-world races work.

Another difference is that, having failed to bound the problem (indeed, even to identify whether there is a problem), the coder immediately jumps to writing code. And that was the case for the people who answered this particular question. Creating a variety of solutions that all solved some interpretation of the OP's problem.

And I think that's what really bothers me: that coders will interpret a question in whatever way makes their coding easiest. This was driven home by a recent DZone puzzle. The question was how to remove duplicates from a linked list, “without using a buffer.” It's a rather poorly-worded question: what constitutes a buffer?

There were a few people who raised that question, but by far the majority started writing code. And some of the implementations were quite inventive in their interpretation of the question. The very first response limited the input list to integers, and used a bitset to track duplicates (at the worst, that would consume nearly 300Mb of RAM — a “buffer” seems modest by comparison). Another respondent seemed to believe that a Java ArrayList satisfied the “linked list” criteria.

Enough with the rant. Bottom line is that this industry needs to replace coders by engineers: people who take the time to understand the problems that they're tasked to solve. Before writing code.

Monday, January 20, 2014

Thirty Years

Thirty years ago this week I started my first full-time, paid position as a software developer. I'm not sure of the actual day; when I moved to Philadelphia I purged a lot of old papers, and the offer letter for this job was one of them. I was a nineteen-year-old who had left college (that's another story), and whose father had been very clear about the consequences. Fortunately, the winter of 1983/84 was a good year to be looking for software jobs in the Boston area.

I landed at Think Technologies, the creators of Macintosh Pascal, as an entry-level programmer. My chief contribution to that product was a demo program that ended up as the box artwork (also shown in the link above).

I started the same week that the Macintosh was released. As a result, I didn't see either it or the product while I was interviewing. What I did see was the Apple Lisa, with its mouse and graphical UI, and the sense of “this is something big” moved Think to the top of my list.

In retrospect, Think may have been one of the most valuable jobs of my career, and not just for getting in on GUIs at the beginning. I learned a lot about startups, and about the pain of real-world software development. I also learned to keep my ego in check: the company was filled with bright people, and Mel, the person whose ideas were the base of the product, combined a particularly creative career with a reserved, self-effacing exterior.

And I learned that no job lasts forever. Eventually, you reach an intellectual or perceptual plateau, and it's time to move on to something new. As a professional, you must of course balance your personal desire for growth against your responsibilities to see your work in production. But it's easy to get trapped on the plateau, and that's something that I quickly learned to avoid.

The intervening years have seen a lot of change in the industry: graphical user interfaces are now the standard, as are integrated development environments (Macintosh Pascal wasn't the first IDE that I used, but it was the first good one); computer networks are pervasive, as is the resulting knowledge-web of Internet-hosted documentation and anonymous helpful strangers; web-apps have taken us back to a central server with dumb terminals, although client-side JavaScript is an echo of the micro-computer revolution; and lastly, the machines are almost infinitely more powerful.

But programming is still about moving bits from one place to another without dropping any.

blog.kdgregory.com