Thursday, April 28, 2011

JDK Bug 6337981: Writing XML to OutputStream Doesn't Indent

I was recently bitten by JDK bug #6337981. Since it took me a while to narrow down the problem, and longer to find the official bug report, I'm posting it here to hopefully increase its Googleability. And so that I can respond to any complaints about how Practical XML does output.

Here's the code:

public class XmlIndentExample
{
    public static void main(String[] argv)
    throws Exception
    {
        String src = "<root><child>text</child></root>";

        ByteArrayInputStream in1 = new ByteArrayInputStream(src.getBytes("UTF-8"));
        StringWriter out1 = new StringWriter();
        transform(new StreamSource(in1), new StreamResult(out1));
        String result1 = out1.toString();
        System.out.println("writer:\n" + result1);

        ByteArrayInputStream in2 = new ByteArrayInputStream(src.getBytes("UTF-8"));
        ByteArrayOutputStream out2 = new ByteArrayOutputStream();
        transform(new StreamSource(in2), new StreamResult(out2));
        String result2 = new String(out2.toByteArray(), "UTF-8");
        System.out.println("stream:\n" + result2);
    }


    private static void transform(Source source, Result result)
    throws Exception
    {
        TransformerFactory fact = TransformerFactory.newInstance();

        // this is a work-around bug #6296446; only needed on JDK 1.5
        fact.setAttribute("indent-number", Integer.valueOf(4));

        Transformer xform = fact.newTransformer();

        xform.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
        xform.setOutputProperty(OutputKeys.INDENT, "yes");

        // since we set the "indent-number" attribute on the factory, we
        // don't need to set the indent amount here; uncomment if you
        // think it will make a difference

//        xform.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");

        xform.transform(source, result);
    }
}

When you run this, the version that writes to a ByteArrayOutputStream isn't indented, while the version that writes to a StringWriter is.

The bug is marked as low priority, and has been open since 2005. And I was unable to find any mention of a similar problem in the Xerces bug database. All of which means that it's unlikely to get fixed any time soon.

There is a work-around, mentioned in the bug report: wrap the OutputStream in an OutputStreamWriter. That works, but you need to pay attention to encoding. Always — always — tell the OutputStreamWriter to use UTF-8:

OutputStreamWriter wrapped = new OutputStreamWriter(out, "UTF-8");

Friday, April 22, 2011

I Can Be Replaced

Returning briefly to the “fear” theme: a lot of people have a fear that they can be replaced. That if they don't do exactly what the boss says, he'll find someone who does. I think this fear drives a lot of the negative behavior in programming shops, from writing unmaintainable code to hoarding information. If someone is the only person who can solve a problem, clearly they're irreplaceable.

Not only is this fear misplaced, but the very idea of being irreplaceable is not something that I would wish on anyone.

I have a friend who is irreplaceable. He didn't try to be, but he's smart and manages to consistently meet deadlines in a chaotic environment. He's been in the same job for six or seven years now, and there's no way out. Even when he was “promoted,” it was a new title and more money, but he ended up doing the same work. For him, work has become a living hell.

And the worst part is that the world moved on, his skills have become focused on the one job, and it would be next to impossible for him to move (at least if he wants anything near the same level of compensation).

With another twenty years or more left in my working life, I try very hard to be replaceable.

Thursday, April 21, 2011

Four Data Structures Every Programmer Should Know

I went back to reading Stack Overflow during my convalescence. And saw a bunch of questions and answers that really disturbed me. For those who don't want to click, they're all questions that relate to data structures.

OK, I realize that data structures can be daunting. Knuth took three books just to cover the basics. Cormen is nearly 1,300 pages, and Sedgewick almost 1,000. And I'll be the first to admit that I've not memorized any of these books, or even read significant pieces of them (much less cover-to-cover). They're meant as references: if and when I need to implement an algorithm, I open the book. If you asked me to sit down and implement a red-black tree without a book, I couldn't do it.

But I do know the performance characteristics of binary trees, and I know that java.util.TreeMap uses a binary tree. And as a Java programmer, I know when I should use a TreeMap versus a HashMap, versus an ArrayList that I've sorted. And I think that knowledge is a baseline that every programmer should have.

So, here's a cheat-sheet of the four data structures that every programmer should know, because they'll handle almost all of your in-memory data management needs.

Structure Insert Remove Search Indexed Access Example
Hash table O(1) O(1) O(1) n/a HashMap
Binary tree O(logN) O(logN) O(logN) n/a TreeMap
Linked list O(1) O(1) O(N) O(N) LinkedList
Array-backed list O(N) O(N) O(N) O(1) ArrayList

Hash table

If you need to maintain a key-value relationship, then a hashed data structure should be your first choice. As I've written elsewhere, a hashed lookup is as fast as you can get: O(1) access for all operations. At least, it's O(1) if you have a good hash function. If you have a bad hash function, performance can drop to O(N).

There is one downside to hash tables: they add a lot of memory overhead. To avoid collisions, a hash table needs to be larger than the number of entries; by default, the Java HashMap resizes when the table becomes 75% full. There's also overhead from the linked list used to hold hash buckets: each entry is a distinct object, and in the Sun implementation, on a 32-bit JVM, takes 24 bytes — not counting the value that it holds.

Binary tree

Whenever someone mentions O(logN) performance, they're talking about a tree of some form. Trees implement a “divide and conquer” approach to data: starting at a “root” node, you recursively identify a child node that contains the data you need. At each step, you divide the total search space by some constant value; in the case of a binary tree, that value is 2, you divide the space in half. There are a bunch of different trees, and a bunch of data structures that implement a binary tree, but as far as Java is concerned, there are just two (which means that O(logN) is really O(log2N)).

TreeMap and TreeSet use a tree structure in which the value of a node's “left” child is less than the value of the node, and the value of the node's ”right” child is greater. This means that, not only do you have O(logN) search performance, you can easily iterate through the elements in order from smallest to largest.

And this is (with a few exceedingly rare exceptions), the only reason that you would pick TreeMap over HashMap. For example, if you are extracting word counts from a piece of text, it makes sense to manage the words in sorted order, because you'll want to output them in that form. Other than this use case, the performance benefit of HashMap means it's a better choice.

The PriorityQueue is another Java class that implements a binary tree structure, however it does not maintain the order of its elements. Instead, it uses a tree where a given element is required to be lower value than either of its children. This means that inserts are O(logN), but searching is O(N); you know that the root is the smallest value in the tree, but you know nothing about any of the others.

Priority queues have a very limited set of use cases. They're great when you need to process or initiate time-dependent events, or for graph traversal algorithms where you care about a least-cost path. Because they don't maintain a total ordering over their elements, there's less work that has to be done for each N, and there's no need for an explicit node object (even though two algorithms can have the same “big O” complexity, the work that takes place can make them very different in performance).

I mention priority queues because of one of the SO links above. Because priority queues are binary structures, they take O(logN) time for inserts and removes. Which is a really dumb penalty to pay if you're looking for a simple linear (eg: fifo, or first-in-first-out) queue.

Linked list

Which brings me to linked lists, the traditional data structure for queues. As long as you actually hold a reference to a node, insertion and removal are O(1): they're implemented by updating pointers. It's easy to get a reference to either the first or last node (at least in a doubly-linked list, which is what the Java LinkedList class implenents). It's not so easy to get access to an arbitrary node, because you need to iterate the predecessor nodes first, an O(N) activity.

But linked lists have another dark side: each of the nodes is a distinct object, that consumes memory and must eventually be garbage-collected (20 bytes on a Sun 32-bit JVM). And the use-case for removing elements from the head or tail of a list tends to be multi-threaded applications, in which case one of the classes in java.util.concurrent will be more appropriate. In fact, the only time that LinkedList would be my first choice is for an application where I need known marginal memory consumption.

Array-backed list

For all other generic lists, my first choice is ArrayList. As its name implies, it implements the List interface using a pre-allocated array for storage. This means that you get O(1) access to any value, as long as you know its position in the array. It also means that, while add or remove of the last element is O(1), it's O(N) for any other position, because the other values in the list have to be shifted. However, in practice most additions are to the end of a list. And even where you need to add or remove values from the start of the list, in a small list (a few thousand entries) the CPU cost to do this is lower than the cost to manage the entries of a LinkedList.

The problem with ArrayList is the way that it resizes the underlying array. As I said, it pre-allocates an array for storage; if you want to add more elements than the array can hold, it must allocate a new, larger array, and copy the existing entries into it. The documentation describes this behavior as “amortized” O(1) performance: although the copy is an O(N) operation, it will only happen every N times you add an element, and O(N) / N == O(1).

More important, in practice, is that an ArrayList allocates a new array that is some significant percentage larger than the old array (it's explicitly not documented, but in the Sun 1.6 JDK is 50%). If you're pushing the boundaries of available memory, and need to expand a large list, it's all too easy to get an OutOfMemoryError. There's also the potential for a lot of wasted space, particularly with small lists.

Summary

I would say that, 99% of the time, when I need a Java collection class I turn to ArrayList and HashMap. To me, they represent the Platonic ideal for List and Map implementations. But I do that with full understanding of their limitations, and am ready to switch to a different data structure (or implement my own) when I run into those limitations.

Thursday, April 14, 2011

There's more than one way ...

Words of wisdom from a former coworker
There's more than one way to skin a cat. And just between you and me, it's a lot easier when you don't care how it looks when you're done.

Friday, April 8, 2011

Aviate, Navigate, Communicate

Houston, we've had a problem

It's the tone of voice, not simply the words. True, the crew of Apollo 13 didn't know how serious their problem was when they calmly notified Mission Control. The crew of US Airways Flight 1549, by comparison, knew exactly how serious their problem was. Yet the transcripts and audio feature the same calm, “let's take care of business” tone.

Compare this to the way that a development organization responds to a problem. In the cases I've seen, there will be a half-dozen people clustered around a desk, maybe more on a conference call. Everyone is talking, everyone is trying to formulate a plan. And surprisingly, most of these people are managers. There may be a single developer. And while she's trying to figure out what went wrong, she has to contend with a barrage of questions from outside.

I've been in that position. Probably most of you have. And when I'm the person at the keyboard, I try to remember the best advice I got from my flight instructor.

Aviate

It's amazing how far an airplane can go without an engine. In the case of a light single-engine plane like a Cessna 172, as long as you fly straight and level at “best glide speed,” you'll travel about a mile horizontally for every 1,000 feet of altitude you lose. The first response to an engine failure is to level the wings and trim for this speed — a response regularly reinforced by your flight instructor, who pulls the throttle to idle while you're in the middle of something else.

The important thing is that the first steps in any crisis should be planned in advance. There's no hesitation when your engine goes silent: you level the wings and trim for airspeed. But the response for an engine fire is completely different: you shut off the fuel supply (causing the engine to stop) and then increase your speed to put out the fire. Once the fire is out you can think about trimming for glide.

I've never known an IT organization that planned for failure — other than disasters, of course. And while it's important to have a plan when your data center goes down, it's just as important to have a plan for when your web server starts sending out “403 - Forbidden” messages to all of your clients. And, just like the procedure for engine fire, sometimes you have to trade one emergency for another.

Navigate

“Keep the airplane flying” is good advice even when you don't have engine problems. And only after you've done that, then it's time to figure out what happened and how to resolve the problem.

Far too many crashes (primarily of private aircraft) happen because pilots try to resolve the problem first. The crash of Eastern 401 is an example: all three people in the cockpit were focused on diagnosing a burnt-out lightbulb while the aircraft descended into the Everglades.

By comparison, the crew of Flight 1549 divided up their tasks: Captain Sullenberger immediately took over the controls, while First Officer Skiles walked through the emergency checklist. This division of responsibilities has a name: Cockpit Resource Management, and it's part of the training for commercial aircrews (in large part due to accidents such as Eastern 401).

Back to an IT organization. In my experience, division of labor is ad hoc: the of the “first responders” tend to brainstorm ideas and then go off and research them. Meanwhile, who is minding the system?

Communicate

If you watched the NTSB video that I linked above, you may have noticed something: before the emergency, air traffic control initiated all communication and the crew responded very quickly — the New York airspace is one of the busiest in the country, and the controllers have little patience. But after calling the mayday, the communication changed: the crew would initiate requests for information, and ignore any requests initiated by the controller.

Of the three steps, this might be the most important: communicate only when necessary. It's also the most difficult: managers need to communicate; it's how they add value. The trick is to balance their need for communication against your need to avoid distractions. A pre-agreed communication plan is best (and this is how air traffic works: an aircraft with a declared emergency takes precedence over everything else in the sky). But if you don't have one, it's easy enough to create one: say what you're doing and when you expect to have an update. And then ignore interruptions (managers that truly add value via communication will understand).

And above all, keep the tone of your voice level, even if you feel like panicking. Because it's the tone of voice that people remember.

Tuesday, April 5, 2011

Deadlines

When a man knows he is to be hanged in a fortnight, it concentrates his mind wonderfully

I doubt that Samuel Johnson had first-hand knowledge of staring down a gallows, but he perfectly captures the effect of an approaching deadline. Unfortunately, all too often the mind is concentrated on the deadline itself, rather on the problem(s) to be solved.

Deadlines are a fact of life in software development. And I've been lucky enough to have worked at several companies where the deadline is just another day: it arrives, the software is released, everybody leaves work early for beer. At other companies, the deadline is a source of panic, sometimes starting at the project kickoff.

The difference, as everyone knows, is that the latter companies set unrealistic deadlines: too much work, too little time.

Except that, as I look back at the projects where I've experienced deadline pressure, I don't remember that the schedule or work effort were that extreme. More often, the problems came from confusion: bugs (particularly performance) that weren't discovered until late in the process, last minute questions about how a feature should behave (with different team members having different strongly-held positions), the undiscovered critical feature, or process delays (QA has to sign off, but the QA lead is sick).

There are lots of techniques to attack these problems, with Agile project management attempting to take on several at once. And I think that any or all can work, even those embodying heavyweight process. The only question is whether they reduce the confusion or add to it.