Friday, April 8, 2011

Aviate, Navigate, Communicate

Houston, we've had a problem

It's the tone of voice, not simply the words. True, the crew of Apollo 13 didn't know how serious their problem was when they calmly notified Mission Control. The crew of US Airways Flight 1549, by comparison, knew exactly how serious their problem was. Yet the transcripts and audio feature the same calm, “let's take care of business” tone.

Compare this to the way that a development organization responds to a problem. In the cases I've seen, there will be a half-dozen people clustered around a desk, maybe more on a conference call. Everyone is talking, everyone is trying to formulate a plan. And surprisingly, most of these people are managers. There may be a single developer. And while she's trying to figure out what went wrong, she has to contend with a barrage of questions from outside.

I've been in that position. Probably most of you have. And when I'm the person at the keyboard, I try to remember the best advice I got from my flight instructor.


It's amazing how far an airplane can go without an engine. In the case of a light single-engine plane like a Cessna 172, as long as you fly straight and level at “best glide speed,” you'll travel about a mile horizontally for every 1,000 feet of altitude you lose. The first response to an engine failure is to level the wings and trim for this speed — a response regularly reinforced by your flight instructor, who pulls the throttle to idle while you're in the middle of something else.

The important thing is that the first steps in any crisis should be planned in advance. There's no hesitation when your engine goes silent: you level the wings and trim for airspeed. But the response for an engine fire is completely different: you shut off the fuel supply (causing the engine to stop) and then increase your speed to put out the fire. Once the fire is out you can think about trimming for glide.

I've never known an IT organization that planned for failure — other than disasters, of course. And while it's important to have a plan when your data center goes down, it's just as important to have a plan for when your web server starts sending out “403 - Forbidden” messages to all of your clients. And, just like the procedure for engine fire, sometimes you have to trade one emergency for another.


“Keep the airplane flying” is good advice even when you don't have engine problems. And only after you've done that, then it's time to figure out what happened and how to resolve the problem.

Far too many crashes (primarily of private aircraft) happen because pilots try to resolve the problem first. The crash of Eastern 401 is an example: all three people in the cockpit were focused on diagnosing a burnt-out lightbulb while the aircraft descended into the Everglades.

By comparison, the crew of Flight 1549 divided up their tasks: Captain Sullenberger immediately took over the controls, while First Officer Skiles walked through the emergency checklist. This division of responsibilities has a name: Cockpit Resource Management, and it's part of the training for commercial aircrews (in large part due to accidents such as Eastern 401).

Back to an IT organization. In my experience, division of labor is ad hoc: the of the “first responders” tend to brainstorm ideas and then go off and research them. Meanwhile, who is minding the system?


If you watched the NTSB video that I linked above, you may have noticed something: before the emergency, air traffic control initiated all communication and the crew responded very quickly — the New York airspace is one of the busiest in the country, and the controllers have little patience. But after calling the mayday, the communication changed: the crew would initiate requests for information, and ignore any requests initiated by the controller.

Of the three steps, this might be the most important: communicate only when necessary. It's also the most difficult: managers need to communicate; it's how they add value. The trick is to balance their need for communication against your need to avoid distractions. A pre-agreed communication plan is best (and this is how air traffic works: an aircraft with a declared emergency takes precedence over everything else in the sky). But if you don't have one, it's easy enough to create one: say what you're doing and when you expect to have an update. And then ignore interruptions (managers that truly add value via communication will understand).

And above all, keep the tone of your voice level, even if you feel like panicking. Because it's the tone of voice that people remember.

No comments: