Tuesday, December 6, 2011

Actron CP9580: How Not To Do An Update

The Actron CP9580 is an automotive scantool. For those who aren't DIY mechanics, it connects to your car's on-board computer and reports on engine operation and trouble codes (ie, why your “check engine” light is on). My car has passed 100,000 miles, and received its first (hopefully spurious) trouble code a few weeks ago; the $200 investment seemed worthwhile.

Except that right now, the tool is an expensive doorstop, sitting in the manufacturer's repair shop, and I wasted a couple of hours last week. All because I ran the manufacturer-supplied update, which failed catastrophically. As I look back on the experience, I see several problems with their update process, some of which are rooted in a 1990-vintage design mentality, but all of which represent some fundamental failing that every developer should avoid.

#1: They used their own protocol to communicate with the device

In 1990, most embedded devices used an RS-232 serial port to communicate with the outside world. Manufacturers had no choice but to develop their own communications protocol, using something like X-Modem for file transfers.

But the CP9580 has a USB port. And I'm betting that it has flash memory to store its data files. Both of which mean that a custom protocol doesn't make sense. Instead, expose the flash memory as a removable drive and let the operating system — any operating system — manage the movement of data back and forth. Doing so should actually reduce development costs, because it would leverage existing components. And it would make user-level debugging possible: simply look at what files are present.

#2: They deleted the old firmware before installing the new

Again, a vestige of 1990, when devices used limited-size EEPROMs for their internal storage. Not only was the amount of space severely limited, but so were the number of times you could rewrite the chip before it failed. Better to simply clear the whole thing and start fresh.

This is another case where flash memory and a filesystem-based design change the game entirely. Consumer demand for memory cards has pushed the price of flash memory to the point where it's not cost-effective to use anything less than a gigabyte. And with a filesystem, version management is as simple as creating a new directory.

It's also a case where the game changed and the system designers half-changed. In the old days, communications code was in permanent ROM. If an update failed, no problem: you could try again (or reload the previous version). However, it seems that the CP9580 stores everything in flash memory, including the loader program (at least, that's what I interpret from the tech support person's comments, but maybe he was just being lazy).

The iPhone is a great example of how to do updates right: you can install a new revision of iOS, run it for months, and then decide that you want to roll back; the old version is still there. But it's not alone; even an Internet radio is smart enough to hold onto its old software while installing an update.

#3: They kept the update on their website, even though they'd had reports of similar failures

The previous two failings can be attributed to engineers doing things the same way they always have, even when the technology has moved forward. This last failure runs a little deeper. After the update failed, the second time I called tech support I was told that “we've had several cases where this happened.” Yet the software was still on the website, without a warning that it might cause problems. And it's still there today.

One of the best-known parts of the Hippocratic Oath is the exhortation to “do no harm.” Programmers don't have to swear a similar oath, but I think they should — if only to themselves. Too often we look at the technical side of a problem, forgetting that there's a human side. Sometimes the result ends up on The Daily WTF, but more often it ends up quietly causing pain to the people who use our software.

No comments: