Friday, August 12, 2011

"Text File" is an Oxymoron

Back in the early 1990s, life was easy. If you worked in the United States, “text” meant ASCII. If you worked in Canada or Europe, it might mean with ISO-8859-1 or windows-1252, but they were almost the same thing … unless you dealt with currency and needed to display the new Euro symbol. There were a few specialists that thought of text as wchar_t, but they were rare. Companies hired them as contractors rather than full-time employees.

This US-centric view of text is pervasive: any MIME Content-Type that begins with “text” is presumed to be US-ASCII unless it has an explicit character set specifier. Which often trips up people who create XML, which presumes UTF-8 in the absence of an explicit encoding (solution: use application/xml rather than text/xml).

This was the world that Java entered, and it left an indelible imprint. Internally, Java looked to the future, managing strings as Unicode (now UCS-2). But in the IO package, it was firmly rooted in the past, relying on “default encoding” when converting those two-byte Unicode characters into bytes. Even today, in JDK 7, FileReader and FileWriter don't support explicit encodings.

The trouble with a default encoding is that it changes from machine to machine. On my Linux machines, it's UTF-8; on my Windows XP machine at home, it's windows-1252; on my Windows XP machine from work, it's iso-8859-1. Which means that I can only move “text” files between these boxes if they're limited to US-ASCII characters. Not a problem for me, personally, but I work with people from all over the world.

At this point in time, I think the whole idea of “text” is obsolete. There's just a stream of bytes with some encoding applied. To read that stream in Java, use InputStreamReader with an explicit encoding; to write it, use OutputStreamWriter. Or, if you have a library that manages encoding for you, stick with streams.

If you're not doing that, you're doing it wrong. And if you aren't using UTF-8 as the encoding, in my opinion you're doing it poorly.

No comments: