Back in the early 1990s, life was easy. If you worked in the United States, “text” meant ASCII. If you worked in Canada or Europe, it might mean with ISO-8859-1 or windows-1252, but they were almost the same thing … unless you dealt with currency and needed to display the new Euro symbol. There were a few specialists that thought of text as wchar_t
, but they were rare. Companies hired them as contractors rather than full-time employees.
This US-centric view of text is pervasive: any MIME Content-Type
that begins with “text” is presumed to be US-ASCII unless it has an explicit character set specifier. Which often trips up people who create XML, which presumes UTF-8 in the absence of an explicit encoding (solution: use application/xml
rather than text/xml
).
This was the world that Java entered, and it left an indelible imprint. Internally, Java looked to the future, managing strings as Unicode (now UCS-2). But in the IO package, it was firmly rooted in the past, relying on “default encoding” when converting those two-byte Unicode characters into bytes. Even today, in JDK 7, FileReader
and FileWriter
don't support explicit encodings.
The trouble with a default encoding is that it changes from machine to machine. On my Linux machines, it's UTF-8; on my Windows XP machine at home, it's windows-1252; on my Windows XP machine from work, it's iso-8859-1. Which means that I can only move “text” files between these boxes if they're limited to US-ASCII characters. Not a problem for me, personally, but I work with people from all over the world.
At this point in time, I think the whole idea of “text” is obsolete. There's just a stream of bytes with some encoding applied. To read that stream in Java, use InputStreamReader
with an explicit encoding; to write it, use OutputStreamWriter
. Or, if you have a library that manages encoding for you, stick with streams.
If you're not doing that, you're doing it wrong. And if you aren't using UTF-8 as the encoding, in my opinion you're doing it poorly.
No comments:
Post a Comment