Following last week's posting about “text files,” I wanted to look at one of the most common ways to deliver text: the web. The HTTP protocol defines a Content-Type header, which specifies how a user agent (read: browser) should interpret the response body. The content type of an HTML document is
text/html; breaking from other “text” types, its default character set is ISO-8859-1. However, you can specify the document's encoding as part of the
Content-Type, and most websites do.
All well and good, except that an HTML document can specify its own encoding, using the
http-equiv meta tag:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html lang="fr" dir="ltr" xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Wikipédia, l'encyclopédie libre</title> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Wikipedia does “meta”
Content-Type about as well as you can: the page is delivered with a
Content-Type header specifying UTF-8, and it's an XHTML document (which implies UTF-8 encoding in the absence of a prologue). The only questionable practice with this page is the location of the
<title> tag: it contains UTF-8 content, but appears before the in-document
Content-Type. But in this case the in-document content type specification is superfluous.
Not all non-English pages do as well. The Montreal Craigslist page, for example, specifies ISO-8859-1 in the HTTP response, but UTF-8 in the
meta tag.* It is a testament to browser developers adhering to Postel's Law that you can read the site at all.
From a “layered architecture” perspective, the embedded content-type declaration is ugly. You could argue that it self-describes a stand-alone document, much like the prologue in an XML document. But there's an important difference: the bytes of an XML prologue are rigidly specified; the parser doesn't need to know the encoding to read them. The
<meta> tag can appear anywhere in the
<head> of an HTML document. Including, as shown by Wikipedia, after content that requires knowledge of the encoding.
While writing this post, I did a quick search for a history of the embedded
Content-Type specification. I turned up a W3C page that recommended always using it, but did not give a rationale. And I found a page that claimed specifying a character set in the HTTP response would “break older browsers.” As the page did not list those browsers, and did not appear to be written by someone involved in browser development, I'm not sure that I believe it.
For my personal website, I rely on the HTTP header, and don't use the
meta tag. But I also limit myself to US-ASCII text, with HTML or numeric entities for anything that isn't ASCII. I'm not going to suggest that you remove the tag from your website (who knows, your biggest customer might have an “older browser”). But if you do use it, it should be the first thing in your
More important than whether the
<meta> tag is present is that you actually get the encoding right, both in the page and in the HTTP headers.
With servlets, it's easy: the first line of your service method should be a call to
This will set the
Content-Type header and also configure the object returned by
ServletResponse.getWriter(). Don't, under any circumstances, write HTML data via the object returned by
ServletResponse.getOutputStream(); it exists for servlets that produce binary content.
With JSP, put the following two directives at the top of each page.
<%@page contentType="text/html"%> <%@page pageEncoding="UTF-8"%>
These are translated into a call to
ServletResponse.setContentType(), and are also used by the JSP container itself to parse the page. If, after reading this posting, you don't feel comfortable writing self-describing files, you can also use a JSP property group in your
One final thing: if you do choose to specify content type via
http-equiv, make sure that it matches what your server is putting in the HTTP response. Otherwise, you risk having your site used as an example by someone writing about encodings.
The Paris Craigslist omits the
<meta> declaration, but retains ISO-8859-1
in the HTTP response. Which explains why all of the ads say “EUR” rather than