blog.kdgregory.com

Wednesday, August 31, 2011

Using a Local Repository Server with Gradle

I've been doing a little work with Gradle recently. And one of the things that I find “less than optimal” is that the build script holds far too much knowledge about its environment, which means that you have to jump through some hoops to make those scripts portable. Not a huge problem for in-house development, but if you're making an open-source library, you don't want everyone else to reconfigure their world to match yours.

One particular problem is how to find your dependencies. A typical build script has a repositories section that lists all the places to look. Here's a simple example, that looks first in the local Maven repository, followed by the Maven Central:

repositories {
    mavenLocal()
    mavenCentral()
}

This is a portable build script — although I have no idea how dependencies might find their way to the local Maven repository, since Gradle uses its own dependency cache. A better build script might want to use a local repository server rather than constantly hitting Maven Central:

repositories {
    mavenRepo urls: 'http://intranet.example.com/repository'
}

That works, but now you can't share the build script with anybody else, unless they edit the script to use their own repository server (assuming they have one), and remember not to check in their changes. The solution that I came up with is to store the repository URL in $HOME/.gradle/gradle.properties, which is loaded for every build.

internalRepositoryUrl: http://repo.traffic.com:8081/nexus/content/groups/public/

Then, the build script is configured to add the local server only if the property is defined:

repositories {
    mavenLocal()

    if (project.hasProperty('internalRepositoryUrl') )
        mavenRepo urls: project.internalRepositoryUrl
    else
        mavenCentral()
}

It's portable, but it's ugly. When searching for solutions, I saw a couple of postings indicating that gradle.properties will eventually be allowed to contain expressions as well as properties. That day can't come soon enough.

Wednesday, August 17, 2011

Meta Content-Type is a Bad Idea

Following last week's posting about “text files,” I wanted to look at one of the most common ways to deliver text: the web. The HTTP protocol defines a Content-Type header, which specifies how a user agent (read: browser) should interpret the response body. The content type of an HTML document is text/html; breaking from other “text” types, its default character set is ISO-8859-1. However, you can specify the document's encoding as part of the Content-Type, and most websites do.

All well and good, except that an HTML document can specify its own encoding, using the http-equiv meta tag:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="fr" dir="ltr" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Wikipédia, l'encyclopédie libre</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

Wikipedia does “meta” Content-Type about as well as you can: the page is delivered with a Content-Type header specifying UTF-8, and it's an XHTML document (which implies UTF-8 encoding in the absence of a prologue). The only questionable practice with this page is the location of the <title> tag: it contains UTF-8 content, but appears before the in-document Content-Type. But in this case the in-document content type specification is superfluous.

Not all non-English pages do as well. The Montreal Craigslist page, for example, specifies ISO-8859-1 in the HTTP response, but UTF-8 in the meta tag.^* It is a testament to browser developers adhering to Postel's Law that you can read the site at all.

From a “layered architecture” perspective, the embedded content-type declaration is ugly. You could argue that it self-describes a stand-alone document, much like the prologue in an XML document. But there's an important difference: the bytes of an XML prologue are rigidly specified; the parser doesn't need to know the encoding to read them. The <meta> tag can appear anywhere in the <head> of an HTML document. Including, as shown by Wikipedia, after content that requires knowledge of the encoding.

While writing this post, I did a quick search for a history of the embedded Content-Type specification. I turned up a W3C page that recommended always using it, but did not give a rationale. And I found a page that claimed specifying a character set in the HTTP response would “break older browsers.” As the page did not list those browsers, and did not appear to be written by someone involved in browser development, I'm not sure that I believe it.

For my personal website, I rely on the HTTP header, and don't use the meta tag. But I also limit myself to US-ASCII text, with HTML or numeric entities for anything that isn't ASCII. I'm not going to suggest that you remove the tag from your website (who knows, your biggest customer might have an “older browser”). But if you do use it, it should be the first thing in your <head>.

More important than whether the <meta> tag is present is that you actually get the encoding right, both in the page and in the HTTP headers.

With servlets, it's easy: the first line of your service method should be a call to ServletResponse.setContentType().

response.setContentType("text/html;charset=UTF-8");

This will set the Content-Type header and also configure the object returned by ServletResponse.getWriter(). Don't, under any circumstances, write HTML data via the object returned by ServletResponse.getOutputStream(); it exists for servlets that produce binary content.

With JSP, put the following two directives at the top of each page.

<%@page contentType="text/html"%>
<%@page pageEncoding="UTF-8"%>

These are translated into a call to ServletResponse.setContentType(), and are also used by the JSP container itself to parse the page. If, after reading this posting, you don't feel comfortable writing self-describing files, you can also use a JSP property group in your web.xml.

One final thing: if you do choose to specify content type via http-equiv, make sure that it matches what your server is putting in the HTTP response. Otherwise, you risk having your site used as an example by someone writing about encodings.

* The Paris Craigslist omits the <meta> declaration, but retains ISO-8859-1 in the HTTP response. Which explains why all of the ads say “EUR” rather than €.

Friday, August 12, 2011

"Text File" is an Oxymoron

Back in the early 1990s, life was easy. If you worked in the United States, “text” meant ASCII. If you worked in Canada or Europe, it might mean with ISO-8859-1 or windows-1252, but they were almost the same thing … unless you dealt with currency and needed to display the new Euro symbol. There were a few specialists that thought of text as wchar_t, but they were rare. Companies hired them as contractors rather than full-time employees.

This US-centric view of text is pervasive: any MIME Content-Type that begins with “text” is presumed to be US-ASCII unless it has an explicit character set specifier. Which often trips up people who create XML, which presumes UTF-8 in the absence of an explicit encoding (solution: use application/xml rather than text/xml).

This was the world that Java entered, and it left an indelible imprint. Internally, Java looked to the future, managing strings as Unicode (now UCS-2). But in the IO package, it was firmly rooted in the past, relying on “default encoding” when converting those two-byte Unicode characters into bytes. Even today, in JDK 7, FileReader and FileWriter don't support explicit encodings.

The trouble with a default encoding is that it changes from machine to machine. On my Linux machines, it's UTF-8; on my Windows XP machine at home, it's windows-1252; on my Windows XP machine from work, it's iso-8859-1. Which means that I can only move “text” files between these boxes if they're limited to US-ASCII characters. Not a problem for me, personally, but I work with people from all over the world.

At this point in time, I think the whole idea of “text” is obsolete. There's just a stream of bytes with some encoding applied. To read that stream in Java, use InputStreamReader with an explicit encoding; to write it, use OutputStreamWriter. Or, if you have a library that manages encoding for you, stick with streams.

If you're not doing that, you're doing it wrong. And if you aren't using UTF-8 as the encoding, in my opinion you're doing it poorly.