Monday, October 8, 2012

Scriptlets Are Not (Inherently) Evil

JSP scriptlets have a bad reputation; everyone can point to their favorite example of scriptlet abuse. In my case, it was a 12,000 line (yes, three zeros) monstrosity that held a “mere” 1,000 lines or so of markup. The rest was a big if-else construct with plenty of embedded logic. But that example, and ones like it, are examples of bad programming, not indictments of scriptlets as a programming medium. They're no different than Java classes that have a single method with thousands of lines.

Monstrosities aside, I don't think there's a valid argument against scriptlets. Actually, when I Googled for “why are scriptlets bad,” most of the top-ranked pages defended them. As far as I can tell, the main arguments are that scriptlets encourage bad programmers to put business logic in the page, that they make the page untestable, that they limit reuse, and that web developers won't understand pages with embedded Java code. All of which seem to me like red, rotted herrings.

Don't get me wrong, I believe in the separation of concerns that underlies the MVC model. Actually, I believe in the separation of concerns found in what I call the “CSV” model: a lightweight Controller that interacts with business logic via a Service layer, and passes any returned data to a View for rendering. But after working with several alternative technologies and languages, I'm convinced that scriptlets are the best way to implement any programmatic constructs involved in view rendering.

And some amount of programmatic rendering resides in almost every view. One common example is populating a <select> element. On the surface this is an easy task: iterate over a list of values, and emit <option> elements for each. In the real world, it's more complex: the option value will come from a different field than the option text, you probably have a (possibly null) value that should be selected, and maybe you'll decorate different options with different classes or IDs. To handle this, you need a language that, if not Turing-complete, is very close.

I'm going to work through just such an example, comparing scriptlets and JSTL. Both examples use the same Spring controller, which stores two values in the request context: a list of employees in allEmployees, and the desired selection in selectedEmployee (which may be null). If you want to walk through the code, it's available here, and builds with Maven.

First up is the JSTL version. Nothing complex here: a forEach over the list of employees, and some EL to extract the bean values. Perhaps the worst thing about this code is that the HTML it generates looks ugly: line breaks carry through from the source JSP, so the <option> element gets split over three lines. You can always tell a JSP-generated page by the amount of incongruous whitespace it contains (if you're security-conscious, that might be an issue).

<select name="employeeSelector">
<c:forEach var="emp" items="${employees}">
    <option value="${emp.id}"
        <c:if test="${!(empty selectedEmployee) and (selectedEmployee eq emp)}"> selected </c:if>
    >${emp}</option>
</c:forEach>
</select>

Now for the scriptlet version. It's a few lines longer than the JSTL version, partly because I have to retrieve the two variables from the request context (something EL does for me automatically). I also chose to create a temporary variable to hold the “selected” flag; I could have used an inline “if” that matched the JSTL version, or I could have changed the JSTL version to use a temporary variable; I find that I have different programming styles depending on language.

<select name="employeeSelector">
    <%
    List<Employee> allEmployees = (List<Employee>)request.getAttribute(Constants.PARAM_ALL_EMPLOYEES);
    Employee selectedEmployee  = (Employee)request.getAttribute(Constants.PARAM_SELECTED_EMPLOYEE);
    for (Employee employee : allEmployees)
    {
        String selected = ObjectUtils.equals(employee, selectedEmployee) ? "selected" : "";
        %>
        <option value="<%=employee.getId()%>" <%=selected%>> <%=employee%> </option>
        <%
    }
    %>
</select>

Even though the scriptlet version has more lines of code, it seems cleaner to me: the code and markup are clearly delineated. Perhaps there's a web developer who will be confused by a Java “for” statement in the middle of the page, but is the JSTL “forEach” any better? Assuming that your company separates the roles of web developer and Java developer, it seems like a better approach to say “anything with angle brackets is yours, everything else is mine.”

Putting subjective measures aside, I think there are a few objective reasons that the scriptlet is better than JSTL. The first of these is how I retrieve data from the request context: I use constants to identify the objects, the same constants that I use in the controller to store them. While I have no inherent opposition to duck typing, there's a certain comfort level from knowing that my page won't compile if I misspell a parameter. Sometimes those misspellings are obvious, but if you happen to write your JSTL with “selectedEmploye” it will take rigorous testing (either manual or automated) to find the problem.

Another benefit to using scriptlets is that you can transparently call out to Java. Here, for example, I use Jakarta Commons to do a null-safe equality check.

A better example would be formatting: here I rely on Employee.toString() to produce something reasonable (and if you look at the example you may question my choice). But let's say the users want to see a different format, like “Last, First”; or they want different formats in different places. In JSTL, you'll be required to manually extract and format the fields, and that code will be copy/pasted everywhere that you want to display the employee name. Then you'll have to track down all of that copy/paste code when the users inevitably change their mind about what they want to see. Or you could add different formatting functions to the Employee object itself, breaking the separation of model and view. With a scriptlet, you can call a method like EmployeeFormattingUtils.selectionName().

OK, before I get comments, I should note that there are easier ways to implement the JSTL version, using 3rd-party tag libraries. The Spring Form Tags, for example, would reduce my code to a single line, assuming that my controller followed the Spring “command” pattern; But those libraries limit you to common cases, restricting your ability to customize your HTML. And although you can write your own taglibs, I haven't seen many projects that do that (in fact, the only ones I've seen are ones where I did it).

I want to finish with a punchline: everything I've just written is moot, because the web-app world is transitioning from server-side view rendering to client-side, using technologies such as Backbone and Spine. That change will come with its own perils, and history tends to repeat itself. I worry that five years from now we'll be dealing with multi-thousand-line JavaScript monstrosities, mixing business logic with view rendering.

Friday, August 31, 2012

If A Field Isn't Referenced, Does It Exist?

In my last post I said that, pre-annotations, finding all the classes referenced by a given class was a simple matter of scanning the constant pool for CONSTANT_Class entries. Turns out that it isn't quite that simple (which makes the last dependency analyzer I wrote not quite correct). Consider the following class:

public class ClassLoadExample
{
    private BigDecimal imNotReferenced;
    
    public void foo()
    {
//        imNotReferenced = new BigDecimal("123.45");
    }
    
    public static void main(String[] argv)
    throws Exception
    {
        System.err.println("main started");
        new ClassLoadExample().foo();
        System.err.println("foo called, main done");
    }
}

If you compile this and walk its constant pool, you won't find a CONSTANT_Class_info entry for BigDecimal. What you will find are two CONSTANT_Utf8_info entries, containing the name of the variable and its type. And you'll find an entry in the field list that references these constants.

Explicitly set the variable to null, and the constant pool gets two additional entries: a CONSTANT_Fieldref_info, and an associated CONSTANT_NameAndType_info entry that links the existing entries for the field's name and type. Initialize the field via the BigDecimal constructor, or invoke a method on the instance, and the expected CONSTANT_Class_info appears.

At first glance, this behavior seems like a WTF: you're clearly referencing BigDecimal, so why doesn't the constant pool reflect that fact? But if you think about how the JVM loads classes, it seems less a WTF and more a premature optimization … and also a reminder that every computer system carries with it the constraints present at its birth. Java was born in 1996, in a world where 256 Mb of RAM was a “workstation-class” machine, and CPU cycles were precious. Today, of course, there's more CPU/RAM in an obsolete smartphone.

To preserve memory and cycles, the JVM loads classes on an as-needed basis, nominally starting with the class holding your main() method (but see below). In order to initialize your class, the JVM will have to load its superclass and any interfaces, as well as any classes referenced in static initializers. But it doesn't have to load classes that are only referenced by member variables, because those classes won't get used until the member variable is first accessed — which might not ever happen. Even if you construct an instance of the class, there's no reason to load the class: the member variable is simply a few bytes in the instance that has been initialized to null. You don't need to load the class until you actually invoke a method on it.*

I said a premature optimization, but that's not right. The JDK has a lot of built-in classes: over 17,000 in rt.jar for JDK 1.6. You don't want to load all of them for every program, because only a relative few of them will ever be used. But one would think that, as machines became more capable, the Java compiler might add CONSTANT_Class_info entries for every referenced class, and the JVM might choose to preload those classes; the JVM spec doesn't say they couldn't. The JVM development team took a somewhat different approach, however, and decided to preload a bunch of “commonly used” classes. From a performance perspective, that no doubt makes more sense.

But for someone writing a dependency analyzer, it's a royal pain, unless you confine yourself to dependencies that are actually used.


* If you want to see classloading in action, start the JVM with the -XX:+TraceClassLoading flag. If you do this with the example program, you'll see the pre-loads, with the program class near the end. If you uncomment the assignment statement, you'll see that BigDecimal is loaded during the call to method foo().

Monday, August 13, 2012

How Annotations are Stored in the Classfile ... WTF?!?

This weekend I made some changes to BCELX, my library of enhancements for Apache BCEL. These changes were prompted not by a need to access different types of annotations, but because I'm currently working on a tool to find hidden and unnecessary dependency references in Maven projects. Why does this have anything to do with annotation processing? Read on.

Prior to JDK 1.5, the Java class file was a rather simple beast. Every referenced class had a CONSTANT_Class_info entry in the constant pool. This structure actually references another entry in the constant pool, which holds the actual class name, but BCEL provides the ConstantClass object so you don't have to chase this reference. It's very easy to find all the external classes that your program references: walk the constant pool and pull out the ConstantClass values.

That functionality is exactly what I needed to cross-check project dependencies. But when I wrote a testcase to check my dependency-extraction method, it failed. I had used the test class itself as my target, and just by chance I picked the @Test annotation as one of my assertions. As far as my dependency-extraction code was concerned, I didn't have a reference to the annotation class.

I figured that there must be some flag in the CONSTANT_Class_info structure that was confusing BCEL — its released version hasn't been updated for JDK 1.5. So I turned to the JDK 1.5 classfile doc, and it slowly dawned on me: annotation classes aren't referenced in the constant pool. Instead, you have to walk through all of the annotation attributes, and get their names out of the constant pool. OK, I should have realized this sooner; after all, it wasn't so long ago that I'd written the annotation parsing code in BCELX.

Of course, this meant that I now had to add support for parameter and field-level annotations to BCELX (I was going to have to parameters anyway, to support another project). While doing this, I discovered something else interesting: the API docs say that you can apply annotations to packages and local variables, but the classfile docs give no indication that this is actually supported.

There are a couple of things that I take from this experience. The first is that it's another piece of evidence that JDK 1.5 represented a changing of the guard at Sun. Annotations have often had a “tacked on” feel to me — right down to the @interface keyword (they broke backwards compatibility for enum, would it have been so bad to add annotation?). I'm sure there was a reason for not treating an annotation reference as just another class, but quite frankly I can't see it.

The other thing I learned is to beware testcases built around spot checks. If I had written my testcase to look for org.junit.Assert rather than org.junit.Test, I never would have found the issue — until it turned up when using the utility. But there are lots of cases where exhaustive checks aren't cost-effective. Including this one: should I write a test that verifies every possible annotation reference? I'll need to, if I want 100% coverage, but really: it's a tiny part of the overall project.

One that could have been far easier if the JVM team had cleanly integrated their changes, and followed the existing model. I suppose that's the real take-away: if you're evolving a design, don't simply tack on the changes.

Thursday, August 2, 2012

Taming Maven: A Local Repository Server

While Maven's automatic dependency retrieval is a great feature, it does have limitations. And one of the biggest of those limitations is Maven's ability to access local projects. Or, really, any projects that aren't found in the Maven Central repository.

Returning to the archetypal IT department of my previous posts, it's really painful if, before you can start working on your own project, you first have to check-out and build multiple dependent projects. Worse is if those dependencies are under development, and you have to rebuild on a regular basis.

The first part of eliminating that pain is to run a local repository server. This server is a place to deploy your local builds, including any third-party software or local patches you've made to open source software. It can also act as a proxy for Maven Central and other external repositories, protecting you from Internet outages and keeping your POMs free of <repositories> entries.

You can create a Maven repository server using a bare-bones Apache web-server: all you need to do is make sure that its document root follows the repository format. However, there are better options: Nexus and Artifactory are both purpose-built servers for managing Maven artifacts, and both come in open-source variants that you can run for free. If you don't have a local machine, or don't want the hassle of administering it, Artifactory provides cloud hosting of your repository (for a fee). Sonatype doesn't go quite that far, instead providing a pre-built EC2 image (hopefully updated since that post).

Once you've got the repository server running, you need to configure Maven to access it. The simplest approach is to add your local server as a mirror for Maven Central, as described here. Note that you can not simply add a <repositories> entry to your parent POM, as you need to deploy that POM to the repository server.

Now you face the question of how to deploy your builds. Both Nexus and Artifactory give you a web interface to manually upload artifacts, but it's far easier to use the Maven deploy goal to deploy directly from your build (using an HTTP connection to the server). Of course, that raises the issue of credentials: do you give each developer his/her own credentials (which are stored in $HOME/.m2/settings.xml), or use a single set of credentials for all?

I'm in favor of the latter: use one set of credentials, stored either in each user's personal settings file, or in the global settings file. While that may make some people cringe, the security risk is non-existent: the repository server is write-only, and it will control where you write. As long as you don't pass out the actual admin login, or use SCP to deploy, the worst a disgruntled ex-employee can do is upload new builds.

And even that that minor risk can be eliminated if your developers never have to type “deploy” — and they'd be happier too. Instead, set up a continuous integration server that examines your source repository for changes and automatically builds and deploys the changed projects. At least for snapshot builds, this ensures that all of your developers will be using the latest codebase, without any manual intervention whatsoever.

Wednesday, August 1, 2012

Taming Maven: Dependency Management

Once you have a parent POM, you can add a <dependencyManagement> section. In my view, it's often more trouble than its worth (in fact, this blog series started out as a single post suggesting properties were usually a better choice). In the worst case, a dependency management section can prevent your child builds from seeing new dependencies. There are, however, some cases where it is useful to prevent Maven from using unwanted transitive dependencies.

For those who haven't seen <dependencyManagement> in use, here's an abbreviated example:

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-core</artifactId>
            <version>${springframework.version}</version>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-webmvc</artifactId>
            <version>${springframework.version}</version>
        </dependency>
    </dependencies>
</dependencyManagement>

With the dependency version specified in the parent's dependency management section, it can be omitted in the child's <dependencies>:

<dependencies>
    <dependency>
        <groupId>org.springframework</groupId>
        <artifactId>spring-core</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework</groupId>
        <artifactId>spring-webmvc</artifactId>
    </dependency>
</dependencies>

If that was all there was to it, there would be very little reason to use a dependency management section, versus defining version properties in the parent — it might actually be a net increase in the size of your POMs. But dependency management goes a bit deeper: it will override the transitive dependencies associated with your direct dependencies. This bears further examination.

If you use the Spring framework, you know that it's broken into lots of pieces. You don't need to specify all of these as direct dependencies, because each component brings along transitive dependencies (I haven't verified this, but I think spring-mvc the only direct dependency you need for a Spring web-app).

Now consider the case where the parent POM has a dependency management section that lists all of the Spring components, and gives them version 3.0.4.RELEASE, while a child uses version 3.1.1.RELEASE, and just specifies spring-webmvc as a direct dependency. When Maven builds the child and retrieves transitive dependencies, it will ignore the 3.1.1.RELEASE implied by the direct dependency, and instead load the 3.0.4.RELEASE versions specified by the parent.

This is rarely a Good Thing. Sometimes it won't cause an actual bug: if you are using features of the new version that haven't changed since the old version, you have nothing to worry about. But more often, you'll get a NoSuchMethodError thrown at runtime. Or worse, the method is present but does something unexpected. These sorts of bugs can be incredibly painful to track down.

Version properties, of course, go a long way toward keeping these errors from occurring. But some projects will need to specify their own version properties, often because they're trying out some new functionality.

There is, however, one case where a dependency management section is useful: excluding transitive dependencies. Again using Spring as an example: it will use either commons-logging or SLF4J for its internal logging; at runtime, it figures out which is available. However, as of this writing (version 3.1.2.RELEASE), spring-core has a non-optional transitive dependency on commons-logging. Which means that your program will also have a transitive dependency on commons-logging — and if your program is a web-app, you'll find commons-logging in its deployed WAR whether you want it or not.

Perhaps some day the Spring developers will change the scope of this dependency to provided. Until then, if you don't want commons-logging you need to manually break the transitive dependency with an exclusion:

<dependency>
    <groupId>org.springframework</groupId>
    <artifactId>spring-core</artifactId>
    <version>${springframework.version}</version>
    <exclusions>
        <exclusion>
            <groupId>commons-logging</groupId>
            <artifactId>commons-logging</artifactId>
        </exclusion>
    </exclusions>
</dependency>

Without a dependency management section in the parent POM, you would have to repeat that exclusion in every project POM. Miss just one, and the transitive dependency appears. Move the exclusion into the parent's dependency management section, and it applies to all children. Of course, this locks down the version number; any child projects that need a different version must specify a direct dependency on spring-core.

Bottom line: don't use a <dependencyManagement> section unless you absolutely have to. And even then, keep it as small as you possibly can.

Tuesday, July 31, 2012

Taming Maven: Parent POMs

When faced with a development environment that has dozens or hundreds of distinct projects, version properties are only a first step. Even if you only have to look at one place in each POM, it's a headache to update versions. As a solution, you can specify all of your common dependencies and configuration in a “parent” POM, which is then referenced by each project.

Before continuing, I want to clear up a misconception: parent POMs are not the same as multi-module projects, even though they're described together in the Maven documentation. True, the two are often seen together: a multi-module project almost always uses a parent POM to bring order to its modules. But the parent need not be tied to the modules; an example is the Sonatype parent POM, which is used by every open-source project that deploys to Maven Central via the Sonatype repository.

A parent POM looks like a normal POM, but specifies a packaging type of “pom

<groupId>org.sonatype.oss</groupId>
<artifactId>oss-parent</artifactId>
<version>7</version>
<packaging>pom</packaging>

The children of this POM then reference it via a <parent> element:

<parent>
    <groupId>org.sonatype.oss</groupId>
    <artifactId>oss-parent</artifactId>
    <version>7</version>
</parent>

<groupId>net.sf.kdgcommons</groupId>
<artifactId>kdgcommons</artifactId>
<version>1.0.7-SNAPSHOT</version>

So what goes into the parent POM? Version properties, of course; one of the main reasons for using a parent POM is to ensure that all projects use the same set of dependencies. Also common plugin configuration, such as the compiler, test runner, and any reporting plugins. Finally, any common environment configuration, such as repositories and deployment configuration.

What shouldn't go in the parent POM is an actual <dependencies> section, because that will cause all of your projects to have the same set of dependencies, whether they need them or not. Nor should you add plugins that only run for one or a few projects (although by all means specify the plugin versions). And finally, if your projects use an <scm> section, it needs to go in the individual project POMs — I learned the hard way that Maven won't substitute project-specific values into a section defined by the parent.

The biggest complaint that I've heard about parent POMs is “if we change a dependency, then we have to update all the projects that use that parent!” That's true: the parent is a released artifact, just like the projects themselves; a child specifies a particular version of its parent, and is not automagically updated when the parent changes (unless you use snapshot versions for the parents).

My answer to this complaint is “either it matters or it doesn't, and either way the parent helps you.” There are times when changes don't matter: for example, if you move to a new library version that's backwards compatible. In that case, projects that use the new parent get the new version, as do any projects that link with them, via transitive dependencies. Projects that don't need the new functionality don't need to be updated. Over time, you can migrate these projects to the new POM as you make changes to them.

On the other hand, sometimes the change matters: for example you've modified your database schema, and need to update all projects that use the affected business objects. In this case, the parent again makes your life easier: once you update the dependency property in the parent, it's a simple matter of grepping for that property to find children that need to be updated and re-released.

Monday, July 30, 2012

Taming Maven: Version Properties

Getting started with Maven is easy, and once you use its dependency management feature, you'll wonder why you waited so long. For simple web-apps or single-module projects, it Just Works.

However, most software developers aren't working on simple, one-module projects. We work in organizations that manage many projects, often depending on one-another. And in this situation, the basic Maven project breaks down: you find that every project has a differing set of dependencies, some of which are incompatible. This is the first in a series of postings about taming a multi-project development environment.

To start, replace all of your hardcoded dependency versions with properties.

Projects accumulate dependencies over time: you might start out with a few of the core Spring packages, then add a few of the Apache Commons projects, then a few more Spring projects, then some libraries that another part of your organization maintains. Pretty soon you'll have dozens of dependencies, completely unordered. Just finding a dependency in the mess becomes difficult, even if you have a tool like m2eclipse. And it becomes very easy to have two related dependencies — or even duplicate dependencies — with different versions. Maven can resolve most of these problems automagically, but when it fails, you're in for a long and painful diagnosis session.

But, if you use properties for your dependencies, and adopt a consistent naming strategy for those properties, you may not be able to find your dependency references, but at least the versions will be defined in one place. Start by adding a <properties> section to your POM; I generally place it near the top of the POM, before the <build> and <dependencies> sections (both of these tend to be long).

<properties>
    <org.springframework.version>3.1.1.RELEASE</sorg.pringframework.version>
    <!-- and so on, for all of your dependencies -->
</properties>

Each property is its own element, and the element name is the property name. You can name your properties anything you want (as long as it's a legal XML element name), but for version properties I think that GROUPID.version makes the most sense. Or use GROUPID.ARTIFACTID.version if there are different artifacts for the same group that don't have a common version (for example, Spring Core and Spring Security).

Next, update the dependency to use that property, rather than a hardcoded version number.

<dependency>
    <groupId>org.springframework</groupId>
    <artifactId>spring-webmvc</artifactId>
    <version>${org.springframework.version}</version>
</dependency>

Once all of your POMs use version properties, you can start to organize dependencies across projects. This can be as simple as running grep to find all projects that use a particular property. But Maven gives you a better solution, which will be the topic of tomorrow's post.


This series of posts was prompted by several recent projects where I worked with development organizations that had large project bases built on Maven. I was originally planning to show some of the Vim macros that I used to clean up POMs, but decided instead to start work on a tool to clean up POMs.

Wednesday, July 25, 2012

Business Logic and Interactive Web Applications

By the mid-2000s, the structure of an MVC web-app had gelled: business logic belonged in the model (which was usually divided into a service layer and persistent objects), a thin controller would invoke model methods and select data to be shown to the user, and a view held markup and the minimal amount of code needed to generate repeating or optional content. There might be some client-side validation written in JavaScript, but it was confined to verifying that fields were filled in with something that looked like the right value. All “real” validation took place on the server, to ensure that only “good” data got persisted.

Then along came client-side JavaScript frameworks like JQuery. Not only could you make AJAX calls, but they let you easily access field values and modify the rendered markup. A few lines of JavaScript, and you have an intelligent form: if option “foo” is selected, hide fields argle and bargle, and make fields biff and bizzle read-only.

The problem with intelligent front-ends is that they almost always duplicate the logic found in the back end. Which means that, inevitably, the two will get out of sync and the users will get upset: bargle was cleared but they expected it to hold the concatenated values of biff and bizzle.

There's no good solution to this problem, although it's been faced over and over. The standard solution with a “thick client” application was layered MVC: each component had its own model, which would advertise its changes to the rest of the app via events. These events would be picked up by an application-level controller, which would initiate changes in an application-level model, which would in turn send out events that could be processed by the components. If you were fastidious, you could completely separate the business logic of both models from the GUI code that rendered those models.

I don't think that approach would work with a web-app. The main reason is that the front-end and back-end code are maintained separately, using different languages. There's simply no way to look one place and see all the logic that applies to bizzle.

Another problem is validation. The layered approach assumes that each component sends data that's already been validated; there's no need for re-validation at the lower levels. That may be acceptable for internal applications, but certainly not for something that's Internet-facing.

One alternative is that every operation returns the annotated state of the application model: every field, its value, and a status code — which might be as simple as used/not-used. The front-end code can walk that list and determine how to change the rendered view. But this means contacting the server after every field change; again, maybe not a problem on an internal network, but not something for the Internet.

Another alternative is to write all your code in one language and translate for the front end. I think the popularity of GWT says enough about this approach.

I don't have an answer, but I'm seeing enough twisted code that I think it's an important topic to think about.

Monday, July 9, 2012

Introducing Pathfinder

I've been on a new assignment for the last couple of months, located in Center City Philadelphia. On the one hand, the commute is great: I can walk to the local train station, and have a quiet 25 minutes to read or work on the train. On the other hand, orienting my morning around the train schedule has thrown a monkey wrench into my blog posts. I have a dozen or more half-written ideas waiting to be cleaned up and published. I never realized how much post-production I normally do: adding links, tweaking the HTML once it's on Blogger, and whatnot. Did I mention that Septa doesn't have wifi on their trains? (score another point for Boston)

Instead, I've been working on Pathfinder, a tool to examine Java web apps and tell you the URLs that they handle. It was inspired by rake routes, a tool from the Ruby/Rails world. My current job has me enhancing legacy web-apps, and I think that knowing the classes associated with a URL is a good way to start learning a codebase. If I just had to deal with Spring, I could rely on STS; my goal for Pathfinder is to (eventually) handle all web frameworks, obsolete or not.

It's got a way to go. Right now it handles servlets, JSPs, and Spring apps, and the latter must use either SimpleUrlHandlerMapping or a component scan with @RequestMapping. But I think the basic design is solid and extensible, and will be updating it as I run into things that it can't handle.

I've learned an enormous amount about Spring while developing it. For a framework that espouses convention over configuration, I needed to handle an enormous number of special cases. For example, there are at least two default locations where you can save your Spring context files. And did you know that @RequestMapping actually takes an array of URLs?

Speaking of which, I also learned a lot about how annotations are stored in the classfile. It's easy to use reflection to access annotations on classes that are already loaded. But because of some code that I'd seen in a legacy Spring app, I didn't want to actually load the web-app. No problem, I thought, there are a bunch of libraries that already exist for working with classfiles.

As it turned out, not so much. My old standby, BCEL, has some code in the trunk to deal with annotations. But its last released version — from 2006 — doesn't handle them; they're just an “unknown” attribute type. The “new hotness,” ASM, does support annotations, but you wish it didn't: you have to decipher each annotation using a big if-else chain filled with instanceof.

Which brings me to the second introduction of this post: BCELX. The name stands for “BCEL extensions,” and it's built on top of BCEL. Right now it just handles class and method annotations; I'll add parameter annotations when I need them. And it doesn't handle nested annotations — I can't find an example of a nested annotation to build a testcase.

BCELX may expand beyond annotation parsing: it seems that there's a need out there for a simple tree-structured view of a Java classfile, and all the existing libraries are on the visitor wagon. I just have to figure out how to schedule it into my 25 minutes of quiet time each morning.

Wednesday, March 21, 2012

Static Variables Considered Harmful

In my last post, I mentioned that a particular programmer had abused static initializers. I've been thinking more about this, and decided that the real problem was that this programmer actually abused static variables. The initializer abuse was simply a side effect, an attempt to make a bad design work.

To start, I'm going to give you my thoughts on static initializers: they're a code smell in and of themselves. If you're not familiar with a static initializer, here's an example: we have a static Map used for a lookup table, and want need to populate it before use. The static initializer is run at the time the class is loaded, before any code is allowed to use the class, so we can guarantee that the map will be populated.

public class MyClassWithALookupTable
{
    private static Map<String,String> states = new HashMap<String,String>();
    
    static
    {
        states.put("AK", "Alaska");
        states.put("AL", "Alabama");
        // ...
    }

This sort of code is, in my opinion, the only valid use for a static initializer: initializing an object that will never change, using code that is guaranteed never to throw. In effect, creating a constant.

The “guaranteed never to throw” part is important: static initializers run before any code is allowed to use the class; what happens if they do throw? The language spec says that static initializers are not permitted to throw checked exceptions, but they are allowed to throw unchecked exceptions (although a smart compiler will make you hide the exception in a called method). If the initializer does throw, you'll get an ExceptionInInitializerError the first time you access the class.

How do you debug such errors? You can't use a debugger, because the class has to be loaded to do set a breakpoint. You can insert a logging statement — assuming that your logging framework is able to initialize (one of the more painful things I've ever debugged was with a homegrown addition to Log4J that had a static initializer that threw). Even if you log the exception, you have to find where it's being thrown: people who abuse static initializers tend to do it a lot, so you might have to work through a long list of classes, adding exception handlers to each, before you find the one that's broken.

OK, so static initializers are a code smell. How do I make the leap from there to “static variables considered harmful?”

Going back to the legacy application in question, the original programmer used a static initializer to populate a lookup map. But the values to populate that map came from Hibernate. And since Hibernate was accessed through a DAO loaded as a Spring bean, he had to create a way to access the application context in a static manner. So he created a new class, with a static variable to hold the app context. And he wired this class into the DispatcherServlet configuration.

It's a long list of hacks, but here's the thing: every one of them logically proceeds from the one before it. And it comes down to the lookup table whose contents are allowed to vary.

There are certainly other solutions that he could have used. For example, I replaced the static initializer with a Spring post-construct from the associated DAO. But this is only one example of how static variables (versus constants) can lead you astray. More common are concurrency bugs, or “how did that value get there?” bugs. It seems a lot smarter to simply say “don't do this,” with the caveat for experts of “don't do this (yet).”

Monday, March 19, 2012

Synchronizing put() is Not Sufficient

Have you ever seen a class like this, which synchronizes put() but not get()? Have you ever written one?

public class MyConcurrentlyAccessedLookupTable
{
    private Map<String,MyDataObject> data = new HashMap<String,MyDataObject>();
    
    // ...
    
    public MyDataObject get(String key)
    {
        return data.get(key);
    }
    
    public synchronized void put(String key, MyDataObject value)
    {
        data.put(key, value);
    }
}

I've been working over a legacy codebase recently, and have seen a lot of them. And I want to track down the original developer, give him a good shake, and say “this only works bacause you never put anything into the map after initializing it!” Actually, this is one of the least egregious ways that he (yes, he, I know his name) managed to fail at concurrent programming. Don't get me started on his abuse of static initializers.

OK, it feels good to let that out. I can at least understand why he thought it would work. A HashMap, by its nature is resilient to incorrectly managed concurrent access. Under normal operation, new items get added to a linked list; a get() should either see or not see the new entry. With put(), because two new entries might get added to the same bucket list and the scheduler might decide to switch threads in the middle, one of the entries might disappear.

And then there's resize: doubling the size of the hash table to reduce the depth of the bucket chains. And, at least in the Sun implementation, there's no attempt to preserve the chains during this operation (I can't see why any implementation would, but there might be one). So put() could get a chain using the old table, but that chain could be assigned to a different bucket by the time the new entry is attached.

OK, so in both cases get() fails to find an entry that is actually in the map, is that really so bad? Well, is it bad if you spend days trying to figure out why a transaction failed when the data was obviously present? Or if that failed transaction caused your company to lose a sale?

The problem with concurrency bugs is that they aren't easily reproducible. Sometimes they appear once and never again. But they're still bugs, and they're not that difficult to avoid: when in doubt, synchronize all access to shared state. If you understand your access patterns (eg, you always initialize and then only read), don't synchronize at all.

But never go halfway.

Wednesday, March 7, 2012

A Month with Ruby and Rails

My current project is moving slowly, so I've been making changes to our corporate website. It uses Ruby on Rails, so has given me the opportunity to learn both in the context of a real project. At this point I think I'm comfortable enough with both Ruby and Rails to make some comments, albeit still from a novice's perspective. And my first comment is that I don't love Ruby, but I don't hate it either. I was told that I'd go down one of those two paths, but it hasn't happened yet.

There's a lot to like about Ruby, and my favorite is that Nil is an object, allowing constructs like foo.present?. This doesn't eliminate null-pointer errors, but it reduces the number of times that you have to worry about them. For example, the conditional expression foo == "bar" doesn't throw, it returns false (which is just what you'd want).

I also like the way that Ruby allows any method to take an optional block. This is the base for standard Ruby constructs like each, also allows for a form of “progressive enhancement.” A nice example of this is in the event_calendar gem, where you can pass in a block that decorates each event (in the README, it's used to wrap each event in a link). If you don't pass that block, you get a basic calendar.

OK, now it's time for my dislikes, and chief of these is that objects and hashes are two different things. One of the nice features of JavaScript is that foo["bar"] and foo.bar do the same thing: they retrieve a member from an object. Ruby, however, is surprisingly formal in its approach to member access — indeed, it's much like Java in that member variables are private and you have to create accessor methods for them. And while Ruby gives you some syntactic sugar to make the process easier, it's still annoying boilerplate, especially when you have to manually extract hash arguments in initialize.

I can understand the desire to limit access to member variables. But then Active Record throws it's own wrench into the mix, by defining a [] method. Which some libraries (like event_calendar) use because it has the nice property of returning Nil if the member doesn't exist. Which in turn means that, if you want to pass your non-Model DTO to those libraries, you have to implement the following reflective code:

def [](attr_name)
  instance_variable_get("@#{attr_name}")
end

That brings me to my second dislike, which is focused more on Rails than Ruby: the “DSL” approach to programming can quickly become uncontrolled. When I was first exposed to Rails, I really liked the way that it leveraged method_missing to provide convention-driven operations such as find_person_by_last_name. After working with it for a while, I've come to the conclusion that it's really 10,000 helper methods to cover corner cases where convention doesn't work. And they're all defined by modules that are transparently mixed into your classes.

This wouldn't be so bad, except the documentation is truly horrendous. Take ActiveRecord: it's a module that consists solely of other modules, so its documentation page just lists those modules. There's no index, so finding anything becomes a treasure hunt, often involving the method method. Even then, I find myself reading source code. For example, the [] method is defined by ActiveRecord::Base, but there's no mention of it in the API docs. Google truly has become my friend.

I'll close with StringInquirer. This is a class that lets you write foo.value? rather than foo == "value". When I discovered this (trying to understand Rails.env), I sent an email to a friend that described it as “either a brilliant use of method_missing or an utter perversion of same.” I suspect that my final answer to that question will determine whether I love or hate Ruby.

Friday, March 2, 2012

Rise of the (64 bit) Machines

In my bytebuffer presentation, I note that 64-bit machines have become inexpensive and ubiquitous, and ask the rhetorical question “what are we doing with all this memory?” This is the setup to a fake graph that proclaims that it's being used to play FarmVille™.

A joke, yes, but it points to a bigger truth: when a consumer-class machine has multiple cores and 8Gb of memory, there's a lot of computing power that isn't being used. And these machines represent the next round of corporate desktop purchases. Five years from now, a typical office user will have four cores, eight gig of RAM, and a terabyte of disk. You don't need that to run Outlook.

I think this is the next opportunity for virtualization technology. To date, virtualization has been used to carve a large machine into smaller pieces. In the future I see it being used to carve up desktop machines: the person sitting in front of the computer will get only a fraction of the capacity of his or her machine. The IT department will get the rest.

There are a lot of problems to be solved before a business runs its critical infrastructure on the desktops of its employees. For one thing, the power buttons have to be disabled — or at least repurposed to shut down only the virtual machine. And virtualization software will have to evolve, to be hidden from the desktop user. At the least, the local display, keyboard, and mouse have to act like they always have.

But more important, our ways of thinking about enterprise-scale software will also have to change. We must write software that is completely unaware of where it's running. And which can recover even if the computer is unplugged.

Today's “cloud,” be it EC2 or a corporate data center, is our chance to experiment. Think of it as training wheels for a fully distributed future.

Thursday, February 23, 2012

What Does Namespace URI Tell You?

When working with XML, you'll often see namespace definitions like this, and wonder what significance the “http” URI might have.

<beans xmlns="http://www.springframework.org/schema/beans"
             xmlns:int="http://www.springframework.org/schema/integration"
             xmlns:ctx="http://www.springframework.org/schema/context"
             …

The answer: absolutely none. It's simply a string that the parser associates with each namespaced element or attribute. As far as the parser is concerned, it could be “foo”.

I can already hear people saying “no, you're wrong! put that URL into your browser!” And, in the case of Spring namespaces, the URL will take you to a page where you can see the relevant schemas. But this is simply a convenience provided by SpringSource, a way to make developers' lives easier.

It's not mandated by specification, nor suggested by convention. And most of us don't have the control over our corporate website necessary to make it happen. If you do, great; if not, then don't feel that you need to use a “http” namespace URL.

I personally prefer a URN — a Uniform Resource Name, rather than a Uniform Resource Locator — because it doesn't give the impression that there might be something that could be located.

The rules for constructing a URN are simple:

<URN> ::= "urn:" <NID> ":" <NSS>

The NID portion is the “namespace identifier,” and there are also rules for registering your namespace with the Internet Assigned Numbers Authority. But those same rules allow “experimental” namespaces, which are anything starting with “X-”. So you can construct perfectly legal URNs using your company's domain name (reversed here, in keeping with the Java package naming convention).

urn:X-com.mycompany:NSS

The NSS portion is the “namespace specific string,” and can be anything you want. If you have corporate coding standards, you can follow them; otherwise, something like “PROJECT:PURPOSE” works well:

urn:X-com.mycompany:frobulator:schemas

And that's it. You have a namespace URI that won't confuse anybody as to its purpose: it's used only to uniquely group elements and attributes.

Friday, February 17, 2012

Setting up a Subversion Repository for a Small Team

Most organizations that use Subversion for source control set up an Apache server in front of the repository. For a small team, this can be daunting, especially if nobody on the team has experience running and configuring Apache. Even in a larger organization, with experienced Apache admins, I question the benefit. Especially if you rely on Apache's built-in authentication mechanism, which involves keeping a private list of Subversion users, separate from your organization's main user list. One that's unlikely to get updated every time someone joins or leaves the organization.

There is an alternative: Subversion comes with the svnserve server, which listens on port 3690, and speaks it's own protocol with Subversion clients. It's deceptively easy to use:

# on the server
svnserve -d -r /opt/svn/repo/

# and on the client ...
svn co svn://localhost workdir

But there are several problems with running svnserve like this. First, you manually started the server, and you have to restart it if the system ever goes down. While you can configure inetd to automatically start the server, that adds a task for your sysadmins. And unless your developers are also your sysadmins, this will take time and may be a point of contention.

A bigger problem is that running svnserve in this way means that all commits are anonymous. The output of svn annotate will be meaningless, but worse, anybody that can physically access your repository will be able to update it. Clearly, you wouldn't want to expose this repo through the corporate firewall. And while svnserve provides several options for authentication, they take you back to a second user database.

But there's an alternative: identify the repository using a svn+ssh URL. This tells a client to open an SSH connection to the server, and run svnserve over that connection.

svn co svn+ssh://localhost/opt/svn/repo workdir

There is a lot to recommend this approach. First, it uses the server's own user database for authentication and authorization; no more anonymous commits. Second, system administrators are a lot more willing to open the SSH port to the outside world, rather than some custom server that has unknown security holes. Third, an experienced user can set up public key authentication, making the whole login process transparent.

The biggest drawback to running Subversion in this way is that, because each user connects as him or herself, all users need to be able to physically write files in the Subversion repository. In a small team, where everybody knows everybody else, I don't see this as a big issue. However, if you're concerned about security, you can take the following steps:

  • Use a dedicated host for the Subversion repository, and configure SSH to limit “ordinary” users to only run svnserve (you'll still need full access for admins).
  • Create a “subversion” user group — or, for larger teams, a project group “ and assign it to the people who are to have access.

Once you've created the repository, you need to grant write permissions on some of the files in it. Assuming that you're creating repositories in the /opt/svn directory, and have a subversion group, here are the commands to set up your repository (run by root):

cd /opt/svn
svnadmin create repo

# there are some files in the repository that get created from the first commit;
# having root create the standard trunk/tags/branches layout gets you that commit

svn co file://`pwd`/repo working
cd working
svn mkdir trunk
svn mkdir tags
svn mkdir branches
svn commit -m "create initial structure"
cd ..
rm -rf working

# now we can modify the actual repository 

chgrp -R subversion repo
cd repo/db
chmod g+w .
chmod -R g+w current min-unpacked-rev rep-cache.db revprops revs transactions txn-* write-lock

If you've changed all the files correctly, running ls -al on the /opt/svn/repo/db directory should show you this:

total 56
drwxrwsr-x 6 root subversion 4096 Feb 11 22:02 .
drwxr-xr-x 6 root subversion 4096 Feb 11 21:53 ..
-rw-rw-r-- 1 u1   subversion    2 Feb 11 22:02 current
-r--r--r-- 1 root subversion   22 Feb 11 21:53 format
-rw-r--r-- 1 root subversion 1920 Feb 11 21:53 fsfs.conf
-rw-r--r-- 1 root subversion    5 Feb 11 21:53 fs-type
-rw-rw-r-- 1 root subversion    2 Feb 11 21:53 min-unpacked-rev
-rw-rw-r-- 1 root subversion 4096 Feb 11 21:54 rep-cache.db
drwxrwsr-x 3 root subversion 4096 Feb 11 21:53 revprops
drwxrwsr-x 3 root subversion 4096 Feb 11 21:53 revs
drwxrwsr-x 2 root subversion 4096 Feb 11 22:02 transactions
-rw-rw-r-- 1 u1   subversion    2 Feb 11 22:02 txn-current
-rw-rw-r-- 1 root subversion    0 Feb 11 21:53 txn-current-lock
drwxrwsr-x 2 root subversion 4096 Feb 11 22:02 txn-protorevs
-rw-r--r-- 1 root subversion   37 Feb 11 21:53 uuid
-rw-rw-r-- 1 root subversion    0 Feb 11 21:53 write-lock

Wednesday, February 15, 2012

The Role of a Manager

Joel Spolsky recently wrote a guest post for the VC funding StackExchange. It's similar to many of his posts on management, following the theme that management “exists to move the furniture around so that [the people with knowledge] can make the hard decisions.“

It's a theme that has been repeated many times, by many writers, and I generally agree with it. But Joel tries to support this theme with an example that I emphatically disagree with.

When two engineers get into an argument about whether to use one big Flash SSD drive or several small SSD drives, do you really think the CEO is going to know better than the two line engineers, who have just spent three days arguing and researching and testing?

The easy answer is “no, the CEO won't know better.” But that's not the point. If you have two engineers that can't come to a consensus after three days, then you need a CEO — or line manager — who can step in an break the logjam. Who can listen to the arguments and discern what actually matters to the business.

Or who can say “I can't tell the difference but we can't argue forever. I'll flip a coin and if we're wrong we'll revisit the decision.”

The best manager that I ever had was non-technical, and knew it. He also had a great bullshit detector, and generally knew when to let his subordinates fight. But when the time came to make a decision, he didn't hesitate to make one.

Too many managers don't have the skills to mediate between equally good options. Or they aren't willing to take responsibility for decisions. In which case, they serve no higher role than moving the furniture around. But the world doesn't need managers like that.

Tuesday, February 14, 2012

A Bash Function to Create Git Branches

I'm currently doing some work on our corporate website, working through a long list of generally minor issues and learning Rails in the process. I'm still at the point where I want to code-review every change, and I also have the philosophy that every issue should be independent. So I'm creating a lot of branches.

After accidentally creating a couple of those branches from other branches rather than master, I added the following function to my .bashrc:

function branch { 
    if git branch | grep -q " $1\$" ; then
        git checkout $1
    else
        git checkout master
        git checkout -b $1
        git push origin $1
    fi ; }

Invoke this as branch branch_name. If there's already a branch with that name, it runs git checkout to switch to that branch — which will fail if you have unstashed changes on the current branch.

Otherwise, it switches to master and then creates the new branch. In both the local and origin repository (we use GitHub at work, so everyone on a project has his or her own fork). If you don't have a similar setup, delete that push — or change it to point to a backup repository.

Monday, January 9, 2012

Marking Time

How long does it take your program to do a task? It's not a question that you usually ask — at least not until that task takes “too long.” And while it seems like a simple question, there are two ways to measure program execution time, and the difference between those two measurements is important. For one thing, it can tell you whether you have any chance of making the task take less time.

The typical measurement is “wall clock” time: the time you'd get by watching a physical clock on the wall. Although in the real world, wall-clock measurements usually come from log output:

2012-01-04 17:21:51,434 [main] DEBUG IMAPReader - connecting to imap.gmail.com
2012-01-04 17:21:53,164 [main] DEBUG IMAPReader - login successful
2012-01-04 17:21:53,164 [main] DEBUG IMAPReader - retrieving mailbox list
2012-01-04 17:21:53,286 [main] DEBUG IMAPReader - selecting mailbox: INBOX
2012-01-04 17:21:53,392 [main] DEBUG IMAPReader - executing SEARCH for all messages
2012-01-04 17:21:53,525 [main] DEBUG IMAPReader - fetching message 1
2012-01-04 17:21:53,639 [main] DEBUG IMAPReader - disconnected

This program makes a connection to Google's IMAP mail service and downloads a single message. It takes slightly more than two seconds to do this, which exceeds the one second limit required to maintain the user's train of thought. You might be tempted to fire up a profiler to see where the code is spending its time, and perhaps rewrite some of it.

A better — easier, faster — option is to measure the program's CPU time.

CPU time is the time that the processor actually runs the process. And unless CPU time is a significant fraction of wall-clock time, then any attempts at performance optimization will be wasted. Java lets you measure CPU time — with some caveats — using the ThreadMXBean class from the java.lang.management package.

ThreadMXBean threadMX = ManagementFactory.getThreadMXBean();

long start = threadMX.getCurrentThreadUserTime();
// time-consuming operation here
long finish = threadMX.getCurrentThreadUserTime();

The getCurrentThreadUserTime() method returns the number of nanoseconds that have been spent executing the current thread's code; you'll have to divide by 1000000 to convert to milliseconds. It's user-mode time, so doesn't include time spent by the kernel on behalf of the thread. Since you can only optimize your own code, in my opinion that's the best measurement.

But, as I said, there's a caveat: this method only works if the JVM and OS support thread-level time measurement; if they don't, it will throw UnsupportedOperationException. As long as you're using the Sun JVM, on recent Windows or Linux systems, you shouldn't have a problem. And if you're worried, you can call ThreadMXBean.isCurrentThreadCpuTimeSupported() before getting the CPU time. Also call isThreadCpuTimeEnabled() ; even if the JVM supports thread CPU time, it might not be enabled.

Adding CPU-time logging into the program, we get the following:

2012-01-06 07:28:37,238 [main] DEBUG JakartaIMAPReader - connecting to imap.gmail.com (current cpu = 230000000 nanos)
...
2012-01-06 07:28:39,205 [main] DEBUG JakartaIMAPReader - disconnected (current cpu = 800000000 nanos)

The difference is 570000000 nanoseconds; 570 milliseconds, or about 25% of the overall runtime. Even if you could optimize this code to zero, you'd still have 1.5 seconds of wall-clock time left; the overall speedup would be minimal. While a profiler would have told me this as well, it was a lot faster for me to add a couple of lines of code and edit two logging statements, than to run the profiler and navigate through its results.

In this case, I didn't expect much CPU consumption: I'm connecting to a remote email server, and network calls always take time, during which the program is blocked inside the OS kernel, waiting for data to arrive. And even if I didn't know that network connections behave that way, simply looking at a tool like top or the Windows Task Manager would have told me that my program wasn't consuming much CPU.

A more interesting case is when one thread is taking a lot of wall-clock time but very little CPU time, and the program as a whole is CPU-bound. In this case, the other threads in the program are affecting the thread that you need to optimize. Usually, you'll find some sort of a synchronization barrier is the cause: the thread that you're tracking is waiting for data to be delivered from one or more other threads, which are taking a long time to create that data. By measuring CPU time, you know that you have to (recursively) examine those other threads.

There's one final case that I want to mention, and that's where the CPU is being consumed by the garbage collector. This happens with message-processing applications, where the incoming messages are relatively large (especially large byte[]s that go directly into the tenured generation), and/or generate a lot of temporary objects during processing. Your best tool in this situation is JConsole to see just how much time is being spent in garbage collection.

The bottom line: before thinking about optimizations, first verify whether there's anything you can optimize.