Friday, August 31, 2012

If A Field Isn't Referenced, Does It Exist?

In my last post I said that, pre-annotations, finding all the classes referenced by a given class was a simple matter of scanning the constant pool for CONSTANT_Class entries. Turns out that it isn't quite that simple (which makes the last dependency analyzer I wrote not quite correct). Consider the following class:

public class ClassLoadExample
{
    private BigDecimal imNotReferenced;
    
    public void foo()
    {
//        imNotReferenced = new BigDecimal("123.45");
    }
    
    public static void main(String[] argv)
    throws Exception
    {
        System.err.println("main started");
        new ClassLoadExample().foo();
        System.err.println("foo called, main done");
    }
}

If you compile this and walk its constant pool, you won't find a CONSTANT_Class_info entry for BigDecimal. What you will find are two CONSTANT_Utf8_info entries, containing the name of the variable and its type. And you'll find an entry in the field list that references these constants.

Explicitly set the variable to null, and the constant pool gets two additional entries: a CONSTANT_Fieldref_info, and an associated CONSTANT_NameAndType_info entry that links the existing entries for the field's name and type. Initialize the field via the BigDecimal constructor, or invoke a method on the instance, and the expected CONSTANT_Class_info appears.

At first glance, this behavior seems like a WTF: you're clearly referencing BigDecimal, so why doesn't the constant pool reflect that fact? But if you think about how the JVM loads classes, it seems less a WTF and more a premature optimization … and also a reminder that every computer system carries with it the constraints present at its birth. Java was born in 1996, in a world where 256 Mb of RAM was a “workstation-class” machine, and CPU cycles were precious. Today, of course, there's more CPU/RAM in an obsolete smartphone.

To preserve memory and cycles, the JVM loads classes on an as-needed basis, nominally starting with the class holding your main() method (but see below). In order to initialize your class, the JVM will have to load its superclass and any interfaces, as well as any classes referenced in static initializers. But it doesn't have to load classes that are only referenced by member variables, because those classes won't get used until the member variable is first accessed — which might not ever happen. Even if you construct an instance of the class, there's no reason to load the class: the member variable is simply a few bytes in the instance that has been initialized to null. You don't need to load the class until you actually invoke a method on it.*

I said a premature optimization, but that's not right. The JDK has a lot of built-in classes: over 17,000 in rt.jar for JDK 1.6. You don't want to load all of them for every program, because only a relative few of them will ever be used. But one would think that, as machines became more capable, the Java compiler might add CONSTANT_Class_info entries for every referenced class, and the JVM might choose to preload those classes; the JVM spec doesn't say they couldn't. The JVM development team took a somewhat different approach, however, and decided to preload a bunch of “commonly used” classes. From a performance perspective, that no doubt makes more sense.

But for someone writing a dependency analyzer, it's a royal pain, unless you confine yourself to dependencies that are actually used.


* If you want to see classloading in action, start the JVM with the -XX:+TraceClassLoading flag. If you do this with the example program, you'll see the pre-loads, with the program class near the end. If you uncomment the assignment statement, you'll see that BigDecimal is loaded during the call to method foo().

Monday, August 13, 2012

How Annotations are Stored in the Classfile ... WTF?!?

This weekend I made some changes to BCELX, my library of enhancements for Apache BCEL. These changes were prompted not by a need to access different types of annotations, but because I'm currently working on a tool to find hidden and unnecessary dependency references in Maven projects. Why does this have anything to do with annotation processing? Read on.

Prior to JDK 1.5, the Java class file was a rather simple beast. Every referenced class had a CONSTANT_Class_info entry in the constant pool. This structure actually references another entry in the constant pool, which holds the actual class name, but BCEL provides the ConstantClass object so you don't have to chase this reference. It's very easy to find all the external classes that your program references: walk the constant pool and pull out the ConstantClass values.

That functionality is exactly what I needed to cross-check project dependencies. But when I wrote a testcase to check my dependency-extraction method, it failed. I had used the test class itself as my target, and just by chance I picked the @Test annotation as one of my assertions. As far as my dependency-extraction code was concerned, I didn't have a reference to the annotation class.

I figured that there must be some flag in the CONSTANT_Class_info structure that was confusing BCEL — its released version hasn't been updated for JDK 1.5. So I turned to the JDK 1.5 classfile doc, and it slowly dawned on me: annotation classes aren't referenced in the constant pool. Instead, you have to walk through all of the annotation attributes, and get their names out of the constant pool. OK, I should have realized this sooner; after all, it wasn't so long ago that I'd written the annotation parsing code in BCELX.

Of course, this meant that I now had to add support for parameter and field-level annotations to BCELX (I was going to have to parameters anyway, to support another project). While doing this, I discovered something else interesting: the API docs say that you can apply annotations to packages and local variables, but the classfile docs give no indication that this is actually supported.

There are a couple of things that I take from this experience. The first is that it's another piece of evidence that JDK 1.5 represented a changing of the guard at Sun. Annotations have often had a “tacked on” feel to me — right down to the @interface keyword (they broke backwards compatibility for enum, would it have been so bad to add annotation?). I'm sure there was a reason for not treating an annotation reference as just another class, but quite frankly I can't see it.

The other thing I learned is to beware testcases built around spot checks. If I had written my testcase to look for org.junit.Assert rather than org.junit.Test, I never would have found the issue — until it turned up when using the utility. But there are lots of cases where exhaustive checks aren't cost-effective. Including this one: should I write a test that verifies every possible annotation reference? I'll need to, if I want 100% coverage, but really: it's a tiny part of the overall project.

One that could have been far easier if the JVM team had cleanly integrated their changes, and followed the existing model. I suppose that's the real take-away: if you're evolving a design, don't simply tack on the changes.

Thursday, August 2, 2012

Taming Maven: A Local Repository Server

While Maven's automatic dependency retrieval is a great feature, it does have limitations. And one of the biggest of those limitations is Maven's ability to access local projects. Or, really, any projects that aren't found in the Maven Central repository.

Returning to the archetypal IT department of my previous posts, it's really painful if, before you can start working on your own project, you first have to check-out and build multiple dependent projects. Worse is if those dependencies are under development, and you have to rebuild on a regular basis.

The first part of eliminating that pain is to run a local repository server. This server is a place to deploy your local builds, including any third-party software or local patches you've made to open source software. It can also act as a proxy for Maven Central and other external repositories, protecting you from Internet outages and keeping your POMs free of <repositories> entries.

You can create a Maven repository server using a bare-bones Apache web-server: all you need to do is make sure that its document root follows the repository format. However, there are better options: Nexus and Artifactory are both purpose-built servers for managing Maven artifacts, and both come in open-source variants that you can run for free. If you don't have a local machine, or don't want the hassle of administering it, Artifactory provides cloud hosting of your repository (for a fee). Sonatype doesn't go quite that far, instead providing a pre-built EC2 image (hopefully updated since that post).

Once you've got the repository server running, you need to configure Maven to access it. The simplest approach is to add your local server as a mirror for Maven Central, as described here. Note that you can not simply add a <repositories> entry to your parent POM, as you need to deploy that POM to the repository server.

Now you face the question of how to deploy your builds. Both Nexus and Artifactory give you a web interface to manually upload artifacts, but it's far easier to use the Maven deploy goal to deploy directly from your build (using an HTTP connection to the server). Of course, that raises the issue of credentials: do you give each developer his/her own credentials (which are stored in $HOME/.m2/settings.xml), or use a single set of credentials for all?

I'm in favor of the latter: use one set of credentials, stored either in each user's personal settings file, or in the global settings file. While that may make some people cringe, the security risk is non-existent: the repository server is write-only, and it will control where you write. As long as you don't pass out the actual admin login, or use SCP to deploy, the worst a disgruntled ex-employee can do is upload new builds.

And even that that minor risk can be eliminated if your developers never have to type “deploy” — and they'd be happier too. Instead, set up a continuous integration server that examines your source repository for changes and automatically builds and deploys the changed projects. At least for snapshot builds, this ensures that all of your developers will be using the latest codebase, without any manual intervention whatsoever.

Wednesday, August 1, 2012

Taming Maven: Dependency Management

Once you have a parent POM, you can add a <dependencyManagement> section. In my view, it's often more trouble than its worth (in fact, this blog series started out as a single post suggesting properties were usually a better choice). In the worst case, a dependency management section can prevent your child builds from seeing new dependencies. There are, however, some cases where it is useful to prevent Maven from using unwanted transitive dependencies.

For those who haven't seen <dependencyManagement> in use, here's an abbreviated example:

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-core</artifactId>
            <version>${springframework.version}</version>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-webmvc</artifactId>
            <version>${springframework.version}</version>
        </dependency>
    </dependencies>
</dependencyManagement>

With the dependency version specified in the parent's dependency management section, it can be omitted in the child's <dependencies>:

<dependencies>
    <dependency>
        <groupId>org.springframework</groupId>
        <artifactId>spring-core</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework</groupId>
        <artifactId>spring-webmvc</artifactId>
    </dependency>
</dependencies>

If that was all there was to it, there would be very little reason to use a dependency management section, versus defining version properties in the parent — it might actually be a net increase in the size of your POMs. But dependency management goes a bit deeper: it will override the transitive dependencies associated with your direct dependencies. This bears further examination.

If you use the Spring framework, you know that it's broken into lots of pieces. You don't need to specify all of these as direct dependencies, because each component brings along transitive dependencies (I haven't verified this, but I think spring-mvc the only direct dependency you need for a Spring web-app).

Now consider the case where the parent POM has a dependency management section that lists all of the Spring components, and gives them version 3.0.4.RELEASE, while a child uses version 3.1.1.RELEASE, and just specifies spring-webmvc as a direct dependency. When Maven builds the child and retrieves transitive dependencies, it will ignore the 3.1.1.RELEASE implied by the direct dependency, and instead load the 3.0.4.RELEASE versions specified by the parent.

This is rarely a Good Thing. Sometimes it won't cause an actual bug: if you are using features of the new version that haven't changed since the old version, you have nothing to worry about. But more often, you'll get a NoSuchMethodError thrown at runtime. Or worse, the method is present but does something unexpected. These sorts of bugs can be incredibly painful to track down.

Version properties, of course, go a long way toward keeping these errors from occurring. But some projects will need to specify their own version properties, often because they're trying out some new functionality.

There is, however, one case where a dependency management section is useful: excluding transitive dependencies. Again using Spring as an example: it will use either commons-logging or SLF4J for its internal logging; at runtime, it figures out which is available. However, as of this writing (version 3.1.2.RELEASE), spring-core has a non-optional transitive dependency on commons-logging. Which means that your program will also have a transitive dependency on commons-logging — and if your program is a web-app, you'll find commons-logging in its deployed WAR whether you want it or not.

Perhaps some day the Spring developers will change the scope of this dependency to provided. Until then, if you don't want commons-logging you need to manually break the transitive dependency with an exclusion:

<dependency>
    <groupId>org.springframework</groupId>
    <artifactId>spring-core</artifactId>
    <version>${springframework.version}</version>
    <exclusions>
        <exclusion>
            <groupId>commons-logging</groupId>
            <artifactId>commons-logging</artifactId>
        </exclusion>
    </exclusions>
</dependency>

Without a dependency management section in the parent POM, you would have to repeat that exclusion in every project POM. Miss just one, and the transitive dependency appears. Move the exclusion into the parent's dependency management section, and it applies to all children. Of course, this locks down the version number; any child projects that need a different version must specify a direct dependency on spring-core.

Bottom line: don't use a <dependencyManagement> section unless you absolutely have to. And even then, keep it as small as you possibly can.