Monday, May 23, 2016

How (and When) Clojure Compiles Your Code

When I started working with Clojure, one of the challenges I faced was understanding exactly what Clojure was doing with my code. I was intimately familiar with how Java and the JVM works, and tried to slot Clojure into that mental model — but found too many cases where my model didn't quite represent reality. The docs weren't much help: they talked about the features of the language (including compilation), but didn't provide detail on what “automatically compiled to JVM bytecode on the fly” actually meant.

I think that detail is important, especially if you come to Clojure from Java or are running your Clojure code within a Java-centric framework. And I see enough questions on the Internet to realize that not a lot of people actually understand how Clojure works. So this post demonstrates a few key points that are the basis of my new mental model.

I'm using Clojure 1.8 for examples, but I believe that everything that I say is correct for versions as early as 1.6 and probably before.

Clojure is a scripting language

I'll start with some definitions:
  • Compiled languages translate source code into an artifact that is then loaded, unchanged into the runtime. Any decisions in the code rely on state maintained within the executing program.
  • Scripting languages load the source code into the runtime, and execute that code as it is loading. Scripts may make decisions based on state that they manage, as well as global state that has been set by other scripts.

Clojure is very much a scripting language, even though it compiles its scripts into JVM bytecode. All Clojure source files are processed one expression at a time: the reader reads characters from the source until it finds a valid expression (a list of symbols delimited by balanced parentheses), expands any macros, compiles the result to bytecode, then executes that bytecode.

It doesn't matter whether the source code is entered by hand into the REPL, or read from a file as part of a (require ...) form. It's processed a single top-level expression at a time.

Note my term: “top-level” expression. You won't find this term in the Clojure docs; they refer to “expressions” and “forms” more-or-less interchangeably, but don't differentiate between expressions that are nested within other expressions. The reason that I do will become apparent later on.

In my opinion, this form of evaluation is what gives macros their power (the oft-proclaimed homoiconicity of the language simply means that they're easier to write). A macro is able to use any information that has already been loaded into the runtime, including variables that have been created earlier by the same or a different script.

Top-level expressions turn into classes

Here's an interesting experiment: start up a Clojure REPL with the following command (using the correct path to the Clojure distribution JAR):

java -XX:+TraceClassLoading -jar clojure-1.8.0.jar

You'll see a long list of classloading messages flash by before you end up at the REPL. Now enter a simple expression, such as (+ 1 2); you'll see more messages, as the Clojure runtime loads the classes that it needs. Enter that same expression again, and you'll see something like this:

user=> (+ 1 2)
[Loaded user$eval3 from __JVM_DefineClass__]
3

This message indicates that Clojure compiled that expression to bytecode and then loaded the newly-created class to execute it. The class definition is still in memory, and you can inspect it. For example, you can look at its superclass (I've removed the now-distracting classloader messages):

user=> (.getSuperclass (Class/forName "user$eval3"))
clojure.lang.AFunction

AFunction is a class within the Clojure runtime; it is a subclass of AFn, which implements the invoke() method. With this knowledge, it's apparent that the evaluation of this simple expression has four steps:

  1. Parse the expression (including macro expansion, which doesn't apply to this case) and generate the bytes that correspond to a Java .class file.
  2. Pass these bytes to a classloader, getting a Java Class back.
  3. Instantiate this class.
  4. Call invoke on the resulting object.

You can, in fact, do all of this by hand, provided that you know the classname:

user=> (.invoke (.newInstance (Class/forName "user$eval3")))
3

OK, so far so good. Now I want to show why earlier I called out a distinction between “top-level” expressions and nested expressions:

user=> (* 3 (+ 1 2))
[Loaded user$eval5 from __JVM_DefineClass__]
9

Here we have an expression that contains a nested expression. However, note that only one class was generated as a result. In theory, every expression could turn into its own class. Clojure takes a more pragmatic approach, which is a good thing for our memory footprint.

Variables are wrapped in objects

To outward appearances, a Clojure variable is similar to a final Java variable: you assign it (once) with def, and retrieve its value simply by inserting the variable in an expression:

user=> (def x 10)
#'user/x

user=> (class x)
java.lang.Long

user=> (+ x 2)
12

A hint of the truth can be seen if you attempt to use an unbound var in an expression:

user=> (def y)
#'user/y

user=> (+ 2 y)

ClassCastException clojure.lang.Var$Unbound cannot be cast to java.lang.Number  clojure.lang.Numbers.add (Numbers.java:128)

In fact, variables are instances of clojure.lang.Var, which provides functions to get and set the variable's value. When you reference a variable within an expression, that reference translates into a method call that retrieves the actual value.

This allows a great deal of flexibility, including the ability to redefine variables. Application code can do this on a per-thread basis using binding and set!, or within a call tree using with-redefs. The Clojure runtime does much more, such as redefining all variables when you reload a namespace.

A namespace is not a class

For someone coming from a Java background, this is perhaps the hardest thing to grasp. A namespace definition certainly looks like a class definition: you have a dot-delimited namespace identifier, which corresponds to the path where you save the source code. And when you invoke a function from a namespace, you use the same syntax that you would to invoke a static method from a Java class.

The first hint that the two aren't equivalent is that the ns macro doesn't enclose the definitons within the namespace. Another is that you can switch between namespaces at will and add new definitions to each:

user=> (ns foo)
nil
foo=> (def x 123)
#'foo/x
foo=> (ns bar)
nil
bar=> (def x 456)
#'bar/x
bar=> (ns foo)
nil
foo=> (def y 987)
#'foo/y

You could take the above code snippets, save them in an arbitrary file, and then use the load-file function to execute that file as a script. In fact, you could write your entire application, with namespaces, as a single script.

But most (sane) people don't do that. Instead, they create one source file per namespace, store that file in a directory derived from the namespace name, and use the require function to load it (or more often, a :require directive in some other ns declaration).

Loading code from the classpath: require and load

The :require directive is another point of confusion for a Java developer starting Clojure. It certainly looks like the import statement that we already know, especially when it's used in an ns invocation:

(ns example.main
    :require [example.utils :as utils])

In reality, :require is almost, but not quite, entirely unlike import. The Java compiler uses import to load definitions from an already-compiled class so that they can be referenced by the class that's currently being compiled. On a superficial level, the Clojure runtime does the same thing when it sees :require, but it does this by loading (and compiling) the source code for that namespace.

OK, there are some caveats to that statement. First is that require only loads a namespace once, unless you specify the :reload option. So if the required namespace has already been loaded, it won't be loaded again. And if the namespace has already been compiled, and the source file is older than the compiled files, then the runtime loads the already compiled form. But still, there's a lot of stuff happening as the result of a seemingly simple directive.

So, let's dig into the behavior of require, along with its step-brother load. Earlier I wrote about using load-file to load an arbitrary file into the REPL. Here's that file, followed by the command to load and run it:

(ns foo)
(def x 123)

(ns bar)
(def x 456)

(ns user)
(do (println "myscript!") (+ foo/x bar/x))
user=> (load-file "src/example/myscript.clj")
myscript!
579

When you load the file, it creates definitions within the two namespaces, then invokes an expression to add them. After loading the file, you can access those variables from the REPL:

user=> (* foo/x bar/x)
56088

The load function is similar, but loads files relative to the classpath. It also assumes a .clj extension. I'm using Leiningen, so my classpath is everything under src; therefore, I can load the same file like so:

user=> (load "example/myscript")
myscript!
nil

Wait a second, what happened to the expression at the end of the script? It was still evaluated — the println executed — but the result was discarded and load returned nil.

Now let's try loading this same script with require:

user=> (require 'example.myscript :reload)
myscript!
nil

Different syntax, same result. The two variables are defined in their respective namespaces, and the stand-alone expression was evaluated. So what's the difference?

The first difference is that require gives you a bunch of options. For example, you can use :as to create a short alias for the namespace, so that you don't have to reference its vars with fully-qualified names. The way that the runtime uses these flags is probably worthy of a post of its own.

Another difference is that require is a little smarter about loading scripts: it only loads (and compiles) a script if it hasn't already done so — unless, of course, you use the :reload or :reload-all options, like I did here. Omitting that option, we see that a second require doesn't invoke the println.

user=> (require 'example.myscript)
nil

Compiling your code (or, :gen-class doesn't do what you might think)

As you've seen above, the Clojure runtime normally compiles your code when it's loaded, producing the bytes of a .class file but not writing them to the filesystem. However, there are times that you want a real, on-disk class. For example, so that you can invoke that class from Java (note that you'll still need the Clojure JAR on your classpath). Or so that you can reduce startup time for a Clojure application, by avoiding load-time compilation (although I think this is probably premature optimization).

The compile function turns Clojure scripts into classes:

user=> (compile 'example.foo)
example.foo

That was simple enough. Note, however, that I was running in lein repl, which sets the *compile-path* runtime global to a directory that it knows exists. If you try to execute this function from the clojure.main REPL, it will fail unless you create the directory classes.

Here's the example file that I compiled:

(ns example.foo)

(def x 123)

(defn what [] "I'm compiled!")

(defn add2 [x] (+ 2 x))

And here are the classes that it produced:

-rw-rw-r--   1 kgregory kgregory     3008 May  7 09:29 target/base+system+user+dev/classes/example/foo__init.class
-rw-rw-r--   1 kgregory kgregory      683 May  7 09:29 target/base+system+user+dev/classes/example/foo$add2.class
-rw-rw-r--   1 kgregory kgregory     1320 May  7 09:29 target/base+system+user+dev/classes/example/foo$fn__1194.class
-rw-rw-r--   1 kgregory kgregory     1503 May  7 09:29 target/base+system+user+dev/classes/example/foo$loading__5569__auto____1192.class
-rw-rw-r--   1 kgregory kgregory      513 May  7 09:29 target/base+system+user+dev/classes/example/foo$what.class

If you're a bytecode geek like me, you'll of course run javap -c on those files to see what they contain (especially fn__1194, which doesn't appear anywhere in the source!). Have at it. For everyone else, here are the two things I think are important:

  • Every function turns into its own class. If you've been reading along, you aready knew that.
  • The foo__init class is responsible for pulling all of the other classes into memory, creating instances of those classes, and assigning them to vars in the namespace.

If you use Leiningen, you've probably noted that it adds a :gen-class directive to the main class of any “app” project that it creates. If you skim the docs for gen-class you might think this will produce a Java class that exposes all of your namespace's functions. Let's see what really happens, by adding a :gen-class directive to the example script:

(ns example.foo
  (:gen-class))

When you compile, the list of classes now looks like this:

-rw-rw-r--   1 kgregory kgregory     1823 May  7 09:31 target/base+system+user+dev/classes/example/foo.class
-rw-rw-r--   1 kgregory kgregory     3009 May  7 09:31 target/base+system+user+dev/classes/example/foo__init.class
-rw-rw-r--   1 kgregory kgregory      683 May  7 09:31 target/base+system+user+dev/classes/example/foo$add2.class
-rw-rw-r--   1 kgregory kgregory     1320 May  7 09:31 target/base+system+user+dev/classes/example/foo$fn__1194.class
-rw-rw-r--   1 kgregory kgregory     1505 May  7 09:31 target/base+system+user+dev/classes/example/foo$loading__5569__auto____1192.class
-rw-rw-r--   1 kgregory kgregory      513 May  7 09:31 target/base+system+user+dev/classes/example/foo$what.class

Everything's the same, except that we now have foo.class. Looking at this class with javap, we find that it contains overrides of the basic Object methods: equals(), hashCode(), toString(), and clone(). It also creates a Java-standard main() function, which looks for the Clojure-standard -main (which doesn't exist for our script, so will fail if invoked). But it doesn't expose any of your functions.

Reading the doc more closely, if you want to use :gen-class to expose your functions, you need to specify the exposed functions in the directive itself — and use a specified naming format that separates the Clojure method implementations from the names exposed to Java.

Pitfalls of compiling your code

Let's change the namespace declaration on foo, so that it requires bar:

(ns example.foo
  (:require [example.bar :as bar]))

This results in the the expected classes for foo, but also several for bar (which doesn't define any functions):

-rw-rw-r--   1 kgregory kgregory     2219 May  7 09:39 target/base+system+user+dev/classes/example/bar__init.class
-rw-rw-r--   1 kgregory kgregory     1320 May  7 09:39 target/base+system+user+dev/classes/example/bar$fn__1196.class
-rw-rw-r--   1 kgregory kgregory     1503 May  7 09:39 target/base+system+user+dev/classes/example/bar$loading__5569__auto____1194.class
-rw-rw-r--   1 kgregory kgregory     3009 May  7 09:39 target/base+system+user+dev/classes/example/foo__init.class
-rw-rw-r--   1 kgregory kgregory      683 May  7 09:39 target/base+system+user+dev/classes/example/foo$add2.class
-rw-rw-r--   1 kgregory kgregory     1320 May  7 09:39 target/base+system+user+dev/classes/example/foo$fn__1198.class
-rw-rw-r--   1 kgregory kgregory     1891 May  7 09:39 target/base+system+user+dev/classes/example/foo$loading__5569__auto____1192.class
-rw-rw-r--   1 kgregory kgregory      513 May  7 09:39 target/base+system+user+dev/classes/example/foo$what.class

This makes perfect sense: ff you want to ahead-of-time compile one namespace, you probably don't want its dependencies to be compiled at runtime. But recognize that the tree of dependencies can run very deep, and will include any third-party libraries that you use (poking around Clojars, there aren't a lot of libraries that come precompiled).

There is one other detail of compilation that may cause concern: require loads a namespace from the file(s) with the latest modification time. If you have both source and compiled classes on your classpath, this could mean that you're not loading what you think you are. Fortunately, in practice this primarily affects work in the REPL: Leiningen removes the target directory as part of the jar and uberjar tasks, so you won't produce an artifact with a source/class mismatch.

Wrap-up

This has been a long post, so I'll wrap up with what I consider the main points.

  • Startup times for Clojure applications will be longer than for normal Java applications, because of the additional step of compiling and evaluating each expression. This isn't going to be an issue if you've written a long-running server in Clojure, but it does add significant overhead to short-running programs (so Clojure is even less appropriate for small command-line utilities than Java).
  • Pay attention to the Clojure version used by your dependencies, because they might rely on functions from a newer version than your application; this problem manifests itself as an “Unable to resolve symbol” runtime error. While this is a general issue with transitive dependencies, I've found that third-party libraries tend to be at the latest version, while corporate applications tend to use whatever was current when they were begun.
  • As far as I can tell, the Clojure runtime doesn't ever unload the classes that it creates. This means that — on pre-1.8 JVMs — you can fill the permgen space. Not a big problem in development, but be careful if you use a REPL when connected to a production instance.
  • Every script that you load adds to the global state of the runtime. Be aware that the behavior of your scripts may be dependent on the order that they're loaded.

No comments: