When I started working with Clojure, one of the challenges I faced was understanding exactly what Clojure was doing with my code. I was intimately familiar with how Java and the JVM works, and tried to slot Clojure into that mental model — but found too many cases where my model didn't quite represent reality. The docs weren't much help: they talked about the features of the language (including compilation), but didn't provide detail on what “automatically compiled to JVM bytecode on the fly” actually meant.
I think that detail is important, especially if you come to Clojure from Java or are running your Clojure code within a Java-centric framework. And I see enough questions on the Internet to realize that not a lot of people actually understand how Clojure works. So this post demonstrates a few key points that are the basis of my new mental model.
I'm using Clojure 1.8 for examples, but I believe that everything that I say is correct for versions as early as 1.6 and probably before.
Clojure is a scripting language
I'll start with some definitions:- Compiled languages translate source code into an artifact that is then loaded, unchanged into the runtime. Any decisions in the code rely on state maintained within the executing program.
- Scripting languages load the source code into the runtime, and execute that code as it is loading. Scripts may make decisions based on state that they manage, as well as global state that has been set by other scripts.
Clojure is very much a scripting language, even though it compiles its scripts into JVM bytecode. All Clojure source files are processed one expression at a time: the reader reads characters from the source until it finds a valid expression (a list of symbols delimited by balanced parentheses), expands any macros, compiles the result to bytecode, then executes that bytecode.
It doesn't matter whether the source code is entered by hand into the REPL, or read
from a file as part of a (require ...)
form. It's processed a single
top-level expression at a time.
Note my term: “top-level” expression. You won't find this term in the Clojure docs; they refer to “expressions” and “forms” more-or-less interchangeably, but don't differentiate between expressions that are nested within other expressions. The reason that I do will become apparent later on.
In my opinion, this form of evaluation is what gives macros their power (the oft-proclaimed homoiconicity of the language simply means that they're easier to write). A macro is able to use any information that has already been loaded into the runtime, including variables that have been created earlier by the same or a different script.
Top-level expressions turn into classes
Here's an interesting experiment: start up a Clojure REPL with the following command (using the correct path to the Clojure distribution JAR):
java -XX:+TraceClassLoading -jar clojure-1.8.0.jar
You'll see a long list of classloading messages flash by before you end up at the
REPL. Now enter a simple expression, such as (+ 1 2)
; you'll see more
messages, as the Clojure runtime loads the classes that it needs. Enter that same
expression again, and you'll see something like this:
user=> (+ 1 2) [Loaded user$eval3 from __JVM_DefineClass__] 3
This message indicates that Clojure compiled that expression to bytecode and then loaded the newly-created class to execute it. The class definition is still in memory, and you can inspect it. For example, you can look at its superclass (I've removed the now-distracting classloader messages):
user=> (.getSuperclass (Class/forName "user$eval3")) clojure.lang.AFunction
AFunction is a class within the Clojure runtime; it is a subclass of
AFn, which implements the invoke()
method. With this
knowledge, it's apparent that the evaluation of this simple expression has four steps:
- Parse the expression (including macro expansion, which doesn't apply to this case)
and generate the bytes that correspond to a Java
.class
file. - Pass these bytes to a classloader, getting a Java
Class
back. - Instantiate this class.
- Call
invoke
on the resulting object.
You can, in fact, do all of this by hand, provided that you know the classname:
user=> (.invoke (.newInstance (Class/forName "user$eval3"))) 3
OK, so far so good. Now I want to show why earlier I called out a distinction between “top-level” expressions and nested expressions:
user=> (* 3 (+ 1 2)) [Loaded user$eval5 from __JVM_DefineClass__] 9
Here we have an expression that contains a nested expression. However, note that only one class was generated as a result. In theory, every expression could turn into its own class. Clojure takes a more pragmatic approach, which is a good thing for our memory footprint.
Variables are wrapped in objects
To outward appearances, a Clojure variable is similar to a final Java variable:
you assign it (once) with def
, and retrieve its value simply by
inserting the variable in an expression:
user=> (def x 10) #'user/x user=> (class x) java.lang.Long user=> (+ x 2) 12
A hint of the truth can be seen if you attempt to use an unbound var in an expression:
user=> (def y) #'user/y user=> (+ 2 y) ClassCastException clojure.lang.Var$Unbound cannot be cast to java.lang.Number clojure.lang.Numbers.add (Numbers.java:128)
In fact, variables are instances of
clojure.lang.Var
, which provides functions to get and set
the variable's value. When you reference a variable within an expression, that
reference translates into a method call that retrieves the actual value.
This allows a great deal of flexibility, including the ability to redefine variables.
Application code can do this on a per-thread basis using binding
and
set!
, or within a call tree using with-redefs
. The Clojure
runtime does much more, such as redefining all variables when you reload a namespace.
A namespace is not a class
For someone coming from a Java background, this is perhaps the hardest thing to grasp. A namespace definition certainly looks like a class definition: you have a dot-delimited namespace identifier, which corresponds to the path where you save the source code. And when you invoke a function from a namespace, you use the same syntax that you would to invoke a static method from a Java class.
The first hint that the two aren't equivalent is that the ns
macro
doesn't enclose the definitons within the namespace. Another is that you can
switch between namespaces at will and add new definitions to each:
user=> (ns foo) nil foo=> (def x 123) #'foo/x foo=> (ns bar) nil bar=> (def x 456) #'bar/x bar=> (ns foo) nil foo=> (def y 987) #'foo/y
You could take the above code snippets, save them in an arbitrary file, and then
use the load-file
function to execute that file as a script. In fact,
you could write your entire application, with namespaces, as a single script.
But most (sane) people don't do that. Instead, they create one source file per
namespace, store that file in a directory derived from the namespace name, and use
the require
function to load it (or more often, a :require
directive in some other ns
declaration).
Loading code from the classpath: require
and load
The :require
directive is another point of confusion for a Java
developer starting Clojure. It certainly looks like the import
statement that we already know, especially when it's used in an ns
invocation:
(ns example.main :require [example.utils :as utils])
In reality, :require
is almost, but not quite, entirely unlike
import
. The Java compiler uses import
to load definitions
from an already-compiled class so that they can be referenced by the class that's
currently being compiled. On a superficial level, the Clojure runtime does
the same thing when it sees :require
, but it does this by loading
(and compiling) the source code for that namespace.
OK, there are some caveats to that statement. First is that require
only loads a namespace once, unless you specify the :reload
option.
So if the required namespace has already been loaded, it won't be loaded again.
And if the namespace has already been compiled, and the source file is older than
the compiled files, then the runtime loads the already compiled form. But still,
there's a lot of stuff happening as the result of a seemingly simple directive.
So, let's dig into the behavior of require
, along with its step-brother
load
. Earlier I wrote about using load-file
to load an
arbitrary file into the REPL. Here's that file, followed by the command to load and
run it:
(ns foo) (def x 123) (ns bar) (def x 456) (ns user) (do (println "myscript!") (+ foo/x bar/x))
user=> (load-file "src/example/myscript.clj") myscript! 579
When you load the file, it creates definitions within the two namespaces, then invokes an expression to add them. After loading the file, you can access those variables from the REPL:
user=> (* foo/x bar/x) 56088
The load
function is similar, but loads files relative to the classpath.
It also assumes a .clj
extension. I'm using Leiningen, so my classpath
is everything under src
; therefore, I can load the same file like so:
user=> (load "example/myscript") myscript! nil
Wait a second, what happened to the expression at the end of the script? It was still
evaluated — the println
executed — but the result was discarded
and load
returned nil
.
Now let's try loading this same script with require
:
user=> (require 'example.myscript :reload) myscript! nil
Different syntax, same result. The two variables are defined in their respective namespaces, and the stand-alone expression was evaluated. So what's the difference?
The first difference is that require
gives you a bunch of options.
For example, you can use :as
to create a short alias for the namespace,
so that you don't have to reference its vars with fully-qualified names. The way
that the runtime uses these flags is probably worthy of a post of its own.
Another difference is that require
is a little smarter about loading
scripts: it only loads (and compiles) a script if it hasn't already done so —
unless, of course, you use the :reload
or :reload-all
options,
like I did here. Omitting that option, we see that a second require
doesn't
invoke the println
.
user=> (require 'example.myscript) nil
Compiling your code (or, :gen-class doesn't do what you might think)
As you've seen above, the Clojure runtime normally compiles your code when it's
loaded, producing the bytes of a .class
file but not writing them to
the filesystem. However, there are times that you want a real, on-disk class.
For example, so that you can invoke that class from Java (note that you'll still
need the Clojure JAR on your classpath). Or so that you can reduce startup time
for a Clojure application, by avoiding load-time compilation (although I think
this is probably premature optimization).
The compile
function turns Clojure scripts into classes:
user=> (compile 'example.foo) example.foo
That was simple enough. Note, however, that I was running in lein repl
,
which sets the *compile-path*
runtime global to a directory that it
knows exists. If you try to execute this function from the clojure.main
REPL, it will fail unless you create the directory classes
.
Here's the example file that I compiled:
(ns example.foo) (def x 123) (defn what [] "I'm compiled!") (defn add2 [x] (+ 2 x))
And here are the classes that it produced:
-rw-rw-r-- 1 kgregory kgregory 3008 May 7 09:29 target/base+system+user+dev/classes/example/foo__init.class -rw-rw-r-- 1 kgregory kgregory 683 May 7 09:29 target/base+system+user+dev/classes/example/foo$add2.class -rw-rw-r-- 1 kgregory kgregory 1320 May 7 09:29 target/base+system+user+dev/classes/example/foo$fn__1194.class -rw-rw-r-- 1 kgregory kgregory 1503 May 7 09:29 target/base+system+user+dev/classes/example/foo$loading__5569__auto____1192.class -rw-rw-r-- 1 kgregory kgregory 513 May 7 09:29 target/base+system+user+dev/classes/example/foo$what.class
If you're a bytecode geek like me, you'll of course run javap -c
on those
files to see what they contain (especially fn__1194
, which doesn't appear
anywhere in the source!). Have at it. For everyone else, here are the two things I think
are important:
- Every function turns into its own class. If you've been reading along, you aready knew that.
- The
foo__init
class is responsible for pulling all of the other classes into memory, creating instances of those classes, and assigning them to vars in the namespace.
If you use Leiningen, you've probably noted that it adds a :gen-class
directive to the main class of any “app” project that it creates. If
you skim the docs for gen-class you might think this will produce a
Java class that exposes all of your namespace's functions. Let's see what really
happens, by adding a :gen-class
directive to the example script:
(ns example.foo (:gen-class))
When you compile, the list of classes now looks like this:
-rw-rw-r-- 1 kgregory kgregory 1823 May 7 09:31 target/base+system+user+dev/classes/example/foo.class -rw-rw-r-- 1 kgregory kgregory 3009 May 7 09:31 target/base+system+user+dev/classes/example/foo__init.class -rw-rw-r-- 1 kgregory kgregory 683 May 7 09:31 target/base+system+user+dev/classes/example/foo$add2.class -rw-rw-r-- 1 kgregory kgregory 1320 May 7 09:31 target/base+system+user+dev/classes/example/foo$fn__1194.class -rw-rw-r-- 1 kgregory kgregory 1505 May 7 09:31 target/base+system+user+dev/classes/example/foo$loading__5569__auto____1192.class -rw-rw-r-- 1 kgregory kgregory 513 May 7 09:31 target/base+system+user+dev/classes/example/foo$what.class
Everything's the same, except that we now have foo.class
. Looking
at this class with javap
, we find that it contains overrides of the
basic Object
methods: equals()
, hashCode()
,
toString()
, and clone()
. It also creates a Java-standard
main()
function, which looks for the Clojure-standard -main
(which doesn't exist for our script, so will fail if invoked). But it doesn't expose
any of your functions.
Reading the doc more closely, if you want to use :gen-class
to expose
your functions, you need to specify the exposed functions in the directive itself
— and use a specified naming format that separates the Clojure method
implementations from the names exposed to Java.
Pitfalls of compiling your code
Let's change the namespace declaration on foo
, so that it requires
bar
:
(ns example.foo (:require [example.bar :as bar]))
This results in the the expected classes for foo
, but also several
for bar
(which doesn't define any functions):
-rw-rw-r-- 1 kgregory kgregory 2219 May 7 09:39 target/base+system+user+dev/classes/example/bar__init.class -rw-rw-r-- 1 kgregory kgregory 1320 May 7 09:39 target/base+system+user+dev/classes/example/bar$fn__1196.class -rw-rw-r-- 1 kgregory kgregory 1503 May 7 09:39 target/base+system+user+dev/classes/example/bar$loading__5569__auto____1194.class -rw-rw-r-- 1 kgregory kgregory 3009 May 7 09:39 target/base+system+user+dev/classes/example/foo__init.class -rw-rw-r-- 1 kgregory kgregory 683 May 7 09:39 target/base+system+user+dev/classes/example/foo$add2.class -rw-rw-r-- 1 kgregory kgregory 1320 May 7 09:39 target/base+system+user+dev/classes/example/foo$fn__1198.class -rw-rw-r-- 1 kgregory kgregory 1891 May 7 09:39 target/base+system+user+dev/classes/example/foo$loading__5569__auto____1192.class -rw-rw-r-- 1 kgregory kgregory 513 May 7 09:39 target/base+system+user+dev/classes/example/foo$what.class
This makes perfect sense: ff you want to ahead-of-time compile one namespace, you probably don't want its dependencies to be compiled at runtime. But recognize that the tree of dependencies can run very deep, and will include any third-party libraries that you use (poking around Clojars, there aren't a lot of libraries that come precompiled).
There is one other detail of compilation that may cause concern: require
loads a namespace from the file(s) with the latest modification time. If you have both
source and compiled classes on your classpath, this could mean that you're not loading
what you think you are. Fortunately, in practice this primarily affects work in the
REPL: Leiningen removes the target
directory as part of the jar
and uberjar
tasks, so you won't produce an artifact with a source/class
mismatch.
Wrap-up
This has been a long post, so I'll wrap up with what I consider the main points.
- Startup times for Clojure applications will be longer than for normal Java applications, because of the additional step of compiling and evaluating each expression. This isn't going to be an issue if you've written a long-running server in Clojure, but it does add significant overhead to short-running programs (so Clojure is even less appropriate for small command-line utilities than Java).
- Pay attention to the Clojure version used by your dependencies, because they might rely on functions from a newer version than your application; this problem manifests itself as an “Unable to resolve symbol” runtime error. While this is a general issue with transitive dependencies, I've found that third-party libraries tend to be at the latest version, while corporate applications tend to use whatever was current when they were begun.
- As far as I can tell, the Clojure runtime doesn't ever unload the classes that it creates. This means that — on pre-1.8 JVMs — you can fill the permgen space. Not a big problem in development, but be careful if you use a REPL when connected to a production instance.
- Every script that you load adds to the global state of the runtime. Be aware that the behavior of your scripts may be dependent on the order that they're loaded.
No comments:
Post a Comment