Saturday, April 30, 2016

Taming Maven: Transitive Dependency Pitfalls

Like much of Maven, transitive dependencies are a huge benefit that brings with them the potential for pain. And while I titled this piece “Taming Maven,” the same issues apply to any build tool that uses the Maven dependency mechanism, including Gradle and Leiningen.

Let's start with definitions: direct dependencies are those listed in the <dependencies> section of your POM. Transitive dependencies are the dependencies needed to support those direct dependencies, recursively. You can display the entire dependency tree with mvn dependency:tree; here's the output for a simple Spring servlet:

[INFO] com.kdgregory.pathfinder:pathfinder-testdata-spring-dispatch-1:war:1.0-SNAPSHOT
[INFO] +- javax.servlet:servlet-api:jar:2.4:provided
[INFO] +- javax.servlet:jstl:jar:1.1.1:compile
[INFO] +- taglibs:standard:jar:1.1.1:compile
[INFO] +- org.springframework:spring-core:jar:3.1.1.RELEASE:compile
[INFO] |  +- org.springframework:spring-asm:jar:3.1.1.RELEASE:compile
[INFO] |  \- commons-logging:commons-logging:jar:1.1.1:compile
[INFO] +- org.springframework:spring-beans:jar:3.1.1.RELEASE:compile
[INFO] +- org.springframework:spring-context:jar:3.1.1.RELEASE:compile
[INFO] |  +- org.springframework:spring-aop:jar:3.1.1.RELEASE:compile
[INFO] |  |  \- aopalliance:aopalliance:jar:1.0:compile
[INFO] |  \- org.springframework:spring-expression:jar:3.1.1.RELEASE:compile
[INFO] +- org.springframework:spring-webmvc:jar:3.1.1.RELEASE:compile
[INFO] |  +- org.springframework:spring-context-support:jar:3.1.1.RELEASE:compile
[INFO] |  \- org.springframework:spring-web:jar:3.1.1.RELEASE:compile
[INFO] \- junit:junit:jar:4.10:test
[INFO]    \- org.hamcrest:hamcrest-core:jar:1.1:test

The direct dependencies of this project include servlet-api version 2.4 and :spring-core version 3.1.1.RELEASE. The latter has a dependency on spring-asm, which in turn has a dependency on commons-logging.

In a real-world application, the dependency tree may include hundreds of JARfiles with many levels of transitive dependencies. And it's not a simple tree, but a directed acyclic graph: many JARs will share the same dependencies — although possibly with differing versions.

So, how does this cause you pain?

The first (and easiest to resolve) pain is that you might end up with dependencies that you don't want. For example, commons-logging. I don't subscribe to the fear that commons-logging causes memory leaks, but I also use SLF4J, and don't want two logging facades in my application. Fortunately, it's (relatively) easy to exclude individual dependecial's, as I described in a previous “Taming Maven” post.

The second pain point, harder to resolve, is what, exactly, is the classpath?

A project's dependency tree is the project's classpath. Actually, “the” classpath is a bit misleading: there are separate classpaths for build, test, and runtime, depending on the <scope> specifications in the POM(s). Each plugin can define its own classpath, and some provide a goal that lets you see the classpath they use; mvn dependency:build-classpath will show you the classpath used to compile your code.

This tool lists dependencies in alphabetical order. But if you look at a generated WAR, they're in a different order (which seems to bear no relationship to how they're listed in the POM). If you're using a “shaded” JAR, you'll get a different order. Worse, since a shaded JAR flattens all classes into a single tree, you might end up with one JAR that overwrites classes from another (for example, SLF4J provides the jcl-over-slf4j artifact, which contains re-implemented classes from commons-logging).

Compounding classpath ordering, there is the possibility of version conflicts. This isn't an issue for the simple example above, but for real-world applications that have deep dependency trees, there are bound to be cases where dependencies-of-dependencies have different versions. For example, the Jenkins CI server has four different versions of commons-collections in its dependency tree, ranging from 2.1 to 3.2.1 — along with 20 other version conflicts.

Maven has rules for resolving such conflicts. The only one that matters is that direct dependencies take precedence over transitive. Yes, there are other rules regarding depth of transitive dependencies and ordering, but those are only valid to discover why you're getting the wrong version; they won't help you fix the problem.

The only sure fix is to lock down the version, either via a direct dependency, or a dependency-management section. This, however, carries its own risk: if one of your transitive dependencies requires a newer version than the one you've chosen, you'll have to update your POM. And, let's be honest, the whole point of transitive dependencies was to keep you from explicitly tracking every dependency that your app needs, so this solution is decidedly sub-optimal.

A final problem — and the one that I consider the most insidious — is directly relying on a transitive dependency.

As an example, I'm going to use the excellent XML manipulation library known as Practical XML. This library makes use of the equally excellent utility library KDGCommons. Having discovered the former, you might also start using the latter — deciding, for example, that its implementation of parallel map is far superior to others.

However, if you never updated your POM with a direct reference to KDGCommons, then when the author of PracticalXML decides that he can use functions from Jakarta commons-lang rather than KDGCommons, you've got a problem. Specifically, your build breaks, because the transitive depenedency has disappeared.

You might think that this is a uncommon situation, but it was actually what prompted this post: a colleague changed one of his application's direct dependencies, and his build started failing. After comparing dependencies between the old and new versions we discovered a transitive depenency that disappeared. Adding it back as a direct dependency fixed the build.

To wrap up, here are the important take-aways:

  • Pay attention to transitive dependency versions: whenever you change your direct dependencies, you should run mvn dependency:tree to see what's changed with your transitives. Pay particular attention to transitives that are omitted due to version conflicts.
  • If your code calls it, it should be a direct dependency. Plugging another of my creations, the PomUtil dependency tool can help you discover those.

Saturday, April 23, 2016

Technical Debt

Like many software developers in the 21st century, I use the term “technical debt” in a negative way: it's the ever-accumulating cruft in your system that stands in the way of adding new features. As technical debt increases, the work takes ever longer, until you reach a point where forward progress ceases.

This view of technical debt equates it to a credit card: unless you pay your balance in full each month, you're charged interest. If you only make the minimum payment, that interest accrues and it will take you years to pay off the card. If you make the minimum payment and keep charging more, you may never get out of debt. Eventually, after maxing out several cards, you'll have to declare bankruptcy.

But that's a very puritanical view of debt, and it's not a view shared by everyone.

For a person with a business-school background, debt is a tool: if you can float a bond at 5% to build a factory that gives you a 10% boost in income, then you should do that (usually — there are other factors to consider, such as maintenance and depreciation). More important, you're not going pay that bond off before it's due; doing so would negate the reasons that you issued it in the first place.

Which means that the term “technical debt” probably doesn't have the same connotations to your business users as it does to you. In fact, using that term may be dangerous to the long-term prospects of your project. If you say “we can release early but we'll add a lot of technical debt to do so,” that's a no-brainer decision: of course you'll take on the debt.

I think a better term is total cost of ownership (TCO): the amount you pay to implement features now, plus the amount you will pay to add new features in the future. For example, “we can release this version early, but we'll add three months to the schedule for the next version.”

Which may still mean that you cut corners to release early, and probably won't stave off demands to release the next version early as well. But at least you'll be speaking the same language.

Saturday, April 16, 2016

The Kobayashi Maru

“We have a stream of data, but need to replace specific values with others: a charge code, for example, with its department.”

“OK, that's easy, we'll save the codes and values in a map.”

“There may be hundreds of thousands of them.”

“A database might be better, then. Can we load the input data into a table and produce the output with a join?”

“Databases are too slow; we're dealing with thousands of messages a second.”

This interchange sounds like something that might come from an interview. Perhaps it's a “Kobayashi Maru” question, designed to test how the interviewee behaves under pressure, in which there is no correct answer and the interviewer piles on the constraints until the problem is insolvable.

Or perhaps it's the mental process of a software engineer working through the constraints of an actual problem, and testing her design against those constraints. In my experience, there are two big differences between the real world and the interview.

The first is that you might not know all of your constraints at design time. You might build your system using a high-performance messaging system, and later learn that it has an easily-reproducible but difficult-to-fix bug — one that you have to deal with in production.

The second is that real-world design is all about ranking the constraints, and making tradeoffs based on those rankings. If you absolutely need the speed of a large in-memory lookup table, then that constraint has to rank higher than hardware cost. This is often distasteful to your users, but it's unavoidable; a good engineer has to recognize that all constraints do not have the same priority, and push users to the same understanding.

It is occasionally possible to “reprogram the simulation” in the real world, and sometimes that is the only way to resolve your constraints. Perhaps the input is structured such that you don't need to keep the entire lookup table in memory at once. Or perhaps you don't actually need to perform the substitution in real-time. Always look for these opportunities.

A third difference is that you have to live with your designs in the real world. For that reason, I recommend asking yourself as many hard questions as you can.

Saturday, April 9, 2016

Polyglot Programmers

I recently received an email from a well-known consulting firm that said, in essence, “you're a polyglot programmer, we're looking for those, let's talk.”

I can see where they got that impression: in the last three years I've worked with Java, JavaScript, Scala, Ruby, Python, Clojure, and SQL. In my 30+ year career, I've worked with over 15 languages. And there are a half-dozen more that I haven't used professionally. But why would a recruiter look for that?

One simple answer is that, as a consulting firm, they have to staff projects for clients with differing development environments. A person who knows multiple languages will be easier to assign. But I think there's more to it than that: “polyglot programmers” and “polyglot application” are an approach to programming that is espoused by this same firm. One that I don't agree with.

Neal Ford is often credited with coining the term polyglot programming, in a 2006 blog post. His thesis — examined in greater detail by Dean Wampler in this 2010 presentation — is that different languages have strengths in different areas, and programmers should use the best tool for the job.

I don't think that it was coincidence that this philosophy emerged in the mid-2000s, from people associated with the Java ecosystem. This was a time when Java-the-language had stagnated but the JVM was the base for a flourishing community of “not Java” languages. Groovy in 2003, Scala in 2004, and Clojure in 2007 are some of the notable examples, although there are many others. All of these languages offered things that Java did not, higher-order functions being one of the more obvious. And all of them offered the ability (more or less) to interoperate with existing Java code.

Given these new abilities, it seemed only reasonable to adopt them: to use Groovy for your XML processing, or Scala for-comprehensions to process nested structures, or rewrite your multi-threaded code to use Scala actors. And perhaps not stop there, but adopt Martin Fowler's strangler application pattern, replacing the underlying Java code entirely. And who wouldn't want to leave behind Java's verbosity and definitional boilerplate?

Ten years have passed since Neal Ford's post, and I question whether the premise is still valid (if it ever was). Here are a few reasons:

  • Convergence of languages

    The mid-1960s also saw a flowering of programming languages with widely varying features: APL (1964) for mathematics, BASIC (1964) as an introductory language for education, PL/I (1965) for large-scale applications, Simula (1965) for simulating real-world interactions, SNOBOL (1962) for text processing, and many others. I was a toddler at the time, so have no first-person knowledge, but I think that these languages arose from much the same situation: mainstream languages lacked features, and the reduced cost and increased sophistication of computer systems lowered the barrier to creating something new.

    But by the end of the 1970s, most of these languages were on the path to obscurity (BASIC notwithstanding), and older languages had adopted many of the ideas that they promoted (even FORTRAN became block-structured). And then there was C, a new language that borrowed ideas from the earlier languages but recast them in a form that was better suited to the rising world of microprocessors.

    If history repeats itself, I think we're in the process of a similar consolidation. Java now has higher-order functions; if that was your reason for looking elsewhere, is it still valid? JavaScript is available outside the browser, it's performant, and now has a vast collection of libraries; is there still a reason to use separate languages for your front and back end code? Or will there be something new that takes over the world?

  • Increased maintenance cost

    Many of the people in favor of writing portions of their system in alternative languages point to the efficiency with which they can write code. And while I accept the truth of that, I also realize two things. First, that writing code is a tiny part of initial implementation. And second, that initial implementation is a tiny part of the effort required during an application's lifetime.

    Given this, using multiple languages for a single application means that all of the people tasked with its maintenance have to be at least comfortable with all of those languages, if not experts. This can be particularly painful if one of the team members has a fondness for an obscure language.

    But even in the case of popular languages, it can be a problem. One of my projects was with a company whose main service was written in a mix of Node.JS and Rails. I think that their original goal was to transition from Rails to Node, but it went awry: the Rails developers left and the Node developers realized that Node didn't (at that time) have all the features they needed. So both codebases remained a core part of the business, and the company had to find people with both skillsets in order to maintain the software.

  • Programmer focus

    In my opinion the best reason to be wary of polyglot projects is that developers can only know a limited number of languages — I use 1½ as my personal limit. Oh, sure, I've met people who claim to know a half-dozen or more languages. But when pressed, they only “know” those languages at a very basic level.

    True knowledge extends past syntax and semantics, to idiom and environment. It is the ability to choose the best implementation for any given goal, without thinking — the stage of learning known as “unconscious competence.” And with programming languages, the best implementation may be very different depending on the language.

    Which is not to say that you can't transition from one language to another. That's easy, I've done it several times in my career. But you'll find that the language you're most familiar with colors the way you work with the new language — when I started working with JavaScript, I attempted to use Java-style constructor functions rather than a map of data. You'll know when you've transitioned when the new language changes the way you write the old.

While I don't think that polyglot applications are a good idea, I wholeheartedly support learning multiple languages as a way to expand your abilities. A few years ago I worked my way through Programming Erlang, and consider that time well-spent: it introduced me to pattern matching, gave me a deeper understanding of actor systems, and showed me an elegant (if inefficient) way to implement Quicksort.

But it didn't make me want to mix Erlang and Java in a single project.

Saturday, April 2, 2016

Practical Git: Recovering Orphaned Commits

> git reset --hard 5b7830c203e3f581c1d23d9f945478b9f94979da
HEAD is now at 5b7830c third revision

Some time later, after the window showing the log has been closed …
“Oh s**t, that was the wrong revision!”

If you've followed my advice, you have a backup of your Git repository and recovery is a simple matter of pulling the latest changes. But maybe you haven't pushed recently. Or maybe this is a scratch project that you never bothered to copy remotely.

The commits are still in your local repository, but they're orphaned: no branch head points to them, so there's no easy way to access them. But with a little digging, you can find the commit(s) that you wanted to preserve, and create a branch that contains them. You just need to know how Git stores commits.

The short form is that Git doesn't store “commits” per se, it stores objects. A commit is simply one type of object (the others are trees, which contains directory listings, and blobs, which contain files). This seems like a nit-picking differentiation, but it means that you have to hunt for your commits within a possibly large directory of objects.

That directory is .git/objects, which resides under the root of your project. If you look inside this directory, you'll see a bunch of sub-directories:

ls -l .git/objects
total 68
drwxrwxr-x 2 kgregory kgregory 4096 Apr  2 08:16 04
drwxrwxr-x 2 kgregory kgregory 4096 Apr  2 08:16 0d
drwxrwxr-x 2 kgregory kgregory 4096 Apr  2 08:16 21

And if you look inside one of these sub-directories, you'll see a bunch of files:

ls -l .git/objects/04
total 4
-r--r--r-- 1 kgregory kgregory 166 Apr  2 08:16 fbee2d3610317180c3f15b0d122e24f39fa82c

Well, in this case not so many files, because this is an example project with just four commits. But in a real project, you may have hundreds, if not thousands, of files in each directory. So you need some way to winnow them down. On Linux, the find command will let you see files that were created within a time range; in this case I look for everything in the last hour, because those are the commits that I deleted:

find .git/objects/ -cmin -60 -type f -ls
796801    4 -r--r--r--   1 kgregory kgregory       31 Apr  2 08:16 .git/objects/ce/06f54a5a2032d1fb605284e7217fca9e7a5073
796811    4 -r--r--r--   1 kgregory kgregory      167 Apr  2 08:16 .git/objects/66/82d6271a5416ed0a325cbafc34b32bbf893976
796803    4 -r--r--r--   1 kgregory kgregory       56 Apr  2 08:16 .git/objects/fa/b61137a51c608783b342d6e1912f45ae24c775

At this point it should be clear that these files are named for SHA-1 hashes. The two-level directory structure is designed so that you can store large numbers of files efficiently: in a project with 10,000 commits there will be 256 sub-directories, each of which will contain an average of 39 commits. Unfortunately for us, the full SHA1 hash consists of the directory name concatenated to the filename. So we need to apply some sed:

find .git/objects/ -cmin -60 -type f | sed -e 's/.*ects.//' | sed -e 's/\///'

The reason that we need the hashes — versus simply grepping the files — is that the files are compressed. However, git provides the cat-file tool to help us:

git cat-file -p 6682d6271a5416ed0a325cbafc34b32bbf893976
tree ec6da2f24b113700e2d64b773b4d2c9149451bfd
parent 5b7830c203e3f581c1d23d9f945478b9f94979da
author Keith Gregory  1459599418 -0400
committer Keith Gregory  1459599418 -0400

fourth revision

At this point, finding the commits that you care about is a matter of passing the list of hashes into cat-file, and grepping for text that identifies your commit. Looking for the actual commit message is chancy (especially if it might overlap with filenames or file contents), so I generally just look for files that contain “committer”:

for f in `find .git/objects/ -cmin -60 -type f | sed -e 's/.*ects.//' | sed -e 's/\///'`
do git cat-file -p $f | grep committer && echo $f
committer Keith Gregory  1459599387 -0400
committer Keith Gregory  1459599402 -0400
committer Keith Gregory  1459599418 -0400
committer Keith Gregory  1459599374 -0400
committer Keith Gregory  1459599395 -0400

The next-to-last field on each line is the timestamp of the commit; the highest number will be the last commit. At this point, you can recover your commits:

git checkout -b my_recovered_commits 6682d6271a5416ed0a325cbafc34b32bbf893976
Switched to a new branch 'my_recovered_commits'

> git log
commit 6682d6271a5416ed0a325cbafc34b32bbf893976
Author: Keith Gregory 
Date:   Sat Apr 2 08:16:58 2016 -0400

    fourth revision

commit 5b7830c203e3f581c1d23d9f945478b9f94979da
Author: Keith Gregory 
Date:   Sat Apr 2 08:16:42 2016 -0400

    third revision