Showing posts with label tools. Show all posts
Showing posts with label tools. Show all posts

Friday, June 14, 2024

CloudTrail + GenAI = Disappointment

AWS just announced a natural language query generator for CloudTrail Lake. Using it, you can write prompts like “What are my top errors in the past month?” and it will spit out the SQL to get the answer. On the surface, that seems great: CloudTrail events can be very complex: they hold the request and response data for thousands of unrelated API operations. In my experience, extracting the information you need can take multiple minutes with a search engine, and longer if you use a tool such as Athena.

Unfortunately, as the title of this post tells you, it doesn't quite live up to expectations. In this post I highlight four queries that I think are simple and likely to be frequently used, but which show what's lost in translation from natural language.

Note 1: I ran the queries in this post on June 12, 2024, the same week that the query generator was announced in preview. I expect it to go through continuous training, so you might not get the same output that I did. I don't plan to make this an evergreen post, and will not update it if the queries change. It calls out the state of the world at one partiular point in time.

Note 2: I made minor changes to identifiers in the generated queries, in case there's information about my account embedded within. I did not change the queries in any other way, including formatting.

Query #1: who started ec2 instance i-0fe20093cf243bca1

Sometimes you find an instance running, and have no idea why it exists. To answer that question, you need to talk to the person who started the instance.

SELECT userIdentity.arn,
    eventName,
    eventTime,
    awsRegion,
    responseElements
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventName = 'RunInstances'
    AND element_at(responseElements, 'instancesSet') LIKE '%i-0fe20093cf243bca1%'
    AND eventTime <= '2024-06-12 17:00:00'
ORDER BY eventTime DESC
LIMIT 1;

This query does what it's supposed to, and is not necessarily something that's easy to write off the top of your head: you need to know how to find the instance ID in the event's responseElements sub-object. Fortunately, the query generator does. But there are a few quirks, that I think should be called out. The first is that this query exhibits “do what I mean, not what I say” behavior.

There are two EC2 API calls that look very similar: StartInstances and RunInstances. The former restarts an existing, stopped instance; the latter starts a new instance. In maybe 99% of the cases, you (or at least I) actually want the latter API call when we use the word “start”. So it's great that the query generator picks it … until it doesn't, perhaps because the model has been retrained. A person experienced with CloudTrail will see that it's looking for the wrong API call; a novice won't, and will assume that the instance was started before the available data.

The second quirk is the condition on eventTime. It adds nothing to the query, and in fact can make it miss results: I generated this query at 17:44 UTC, so if the instance was started in the previous 45 minutes it wouldn't appear in the results.

I also found the ORDER BY and LIMIT clauses strange: they only make sense if instance IDs can be reused (which as far as I know, they can not).

Query #2: who started ec2 instances today

This query is less common, but might be useful to identify whether an person who shouldn't be authorized to start instances is, in fact, able to start them.

SELECT userIdentity.arn,
    eventName,
    eventTime,
    recipientAccountId,
    awsRegion,
    responseElements
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventName = 'RunInstances'
    AND eventTime >= timestamp '2024-06-12 00:00:00'
    AND eventTime <= timestamp '2024-06-12 23:59:59'

This isn't quite what I asked: it's listing all of the events, not just the list of identities. To give me what I want, it should use distinct userIdentity.arn, and get rid of all of the other fields in the select-list. I suspect, however, that the training data was focused on retrieving individual records, so that's what the query generator “thought” people would want.

A more subtle issue is the date range. If you're familiar with the details of CLoudTrail events, you know that event timestamps are UTC. But if you're not familiar — in other words, the type of person who would find a query generator useful — you might think incorrectly that the timestamps are local time. I tried appending “local time” to the query, but it had no effect; all date ranges in generated queries are UTF

Query #3: how many EC2 instances were started today

I could imagine this query being used by someone investigating the cost of their EC2 fleet. Unfortunately, the generated query makes a rather large mistake.

SELECT COUNT(*) AS instances_started
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventName = 'RunInstances'
    AND eventtime >= '2024-06-12 00:00:00'
    AND eventtime <= '2024-06-12 23:59:59'
    AND eventsource = 'ec2.amazonaws.com'

This query actually counts the number of requests to start instances. In many cases, one request means one instance, but that's certainly not true in all cases. There are plenty of times when you (or your automated scripts) might spin up a fleet of instances in a single request.

To get the actual count, you must know that the responseElements object in the raw event contains an instancesSet sub-object, which in turn has an items element that's an array of information about the actual instances created. And you can use that information to write a SELECT clause like this:

SELECT sum(json_array_length(json_extract(responseElements['instancesSet'], '$.items'))) AS instances_started

This is a case where a query generator should be great, because its model could incorporate information about the contents of each event type. Unfortunately, large language models only “knows” what they have seen in their training data. The first query I showed above indicates that this model has some information about the deep event structures, but not enough to go past the basics.

Query #4: who started ec2 instance with tag "foobar"

This query seems as simple as my first. But the output has entered the realm of hallucination.

SELECT useridentity.arn,
    useridentity.username
FROM c40adde3-0909-92d8-b4cf-6b821599a449
WHERE eventname = 'RunInstances'
    AND resources [0].type = 'EC2'
    AND resources [0].tags ['foobar'] IS NOT NULL
    AND eventtime >= '2024-06-12 00:00:00'
    AND eventtime <= '2024-06-12 23:59:59'
GROUP BY 1,
    2

First off, I have no idea where that date range came from; it's not in my prompt. I suspect that it was sent to the query generator as historical context, based on my previous queries. But it shouldn't be.

The conditions on resources are also hallucinations. Multiple hallucinations.

The first, and biggest, is that resources isn't a valid field for this query. There are some events that have a resources element: AssumeRole, for example. But RunInstances does not have this element; the resources that you see listed in the CloudTrail Console are all extracted from the event data programmatically. When I first saw this query I thought that the CloudTrail Lake transformation might be doing this extracton. But when I queried the data I found that the value of that column was always null for RunInstances, so clearly it isn't.

The second hallucination is that the zero element of the resources array contains information about the instance (and assumes that there's only one instance). Like I said, the resource list in the CloudTrail Console is built programmatically, so can't really be used as a guide, but it showed the instance as the fourth element of the array in the event I was viewing.

Third, the type field, for those events that have resources, is a qualified name such as AWS::IAM::Role. The CloudTrail Console uses AWS::EC2::Instance for the instance entry, not EC2, so that's what I would expect to see (if the RunInstances event did, in fact, have resources).

And lastly, I don't believe the tags element exists. I don't see it in any of the events that do have resources, and it doesn't appear in the (programatically generated) resource list in the CloudTrail Console. But this condition is at least plausible; when I tried this same prompt on June 11, it came back with a condition involving the resource ARN.

All of these hallucinations mean that this query will never return results, whether or not there's an actual instance tagged “foobar”

The one good thing about this query is that it eliminates the extra fields in the select-list. And it adds useridentity.username, which is useful even if it's only populated when an actual IAM user is responsible for the operation (which, in a world moving away from IAM users and to Identity Center, will be infrequent).

Wrapping Up

I spun up my first EC2 instance in 2008. And since then, I've usually been delighted by new AWS services, even if they had rough edges. I haven't feel that way since AWS decided to focus on gen-AI.

I don't have anything against generative AI per se (well, other than copyright infringement and the possibility to leak confidential information). I think it's great when helping an inexperienced writer organize and polish their thoughts (but hopefully they learn from the experience). And I have a friend who uses it as a “better search” when he's working with something new (but he always backs that up by reading the documentation, never blindly taking the output of the model).

But when it comes to programming, gen-AI always seems to miss the mark. I've written elsewhere about my experiences with Amazon CodeGuru reviewer. A friend recently told me that his company abandoned using Amazon CodeWhisperer to help them write CDK scripts; you'd think that would be a place where it would excel, having a large corpus of training data. And I quickly got bored with trying to use ChatGPT to generate a non-trivial progamming construct.

I think that the reason is that large language models are very good at representing syntax, but that's only a small part of programming. Programming is much more about data: its structure and how it changes over time. And also about the causal and temporal relationships between pieces of code. Syntax is essential, but it's not sufficient.

So my job is probably safe until AI actually learns how to reason, if it ever does. But what about the CloudTrail query generator?

To be honest, I don't see the market. People who use CloudTrail all the time probably use a homegrown search engine or a commercial SIEM. So that leaves novices, people who might need an occasional query as part of their workload. But given the quality of the output, I think they're likely to get frustrated very quickly and walk away (or worse, decide that the query output is correct). Perhaps the model will get better over time, but as the saying goes, you only get one chance to make a first impression.

Friday, November 11, 2016

Git Behind The Curtain: what happens when you commit, branch, and merge

This is the script for a talk that I'm planning to give tomorrow at BarCamp Philly. The talk is a “live committing” exercise, and this post contains all the information needed to follow along. With some links to relevant source material and minus my pauses, typos, and attempts at humor.


Object Management

I think the first thing to understand about Git is that it's not strictly a source control system; it's more like a versioned filesystem that happens to be good at source control. Traditionally, source control systems focused on the evolution of files. For example, RCS (and its successor CVS) maintain a separate file in the repository for each source file; these repository files hold the entire history of the file, as a sequence of diffs that allow the tool to reconstruct any version. Subversion applies the idea of diffs to the entire repository, allowing it to track files as they move between directories.

Git takes a different approach: rather than constructing the state of the repository via diffs, it maintains snapshots of the repository and constructs diffs from those (if you don't believe this, read on). This allows very efficient comparisons between any two points in history, but does consume more disk space. I think the key insight is not just that disk is cheap and programmer time expensive, but that real-world software projects don't have a lot of large files, and those files don't experience a lot of churn.

To see Git in action, we'll create a temporary directory, initialize it as a repository, and create a couple of files. I should note here that I'm using bash on Linux; if you're running Windows you're on your own re commands. Anything that starts with “>” is a command that I typed; anything else is the response from the system.

> mkdir /tmp/$$

> cd /tmp/$$

> git init
Initialized empty Git repository in /tmp/13914/.git/

> touch foo.txt

> mkdir bar

> touch bar/baz.txt

> git add *

> git commit -m "initial revision"
[master (root-commit) 37a649d] initial revision
 2 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 bar/baz.txt
 create mode 100644 foo.txt

Running git log shows you this commit, identified by its SHA-1 hash.

> git log
commit 37a649dd6dec75cd68a2c3dfb7fa2948a0d3426e
Author: Keith Gregory 
Date:   Thu Nov 3 09:20:29 2016 -0400

    initial revision

What you might not realize is that a commit is a physical object in the Git repository, and the SHA-1 hash is actually the hash of its contents. Git has several different types of objects, and each object is uniquely identified by the SHA-1 hash of its contents. Git stores these objects under the directory .git/objects, and the find command will help you explore this directory. Here I sort the results by timestamp and then filename, to simplify tracing the changes to the repository.

> find .git/objects -type f -ls | sort -k 10,11
13369623    4 -r--r--r--   1 kgregory kgregory       82 Nov  3 09:20 .git/objects/2d/2de60b0620e7ac574fa8050997a48efa469f5d
13369612    4 -r--r--r--   1 kgregory kgregory       52 Nov  3 09:20 .git/objects/34/707b133d819e3505b31c17fe67b1c6eacda817
13369637    4 -r--r--r--   1 kgregory kgregory      137 Nov  3 09:20 .git/objects/37/a649dd6dec75cd68a2c3dfb7fa2948a0d3426e
13369609    4 -r--r--r--   1 kgregory kgregory       15 Nov  3 09:20 .git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391

OK, one commit created four objects. To understand what they are, it helps to have a picture:

Each commit represents a snapshot of the project: from the commit you can access all of the files and directories in the project as they appeared at the time of the commit. Each commit contains three pieces of information: metadata about the commit (who made it, when it happened, and the message), a list of parent commits (which is empty for the first commit but has at least one entry for every other commit), and a reference to the “tree” object holding the root of the project directory.

Tree objects are like directories in a filesystem: they contain a list of names and references to the content for each name. In the case of Git, a name may either reference another tree object (in this example, the “bar” sub-directory), or a “blob” object that holds the content of a regular file.

As I said above, an object's SHA-1 is built from the object's content. That's why we have two files in the project but only one blob: because they're both empty files, the content is identical and therefore the SHA-1 is identical.

You can use the cat-file command to look at the objects in the repository. Starting with the commit, here are the four objects from this commit (the blob, being empty, doesn't have any output from this command):

> git cat-file -p 37a649dd6dec75cd68a2c3dfb7fa2948a0d3426e
tree 2d2de60b0620e7ac574fa8050997a48efa469f5d
author Keith Gregory  1478179229 -0400
committer Keith Gregory  1478179229 -0400

initial revision
> git cat-file -p 2d2de60b0620e7ac574fa8050997a48efa469f5d
040000 tree 34707b133d819e3505b31c17fe67b1c6eacda817 bar
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 foo.txt
> git cat-file -p 34707b133d819e3505b31c17fe67b1c6eacda817
100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 baz.txt
> git cat-file -p e69de29bb2d1d6434b8b29ae775ad8c2e48c5391

Now let's create some content. I'm generating random text with that I think is a neat hack: you get a stream of random bytes from /dev/urandom, then use sed to throw away anything that you don't want. Since the random bytes includes a newline every 256 bytes (on average), you get a file that can be edited just like typical source code. One thing that's non-obvious: Linux by default would interpret the stream of bytes as UTF-8, meaning a lot of invalid characters from the random source; explicitly setting the LANG variable to an 8-bit encoding solves this.

> cat /dev/urandom | LANG=iso8859-1 sed -e 's/[^ A-Za-z0-9]//g' | head -10000 > foo.txt
> ls -l foo.txt 
-rw-rw-r-- 1 kgregory kgregory 642217 Nov  3 10:05 foo.txt
> head -4 foo.txt 
U4UIESn4HN61l6Is epQXlHSVaLpJGt
N8opIkrSt5NQsWnqYYmt9BBmEWBVaaSVjzTTFJCXHT2vay2CoDT7J rm3f7CWefdgicOdfs0tUdgx
OvqjwOykmKToWkd8nxWNtCCCkUi cxn3Bn5gN4Im38y cfS6IdXgIj9O6gBEGgBW6BcZJ 
2BluWmwQYgyNFIHP8RUL8m2aAjM1FwcY8ZX9fvmvJi30p9sBEkq6giuoRvJSWRW8PLCsrEWfSXeZXxO2HK2IS3MFNpviKRagug3HE96I

When we commit this change, our object directory gets three new entries:

13369623    4 -r--r--r--   1 kgregory kgregory       82 Nov  3 09:20 .git/objects/2d/2de60b0620e7ac574fa8050997a48efa469f5d
13369612    4 -r--r--r--   1 kgregory kgregory       52 Nov  3 09:20 .git/objects/34/707b133d819e3505b31c17fe67b1c6eacda817
13369637    4 -r--r--r--   1 kgregory kgregory      137 Nov  3 09:20 .git/objects/37/a649dd6dec75cd68a2c3dfb7fa2948a0d3426e
13369609    4 -r--r--r--   1 kgregory kgregory       15 Nov  3 09:20 .git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391
13369652    4 -r--r--r--   1 kgregory kgregory      170 Nov  3 10:07 .git/objects/16/c0d98f4476f088c46086583b9ebf76dee03bb9
13369650    4 -r--r--r--   1 kgregory kgregory       81 Nov  3 10:07 .git/objects/bf/21b205ba0ceff1655ca5c8476bbc254748d2b2
13369647  488 -r--r--r--   1 kgregory kgregory   496994 Nov  3 10:07 .git/objects/ef/49cb59aee81783788c17b5a024bd377f2b119e

In the diagram, you can see what happened: adding content to the file created a new blob. Since this had a different SHA-1 than the original file content, it meant that we got a new tree to reference it. And of course, we have a new commit that references that tree. Since baz.txt wasn't changed, it continues to point to the original blob; in turn that means that the directory bar hasn't changed, so it can be represented by same tree object.

One interesting thing is to compare the size of the original file, 642,217 bytes, with the 496,994 bytes stored in the repository. Git compresses all of its objects (so you can't just cat the files). I generated random data for foo.txt, which is normally uncompressible, but limiting it to alphanumeric characters meant that each character only takes fit in 6 bits rather than 8; compressing the file therefore saves roughly 25% of its space.

I'm going to diverge from the diagram for the next example, because I think it's important to understand that Git stores full objects rather than diffs. So, we'll add a few new lines to the file:

> cat /dev/urandom | LANG=iso8859-1 sed -e 's/[^ A-Za-z0-9]//g' | head -100 >> foo.txt

When we look at the objects, we see that there is a new blob that is slightly larger than the old one, and that the commit object (c620754c) is nowhere near large enough to hold the change. Clearly, the commit does not encapsulate a diff.

13369652    4 -r--r--r--   1 kgregory kgregory      170 Nov  3 10:07 .git/objects/16/c0d98f4476f088c46086583b9ebf76dee03bb9
13369650    4 -r--r--r--   1 kgregory kgregory       81 Nov  3 10:07 .git/objects/bf/21b205ba0ceff1655ca5c8476bbc254748d2b2
13369647  488 -r--r--r--   1 kgregory kgregory   496994 Nov  3 10:07 .git/objects/ef/49cb59aee81783788c17b5a024bd377f2b119e
13369648  492 -r--r--r--   1 kgregory kgregory   501902 Nov  3 10:18 .git/objects/58/2f9a9353fc84a6a3571d3983fbe2a0418007db
13369656    4 -r--r--r--   1 kgregory kgregory       81 Nov  3 10:18 .git/objects/9b/671a1155ddca40deb84af138959d37706a8b03
13369658    4 -r--r--r--   1 kgregory kgregory      165 Nov  3 10:18 .git/objects/c6/20754c351f5462bf149a93bdf0d3d51b7d91a9

Before moving on, I want to call out Git's two-level directory structure. Filesystem directories are typically a linear list of files, and searching for a specific filename becomes a significant cost once you have more than a few hundred files. Even a small repository, however, may have 10,000 or more objects. A two-level filesystem is the first step to solving this problem: the top level consists of sub-directories with two-character names, representing the first byte of the object's SHA-1 hash. Each sub-directory only holds those objects whose hashes start with that byte, thereby partitioning the total search space.

Large repositories would still be expensive to search, so Git also uses “pack” files, stored in .git/objects/pack; each pack file contains some large number of commits, indexed for efficient access. You can trigger this compression using git gc, although that only affects your local repository. Pack files are also used when retrieving objects from a remote repository, so your initial clone gives you a pre-packed object directory.

Branches

OK, you've seen how Git stores objects, what happens when you create a branch?

> git checkout -b my-branch
Switched to a new branch 'my-branch'

> cat /dev/urandom | LANG=iso8859-1 sed -e 's/[^ A-Za-z0-9]//g' | head -100 >> foo.txt

> git commit -m "add some content in a branch" foo.txt 
[my-branch 1791626] add some content in a branch
 1 file changed, 100 insertions(+)

If you look in .git/objects, you'll see another three objects, and at this point I assume that you know what they are.

13369648  492 -r--r--r--   1 kgregory kgregory   501902 Nov  3 10:18 .git/objects/58/2f9a9353fc84a6a3571d3983fbe2a0418007db
13369656    4 -r--r--r--   1 kgregory kgregory       81 Nov  3 10:18 .git/objects/9b/671a1155ddca40deb84af138959d37706a8b03
13369658    4 -r--r--r--   1 kgregory kgregory      165 Nov  3 10:18 .git/objects/c6/20754c351f5462bf149a93bdf0d3d51b7d91a9
13510395    4 -r--r--r--   1 kgregory kgregory      175 Nov  3 11:50 .git/objects/17/91626ece8dce3c713261e470a60049553d5411
13377993  496 -r--r--r--   1 kgregory kgregory   506883 Nov  3 11:50 .git/objects/57/c091e77d909f2f20e985150ea922c3ec303ab6
13377996    4 -r--r--r--   1 kgregory kgregory       81 Nov  3 11:50 .git/objects/f2/c824ab04c8f902ab195901e922ab802d8bc37b

So how does Git know that this commit belongs to a branch? The answer is that Git stores that information elsewhere:

> ls -l .git/refs/heads/
total 8
-rw-rw-r-- 1 kgregory kgregory 41 Nov  3 10:18 master
-rw-rw-r-- 1 kgregory kgregory 41 Nov  3 11:50 my-branch
> cat .git/refs/heads/master 
c620754c351f5462bf149a93bdf0d3d51b7d91a9

> cat .git/refs/heads/my-branch
1791626ece8dce3c713261e470a60049553d5411

That's it: text files that hold the SHA-1 of a commit. There is one additional file, .git/HEAD, which says which of those text files represent the “working” branch:

> cat .git/HEAD 
ref: refs/heads/my-branch

This isn't quite the entire story. For one thing, it omits tags, stored in .git/refs/tags. Or “unattached HEAD” state, when .git/HEAD holds a commit rather than a ref. And most important, remote branches and their relationship to local branches. But this post (and the talk) are long enough as it is.

Merges

You've been making changes on a development branch, now it's time to merge those changes into master (or an integration branch). It's useful to know what happens when you type git merge,

Fast-Forward Merges

The simplest type of merge — indeed, you could argue that it's not really a merge at all, because it doesn't create new commits — is a fast-forward merge. This is the sort of merge that we'd get if we merged our example project branch into master.

> git checkout master
Switched to branch 'master'

> git merge my-branch
Updating c620754..1791626
Fast-forward
 foo.txt | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

As I said, it's not really a merge at all; if you look in .git/objects you'll see the same list of files that were there after the last commit. What has changed are the refs:

> ls -l .git/refs/heads/
total 8
-rw-rw-r-- 1 kgregory kgregory 41 Nov  3 12:15 master
-rw-rw-r-- 1 kgregory kgregory 41 Nov  3 11:50 my-branch
> cat .git/refs/heads/master 
1791626ece8dce3c713261e470a60049553d5411

> cat .git/refs/heads/my-branch
1791626ece8dce3c713261e470a60049553d5411

Here's a diagram of what happened, showing the repository state pre- and post-merge:

Topologically, the “branch” represents a straight line, extending the last commit on master. Therefore, “merging” the branch is as simple as re-pointing the master reference to that commit.

Squashed Merges

I'll start this section with a rant: one of the things that I hate, when looking at the history of a project, is to see a series of commits like this: “added foo”; “added unit test for foo”; “added another testcase for foo”; “fixed foo to cover new testcase”… Really. I don't care what you did to make “foo” work, I just care that you did it. And if your commits are interspersed with those of the person working on “bar” the evolution of the project becomes almost impossible to follow (I'll return to this in the next section).

Fortunately, this can be resolved easily with a squash merge:

> git merge --squash my-branch
Updating c620754..1791626
Fast-forward
Squash commit -- not updating HEAD
 foo.txt | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

The “not updting HEAD” part of this message is important: rather than commit the changes, a squash merge leaves them staged and lets you commit (with a summary message) once you've verified the changes. The diagram looks like this:

The thing to understand about squashed commits is that they're not actually merges: there's no connection between the branch and master. As a result, “git branch -d my-branch” will fail, warning you that it's an unmerged branch; you need to force deletion by replacing -d with -D.

To wrap up this section: I don't think you need to squash all merges, just the ones that merge a feature onto master or an integration branch. Use normal merges when pulling an integration branch onto master, or when back-porting changes from the integration branch to the development branch (I do, however, recommend squashing back-ports from master to integration).

“Normal” Merges, with or without conflicts

To understand what I dislike about “normal” merges, we need to do one. For this we'll create a completely new repository, one where we'll compile our favorite quotes from Lewis Carroll. We start by creating the file carroll.txt in master; for this section I'll just show changes to the file, not the actual commits.

Jabberwocky

Walrus and Carpenter

In order to work independently, one person will make the changes to Jabberwocky on a branch named “jabber”:

Jabberwocky

Twas brillig, and the slithy toves
Did gire and gimbal in the wabe
All mimsy were the borogroves
And the mome raths outgrabe

Walrus and Carpenter

After that's checked in, someone else modifies the file on master:

Jabberwocky

Walrus and Carpenter

The time has come, the Walrus said,
To talk of many things:
Of shoes, and ships, and sealing wax
Of cabbages, and kings
And why the sea is boiling hot
And whether pigs have wings

Back on branch jabber, someone has found the actual text and corrected mistakes:

Jabberwocky

'Twas brillig, and the slithy toves
Did gyre and gimble in the wabe
All mimsy were the borogoves
And the mome raths outgrabe

Walrus and Carpenter

At this point we have two commits on jabber, and one on master after the branch was made. If you look at the commit log for jabber you see this:

commit 2e68e9b9905b043b06166cf9aa5566d550dbd8ad
Author: Keith Gregory 
Date:   Thu Nov 3 12:59:40 2016 -0400

    fix typos in jabberwocky

commit 5a7d657fb0c322d9d5aca5c7a9d8ef6eb690eaeb
Author: Keith Gregory 
Date:   Thu Nov 3 12:41:35 2016 -0400

    jabberwocky: initial verse

commit e2490ffad2ba7032367f94c0bf16ca44e4b28ee6
Author: Keith Gregory 
Date:   Thu Nov 3 12:39:41 2016 -0400

    initial commit, titles without content

And on master, this is the commit log:

commit 5269b0742e49a3b201ddc0255c50c37b1318aa9c
Author: Keith Gregory 
Date:   Thu Nov 3 12:57:19 2016 -0400

    favorite verse of Walrus

commit e2490ffad2ba7032367f94c0bf16ca44e4b28ee6
Author: Keith Gregory 
Date:   Thu Nov 3 12:39:41 2016 -0400

    initial commit, titles without content

Now the question is: what happens when you merge jabber onto master? In my experience, most people think that the branch commits are added into master based on when they occurred. This is certainly reinforced by the post-merge commit log:

commit 8ba1b90996b4fb67529a2964836f8524a220f2d8
Merge: 5269b07 2e68e9b
Author: Keith Gregory 
Date:   Thu Nov 3 13:03:09 2016 -0400

    Merge branch 'jabber'

commit 2e68e9b9905b043b06166cf9aa5566d550dbd8ad
Author: Keith Gregory 
Date:   Thu Nov 3 12:59:40 2016 -0400

    fix typos in jabberwocky

commit 5269b0742e49a3b201ddc0255c50c37b1318aa9c
Author: Keith Gregory 
Date:   Thu Nov 3 12:57:19 2016 -0400

    favorite verse of Walrus

commit 5a7d657fb0c322d9d5aca5c7a9d8ef6eb690eaeb
Author: Keith Gregory 
Date:   Thu Nov 3 12:41:35 2016 -0400

    jabberwocky: initial verse

commit e2490ffad2ba7032367f94c0bf16ca44e4b28ee6
Author: Keith Gregory 
Date:   Thu Nov 3 12:39:41 2016 -0400

    initial commit, titles without content

But take a closer look at those commit hashes, and compare them to the hashes from the separate branches. They're the same, which means that the commits are unchanged. In fact, git log walks the parent references of both branches, making it appear that commits are on one branch when they aren't. The diagram actually looks like this (yes, I know, there are two commits too many):

The merge created a new commit, which has two parent references. It also created a new blob to hold the combined file, along with a modified tree. If we run cat-file on the merge commit, this is what we get:

tree 589187777a672868e50aefc491ed640125d1e3ed
parent 5269b0742e49a3b201ddc0255c50c37b1318aa9c
parent 2e68e9b9905b043b06166cf9aa5566d550dbd8ad
author Keith Gregory  1478192589 -0400
committer Keith Gregory  1478192589 -0400

Merge branch 'jabber'

So, what does it mean that git log produces the illusion of a series of merged commits? Consider what happens when you check out one of the commits in the list, for example 2e68e9b9.

This was a commit that was made on the branch. If you check out that commit and look at the commit log from that point, you'll see that commit 5269b074 no longer appears. It was made on master, in a completely different chain of commits.

In a complex series of merges (say, multiple development branches onto an integration branch, and several integration branches onto a feature branch) you can completely lose track of where and why a change was made. If you try to diff your way through the commit history, you'll find that the code changes dramatically between commits, and appears to flip-flop; you're simply seeing the code state on different branches.

Wrapping up: how safe is SHA-1?

“Everybody knows” that SHA-1 is a “broken” hash, so why is it the basis for storing objects in Git?

The answer to that question has two parts. The first is that SHA-1 is “broken” in terms of an attacker being able to create a false message that has the same SHA-1 hash as a real message: it takes fewer than the expected number of attempts (although still a lot!). This is problem if the message that you're hashing is involved in validating a server certificate. It's not a problem in the case of Git, because at worst the attacher would be able to replace a single object — you might lose one file within a commit, or need to manually rebuild a directory.

That's an active attacker, but what about accidental collisions: let's say that you have a commit with a particular hash, and it just so happens that you create a blob with the same hash. It could happen, although the chances are vanishingly small. And if it does happen, Git ignores the later file.

So don't worry, be happy, and remember to squash features from development branches.

Saturday, April 30, 2016

Taming Maven: Transitive Dependency Pitfalls

Like much of Maven, transitive dependencies are a huge benefit that brings with them the potential for pain. And while I titled this piece “Taming Maven,” the same issues apply to any build tool that uses the Maven dependency mechanism, including Gradle and Leiningen.

Let's start with definitions: direct dependencies are those listed in the <dependencies> section of your POM. Transitive dependencies are the dependencies needed to support those direct dependencies, recursively. You can display the entire dependency tree with mvn dependency:tree; here's the output for a simple Spring servlet:

[INFO] com.kdgregory.pathfinder:pathfinder-testdata-spring-dispatch-1:war:1.0-SNAPSHOT
[INFO] +- javax.servlet:servlet-api:jar:2.4:provided
[INFO] +- javax.servlet:jstl:jar:1.1.1:compile
[INFO] +- taglibs:standard:jar:1.1.1:compile
[INFO] +- org.springframework:spring-core:jar:3.1.1.RELEASE:compile
[INFO] |  +- org.springframework:spring-asm:jar:3.1.1.RELEASE:compile
[INFO] |  \- commons-logging:commons-logging:jar:1.1.1:compile
[INFO] +- org.springframework:spring-beans:jar:3.1.1.RELEASE:compile
[INFO] +- org.springframework:spring-context:jar:3.1.1.RELEASE:compile
[INFO] |  +- org.springframework:spring-aop:jar:3.1.1.RELEASE:compile
[INFO] |  |  \- aopalliance:aopalliance:jar:1.0:compile
[INFO] |  \- org.springframework:spring-expression:jar:3.1.1.RELEASE:compile
[INFO] +- org.springframework:spring-webmvc:jar:3.1.1.RELEASE:compile
[INFO] |  +- org.springframework:spring-context-support:jar:3.1.1.RELEASE:compile
[INFO] |  \- org.springframework:spring-web:jar:3.1.1.RELEASE:compile
[INFO] \- junit:junit:jar:4.10:test
[INFO]    \- org.hamcrest:hamcrest-core:jar:1.1:test

The direct dependencies of this project include servlet-api version 2.4 and :spring-core version 3.1.1.RELEASE. The latter has a dependency on spring-asm, which in turn has a dependency on commons-logging.

In a real-world application, the dependency tree may include hundreds of JARfiles with many levels of transitive dependencies. And it's not a simple tree, but a directed acyclic graph: many JARs will share the same dependencies — although possibly with differing versions.

So, how does this cause you pain?

The first (and easiest to resolve) pain is that you might end up with dependencies that you don't want. For example, commons-logging. I don't subscribe to the fear that commons-logging causes memory leaks, but I also use SLF4J, and don't want two logging facades in my application. Fortunately, it's (relatively) easy to exclude individual dependecial's, as I described in a previous “Taming Maven” post.

The second pain point, harder to resolve, is what, exactly, is the classpath?

A project's dependency tree is the project's classpath. Actually, “the” classpath is a bit misleading: there are separate classpaths for build, test, and runtime, depending on the <scope> specifications in the POM(s). Each plugin can define its own classpath, and some provide a goal that lets you see the classpath they use; mvn dependency:build-classpath will show you the classpath used to compile your code.

This tool lists dependencies in alphabetical order. But if you look at a generated WAR, they're in a different order (which seems to bear no relationship to how they're listed in the POM). If you're using a “shaded” JAR, you'll get a different order. Worse, since a shaded JAR flattens all classes into a single tree, you might end up with one JAR that overwrites classes from another (for example, SLF4J provides the jcl-over-slf4j artifact, which contains re-implemented classes from commons-logging).

Compounding classpath ordering, there is the possibility of version conflicts. This isn't an issue for the simple example above, but for real-world applications that have deep dependency trees, there are bound to be cases where dependencies-of-dependencies have different versions. For example, the Jenkins CI server has four different versions of commons-collections in its dependency tree, ranging from 2.1 to 3.2.1 — along with 20 other version conflicts.

Maven has rules for resolving such conflicts. The only one that matters is that direct dependencies take precedence over transitive. Yes, there are other rules regarding depth of transitive dependencies and ordering, but those are only valid to discover why you're getting the wrong version; they won't help you fix the problem.

The only sure fix is to lock down the version, either via a direct dependency, or a dependency-management section. This, however, carries its own risk: if one of your transitive dependencies requires a newer version than the one you've chosen, you'll have to update your POM. And, let's be honest, the whole point of transitive dependencies was to keep you from explicitly tracking every dependency that your app needs, so this solution is decidedly sub-optimal.

A final problem — and the one that I consider the most insidious — is directly relying on a transitive dependency.

As an example, I'm going to use the excellent XML manipulation library known as Practical XML. This library makes use of the equally excellent utility library KDGCommons. Having discovered the former, you might also start using the latter — deciding, for example, that its implementation of parallel map is far superior to others.

However, if you never updated your POM with a direct reference to KDGCommons, then when the author of PracticalXML decides that he can use functions from Jakarta commons-lang rather than KDGCommons, you've got a problem. Specifically, your build breaks, because the transitive depenedency has disappeared.

You might think that this is a uncommon situation, but it was actually what prompted this post: a colleague changed one of his application's direct dependencies, and his build started failing. After comparing dependencies between the old and new versions we discovered a transitive depenency that disappeared. Adding it back as a direct dependency fixed the build.

To wrap up, here are the important take-aways:

  • Pay attention to transitive dependency versions: whenever you change your direct dependencies, you should run mvn dependency:tree to see what's changed with your transitives. Pay particular attention to transitives that are omitted due to version conflicts.
  • If your code calls it, it should be a direct dependency. Plugging another of my creations, the PomUtil dependency tool can help you discover those.

Saturday, April 2, 2016

Practical Git: Recovering Orphaned Commits

> git reset --hard 5b7830c203e3f581c1d23d9f945478b9f94979da
HEAD is now at 5b7830c third revision

Some time later, after the window showing the log has been closed …
“Oh s**t, that was the wrong revision!”

If you've followed my advice, you have a backup of your Git repository and recovery is a simple matter of pulling the latest changes. But maybe you haven't pushed recently. Or maybe this is a scratch project that you never bothered to copy remotely.

The commits are still in your local repository, but they're orphaned: no branch head points to them, so there's no easy way to access them. But with a little digging, you can find the commit(s) that you wanted to preserve, and create a branch that contains them. You just need to know how Git stores commits.

The short form is that Git doesn't store “commits” per se, it stores objects. A commit is simply one type of object (the others are trees, which contains directory listings, and blobs, which contain files). This seems like a nit-picking differentiation, but it means that you have to hunt for your commits within a possibly large directory of objects.

That directory is .git/objects, which resides under the root of your project. If you look inside this directory, you'll see a bunch of sub-directories:

ls -l .git/objects
total 68
drwxrwxr-x 2 kgregory kgregory 4096 Apr  2 08:16 04
drwxrwxr-x 2 kgregory kgregory 4096 Apr  2 08:16 0d
drwxrwxr-x 2 kgregory kgregory 4096 Apr  2 08:16 21
…

And if you look inside one of these sub-directories, you'll see a bunch of files:

ls -l .git/objects/04
total 4
-r--r--r-- 1 kgregory kgregory 166 Apr  2 08:16 fbee2d3610317180c3f15b0d122e24f39fa82c

Well, in this case not so many files, because this is an example project with just four commits. But in a real project, you may have hundreds, if not thousands, of files in each directory. So you need some way to winnow them down. On Linux, the find command will let you see files that were created within a time range; in this case I look for everything in the last hour, because those are the commits that I deleted:

find .git/objects/ -cmin -60 -type f -ls
796801    4 -r--r--r--   1 kgregory kgregory       31 Apr  2 08:16 .git/objects/ce/06f54a5a2032d1fb605284e7217fca9e7a5073
796811    4 -r--r--r--   1 kgregory kgregory      167 Apr  2 08:16 .git/objects/66/82d6271a5416ed0a325cbafc34b32bbf893976
796803    4 -r--r--r--   1 kgregory kgregory       56 Apr  2 08:16 .git/objects/fa/b61137a51c608783b342d6e1912f45ae24c775
…

At this point it should be clear that these files are named for SHA-1 hashes. The two-level directory structure is designed so that you can store large numbers of files efficiently: in a project with 10,000 commits there will be 256 sub-directories, each of which will contain an average of 39 commits. Unfortunately for us, the full SHA1 hash consists of the directory name concatenated to the filename. So we need to apply some sed:

find .git/objects/ -cmin -60 -type f | sed -e 's/.*ects.//' | sed -e 's/\///'
ce06f54a5a2032d1fb605284e7217fca9e7a5073
6682d6271a5416ed0a325cbafc34b32bbf893976
fab61137a51c608783b342d6e1912f45ae24c775
…

The reason that we need the hashes — versus simply grepping the files — is that the files are compressed. However, git provides the cat-file tool to help us:

git cat-file -p 6682d6271a5416ed0a325cbafc34b32bbf893976
tree ec6da2f24b113700e2d64b773b4d2c9149451bfd
parent 5b7830c203e3f581c1d23d9f945478b9f94979da
author Keith Gregory  1459599418 -0400
committer Keith Gregory  1459599418 -0400

fourth revision

At this point, finding the commits that you care about is a matter of passing the list of hashes into cat-file, and grepping for text that identifies your commit. Looking for the actual commit message is chancy (especially if it might overlap with filenames or file contents), so I generally just look for files that contain “committer”:

for f in `find .git/objects/ -cmin -60 -type f | sed -e 's/.*ects.//' | sed -e 's/\///'`
do git cat-file -p $f | grep committer && echo $f
done
committer Keith Gregory  1459599387 -0400
275a303e89187e0ccffef18fd8f8d42103b33618
committer Keith Gregory  1459599402 -0400
5b7830c203e3f581c1d23d9f945478b9f94979da
committer Keith Gregory  1459599418 -0400
6682d6271a5416ed0a325cbafc34b32bbf893976
committer Keith Gregory  1459599374 -0400
0dc65221c3e1d4991fc9bf9471d5dc6e372c3885
committer Keith Gregory  1459599395 -0400
04fbee2d3610317180c3f15b0d122e24f39fa82c

The next-to-last field on each line is the timestamp of the commit; the highest number will be the last commit. At this point, you can recover your commits:

git checkout -b my_recovered_commits 6682d6271a5416ed0a325cbafc34b32bbf893976
Switched to a new branch 'my_recovered_commits'

> git log
commit 6682d6271a5416ed0a325cbafc34b32bbf893976
Author: Keith Gregory 
Date:   Sat Apr 2 08:16:58 2016 -0400

    fourth revision

commit 5b7830c203e3f581c1d23d9f945478b9f94979da
Author: Keith Gregory 
Date:   Sat Apr 2 08:16:42 2016 -0400

    third revision
…

Sunday, March 1, 2015

Developing in the Clouds

Deploying to the cloud — whether to a platform-as-a-service provider such as Heroku, or an infrastructure-as-a-service provider such as Amazon EC2 — is commonplace, particularly for startups that don't want to invest in infrastructure, or established companies that need resources to handle operational spikes. But not much has been written about using a cloud-based host as a development platform. I just wrapped up a six-month project where my primary development machine was an Amazon EC2 instance, and here are a few of my thoughts.

I'm going to start with the negatives:

  • Latency

    I live in Philadelphia and was working on an Amazon EC2 instance in Ashburn Virgina, a distance of approximately 150 miles — at the speed of light, roughly a millisecond. However, there are multiple network hops between me and the server, all of which add up to a round-trip time (via ping) of roughly 40 milliseconds. If you confine yourself to text, that's nearly unnoticeable. If you run the X Window System, it's unbearable. If your workflow is GUI-intensive, cloud-based development might not be a good choice (although I consider VNC quite acceptable when using an IDE for development).

  • Capability

    My desktop runs an Ivy Bridge Core i7 with 32 Gb of RAM. The t2.medium instance that I used for remote development has baseline performance of roughly 40% of an undisclosed Xeon and only 4 Gb of RAM. As it turns out, that's sufficient for many development tasks, especially with a rapid-turnaround platform such as Node.JS. If you have big compiles, you can always fire up a c4.8xlarge with a Haswell Xeon, 60 Gb of RAM, and disk throughput that's far better than your desktop SSD.

  • Cost

    Mind you, that c4.8xlarge will cost you: as of this date, $1.68 per hour or fraction thereof. On another project, a colleague fired up a cluster of these instances and forgot to shut them down when he left the company. A month later the IT department gave me a call to ask if we really needed them, because they were costing us $5,000 a month. By comparison, the t2.medium instance costs $0.052 per hour, or $456 per year. More than a decent developer desktop on a three-year depreciation schedule, but not that bad in the larger scheme.

  • Security

    This is the big one: if you're going to run machines in the cloud, you need to have at least a baseline knowledge of Internet security (or hire someone who does). Put simply, you will be attacked. To apply some numbers to that statement, I started an EC2 instance that exposed SSH and HTTP, and left it running for 24 hours. The first attempt to break into SSH happened within a half hour; there were 39 attempts over the course of the test. Yandex started exploring the site within 12 hours, followed by other web scrapers.*

    Basic security rules will get you a long way: don't use dictionary passwords for any exposed service (and for SSH, don't use passwords at all), and don't expose any unprotected services to the outside world. Use a firewall that checks origin IP. If you're running on AWS, this feature is built into security groups. If you need to share access to your instance, or access it from locations that you don't know in advance, consider a VPN.

    This is also a case where I think security by obscurity is useful — at least as a first line of defense. Most scannerbots and web crawlers look at well-known ports: 22 for SSH; 80, 8000, and 8080 for HTTP. Exposing your prototype website on port 21498 isn't going to stop a dedicated attacker (and there are bulk port scanners out there), but it will prevent your site's content from showing up in a search index before you're ready.

And now, the positives:
  • Availability

    The ability to access a cloud-based host from anywhere, at any time, gives you an enormous amount of flexibility in how you do work. There's no need to lug a laptop home every night, and if the roads are snow-covered and filled with stopped traffic, you can easily work from home. With tools like screen or VNC, you can have a session that's always set up just how you want it, and which can run programs while you're not connected. Plus, it's easy to collaborate: unlike a personal computer, a cloud instance can be shared by multiple users.

  • Consistency

    I don't know about you, but after a few years my computers all accumulate a significant amount of cruft: libraries or entire applications that I installed for a particular project and no longer need, along with various versions of standard tools, some of which are expected by the OS itself. Multiply that cruft by the number of members on your team, and add a fudge factor for different shell configuration files. None of which matches your production (or even test) environment. It's a mess.

    To me, this is the biggest benefit of developing in the cloud: you can be certain that all machines are configured alike — or at least start out that way. Tools such as Chef and Puppet will take you from a base image to a fully configured server in one step. With Amazon, after you've configured the server once, you can create a private AMI and stamp out as many instances as you want.

  • Disposability

    The flip side of starting servers quickly is disposing of them when no longer needed. There's no reason to patch or update your machine; that just accumulates cruft. This mantra has long been used by operations teams: I was out riding with a friend last fall when his beeper went off; we pulled over, he discovered that one of his production servers was having trouble, shut it down, and started a replacement. Analysis could wait for later; there was no need to try to make an emergency patch.

Finally, lessons learned:
  • Experience Helps — a Lot

    I've been using AWS since 2009, and have been responsible for administering my personal Linux machine since well before that. However, the limits of my knowledge became apparent when the company hired Eric, a “cloud architect” who lived up to the name. He quickly had us running in an isolated virtual private cloud (VPC), with separate VPCs for our test, integration, and production environments, OpenVPN to secure access, LDAP to hold credentials, and automatic deployments from our CI server. If you can find such a person, hire him or her; it will save a lot of time.

  • Disposability Changes How You Work

    I got a hint of this several years ago, when using AWS to test a distributed application: we would prepare a test, start up a dozen servers, run the test, then shut them down. It allowed a form of testing that was quite simply impossible just a few years earlier; no company that I've worked for had a spare closet full of machines, and even if they did, configuration would require hours.

    As developers, we set up our machines just the way we want them; it takes a long time, and no two developers have the same configuration. But if you have the base development image preconfigured, firing up a new machine becomes a process of copying your personal configuration files and checking out your workspace. You learn to let go of “my machine.”


* I've written elsewhere about the questionable practices of web crawlers. If you have a site that's not ready for primetime, don't expose it to the Internet on a well-known port.

Saturday, November 30, 2013

Accessing the World Via Cellphone

My mother doesn't have a connection to the Internet. For that matter, she doesn't have a computer; if I ever want to explore the reasons why I'm a closet Luddite, I need look no further. This makes the Thanksgiving holiday a bit challenging: I want to spend the entire week visiting (less traffic, less overall stress), but usually need to work at least a couple of days.

It used to be that there were plenty of neighbors with open wifi, but in the past few years they've either changed providers or learned about security; now everyone in range requires a password. We've tried going to the town library, but that's really not a great place to work (when did librarians switch from shushing patrons to talking loudly amongst themselves?). And we've spent days with family and friends, but it's hard to actually focus on work in that situation.

This year I decided to try something different: bring my Internet with me via a mobile hotspot. I'd used Clear WiMax at a client site for a couple of months, and occasionally tether my cellphone, so thought that it might be a viable solution. After some research I settled on a prepaid 5 gigabyte plan from T-Mobile. Their coverage map indicated an excellent signal at my mother's house (unlike AT&T, where I'm lucky to get one bar on the second floor), there was no long-term commitment, and the price was good ($42.40: $30 for service, $10 for a SIM, and $2.40 for tax). I was also able to use my old (unlocked) Galaxy Nexus as the hotspot: it's limited to 3G, but I didn't think the extra speed of LTE justified the $150 cost of a new device.

Overall, the experiment was a success: I consistently saw download speeds in the 2-4 Mbit/sec range, and was able to browse the web, connect to remote servers via SSH, and participate in video chats without problem. Not quite what I'm used to from my home FiOS service, but better than the ADSL service that I used for ten years.

The major problem that I experienced was that it took a long time to initiate connections. It didn't matter whether I was opening an SSH session or a web page; I'd wait several seconds until the data started flowing. Once the connection was established, data flowed without delay.

Well, almost without delay: when downloading large files, I could see the rate fluctuating, from a high of around 600 kbytes/sec down to a low under 10 kbytes/sec, with occasional stalls. I'm not sure whether that was due to rate limiting on the part of T-Mobile, or competition for the network. At one point I looked at the phone's status screen, and saw the network type repeatedly switching between UMTS, HSDPA, and HSPA+, so the latter seems most likely. But video calls didn't seem affected, other than occasional blurring.

Bottom line: for less than the cost of a tank of gas, I was able to work in my pajamas. Well worth it.

Thursday, August 2, 2012

Taming Maven: A Local Repository Server

While Maven's automatic dependency retrieval is a great feature, it does have limitations. And one of the biggest of those limitations is Maven's ability to access local projects. Or, really, any projects that aren't found in the Maven Central repository.

Returning to the archetypal IT department of my previous posts, it's really painful if, before you can start working on your own project, you first have to check-out and build multiple dependent projects. Worse is if those dependencies are under development, and you have to rebuild on a regular basis.

The first part of eliminating that pain is to run a local repository server. This server is a place to deploy your local builds, including any third-party software or local patches you've made to open source software. It can also act as a proxy for Maven Central and other external repositories, protecting you from Internet outages and keeping your POMs free of <repositories> entries.

You can create a Maven repository server using a bare-bones Apache web-server: all you need to do is make sure that its document root follows the repository format. However, there are better options: Nexus and Artifactory are both purpose-built servers for managing Maven artifacts, and both come in open-source variants that you can run for free. If you don't have a local machine, or don't want the hassle of administering it, Artifactory provides cloud hosting of your repository (for a fee). Sonatype doesn't go quite that far, instead providing a pre-built EC2 image (hopefully updated since that post).

Once you've got the repository server running, you need to configure Maven to access it. The simplest approach is to add your local server as a mirror for Maven Central, as described here. Note that you can not simply add a <repositories> entry to your parent POM, as you need to deploy that POM to the repository server.

Now you face the question of how to deploy your builds. Both Nexus and Artifactory give you a web interface to manually upload artifacts, but it's far easier to use the Maven deploy goal to deploy directly from your build (using an HTTP connection to the server). Of course, that raises the issue of credentials: do you give each developer his/her own credentials (which are stored in $HOME/.m2/settings.xml), or use a single set of credentials for all?

I'm in favor of the latter: use one set of credentials, stored either in each user's personal settings file, or in the global settings file. While that may make some people cringe, the security risk is non-existent: the repository server is write-only, and it will control where you write. As long as you don't pass out the actual admin login, or use SCP to deploy, the worst a disgruntled ex-employee can do is upload new builds.

And even that that minor risk can be eliminated if your developers never have to type “deploy” — and they'd be happier too. Instead, set up a continuous integration server that examines your source repository for changes and automatically builds and deploys the changed projects. At least for snapshot builds, this ensures that all of your developers will be using the latest codebase, without any manual intervention whatsoever.

Wednesday, August 1, 2012

Taming Maven: Dependency Management

Once you have a parent POM, you can add a <dependencyManagement> section. In my view, it's often more trouble than its worth (in fact, this blog series started out as a single post suggesting properties were usually a better choice). In the worst case, a dependency management section can prevent your child builds from seeing new dependencies. There are, however, some cases where it is useful to prevent Maven from using unwanted transitive dependencies.

For those who haven't seen <dependencyManagement> in use, here's an abbreviated example:

<dependencyManagement>
    <dependencies>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-core</artifactId>
            <version>${springframework.version}</version>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-webmvc</artifactId>
            <version>${springframework.version}</version>
        </dependency>
    </dependencies>
</dependencyManagement>

With the dependency version specified in the parent's dependency management section, it can be omitted in the child's <dependencies>:

<dependencies>
    <dependency>
        <groupId>org.springframework</groupId>
        <artifactId>spring-core</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework</groupId>
        <artifactId>spring-webmvc</artifactId>
    </dependency>
</dependencies>

If that was all there was to it, there would be very little reason to use a dependency management section, versus defining version properties in the parent — it might actually be a net increase in the size of your POMs. But dependency management goes a bit deeper: it will override the transitive dependencies associated with your direct dependencies. This bears further examination.

If you use the Spring framework, you know that it's broken into lots of pieces. You don't need to specify all of these as direct dependencies, because each component brings along transitive dependencies (I haven't verified this, but I think spring-mvc the only direct dependency you need for a Spring web-app).

Now consider the case where the parent POM has a dependency management section that lists all of the Spring components, and gives them version 3.0.4.RELEASE, while a child uses version 3.1.1.RELEASE, and just specifies spring-webmvc as a direct dependency. When Maven builds the child and retrieves transitive dependencies, it will ignore the 3.1.1.RELEASE implied by the direct dependency, and instead load the 3.0.4.RELEASE versions specified by the parent.

This is rarely a Good Thing. Sometimes it won't cause an actual bug: if you are using features of the new version that haven't changed since the old version, you have nothing to worry about. But more often, you'll get a NoSuchMethodError thrown at runtime. Or worse, the method is present but does something unexpected. These sorts of bugs can be incredibly painful to track down.

Version properties, of course, go a long way toward keeping these errors from occurring. But some projects will need to specify their own version properties, often because they're trying out some new functionality.

There is, however, one case where a dependency management section is useful: excluding transitive dependencies. Again using Spring as an example: it will use either commons-logging or SLF4J for its internal logging; at runtime, it figures out which is available. However, as of this writing (version 3.1.2.RELEASE), spring-core has a non-optional transitive dependency on commons-logging. Which means that your program will also have a transitive dependency on commons-logging — and if your program is a web-app, you'll find commons-logging in its deployed WAR whether you want it or not.

Perhaps some day the Spring developers will change the scope of this dependency to provided. Until then, if you don't want commons-logging you need to manually break the transitive dependency with an exclusion:

<dependency>
    <groupId>org.springframework</groupId>
    <artifactId>spring-core</artifactId>
    <version>${springframework.version}</version>
    <exclusions>
        <exclusion>
            <groupId>commons-logging</groupId>
            <artifactId>commons-logging</artifactId>
        </exclusion>
    </exclusions>
</dependency>

Without a dependency management section in the parent POM, you would have to repeat that exclusion in every project POM. Miss just one, and the transitive dependency appears. Move the exclusion into the parent's dependency management section, and it applies to all children. Of course, this locks down the version number; any child projects that need a different version must specify a direct dependency on spring-core.

Bottom line: don't use a <dependencyManagement> section unless you absolutely have to. And even then, keep it as small as you possibly can.

Tuesday, July 31, 2012

Taming Maven: Parent POMs

When faced with a development environment that has dozens or hundreds of distinct projects, version properties are only a first step. Even if you only have to look at one place in each POM, it's a headache to update versions. As a solution, you can specify all of your common dependencies and configuration in a “parent” POM, which is then referenced by each project.

Before continuing, I want to clear up a misconception: parent POMs are not the same as multi-module projects, even though they're described together in the Maven documentation. True, the two are often seen together: a multi-module project almost always uses a parent POM to bring order to its modules. But the parent need not be tied to the modules; an example is the Sonatype parent POM, which is used by every open-source project that deploys to Maven Central via the Sonatype repository.

A parent POM looks like a normal POM, but specifies a packaging type of “pom

<groupId>org.sonatype.oss</groupId>
<artifactId>oss-parent</artifactId>
<version>7</version>
<packaging>pom</packaging>

The children of this POM then reference it via a <parent> element:

<parent>
    <groupId>org.sonatype.oss</groupId>
    <artifactId>oss-parent</artifactId>
    <version>7</version>
</parent>

<groupId>net.sf.kdgcommons</groupId>
<artifactId>kdgcommons</artifactId>
<version>1.0.7-SNAPSHOT</version>

So what goes into the parent POM? Version properties, of course; one of the main reasons for using a parent POM is to ensure that all projects use the same set of dependencies. Also common plugin configuration, such as the compiler, test runner, and any reporting plugins. Finally, any common environment configuration, such as repositories and deployment configuration.

What shouldn't go in the parent POM is an actual <dependencies> section, because that will cause all of your projects to have the same set of dependencies, whether they need them or not. Nor should you add plugins that only run for one or a few projects (although by all means specify the plugin versions). And finally, if your projects use an <scm> section, it needs to go in the individual project POMs — I learned the hard way that Maven won't substitute project-specific values into a section defined by the parent.

The biggest complaint that I've heard about parent POMs is “if we change a dependency, then we have to update all the projects that use that parent!” That's true: the parent is a released artifact, just like the projects themselves; a child specifies a particular version of its parent, and is not automagically updated when the parent changes (unless you use snapshot versions for the parents).

My answer to this complaint is “either it matters or it doesn't, and either way the parent helps you.” There are times when changes don't matter: for example, if you move to a new library version that's backwards compatible. In that case, projects that use the new parent get the new version, as do any projects that link with them, via transitive dependencies. Projects that don't need the new functionality don't need to be updated. Over time, you can migrate these projects to the new POM as you make changes to them.

On the other hand, sometimes the change matters: for example you've modified your database schema, and need to update all projects that use the affected business objects. In this case, the parent again makes your life easier: once you update the dependency property in the parent, it's a simple matter of grepping for that property to find children that need to be updated and re-released.

Monday, July 30, 2012

Taming Maven: Version Properties

Getting started with Maven is easy, and once you use its dependency management feature, you'll wonder why you waited so long. For simple web-apps or single-module projects, it Just Works.

However, most software developers aren't working on simple, one-module projects. We work in organizations that manage many projects, often depending on one-another. And in this situation, the basic Maven project breaks down: you find that every project has a differing set of dependencies, some of which are incompatible. This is the first in a series of postings about taming a multi-project development environment.

To start, replace all of your hardcoded dependency versions with properties.

Projects accumulate dependencies over time: you might start out with a few of the core Spring packages, then add a few of the Apache Commons projects, then a few more Spring projects, then some libraries that another part of your organization maintains. Pretty soon you'll have dozens of dependencies, completely unordered. Just finding a dependency in the mess becomes difficult, even if you have a tool like m2eclipse. And it becomes very easy to have two related dependencies — or even duplicate dependencies — with different versions. Maven can resolve most of these problems automagically, but when it fails, you're in for a long and painful diagnosis session.

But, if you use properties for your dependencies, and adopt a consistent naming strategy for those properties, you may not be able to find your dependency references, but at least the versions will be defined in one place. Start by adding a <properties> section to your POM; I generally place it near the top of the POM, before the <build> and <dependencies> sections (both of these tend to be long).

<properties>
    <org.springframework.version>3.1.1.RELEASE</sorg.pringframework.version>
    <!-- and so on, for all of your dependencies -->
</properties>

Each property is its own element, and the element name is the property name. You can name your properties anything you want (as long as it's a legal XML element name), but for version properties I think that GROUPID.version makes the most sense. Or use GROUPID.ARTIFACTID.version if there are different artifacts for the same group that don't have a common version (for example, Spring Core and Spring Security).

Next, update the dependency to use that property, rather than a hardcoded version number.

<dependency>
    <groupId>org.springframework</groupId>
    <artifactId>spring-webmvc</artifactId>
    <version>${org.springframework.version}</version>
</dependency>

Once all of your POMs use version properties, you can start to organize dependencies across projects. This can be as simple as running grep to find all projects that use a particular property. But Maven gives you a better solution, which will be the topic of tomorrow's post.


This series of posts was prompted by several recent projects where I worked with development organizations that had large project bases built on Maven. I was originally planning to show some of the Vim macros that I used to clean up POMs, but decided instead to start work on a tool to clean up POMs.

Friday, February 17, 2012

Setting up a Subversion Repository for a Small Team

Most organizations that use Subversion for source control set up an Apache server in front of the repository. For a small team, this can be daunting, especially if nobody on the team has experience running and configuring Apache. Even in a larger organization, with experienced Apache admins, I question the benefit. Especially if you rely on Apache's built-in authentication mechanism, which involves keeping a private list of Subversion users, separate from your organization's main user list. One that's unlikely to get updated every time someone joins or leaves the organization.

There is an alternative: Subversion comes with the svnserve server, which listens on port 3690, and speaks it's own protocol with Subversion clients. It's deceptively easy to use:

# on the server
svnserve -d -r /opt/svn/repo/

# and on the client ...
svn co svn://localhost workdir

But there are several problems with running svnserve like this. First, you manually started the server, and you have to restart it if the system ever goes down. While you can configure inetd to automatically start the server, that adds a task for your sysadmins. And unless your developers are also your sysadmins, this will take time and may be a point of contention.

A bigger problem is that running svnserve in this way means that all commits are anonymous. The output of svn annotate will be meaningless, but worse, anybody that can physically access your repository will be able to update it. Clearly, you wouldn't want to expose this repo through the corporate firewall. And while svnserve provides several options for authentication, they take you back to a second user database.

But there's an alternative: identify the repository using a svn+ssh URL. This tells a client to open an SSH connection to the server, and run svnserve over that connection.

svn co svn+ssh://localhost/opt/svn/repo workdir

There is a lot to recommend this approach. First, it uses the server's own user database for authentication and authorization; no more anonymous commits. Second, system administrators are a lot more willing to open the SSH port to the outside world, rather than some custom server that has unknown security holes. Third, an experienced user can set up public key authentication, making the whole login process transparent.

The biggest drawback to running Subversion in this way is that, because each user connects as him or herself, all users need to be able to physically write files in the Subversion repository. In a small team, where everybody knows everybody else, I don't see this as a big issue. However, if you're concerned about security, you can take the following steps:

  • Use a dedicated host for the Subversion repository, and configure SSH to limit “ordinary” users to only run svnserve (you'll still need full access for admins).
  • Create a “subversion” user group — or, for larger teams, a project group “ and assign it to the people who are to have access.

Once you've created the repository, you need to grant write permissions on some of the files in it. Assuming that you're creating repositories in the /opt/svn directory, and have a subversion group, here are the commands to set up your repository (run by root):

cd /opt/svn
svnadmin create repo

# there are some files in the repository that get created from the first commit;
# having root create the standard trunk/tags/branches layout gets you that commit

svn co file://`pwd`/repo working
cd working
svn mkdir trunk
svn mkdir tags
svn mkdir branches
svn commit -m "create initial structure"
cd ..
rm -rf working

# now we can modify the actual repository 

chgrp -R subversion repo
cd repo/db
chmod g+w .
chmod -R g+w current min-unpacked-rev rep-cache.db revprops revs transactions txn-* write-lock

If you've changed all the files correctly, running ls -al on the /opt/svn/repo/db directory should show you this:

total 56
drwxrwsr-x 6 root subversion 4096 Feb 11 22:02 .
drwxr-xr-x 6 root subversion 4096 Feb 11 21:53 ..
-rw-rw-r-- 1 u1   subversion    2 Feb 11 22:02 current
-r--r--r-- 1 root subversion   22 Feb 11 21:53 format
-rw-r--r-- 1 root subversion 1920 Feb 11 21:53 fsfs.conf
-rw-r--r-- 1 root subversion    5 Feb 11 21:53 fs-type
-rw-rw-r-- 1 root subversion    2 Feb 11 21:53 min-unpacked-rev
-rw-rw-r-- 1 root subversion 4096 Feb 11 21:54 rep-cache.db
drwxrwsr-x 3 root subversion 4096 Feb 11 21:53 revprops
drwxrwsr-x 3 root subversion 4096 Feb 11 21:53 revs
drwxrwsr-x 2 root subversion 4096 Feb 11 22:02 transactions
-rw-rw-r-- 1 u1   subversion    2 Feb 11 22:02 txn-current
-rw-rw-r-- 1 root subversion    0 Feb 11 21:53 txn-current-lock
drwxrwsr-x 2 root subversion 4096 Feb 11 22:02 txn-protorevs
-rw-r--r-- 1 root subversion   37 Feb 11 21:53 uuid
-rw-rw-r-- 1 root subversion    0 Feb 11 21:53 write-lock

Monday, August 9, 2010

Ant, Taskdef, and running out of PermGen

Although I've switched to Maven for building Java projects (convention over configuration ftw), I still keep Ant in my toolbox. It excels at the sort of free-form non-Java projects that most people implement using shell scripts.

One reason that Ant excels at these types of projects is that you can easily implement project-specific tasks such as a database extract, and mix those tasks with the large library of built-in tasks like filter or mkdir. And the easiest way to add your tasks to a build file is with a taskdef:

    <taskdef name="example"
             classname="com.kdgregory.example.ant.ExampleTask"
             classpath="${basedir}/lib/mytasks.jar"/>

Last week I was working on a custom task that would retrieve data by US state. I invoked those with the foreach task from the ant-contrib library, so that I could build a file from all 50 states. Since I expected it to take several hours to run, I kicked it off before leaving work for the day.

The next morning, I saw that it had failed about 15 minutes in, having run out of permgen space. And the error happened when it was loading a class. At first I suspected the foreach task, or more likely, the antcall that it invoked. After all, it creates a new project, so what better place to create a new classloader? Plus, it was in the stack trace.

But as I looked through the source code for these tasks, I couldn't see any place where a new classloader was created (another reason that I like Ant is that it's source is generally easy to follow). That left the taskdef — after all, I knew that my code wasn't creating a new classloader. To test, I created a task that printed out its classloader, and used the following build file:

<project default="default" basedir="..">

    <taskdef name="example1"
             classname="com.kdgregory.example.ant.ExampleTask"
             classpath="${basedir}/classes"/>
    <taskdef name="example2"
             classname="com.kdgregory.example.ant.ExampleTask"
             classpath="${basedir}/classes"/>

    <target name="default">
        <example1 />
        <example2 />
    </target>

</project>

Sure enough, each taskdef is loaded by its own classloader. The antcall simply exacerbates the problem, because it executes the typedefs over again.

It makes sense that Ant would create a new classloader for each project, and even for each taskdef within a project (they can, after all, have unique classpaths). And as long as the classloader is referenced only from the project, it — and the classes it loads — will get collected at the same time as the project. And when I looked in the Project class, I found the member variable coreLoader.

But when I fired up my debugger, I found that that variable was explicitly set to null and never updated. The I put a breakpoint in ClasspathUtils, and saw that it was being invoked with a “reuse” flag set to false. The result: each taskdef gets its own classloader, and they're never collected.

I think there's a bug here: not only is the classloader not tied to the project object, it uses the J2EE delegation model, in which a classloader attempts to load classes from its own classpath before asking its parent for the class. However, the code makes me think that this is intentional. And I don't understand project life cycles well enough to know what would break with what I feel is the “correct” implementation.

Fortunately, there's a work-around.

As I was reading the documentation for taskdef, I saw a reference to antlibs. I remembered using antlibs several years ago, when I was building a library of a dozen or so tasks, and didn't want to copy-and-paste the taskdefs for them. And then a lightbulb lit: antlibs must be available on Ant's classpath. And that means that they don't need their own classloader.

To use an antlib, you create the file antlib.xml, and package it with the tasks themselves:

<antlib>
    <taskdef name="example1" classname="com.kdgregory.example.ant.ExampleTask"/>
    <taskdef name="example2" classname="com.kdgregory.example.ant.ExampleTask"/>
</antlib>

Then you define an “antlib” namespace in your project file, and refer to your tasks using that namespace. The namespace specifies the package where antlib.xml can be found (by convention, the top-level package of your task library).

<project default="default" 
    xmlns:ex="antlib:com.kdgregory.example.ant">

    <target name="default">
        <ex:example1 />
        <ex:example2 />
        <antcall target="example"/>
    </target>

    <target name="example">
        <ex:example1 />
        <ex:example2 />
    </target>   

</project>
It's extra effort, but the output makes the effort worthwhile:
ant-classloader-example, 528> ant -f -lib bin build2.xml 
Buildfile: /home/kgregory/tmp/ant-classloader-example/build2.xml

default:
[ex:example1] project:     org.apache.tools.ant.Project@110b053
[ex:example1] classloader: java.net.URLClassLoader@a90653
[ex:example2] project:     org.apache.tools.ant.Project@110b053
[ex:example2] classloader: java.net.URLClassLoader@a90653

example:
[ex:example1] project:     org.apache.tools.ant.Project@167d940
[ex:example1] classloader: java.net.URLClassLoader@a90653
[ex:example2] project:     org.apache.tools.ant.Project@167d940
[ex:example2] classloader: java.net.URLClassLoader@a90653

BUILD SUCCESSFUL
Total time: 0 seconds

Bottom line: if you're running out of permgen while running Ant, take a look at your use of taskdef, and see if you can replace it with an antlib. (at least one other person has run into similar problems; if you're interested in the sample code, you can find it here).

Tuesday, April 6, 2010

The Trouble with Wiki

If the comments don't match the code, they're both wrong.

I've been thinking about this maxim a lot. I recently started a new job, and have been busy learning the codebase. One of the tools that I've been using is the corporate wiki, and once again I'm faced with “wiki rot”: pages that are no longer relevant, or have broken links, or were started with the best of intention and then abandoned. I suppose I should feel lucky: I know one company that stored status reports in their wiki; the chance of a search returning anything useful at that company was pretty low.

Don't get me wrong. I think that wikis were a great invention. Wikipedia is my first stop for answers on a wide range of topics. And I learned much of what I know about patterns and Agile methodology from Ward's wiki (although by the time I started reading, much of the original community had seemingly vanished, replaced by a lot of insecure whiners).

At its best, a wiki gives you a place to write notes that can be shared within a team. It's a natural transition from the “project notebook” that Brooks recommended in The Mythical Man Month. But as long-term project documentation? Not so good.

Projects evolve, and all too often wiki pages don't. After all, the people who spend their days working on a project know how it works. They know the environment variables required for a successful build, and the configuration needed for a successful run. And when new people join the project, it's easier (once you realize the wiki is wrong) to just ask someone. And no-one has time to update the wiki …

What's to be done? Wikipedia has one answer: a large community of people who passionately update pages. But the key word there is “large.” In a corporate environment, you simply don't have this community. And if you have a corporate practice of putting everything into the wiki, it's not going to be useful even if you do have a community of dedicated maintainers.

I think that a solution for corporate wikis starts with organization by project or product rather than organizational unit. This change is subtle, meant to tap into the developers' feelings of craftsmanship: “these pages are the public face of my work.” An empty project wiki stares team members in the face in the same way “fixme” comments do in code — admittedly, that doesn't work for all teams.

A second, perhaps more important step is to archive all pages that haven't been edited in the last year. Keep them available, but not easily available: don't show them in basic searches, and highlight any links from other pages. In the best case, this will prompt team members to keep their pages up-to-date. In the worst, it will keep casual users from wasting time with incorrect documentation — which is valuable in itself.

Friday, March 26, 2010

Distributed Version Control

I've been using Mercurial for the past couple of weeks: I'm working on the sort of small, “skunkworks” project that doesn't belong in the company's main version control system. Normally I'd create my own Subversion repository to hold it. In this case, I thought my coworkers might want to run and modify it (which they did), so distributed version control (DVCS) seemed like a good idea. I picked Mercurial primarily because Git confused me when I tried it. Mercurial has a simple interface, with many of the same commands used by Subversion, and the built-in web-server is nice for distributing changes.

Before continuing, I should say that I'm a Luddite when it comes to version control. It's not that I don't like using a VCS — indeed, I think that any team or project that doesn't use some form of version control is on a short road to unhappiness. But I'm not one to jump to the newest and shiniest VCS. I was using SCCS until long after CVS became available (working directories? we don't need no stinkin' working directories!), and it wasn't until this year that I converted the last of my personal CVS repositories to Subversion (I kept forgetting which commands went with which projects).

So I'm naturally reluctant to jump on the DVCS bandwagon. But more than that, I don't see a clear use case for a typical corporate development team.

There are definitely some situations where DVCS is appropriate: my skunkworks project is one. Linux development is another. But the reason that DVCS is appropriate in these situations is because everyone's off doing their own thing. In the case of Linux development, you have thousands (?) of individual developers making little tweaks; if those tweaks are perceived as a Good Thing, they'll eventually make their way into the (central) repository owned by Linus. On the other hand, you have dozens of distributions, all pulling from that repository and adding their own pieces. Coordination is ad hoc; development is largely (to borrow a phrase) a Team of One.

In a corporate environment, however, you don't often have people doing their own thing (unless you have a very dysfunctional team). Instead, everyone is working on (possibly overlapping) pieces of a common codebase. To me, this implies that you want one “authoritative” copy of that codebase: you don't want to guess who has the “best” copy on any given day. And once you admit the need for some level of centralization, the checkin-merge-push cycle demanded by a distributed VCS seems like a lot of effort compared to the update-checkin cycle of a centralized VCS.

“But I can work on a plane!” I've actually heard that argument. At the time, my response was “I didn't think you traveled that often.” But a better response is “why do we care about your commits?” And that's the main point of this post.

“Check-in early, check-in often” is a great motto to live by. Especially early in a project, when you're experimenting. It's nice to be able to roll back to what you were doing an hour ago, or roll forward some pieces you thought you didn't need but now realize you do. If you use continuous integration, you'll check in at least daily.

However, frequent checkins create a quandary: once your time horizon moves past a few hours, you don't want all those commit messages in your repository! Think about the last time that you looked at a repository log: why did you do it? Probably to find when and why a particular file changed; maybe several changes, widely spaced in time. I've done that a lot, and stumbling over dozens of minor checkins (even after using annotate) is a royal pain.

One way to work around this issue (when it's even considered an issue) is to use “feature branches”: all new development takes place on a branch, and then gets merged back to the trunk when completed. This is the way that I learned how to do multi-person source management some 15 years ago, using ClearCase. Unfortunately, many teams are scared by feature branches, or perhaps they're scared of the merge at the end, so they stick with the “CVS” approach of doing all development on the trunk, with branches confined to bugfix duty. And at the other extreme, standard DVCS procedure is that the repository is the branch — in other words, once you've pushed, your trunk has a bunch of undifferentiated revisions.

My alternative is something I call “ghetto DVCS”: cloning a central repository, and working in the clone. When you're done, you copy your changes into a working directory, and check them into the “real” repository. Ironically, I started doing this while I was traveling, and wanted to use source control without access to the central repository. I decided that I liked it enough to continue even when I had wasn't traveling. For example, Practical XML revision 87 represents close to two weeks worth of work, during which I added, deleted, or renamed files many times. If I had been working directly in the central repository, there would be a dozen or more revisions.

The problem with my approach is that it can be a lot of work, particularly if you rename existing files. And this is where I think DVCS offers promise: since they track the ultimate parent revision for each file in your local repository, they could easily merge the tip with this parent and ignore the interim revisions (and maybe Git can do this; as I said, it confused me, but tip merges make a lot of sense for the Linux development model).

Until that happens, I'm sticking with svn.


Update: my friend Jason pointed out that git pull has a --squash option. Which makes a lot of sense, given the pull-based nature of the Linux development model. It's not the "push tip" that I want, but is closer. Now I just have to figure out how to use the rest of git.