Friday, March 26, 2010

Distributed Version Control

I've been using Mercurial for the past couple of weeks: I'm working on the sort of small, “skunkworks” project that doesn't belong in the company's main version control system. Normally I'd create my own Subversion repository to hold it. In this case, I thought my coworkers might want to run and modify it (which they did), so distributed version control (DVCS) seemed like a good idea. I picked Mercurial primarily because Git confused me when I tried it. Mercurial has a simple interface, with many of the same commands used by Subversion, and the built-in web-server is nice for distributing changes.

Before continuing, I should say that I'm a Luddite when it comes to version control. It's not that I don't like using a VCS — indeed, I think that any team or project that doesn't use some form of version control is on a short road to unhappiness. But I'm not one to jump to the newest and shiniest VCS. I was using SCCS until long after CVS became available (working directories? we don't need no stinkin' working directories!), and it wasn't until this year that I converted the last of my personal CVS repositories to Subversion (I kept forgetting which commands went with which projects).

So I'm naturally reluctant to jump on the DVCS bandwagon. But more than that, I don't see a clear use case for a typical corporate development team.

There are definitely some situations where DVCS is appropriate: my skunkworks project is one. Linux development is another. But the reason that DVCS is appropriate in these situations is because everyone's off doing their own thing. In the case of Linux development, you have thousands (?) of individual developers making little tweaks; if those tweaks are perceived as a Good Thing, they'll eventually make their way into the (central) repository owned by Linus. On the other hand, you have dozens of distributions, all pulling from that repository and adding their own pieces. Coordination is ad hoc; development is largely (to borrow a phrase) a Team of One.

In a corporate environment, however, you don't often have people doing their own thing (unless you have a very dysfunctional team). Instead, everyone is working on (possibly overlapping) pieces of a common codebase. To me, this implies that you want one “authoritative” copy of that codebase: you don't want to guess who has the “best” copy on any given day. And once you admit the need for some level of centralization, the checkin-merge-push cycle demanded by a distributed VCS seems like a lot of effort compared to the update-checkin cycle of a centralized VCS.

“But I can work on a plane!” I've actually heard that argument. At the time, my response was “I didn't think you traveled that often.” But a better response is “why do we care about your commits?” And that's the main point of this post.

“Check-in early, check-in often” is a great motto to live by. Especially early in a project, when you're experimenting. It's nice to be able to roll back to what you were doing an hour ago, or roll forward some pieces you thought you didn't need but now realize you do. If you use continuous integration, you'll check in at least daily.

However, frequent checkins create a quandary: once your time horizon moves past a few hours, you don't want all those commit messages in your repository! Think about the last time that you looked at a repository log: why did you do it? Probably to find when and why a particular file changed; maybe several changes, widely spaced in time. I've done that a lot, and stumbling over dozens of minor checkins (even after using annotate) is a royal pain.

One way to work around this issue (when it's even considered an issue) is to use “feature branches”: all new development takes place on a branch, and then gets merged back to the trunk when completed. This is the way that I learned how to do multi-person source management some 15 years ago, using ClearCase. Unfortunately, many teams are scared by feature branches, or perhaps they're scared of the merge at the end, so they stick with the “CVS” approach of doing all development on the trunk, with branches confined to bugfix duty. And at the other extreme, standard DVCS procedure is that the repository is the branch — in other words, once you've pushed, your trunk has a bunch of undifferentiated revisions.

My alternative is something I call “ghetto DVCS”: cloning a central repository, and working in the clone. When you're done, you copy your changes into a working directory, and check them into the “real” repository. Ironically, I started doing this while I was traveling, and wanted to use source control without access to the central repository. I decided that I liked it enough to continue even when I had wasn't traveling. For example, Practical XML revision 87 represents close to two weeks worth of work, during which I added, deleted, or renamed files many times. If I had been working directly in the central repository, there would be a dozen or more revisions.

The problem with my approach is that it can be a lot of work, particularly if you rename existing files. And this is where I think DVCS offers promise: since they track the ultimate parent revision for each file in your local repository, they could easily merge the tip with this parent and ignore the interim revisions (and maybe Git can do this; as I said, it confused me, but tip merges make a lot of sense for the Linux development model).

Until that happens, I'm sticking with svn.


Update: my friend Jason pointed out that git pull has a --squash option. Which makes a lot of sense, given the pull-based nature of the Linux development model. It's not the "push tip" that I want, but is closer. Now I just have to figure out how to use the rest of git.