This is the script for a talk that I'm planning to give tomorrow at BarCamp Philly. The talk is a “live committing” exercise, and this post contains all the information needed to follow along. With some links to relevant source material and minus my pauses, typos, and attempts at humor.
Object Management
I think the first thing to understand about Git is that it's not strictly a source control system; it's more like a versioned filesystem that happens to be good at source control. Traditionally, source control systems focused on the evolution of files. For example, RCS (and its successor CVS) maintain a separate file in the repository for each source file; these repository files hold the entire history of the file, as a sequence of diffs that allow the tool to reconstruct any version. Subversion applies the idea of diffs to the entire repository, allowing it to track files as they move between directories.
Git takes a different approach: rather than constructing the state of the repository via diffs, it maintains snapshots of the repository and constructs diffs from those (if you don't believe this, read on). This allows very efficient comparisons between any two points in history, but does consume more disk space. I think the key insight is not just that disk is cheap and programmer time expensive, but that real-world software projects don't have a lot of large files, and those files don't experience a lot of churn.
To see Git in action, we'll create a temporary directory, initialize it as a repository, and
create a couple of files. I should note here that I'm using bash
on Linux; if you're
running Windows you're on your own re commands. Anything that starts with
“>
” is a command that I typed; anything else is the response from the
system.
> mkdir /tmp/$$ > cd /tmp/$$ > git init Initialized empty Git repository in /tmp/13914/.git/ > touch foo.txt > mkdir bar > touch bar/baz.txt > git add * > git commit -m "initial revision" [master (root-commit) 37a649d] initial revision 2 files changed, 0 insertions(+), 0 deletions(-) create mode 100644 bar/baz.txt create mode 100644 foo.txt
Running git log
shows you this commit, identified by its SHA-1 hash.
> git log commit 37a649dd6dec75cd68a2c3dfb7fa2948a0d3426e Author: Keith GregoryDate: Thu Nov 3 09:20:29 2016 -0400 initial revision
What you might not realize is that a commit is a physical object in the Git repository,
and the SHA-1 hash is actually the hash of its contents. Git has several different types
of objects, and each object is uniquely identified by the SHA-1 hash of its contents.
Git stores these objects under the directory .git/objects
, and the
find
command will help you explore this directory. Here I sort the results
by timestamp and then filename, to simplify tracing the changes to the repository.
> find .git/objects -type f -ls | sort -k 10,11 13369623 4 -r--r--r-- 1 kgregory kgregory 82 Nov 3 09:20 .git/objects/2d/2de60b0620e7ac574fa8050997a48efa469f5d 13369612 4 -r--r--r-- 1 kgregory kgregory 52 Nov 3 09:20 .git/objects/34/707b133d819e3505b31c17fe67b1c6eacda817 13369637 4 -r--r--r-- 1 kgregory kgregory 137 Nov 3 09:20 .git/objects/37/a649dd6dec75cd68a2c3dfb7fa2948a0d3426e 13369609 4 -r--r--r-- 1 kgregory kgregory 15 Nov 3 09:20 .git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391
OK, one commit created four objects. To understand what they are, it helps to have a picture:
Each commit represents a snapshot of the project: from the commit you can access all of the files and directories in the project as they appeared at the time of the commit. Each commit contains three pieces of information: metadata about the commit (who made it, when it happened, and the message), a list of parent commits (which is empty for the first commit but has at least one entry for every other commit), and a reference to the “tree” object holding the root of the project directory.
Tree objects are like directories in a filesystem: they contain a list of names and references to the content for each
name. In the case of Git, a name may either reference another tree object (in this example, the “bar
”
sub-directory), or a “blob” object that holds the content of a regular file.
As I said above, an object's SHA-1 is built from the object's content. That's why we have two files in the project but only one blob: because they're both empty files, the content is identical and therefore the SHA-1 is identical.
You can use the cat-file
command to look at the objects in the repository. Starting with the commit, here
are the four objects from this commit (the blob, being empty, doesn't have any output from this command):
> git cat-file -p 37a649dd6dec75cd68a2c3dfb7fa2948a0d3426e tree 2d2de60b0620e7ac574fa8050997a48efa469f5d author Keith Gregory1478179229 -0400 committer Keith Gregory 1478179229 -0400 initial revision
> git cat-file -p 2d2de60b0620e7ac574fa8050997a48efa469f5d 040000 tree 34707b133d819e3505b31c17fe67b1c6eacda817 bar 100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 foo.txt
> git cat-file -p 34707b133d819e3505b31c17fe67b1c6eacda817 100644 blob e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 baz.txt
> git cat-file -p e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
Now let's create some content. I'm generating random text with that I think is a neat hack: you get a stream of random
bytes from /dev/urandom
, then use sed
to throw away anything that you don't want. Since the
random bytes includes a newline every 256 bytes (on average), you get a file that can be edited just like typical source
code. One thing that's non-obvious: Linux by default would interpret the stream of bytes as UTF-8, meaning a lot of
invalid characters from the random source; explicitly setting the LANG
variable to an 8-bit encoding solves
this.
> cat /dev/urandom | LANG=iso8859-1 sed -e 's/[^ A-Za-z0-9]//g' | head -10000 > foo.txt
> ls -l foo.txt -rw-rw-r-- 1 kgregory kgregory 642217 Nov 3 10:05 foo.txt
> head -4 foo.txt U4UIESn4HN61l6Is epQXlHSVaLpJGt N8opIkrSt5NQsWnqYYmt9BBmEWBVaaSVjzTTFJCXHT2vay2CoDT7J rm3f7CWefdgicOdfs0tUdgx OvqjwOykmKToWkd8nxWNtCCCkUi cxn3Bn5gN4Im38y cfS6IdXgIj9O6gBEGgBW6BcZJ 2BluWmwQYgyNFIHP8RUL8m2aAjM1FwcY8ZX9fvmvJi30p9sBEkq6giuoRvJSWRW8PLCsrEWfSXeZXxO2HK2IS3MFNpviKRagug3HE96I
When we commit this change, our object directory gets three new entries:
13369623 4 -r--r--r-- 1 kgregory kgregory 82 Nov 3 09:20 .git/objects/2d/2de60b0620e7ac574fa8050997a48efa469f5d 13369612 4 -r--r--r-- 1 kgregory kgregory 52 Nov 3 09:20 .git/objects/34/707b133d819e3505b31c17fe67b1c6eacda817 13369637 4 -r--r--r-- 1 kgregory kgregory 137 Nov 3 09:20 .git/objects/37/a649dd6dec75cd68a2c3dfb7fa2948a0d3426e 13369609 4 -r--r--r-- 1 kgregory kgregory 15 Nov 3 09:20 .git/objects/e6/9de29bb2d1d6434b8b29ae775ad8c2e48c5391 13369652 4 -r--r--r-- 1 kgregory kgregory 170 Nov 3 10:07 .git/objects/16/c0d98f4476f088c46086583b9ebf76dee03bb9 13369650 4 -r--r--r-- 1 kgregory kgregory 81 Nov 3 10:07 .git/objects/bf/21b205ba0ceff1655ca5c8476bbc254748d2b2 13369647 488 -r--r--r-- 1 kgregory kgregory 496994 Nov 3 10:07 .git/objects/ef/49cb59aee81783788c17b5a024bd377f2b119e
In the diagram, you can see what happened: adding content to the file created a new blob. Since this had a different
SHA-1 than the original file content, it meant that we got a new tree to reference it. And of course, we have a new
commit that references that tree. Since baz.txt
wasn't changed, it continues to point to the original blob;
in turn that means that the directory bar
hasn't changed, so it can be represented by same tree object.
One interesting thing is to compare the size of the original file, 642,217 bytes, with the 496,994 bytes stored in
the repository. Git compresses all of its objects (so you can't just cat
the files). I generated random
data for foo.txt
, which is normally uncompressible, but limiting it to alphanumeric characters meant that
each character only takes fit in 6 bits rather than 8; compressing the file therefore saves roughly 25% of its space.
I'm going to diverge from the diagram for the next example, because I think it's important to understand that Git stores full objects rather than diffs. So, we'll add a few new lines to the file:
> cat /dev/urandom | LANG=iso8859-1 sed -e 's/[^ A-Za-z0-9]//g' | head -100 >> foo.txt
When we look at the objects, we see that there is a new blob that is slightly larger than the old one, and that the commit object (c620754c) is nowhere near large enough to hold the change. Clearly, the commit does not encapsulate a diff.
13369652 4 -r--r--r-- 1 kgregory kgregory 170 Nov 3 10:07 .git/objects/16/c0d98f4476f088c46086583b9ebf76dee03bb9 13369650 4 -r--r--r-- 1 kgregory kgregory 81 Nov 3 10:07 .git/objects/bf/21b205ba0ceff1655ca5c8476bbc254748d2b2 13369647 488 -r--r--r-- 1 kgregory kgregory 496994 Nov 3 10:07 .git/objects/ef/49cb59aee81783788c17b5a024bd377f2b119e 13369648 492 -r--r--r-- 1 kgregory kgregory 501902 Nov 3 10:18 .git/objects/58/2f9a9353fc84a6a3571d3983fbe2a0418007db 13369656 4 -r--r--r-- 1 kgregory kgregory 81 Nov 3 10:18 .git/objects/9b/671a1155ddca40deb84af138959d37706a8b03 13369658 4 -r--r--r-- 1 kgregory kgregory 165 Nov 3 10:18 .git/objects/c6/20754c351f5462bf149a93bdf0d3d51b7d91a9
Before moving on, I want to call out Git's two-level directory structure. Filesystem directories are typically a linear list of files, and searching for a specific filename becomes a significant cost once you have more than a few hundred files. Even a small repository, however, may have 10,000 or more objects. A two-level filesystem is the first step to solving this problem: the top level consists of sub-directories with two-character names, representing the first byte of the object's SHA-1 hash. Each sub-directory only holds those objects whose hashes start with that byte, thereby partitioning the total search space.
Large repositories would still be expensive to search, so Git also uses “pack” files, stored in
.git/objects/pack
; each pack file contains some large number of commits, indexed for efficient access.
You can trigger this compression using git gc
, although that only affects your local repository. Pack
files are also used when retrieving objects from a remote repository, so your initial clone gives you a pre-packed
object directory.
Branches
OK, you've seen how Git stores objects, what happens when you create a branch?
> git checkout -b my-branch Switched to a new branch 'my-branch' > cat /dev/urandom | LANG=iso8859-1 sed -e 's/[^ A-Za-z0-9]//g' | head -100 >> foo.txt > git commit -m "add some content in a branch" foo.txt [my-branch 1791626] add some content in a branch 1 file changed, 100 insertions(+)
If you look in .git/objects
, you'll see another three objects, and at this point I assume that you
know what they are.
13369648 492 -r--r--r-- 1 kgregory kgregory 501902 Nov 3 10:18 .git/objects/58/2f9a9353fc84a6a3571d3983fbe2a0418007db 13369656 4 -r--r--r-- 1 kgregory kgregory 81 Nov 3 10:18 .git/objects/9b/671a1155ddca40deb84af138959d37706a8b03 13369658 4 -r--r--r-- 1 kgregory kgregory 165 Nov 3 10:18 .git/objects/c6/20754c351f5462bf149a93bdf0d3d51b7d91a9 13510395 4 -r--r--r-- 1 kgregory kgregory 175 Nov 3 11:50 .git/objects/17/91626ece8dce3c713261e470a60049553d5411 13377993 496 -r--r--r-- 1 kgregory kgregory 506883 Nov 3 11:50 .git/objects/57/c091e77d909f2f20e985150ea922c3ec303ab6 13377996 4 -r--r--r-- 1 kgregory kgregory 81 Nov 3 11:50 .git/objects/f2/c824ab04c8f902ab195901e922ab802d8bc37b
So how does Git know that this commit belongs to a branch? The answer is that Git stores that information elsewhere:
> ls -l .git/refs/heads/ total 8 -rw-rw-r-- 1 kgregory kgregory 41 Nov 3 10:18 master -rw-rw-r-- 1 kgregory kgregory 41 Nov 3 11:50 my-branch
> cat .git/refs/heads/master c620754c351f5462bf149a93bdf0d3d51b7d91a9 > cat .git/refs/heads/my-branch 1791626ece8dce3c713261e470a60049553d5411
That's it: text files that hold the SHA-1 of a commit. There is one additional file, .git/HEAD
, which says
which of those text files represent the “working” branch:
> cat .git/HEAD ref: refs/heads/my-branch
This isn't quite the entire story. For one thing, it omits tags, stored in .git/refs/tags
. Or
“unattached HEAD” state, when .git/HEAD
holds a commit rather than a ref. And most
important, remote branches and their relationship to local branches. But this post (and the talk) are long
enough as it is.
Merges
You've been making changes on a development branch, now it's time to merge those changes into master (or an
integration branch). It's useful to know what happens when you type git merge
,
Fast-Forward Merges
The simplest type of merge — indeed, you could argue that it's not really a merge at all, because it doesn't create new commits — is a fast-forward merge. This is the sort of merge that we'd get if we merged our example project branch into master.
> git checkout master Switched to branch 'master' > git merge my-branch Updating c620754..1791626 Fast-forward foo.txt | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 100 insertions(+)
As I said, it's not really a merge at all; if you look in .git/objects
you'll see the same
list of files that were there after the last commit. What has changed are the refs:
> ls -l .git/refs/heads/ total 8 -rw-rw-r-- 1 kgregory kgregory 41 Nov 3 12:15 master -rw-rw-r-- 1 kgregory kgregory 41 Nov 3 11:50 my-branch
> cat .git/refs/heads/master 1791626ece8dce3c713261e470a60049553d5411 > cat .git/refs/heads/my-branch 1791626ece8dce3c713261e470a60049553d5411
Here's a diagram of what happened, showing the repository state pre- and post-merge:
Topologically, the “branch” represents a straight line, extending the last commit on
master
. Therefore, “merging” the branch is as simple as re-pointing the
master reference to that commit.
Squashed Merges
I'll start this section with a rant: one of the things that I hate, when looking at the history of a project, is to see a series of commits like this: “added foo”; “added unit test for foo”; “added another testcase for foo”; “fixed foo to cover new testcase”… Really. I don't care what you did to make “foo” work, I just care that you did it. And if your commits are interspersed with those of the person working on “bar” the evolution of the project becomes almost impossible to follow (I'll return to this in the next section).
Fortunately, this can be resolved easily with a squash merge:
> git merge --squash my-branch Updating c620754..1791626 Fast-forward Squash commit -- not updating HEAD foo.txt | 100 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 100 insertions(+)
The “not updting HEAD” part of this message is important: rather than commit the changes, a squash merge leaves them staged and lets you commit (with a summary message) once you've verified the changes. The diagram looks like this:
The thing to understand about squashed commits is that they're not actually merges: there's no
connection between the branch and master. As a result, “git branch -d my-branch
”
will fail, warning you that it's an unmerged branch; you need to force deletion by replacing
-d
with -D
.
To wrap up this section: I don't think you need to squash all merges, just the ones that merge a feature onto master or an integration branch. Use normal merges when pulling an integration branch onto master, or when back-porting changes from the integration branch to the development branch (I do, however, recommend squashing back-ports from master to integration).
“Normal” Merges, with or without conflicts
To understand what I dislike about “normal” merges, we need to do one. For this
we'll create a completely new repository, one where we'll compile our favorite quotes from
Lewis Carroll. We start by creating the file carroll.txt
in master
;
for this section I'll just show changes to the file, not the actual commits.
Jabberwocky Walrus and Carpenter
In order to work independently, one person will make the changes to Jabberwocky on a branch
named “jabber
”:
Jabberwocky Twas brillig, and the slithy toves Did gire and gimbal in the wabe All mimsy were the borogroves And the mome raths outgrabe Walrus and Carpenter
After that's checked in, someone else modifies the file on master
:
Jabberwocky Walrus and Carpenter The time has come, the Walrus said, To talk of many things: Of shoes, and ships, and sealing wax Of cabbages, and kings And why the sea is boiling hot And whether pigs have wings
Back on branch jabber
, someone has found the actual text and corrected
mistakes:
Jabberwocky 'Twas brillig, and the slithy toves Did gyre and gimble in the wabe All mimsy were the borogoves And the mome raths outgrabe Walrus and Carpenter
At this point we have two commits on jabber
, and one on master
after the branch was made. If you look at the commit log for jabber
you see
this:
commit 2e68e9b9905b043b06166cf9aa5566d550dbd8ad Author: Keith GregoryDate: Thu Nov 3 12:59:40 2016 -0400 fix typos in jabberwocky commit 5a7d657fb0c322d9d5aca5c7a9d8ef6eb690eaeb Author: Keith Gregory Date: Thu Nov 3 12:41:35 2016 -0400 jabberwocky: initial verse commit e2490ffad2ba7032367f94c0bf16ca44e4b28ee6 Author: Keith Gregory Date: Thu Nov 3 12:39:41 2016 -0400 initial commit, titles without content
And on master
, this is the commit log:
commit 5269b0742e49a3b201ddc0255c50c37b1318aa9c Author: Keith GregoryDate: Thu Nov 3 12:57:19 2016 -0400 favorite verse of Walrus commit e2490ffad2ba7032367f94c0bf16ca44e4b28ee6 Author: Keith Gregory Date: Thu Nov 3 12:39:41 2016 -0400 initial commit, titles without content
Now the question is: what happens when you merge jabber
onto master
?
In my experience, most people think that the branch commits are added into master based on
when they occurred. This is certainly reinforced by the post-merge commit log:
commit 8ba1b90996b4fb67529a2964836f8524a220f2d8 Merge: 5269b07 2e68e9b Author: Keith GregoryDate: Thu Nov 3 13:03:09 2016 -0400 Merge branch 'jabber' commit 2e68e9b9905b043b06166cf9aa5566d550dbd8ad Author: Keith Gregory Date: Thu Nov 3 12:59:40 2016 -0400 fix typos in jabberwocky commit 5269b0742e49a3b201ddc0255c50c37b1318aa9c Author: Keith Gregory Date: Thu Nov 3 12:57:19 2016 -0400 favorite verse of Walrus commit 5a7d657fb0c322d9d5aca5c7a9d8ef6eb690eaeb Author: Keith Gregory Date: Thu Nov 3 12:41:35 2016 -0400 jabberwocky: initial verse commit e2490ffad2ba7032367f94c0bf16ca44e4b28ee6 Author: Keith Gregory Date: Thu Nov 3 12:39:41 2016 -0400 initial commit, titles without content
But take a closer look at those commit hashes, and compare them to the hashes from the separate
branches. They're the same, which means that the commits are unchanged. In fact, git log
walks the parent references of both branches, making it appear that commits are on one branch
when they aren't. The diagram actually looks like this (yes, I know, there are two commits too many):
The merge created a new commit, which has two parent references. It also created a new blob to hold
the combined file, along with a modified tree. If we run cat-file
on the merge commit,
this is what we get:
tree 589187777a672868e50aefc491ed640125d1e3ed parent 5269b0742e49a3b201ddc0255c50c37b1318aa9c parent 2e68e9b9905b043b06166cf9aa5566d550dbd8ad author Keith Gregory1478192589 -0400 committer Keith Gregory 1478192589 -0400 Merge branch 'jabber'
So, what does it mean that git log
produces the illusion of a series of merged
commits? Consider what happens when you check out one of the commits in the list, for example
2e68e9b9
.
This was a commit that was made on the branch. If you check out that commit and look at the
commit log from that point, you'll see that commit 5269b074
no longer appears.
It was made on master
, in a completely different chain of commits.
In a complex series of merges (say, multiple development branches onto an integration branch, and several integration branches onto a feature branch) you can completely lose track of where and why a change was made. If you try to diff your way through the commit history, you'll find that the code changes dramatically between commits, and appears to flip-flop; you're simply seeing the code state on different branches.
Wrapping up: how safe is SHA-1?
“Everybody knows” that SHA-1 is a “broken” hash, so why is it the basis for storing objects in Git?
The answer to that question has two parts. The first is that SHA-1 is “broken” in terms of an attacker being able to create a false message that has the same SHA-1 hash as a real message: it takes fewer than the expected number of attempts (although still a lot!). This is problem if the message that you're hashing is involved in validating a server certificate. It's not a problem in the case of Git, because at worst the attacher would be able to replace a single object — you might lose one file within a commit, or need to manually rebuild a directory.
That's an active attacker, but what about accidental collisions: let's say that you have a commit with a particular hash, and it just so happens that you create a blob with the same hash. It could happen, although the chances are vanishingly small. And if it does happen, Git ignores the later file.
So don't worry, be happy, and remember to squash features from development branches.
No comments:
Post a Comment