Saturday, April 2, 2016

Practical Git: Recovering Orphaned Commits

> git reset --hard 5b7830c203e3f581c1d23d9f945478b9f94979da
HEAD is now at 5b7830c third revision

Some time later, after the window showing the log has been closed …
“Oh s**t, that was the wrong revision!”

If you've followed my advice, you have a backup of your Git repository and recovery is a simple matter of pulling the latest changes. But maybe you haven't pushed recently. Or maybe this is a scratch project that you never bothered to copy remotely.

The commits are still in your local repository, but they're orphaned: no branch head points to them, so there's no easy way to access them. But with a little digging, you can find the commit(s) that you wanted to preserve, and create a branch that contains them. You just need to know how Git stores commits.

The short form is that Git doesn't store “commits” per se, it stores objects. A commit is simply one type of object (the others are trees, which contains directory listings, and blobs, which contain files). This seems like a nit-picking differentiation, but it means that you have to hunt for your commits within a possibly large directory of objects.

That directory is .git/objects, which resides under the root of your project. If you look inside this directory, you'll see a bunch of sub-directories:

ls -l .git/objects
total 68
drwxrwxr-x 2 kgregory kgregory 4096 Apr  2 08:16 04
drwxrwxr-x 2 kgregory kgregory 4096 Apr  2 08:16 0d
drwxrwxr-x 2 kgregory kgregory 4096 Apr  2 08:16 21
…

And if you look inside one of these sub-directories, you'll see a bunch of files:

ls -l .git/objects/04
total 4
-r--r--r-- 1 kgregory kgregory 166 Apr  2 08:16 fbee2d3610317180c3f15b0d122e24f39fa82c

Well, in this case not so many files, because this is an example project with just four commits. But in a real project, you may have hundreds, if not thousands, of files in each directory. So you need some way to winnow them down. On Linux, the find command will let you see files that were created within a time range; in this case I look for everything in the last hour, because those are the commits that I deleted:

find .git/objects/ -cmin -60 -type f -ls
796801    4 -r--r--r--   1 kgregory kgregory       31 Apr  2 08:16 .git/objects/ce/06f54a5a2032d1fb605284e7217fca9e7a5073
796811    4 -r--r--r--   1 kgregory kgregory      167 Apr  2 08:16 .git/objects/66/82d6271a5416ed0a325cbafc34b32bbf893976
796803    4 -r--r--r--   1 kgregory kgregory       56 Apr  2 08:16 .git/objects/fa/b61137a51c608783b342d6e1912f45ae24c775
…

At this point it should be clear that these files are named for SHA-1 hashes. The two-level directory structure is designed so that you can store large numbers of files efficiently: in a project with 10,000 commits there will be 256 sub-directories, each of which will contain an average of 39 commits. Unfortunately for us, the full SHA1 hash consists of the directory name concatenated to the filename. So we need to apply some sed:

find .git/objects/ -cmin -60 -type f | sed -e 's/.*ects.//' | sed -e 's/\///'
ce06f54a5a2032d1fb605284e7217fca9e7a5073
6682d6271a5416ed0a325cbafc34b32bbf893976
fab61137a51c608783b342d6e1912f45ae24c775
…

The reason that we need the hashes — versus simply grepping the files — is that the files are compressed. However, git provides the cat-file tool to help us:

git cat-file -p 6682d6271a5416ed0a325cbafc34b32bbf893976
tree ec6da2f24b113700e2d64b773b4d2c9149451bfd
parent 5b7830c203e3f581c1d23d9f945478b9f94979da
author Keith Gregory  1459599418 -0400
committer Keith Gregory  1459599418 -0400

fourth revision

At this point, finding the commits that you care about is a matter of passing the list of hashes into cat-file, and grepping for text that identifies your commit. Looking for the actual commit message is chancy (especially if it might overlap with filenames or file contents), so I generally just look for files that contain “committer”:

for f in `find .git/objects/ -cmin -60 -type f | sed -e 's/.*ects.//' | sed -e 's/\///'`
do git cat-file -p $f | grep committer && echo $f
done
committer Keith Gregory  1459599387 -0400
275a303e89187e0ccffef18fd8f8d42103b33618
committer Keith Gregory  1459599402 -0400
5b7830c203e3f581c1d23d9f945478b9f94979da
committer Keith Gregory  1459599418 -0400
6682d6271a5416ed0a325cbafc34b32bbf893976
committer Keith Gregory  1459599374 -0400
0dc65221c3e1d4991fc9bf9471d5dc6e372c3885
committer Keith Gregory  1459599395 -0400
04fbee2d3610317180c3f15b0d122e24f39fa82c

The next-to-last field on each line is the timestamp of the commit; the highest number will be the last commit. At this point, you can recover your commits:

git checkout -b my_recovered_commits 6682d6271a5416ed0a325cbafc34b32bbf893976
Switched to a new branch 'my_recovered_commits'

> git log
commit 6682d6271a5416ed0a325cbafc34b32bbf893976
Author: Keith Gregory 
Date:   Sat Apr 2 08:16:58 2016 -0400

    fourth revision

commit 5b7830c203e3f581c1d23d9f945478b9f94979da
Author: Keith Gregory 
Date:   Sat Apr 2 08:16:42 2016 -0400

    third revision
…

No comments: