Friday, December 12, 2008

git is slow; too many lstat operations


I'm working on a new project and using git for the first time; I really like the distributed source control model, it makes a lot of sense. I come from using p4 for quite a long time, so git feels a lot like CVS/SubVersion, but obviously a lot more powerful than CVS. Also, I want to make sure that I do not say that git is hard to understand and use.

The way I work with my projects is by submitting not only all my source code for my project, but also all tools that are used to generate my project. (Obviously I draw the line somewhere, I don't submit my Linux distribution that I use on the build machine.) That way I can guarantee reproducibility and no one on the team has to install their own tools before being productive.

A short time into my new project I already have more than 500,000 files in my git repository. Now every time I use any git command, it just sits there for many seconds... even when I open a file using Emacs, I have a severe delay because git is invoked to check its status. Using dtrace I've determined that it is doing an lstat on every single file in the repository, to make sure that it's not out of date.

Therefore beware: git is not good for large repositories. And when I say large, I don't mean "sorta large" like 20,000 files. I mean more on the order of 500,000 or 2.5 million files (the current size of our p4 repository). This is because git is just like CVS: it automatically determines for you which files you have edited/deleted/added, and which you have not, but it does this by doing an lstat on every single file in your tree. p4 does not make this assumption, and requires you to tell it what files you have edited/deleted/added, therefore it never does an lstat of the files in your tree (unless you invoke some special commands to ask it to look).

It seems there are attempts to optimize the lstat operations, but it is part of the design of git, and would likely be unnatural to avoid it or suppress it.

I found the git benchmarks, where it describes how efficient it is for the size of its repositories, which is great. However, it does not mention the number of files in a repository. And I couldn't find the current size of the linux kernel repository, but I found a mention that the pull of an entire tree was 5,000 files, which included all of git's metadata (if I understood it correctly).

In order to work around git's slowness with large numbers of files, I decided to split my git repository into two halves; the half with the tools, and the other half with my project source. I found a tool called git split that seems to do the job (seehttp://people.freedesktop.org/~jamey/git-split), but it didn't work on my git 1.6.0 repository. It got stuck because it couldn't read my .git/info/grafts. So I gave up there and just deleted all the tools out of my tree, and added them into a separate git repository. That made things much better.

Update: I found this blog article that also claims that git does not scale: Why Perforce is more scalable than Git