Friday, December 12, 2008

git is slow; too many lstat operations


I'm working on a new project and using git for the first time; I really like the distributed source control model, it makes a lot of sense. I come from using p4 for quite a long time, so git feels a lot like CVS/SubVersion, but obviously a lot more powerful than CVS. Also, I want to make sure that I do not say that git is hard to understand and use.

The way I work with my projects is by submitting not only all my source code for my project, but also all tools that are used to generate my project. (Obviously I draw the line somewhere, I don't submit my Linux distribution that I use on the build machine.) That way I can guarantee reproducibility and no one on the team has to install their own tools before being productive.

A short time into my new project I already have more than 500,000 files in my git repository. Now every time I use any git command, it just sits there for many seconds... even when I open a file using Emacs, I have a severe delay because git is invoked to check its status. Using dtrace I've determined that it is doing an lstat on every single file in the repository, to make sure that it's not out of date.

Therefore beware: git is not good for large repositories. And when I say large, I don't mean "sorta large" like 20,000 files. I mean more on the order of 500,000 or 2.5 million files (the current size of our p4 repository). This is because git is just like CVS: it automatically determines for you which files you have edited/deleted/added, and which you have not, but it does this by doing an lstat on every single file in your tree. p4 does not make this assumption, and requires you to tell it what files you have edited/deleted/added, therefore it never does an lstat of the files in your tree (unless you invoke some special commands to ask it to look).

It seems there are attempts to optimize the lstat operations, but it is part of the design of git, and would likely be unnatural to avoid it or suppress it.

I found the git benchmarks, where it describes how efficient it is for the size of its repositories, which is great. However, it does not mention the number of files in a repository. And I couldn't find the current size of the linux kernel repository, but I found a mention that the pull of an entire tree was 5,000 files, which included all of git's metadata (if I understood it correctly).

In order to work around git's slowness with large numbers of files, I decided to split my git repository into two halves; the half with the tools, and the other half with my project source. I found a tool called git split that seems to do the job (seehttp://people.freedesktop.org/~jamey/git-split), but it didn't work on my git 1.6.0 repository. It got stuck because it couldn't read my .git/info/grafts. So I gave up there and just deleted all the tools out of my tree, and added them into a separate git repository. That made things much better.

Update: I found this blog article that also claims that git does not scale: Why Perforce is more scalable than Git

2 comments:

  1. You may want to try out the "ASSUME UNCHANGED" bit.

    This makes git behave like p4, so you have to tell git what you have touched.

    You find more in GIT-UPDATE-INDEX(1).

    A snippit:
    ---
    USING “ASSUME UNCHANGED” BIT
    Many operations in git depend on your filesystem to have an efficient lstat(2) implementation, so that st_mtime information for working tree files can be cheaply checked to see if the
    file contents have changed from the version recorded in the index file. Unfortunately, some filesystems have inefficient lstat(2). If your filesystem is one of them, you can set "assume
    unchanged" bit to paths you have not changed to cause git not to do this check. Note that setting this bit on a path does not mean git will check the contents of the file to see if it
    has changed — it makes git to omit any checking and assume it has not changed. When you make changes to working tree files, you have to explicitly tell git about it by dropping "assume
    unchanged" bit, either before or after you modify them.

    In order to set "assume unchanged" bit, use --assume-unchanged option. To unset, use --no-assume-unchanged. To see which files have the "assume unchanged" bit set, use git ls-files -v
    (see git-ls-files(1)).

    The command looks at core.ignorestat configuration variable. When this is true, paths updated with git update-index paths... and paths updated with other git commands that update both
    index and working tree (e.g. git apply --index, git checkout-index -u, and git read-tree -u) are automatically marked as "assume unchanged". Note that "assume unchanged" bit is not set
    if git update-index --refresh finds the working tree file matches the index (use git update-index --really-refresh if you want to mark them as "assume unchanged").
    ---

    ReplyDelete
  2. Why can't git let user specify which files he wants to work on and don't compromise on speed by lstat.
    -Pankaj

    ReplyDelete