-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In-memory diffs and merges slow on large repos #6036
Comments
Yes! This isn't limited to in-memory repositories, either. Merge always does more work than it needs to. But it seems more acceptable in the on-disk case because you always need to produce an index. (With the in-memory case, you can imagine a technique like you described.) I think that the problem is the final index calculation. I think that we only walk into changed trees, but I could be mistaken, it's been a while since I've been in that code. Interestingly, git is pursuing a sparse index for this that may inform this change. Maybe we always return a sparse index from merge that can either be written straight out, converted to trees, or rehydrated as an old school index. |
The longest step is writing the index to disk as a tree. It's done in the straightforward fashion, by iterating over all the entries and building up trees: Line 572 in f1b89a2
Even if most of the subtrees exist on disk, you have to make a huge number of disk accesses just to confirm that every object is stored in the object database. Thanks for the link on to the sparse indexes patch series. I also found this design document upon searching. (The last time I searched for "sparse index", I only got results about "sparse checkouts", so I think this has been indexed since then.) One thing I'm curious about is what the cache tree extension ( |
It also seems that
|
I'm experiencing the same problem with slow diffs with large repos. We have a repo with over 6 million files (but we're always operating on it as a bare repo). Performing a diff using libgit2 between HEAD and its parent takes about 10 seconds, while "git diff" takes 0.01 seconds, so a factor of 1000 times slower for libgit2. The diff itself only involves two files, but they are around 10 directories down. I've compiled libgit2 (v1.5) and lg2 with profiling and will attach the results here. I hope it will help you find out why it's slow. I'm not an expert on the libgit2 code but from the profiling it looks like most of the time is actually spent on advancing the iterators during diffing, and not performing the diffing itself. |
In-memory merges can be slow on a repository like https://github.com/mozilla/gecko-dev, e.g. several seconds to carry out a simple merge for a few changed files. This is around 500x slower than what's possible, comparing against a workaround (see benchmark at arxanas/git-branchless@4c57407):
I believe this is because the in-memory
Index
structure always stores all files, even when the vast majority of them aren't changed. It would be best if the in-memory index could alternatively be backed by a tree + changed paths.The workaround is as follows:
Reference implementation for cherry-picking specifically: https://github.com/arxanas/git-branchless/blob/ec0d27427ab7a505d4109e4588e356d6a18da2fe/src/git/repo.rs#L726-L836
The text was updated successfully, but these errors were encountered: