Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-memory diffs and merges slow on large repos #6036

Open
arxanas opened this issue Sep 6, 2021 · 4 comments
Open

In-memory diffs and merges slow on large repos #6036

arxanas opened this issue Sep 6, 2021 · 4 comments

Comments

@arxanas
Copy link
Contributor

arxanas commented Sep 6, 2021

In-memory merges can be slow on a repository like https://github.com/mozilla/gecko-dev, e.g. several seconds to carry out a simple merge for a few changed files. This is around 500x slower than what's possible, comparing against a workaround (see benchmark at arxanas/git-branchless@4c57407):

  • 1.9s for naive cherry-pick.
  • 2.8ms for workaround cherry-pick.

I believe this is because the in-memory Index structure always stores all files, even when the vast majority of them aren't changed. It would be best if the in-memory index could alternatively be backed by a tree + changed paths.

The workaround is as follows:

  • Find the commits to merge and calculate their merge-base, as appropriate.
  • Find all paths changed among each of those commits' trees compared to the merge-base.
  • Generate synthetic versions of each tree/commit only containing the changed paths.
  • Carry out the merge on the synthetic trees/commits.
  • Combine the result back with an original tree, overwriting any entries already in the tree. (It doesn't matter which tree, since the non-changed entries are all the same.)

Reference implementation for cherry-picking specifically: https://github.com/arxanas/git-branchless/blob/ec0d27427ab7a505d4109e4588e356d6a18da2fe/src/git/repo.rs#L726-L836

@arxanas arxanas changed the title In-memory merges slow on large repos In-memory merges 500x slower than necessary on large repos Sep 6, 2021
@ethomson
Copy link
Member

ethomson commented Sep 6, 2021

Yes! This isn't limited to in-memory repositories, either. Merge always does more work than it needs to. But it seems more acceptable in the on-disk case because you always need to produce an index.

(With the in-memory case, you can imagine a technique like you described.)

I think that the problem is the final index calculation. I think that we only walk into changed trees, but I could be mistaken, it's been a while since I've been in that code.

Interestingly, git is pursuing a sparse index for this that may inform this change. Maybe we always return a sparse index from merge that can either be written straight out, converted to trees, or rehydrated as an old school index.

@arxanas
Copy link
Contributor Author

arxanas commented Sep 6, 2021

The longest step is writing the index to disk as a tree. It's done in the straightforward fashion, by iterating over all the entries and building up trees:

static int write_tree(

Even if most of the subtrees exist on disk, you have to make a huge number of disk accesses just to confirm that every object is stored in the object database.

Thanks for the link on to the sparse indexes patch series. I also found this design document upon searching. (The last time I searched for "sparse index", I only got results about "sparse checkouts", so I think this has been indexed since then.)

One thing I'm curious about is what the cache tree extension (TREE) for the index (https://git-scm.com/docs/index-format#_cache_tree) is supposed to accomplish. It's supposed to "[speed] up tree object generation from the index for a new commit by only computing the trees that are "new" to that commit." Is that applicable for in-memory indexes?

@arxanas
Copy link
Contributor Author

arxanas commented Sep 11, 2021

It also seems that diff_tree_to_tree is slower than it needs to be, compared to the workaround approach. Benchmark at arxanas/git-branchless@16b0c69:

  • 224ms for naive diff_tree_to_tree.
  • 1.9ms for workaround diff_tree_to_tree.

@arxanas arxanas changed the title In-memory merges 500x slower than necessary on large repos In-memory diffs and merges slow on large repos Sep 11, 2021
@tomasskare
Copy link

I'm experiencing the same problem with slow diffs with large repos. We have a repo with over 6 million files (but we're always operating on it as a bare repo). Performing a diff using libgit2 between HEAD and its parent takes about 10 seconds, while "git diff" takes 0.01 seconds, so a factor of 1000 times slower for libgit2. The diff itself only involves two files, but they are around 10 directories down.

I've compiled libgit2 (v1.5) and lg2 with profiling and will attach the results here. I hope it will help you find out why it's slow. I'm not an expert on the libgit2 code but from the profiling it looks like most of the time is actually spent on advancing the iterators during diffing, and not performing the diffing itself.

gprof-output.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants