Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up protected branch detection #222

Open
2 tasks done
epage opened this issue Mar 23, 2022 · 5 comments
Open
2 tasks done

Speed up protected branch detection #222

epage opened this issue Mar 23, 2022 · 5 comments
Labels
bug Not as expected

Comments

@epage
Copy link
Collaborator

epage commented Mar 23, 2022

Please complete the following tasks

  • I have searched the discussions
  • I have searched the existing issues

Description

In #216, we found that git stack can be slow calculating protected bases when some of them are far from each other or the topic branches.

Version

v0.8.0

Steps to reproduce

git clone https://github.com/westermo/linux/
cd linux
time git stack
git checkout 8dc84dc
git checkout -b foo
gvim COPYING
git commit -a -m "foo"
time git stack
git co master
git branch -D foo
time git stack

Actual Behaviour

Takes 6s

Expected Behaviour

Should take <1s

Debug Output

No response

@epage epage added the bug Not as expected label Mar 23, 2022
@epage
Copy link
Collaborator Author

epage commented Mar 23, 2022

git-branchless originally cached merge-bases on-disk

https://github.com/arxanas/git-branchless/blob/b48ed600777f73e410732e613b199a09e6dafaf2/src/core/mergebase.rs

Then switched to Eden. @arxanas is it all eden or is it also the hooks incrementally updating eden's database?

it now uses a data structure called the "segmented changelog" to represent its commit graph. See quark-zju/gitrevset#1 and commit message at arxanas/git-branchless@f6c540f for more details. It answers O(log(n)) merge-base queries with very good real performance (~1ms per query). That particular implementation is GPL-3-only, unfortunately.

@epage
Copy link
Collaborator Author

epage commented Mar 23, 2022

@arxanas

  • Any particular reason you used sqlite? Seems like reading/writing json would be good enough
    • Risks for reads/writes not being atomic is low (just invalidate the cache)
    • I guess it can reduce complications in detecting when to write the file back out
  • Looks like you didn't handle cache invalidation? I was thinking of tracking what isn't read and removing those from the cache.

@arxanas
Copy link
Contributor

arxanas commented Mar 24, 2022

  • Why SQLite:
    • git-branchless was already using SQLite, so low overhead to add a new table. It might not be worth it if you have to take on a new dependency.
    • I didn't have to worry about atomicity/performance concerns.
    • In the future, I was thinking that I might want to index on one or both of the operands for a certain optimization. Here's the idea I had from Enhancements arxanas/git-branchless#2 (comment):
  • Improve performance of git sl after rebase. In large repos, it can take several seconds to compute.
    • This could probably be done using the post-update hook. There should be something about the lattice merge-base structure which we can exploit to quickly calculate new merge-bases for all the visible commits, something like if mb(local, main) = x and mb(main, main') = x then mb(local, main') = x (but there are some edge-cases due to the fact that nodes can have multiple parents).
  • Invalidation:

    • None of the merge-base queries are "invalidated" in terms of correctness, as you're no doubt aware.
    • In practice, the results were never slow enough and never used up enough space on disk for me to worry about cleaning up.
    • The plan was to just clear out the merge-base cache as part of git branchless gc (which is triggered by the pre-auto-gc hook), but as per above, I didn't need to do that.
    • I think it's perfectly sensible to do something like read in the entire cache, drop any entries which you didn't use, and then write it back to disk (atomically).
  • Eden DAG:

    • The only interface that the Eden DAG needs for updates is for you to pass a function that, when given a node in the DAG, looks up and returns the parents of that node: https://github.com/arxanas/git-branchless/blob/45ace45579804697f8daeb551750424bf8e8538b/src/core/dag.rs#L209-L258. Eden then recurses on only the parents which don't already appear in its internal DAG. This isn't run as part of any hook; it happens on-demand when we create the DAG object in memory.
    • Before doing operations with any commit, we update the Eden DAG on-demand with all commits we care about. Usually, the update step takes just a few milliseconds, even after big fetches, because there won't have been that many commits since the last time we asked (in absolute terms).
    • The Eden DAG structure means that we incur a penalty during the first DAG sync, as we have to walk all the commits in the repository.
      • In contrast, looking up merge-bases directly only incurs a penalty in the order of the number of commits to the common ancestor, rather than the repository size. (But you'll have to incur that whenever you do a merge-base query.
      • To walk all the commits in gecko-dev, it takes ~20s on my machine, which is a fine trade-off for me in exchange for practically free merge-base lookups after that. This reminded me to open Populate the DAG on git branchless init arxanas/git-branchless#308 to make the construction happen during git branchless init (as currently it just happens the first time you need the DAG).

@epage
Copy link
Collaborator Author

epage commented Mar 24, 2022

Yeah, Eden sounds nice; too bad its GPL. Thats weird for me to say as I used to be a fan but there are enough people / companies that restrict GPL that I don't want the work I do to be restricted and I'm unlikely to create the next major piece of software that will be abused for its license. This also means I'm participating in the viral nature of license biases.

Thats a good point that incremental updates wouldn't be a big deal, even without hooks (I also want to avoid hooks). With a graph data structure, it seems like it'd be relatively easy to identify and trim all roots that aren't known (ie cache invalidation). Might be interesting to play with this at a later time if there is an alternative viable graph that supports serialization.

@arxanas how does git-branchless do for initializing on Linux? I'm assuming thats much worse than gecko-dev.

@arxanas
Copy link
Contributor

arxanas commented Mar 25, 2022

@epage here's the timing results:

$ (cd linux && rm -rf .git/branchless/dag/ && sudo purge && /usr/bin/time git sl)
       35.89 real        29.53 user         1.89 sys
$ (cd linux && git rev-list --count HEAD)
1078940
$ (cd gecko-dev && rm -rf .git/branchless/dag/ && sudo purge && /usr/bin/time git sl)
       21.52 real        17.32 user         1.35 sys
$ (cd gecko-dev && git rev-list --count HEAD)
787973

That's a little bit longer than you would expect just based on commit numbers (21.52s * 1078940 commits / 787973 commits = 29.47s, but it actually takes 35s). I'm guessing that it's somehow pessimized by linux's use of many merge commits in contrast to gecko-dev's avoidance of merge commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Not as expected
Projects
None yet
Development

No branches or pull requests

2 participants