Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accelerate FindMergeBase by caching commits #2981

Closed
talSofer opened this issue Feb 27, 2022 · 4 comments
Closed

Accelerate FindMergeBase by caching commits #2981

talSofer opened this issue Feb 27, 2022 · 4 comments
Labels
stale team/versioning-engine Team versioning engine

Comments

@talSofer
Copy link
Contributor

This is a copy of @arielshaqed 's comment

#2968 is very nice :-)

But it still performs many GetCommit calls, these are unnecessarily slow and in fact may be cached so are probably unnecessary.

GetCommit is slow

This is intentional: this call is batched. This batching makes sense when performing other operations (see @tzahij's blog post) because they involve concurrency at load. For instance, when a large-scale Spark job starts reading from lakeFS, it performs multiple concurrent reads, and each fetches the same HEAD commit. Batching means responses arrive later than they might.

But batching does not make sense in FindMergeBase: this function sequentially fetches commits, so if we batch the commits we slow each iteration down. One speedup might be to pass this function an unbatched CommitGetter.

But we can do even better...

GetCommit should be cached

There can be many thousands of commits. However, a quick inspection of graveler.go suggests that the relation graveler_commits is never updated other than by FillGenerations / LoadCommits. Commits are immutable.

So we can cache commits. Every call to GetCommit should go through a cache. The cache can (easily) have size 10_000, and will hold all relevant commits in memory. FillGenerations can simply invalidate the cache :-) , it is anyway only used by restore-refs. And, if we use our cache.Cache then we don't even need to batch: concurrent GetCommit operations only call out once to the database, and the result is cached for later operations.

@nopcoder nopcoder added team/versioning-engine Team versioning engine team/ecosystem Team Ecosystem labels Mar 16, 2022
@arielshaqed arielshaqed removed the team/ecosystem Team Ecosystem label Mar 17, 2022
@arielshaqed
Copy link
Contributor

Just "versioning". You lot get to have all the fun on this one.

Copy link

github-actions bot commented Nov 1, 2023

This issue is now marked as stale after 90 days of inactivity, and will be closed soon. To keep it, mark it with the "no stale" label.

@github-actions github-actions bot added the stale label Nov 1, 2023
Copy link

Closing this issue because it has been stale for 7 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 12, 2023
@arielshaqed
Copy link
Contributor

This has actually been the case for a while, now. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale team/versioning-engine Team versioning engine
Projects
None yet
Development

No branches or pull requests

3 participants