Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diff takes more than 4 minutes to complete #2962

Closed
talSofer opened this issue Feb 23, 2022 · 3 comments · Fixed by #2968
Closed

Diff takes more than 4 minutes to complete #2962

talSofer opened this issue Feb 23, 2022 · 3 comments · Fixed by #2968
Labels
bug/critical bug Something isn't working contributor

Comments

@talSofer
Copy link
Contributor

No description provided.

@arielshaqed
Copy link
Contributor

arielshaqed commented Feb 27, 2022

#2968 is very nice :-)

But it still performs many GetCommit calls, these are unnecessarily slow and in fact may be cached so are probably unnecessary.

GetCommit is slow

This is intentional: this call is batched. This batching makes sense when performing other operations (see @tzahij's blog post) because they involve concurrency at load. For instance, when a large-scale Spark job starts reading from lakeFS, it performs multiple concurrent reads, and each fetches the same HEAD commit. Batching means responses arrive later than they might.

But batching does not make sense in FindMergeBase: this function sequentially fetches commits, so if we batch the commits we slow each iteration down. One speedup might be to pass this function an unbatched CommitGetter.

But we can do even better...

GetCommit should be cached

There can be many thousands of commits. However, a quick inspection of graveler.go suggests that the relation graveler_commits is never updated other than by FillGenerations / LoadCommits. Commits are immutable.

So we can cache commits. Every call to GetCommit should go through a cache. The cache can (easily) have size 10_000, and will hold all relevant commits in memory. FillGenerations can simply invalidate the cache :-) , it is anyway only used by restore-refs. And, if we use our cache.Cache then we don't even need to batch: concurrent GetCommit operations only call out once to the database, and the result is cached for later operations.

WDYT? Can we open a new issue "Cache commits"?

@itaiad200
Copy link
Contributor

@arielshaqed @talSofer I don't want to lose the above comment. Should we create a separate issue for commit caching?

@talSofer
Copy link
Contributor Author

sure, it's on my list. thanks for confirming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/critical bug Something isn't working contributor
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants