You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
But it still performs many GetCommit calls, these are unnecessarily slow and in fact may be cached so are probably unnecessary.
GetCommit is slow
This is intentional: this call is batched. This batching makes sense when performing other operations (see @tzahij's blog post) because they involve concurrency at load. For instance, when a large-scale Spark job starts reading from lakeFS, it performs multiple concurrent reads, and each fetches the same HEAD commit. Batching means responses arrive later than they might.
But batching does not make sense in FindMergeBase: this function sequentially fetches commits, so if we batch the commits we slow each iteration down. One speedup might be to pass this function an unbatched CommitGetter.
But we can do even better...
GetCommit should be cached
There can be many thousands of commits. However, a quick inspection of graveler.go suggests that the relation graveler_commits is never updated other than by FillGenerations / LoadCommits. Commits are immutable.
So we can cache commits. Every call to GetCommit should go through a cache. The cache can (easily) have size 10_000, and will hold all relevant commits in memory. FillGenerations can simply invalidate the cache :-) , it is anyway only used by restore-refs. And, if we use our cache.Cache then we don't even need to batch: concurrent GetCommit operations only call out once to the database, and the result is cached for later operations.
The text was updated successfully, but these errors were encountered:
This is a copy of @arielshaqed 's comment
#2968 is very nice :-)
But it still performs many GetCommit calls, these are unnecessarily slow and in fact may be cached so are probably unnecessary.
GetCommit is slow
This is intentional: this call is batched. This batching makes sense when performing other operations (see @tzahij's blog post) because they involve concurrency at load. For instance, when a large-scale Spark job starts reading from lakeFS, it performs multiple concurrent reads, and each fetches the same HEAD commit. Batching means responses arrive later than they might.
But batching does not make sense in FindMergeBase: this function sequentially fetches commits, so if we batch the commits we slow each iteration down. One speedup might be to pass this function an unbatched CommitGetter.
But we can do even better...
GetCommit should be cached
There can be many thousands of commits. However, a quick inspection of graveler.go suggests that the relation graveler_commits is never updated other than by FillGenerations / LoadCommits. Commits are immutable.
So we can cache commits. Every call to GetCommit should go through a cache. The cache can (easily) have size 10_000, and will hold all relevant commits in memory. FillGenerations can simply invalidate the cache :-) , it is anyway only used by restore-refs. And, if we use our cache.Cache then we don't even need to batch: concurrent GetCommit operations only call out once to the database, and the result is cached for later operations.
The text was updated successfully, but these errors were encountered: