-
Notifications
You must be signed in to change notification settings - Fork 440
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compaction: sort on slices directly instead of kmerge #4839
Conversation
1264 tests run: 1213 passed, 0 failed, 51 skipped (full report)Flaky tests (1)Postgres 15
|
374cef8
to
92784e3
Compare
92784e3
to
10d2740
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this already implements what I suggested; a single walk to get sizes and valuerefs. Looking good.
Always enjoy a decrease in:
- amount of virtual calls
- amount of duplicated in-mem contents
- amount of LOC
Indeed I have pushed a commit to merge the two functions, but now there is a failure in |
Mhh yeah if I print the size of the arrays, it seems that the key search has way less items than the value search:
The failure is gone if I remove the |
I've pushed a commit which I hope fixes the test failures.... but I don't understand why this prior optimization worked. I think due to the reordering, it's not possible to have a similarly optimized state where a chunk of the keys are gone when we have a shared list (for the |
64cbebf
to
ab36f4a
Compare
I've looked at what is done with the keys, and am now more confident that all the merging did was to ensure that blobs for the same key (but different lsns) would end up in the same layer file. To preserve that, I've added a commit to coalesce the iterator. It won't have precisely the same coalescing merging though: first, the old approach would not merge key pairs from different files. in the new vec we have no good way to distinguish files so instead we just look at the size: if the combined key reaches the target file size, we don't merge. This might still create files larger than the target file size, however it will stay below the double of the target file size. |
dc9889b
to
10d2740
Compare
I've pushed an earlier state of the PR (with two separate vecs) to this branch to enable faster merging. This will unblock some downstream refactors and separates the more involved changes from the ones that just do a refactor. For the changes that originally were part of this PR, see #4888. |
## Problem `DiskBtreeReader::get` and `DiskBtreeReader::visit` both call `read_blk` internally, which we would like to make async in the future. This PR focuses on making the interface of these two functions `async`. There is further work to be done in forms of making `visit` to not be recursive any more, similar to #4838. For that, see #4884. Builds on top of #4839, part of #4743 ## Summary of changes Make `DiskBtreeReader::get` and `DiskBtreeReader::visit` async functions and `await` in the places that call these functions.
## Problem The functions `DeltaLayer::load_inner` and `ImageLayer::load_inner` are calling `read_blk` internally, which we would like to turn into an async fn. ## Summary of changes We switch from `once_cell`'s `OnceCell` implementation to the one in `tokio` in order to be able to call an async `get_or_try_init` function. Builds on top of #4839, part of #4743
## Problem PR #4839 has already reduced the number of b-tree traversals and vec creations from 3 to 2, but as pointed out in #4839 (comment) , we would ideally just traverse the b-tree once during compaction. Afer #4836, the two vecs created are one for the list of keys, lsns and sizes, and one for the list of `(key, lsn, value reference)`. However, they are not equal, as pointed out in #4839 (comment) and the following comment: the key vec creation combines multiple entries for which the lsn is changing but the key stays the same into one, with the size being the sum of the sub-sizes. In SQL, this would correspond to something like `SELECT key, lsn, SUM(size) FROM b_tree GROUP BY key;` and `SELECT key, lsn, val_ref FROM b_tree;`. Therefore, the join operation is non-trivial. ## Summary of changes This PR merges the two lists of keys and value references into one. It's not a trivial change and affects the size pattern of the resulting files, which is why this is in a separate PR from #4839 . The key vec is used in compaction for determining when to start a new layer file. The loop uses various thresholds to come to this conclusion, but the grouping via the key has led to the behaviour that regardless of the threshold, it only starts a new file when either a new key is encountered, or a new delta file. The new code now does the combination after the merging and sorting of the various keys from the delta files. This *mostly* does the same as the old code, except for a detail: with the grouping done on a per-delta-layer basis, the sorted and merged vec would still have multiple entries for multiple delta files, but now, we don't have an easy way to tell when a new input delta layer file is encountered, so we cannot create multiple entries on that basis easily. To prevent possibly infinite growth, our new grouping code compares the combined size with the threshold, and if it is exceeded, it cuts a new entry so that the downstream code can cut a new output file. Here, we perform a tradeoff however, as if the threshold is too small, we risk putting entries for the same key into multiple layer files, but if the threshold is too big, we can in some instances exceed the target size. Currently, we set the threshold to the target size, so in theory we would stay below or roughly at double the `target_file_size`. We also fix the way the size was calculated for the last key. The calculation was wrong and accounted for the old layer's btree, even though we already account for the overhead of the in-construction btree. Builds on top of #4839 .
…#4938) Implement a new `struct Layer` abstraction which manages downloadness internally, requiring no LayerMap locking or rewriting to download or evict providing a property "you have a layer, you can read it". The new `struct Layer` provides ability to keep the file resident via a RAII structure for new layers which still need to be uploaded. Previous solution solved this `RemoteTimelineClient::wait_completion` which lead to bugs like #5639. Evicting or the final local deletion after garbage collection is done using Arc'd value `Drop`. With a single `struct Layer` the closed open ended `trait Layer`, `trait PersistentLayer` and `struct RemoteLayer` are removed following noting that compaction could be simplified by simply not using any of the traits in between: #4839. The new `struct Layer` is a preliminary to remove `Timeline::layer_removal_cs` documented in #4745. Preliminaries: #4936, #4937, #5013, #5014, #5022, #5033, #5044, #5058, #5059, #5061, #5074, #5103, epic #5172, #5645, #5649. Related split off: #5057, #5134.
This PR does two things: First, it fixes a bug with tiered compaction's k-merge implementation. It ignored the lsn of a key during ordering, so multiple updates of the same key could be read in arbitrary order, say from different layers. For example there is layers `[(a, 2),(b, 3)]` and `[(a, 1),(c, 2)]` in the heap, they might return `(a,2)` and `(a,1)`. Ultimately, this change wasn't enough to fix the ordering issues in #7296, in other words there is likely still bugs in the k-merge. So as the second thing, we switch away from the k-merge to an in-memory based one, similar to #4839, but leave the code around to be improved and maybe switched to later on. Part of #7296
This PR does two things: First, it fixes a bug with tiered compaction's k-merge implementation. It ignored the lsn of a key during ordering, so multiple updates of the same key could be read in arbitrary order, say from different layers. For example there is layers `[(a, 2),(b, 3)]` and `[(a, 1),(c, 2)]` in the heap, they might return `(a,2)` and `(a,1)`. Ultimately, this change wasn't enough to fix the ordering issues in #7296, in other words there is likely still bugs in the k-merge. So as the second thing, we switch away from the k-merge to an in-memory based one, similar to #4839, but leave the code around to be improved and maybe switched to later on. Part of #7296
Problem
The k-merge in pageserver compaction currently relies on iterators over the keys and also over the values. This approach does not support async code because we are using iterators and those don't support async in general. Also, the k-merge implementation we use doesn't support async either. Instead, as we already load all the keys into memory, just do sorting in-memory.
Summary of changes
The PR can be read commit-by-commit, but most importantly, it:
load_keys
andload_val_refs
async, usingHandle::block_on
in the compaction code as we don't want to turn the compaction function, called insidespawn_blocking
, into an async fn.Builds on top of #4836, part of #4743
Checklist before requesting a review
Checklist before merging