Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compaction: sort on slices directly instead of kmerge #4839

Merged
merged 2 commits into from
Aug 3, 2023

Conversation

arpad-m
Copy link
Member

@arpad-m arpad-m commented Jul 28, 2023

Problem

The k-merge in pageserver compaction currently relies on iterators over the keys and also over the values. This approach does not support async code because we are using iterators and those don't support async in general. Also, the k-merge implementation we use doesn't support async either. Instead, as we already load all the keys into memory, just do sorting in-memory.

Summary of changes

The PR can be read commit-by-commit, but most importantly, it:

  • Stops using kmerge in compaction, using slice sorting instead.
  • Makes load_keys and load_val_refs async, using Handle::block_on in the compaction code as we don't want to turn the compaction function, called inside spawn_blocking, into an async fn.

Builds on top of #4836, part of #4743

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@github-actions
Copy link

github-actions bot commented Jul 28, 2023

1264 tests run: 1213 passed, 0 failed, 51 skipped (full report)


Flaky tests (1)

Postgres 15

  • test_crafted_wal_end[wal_record_crossing_segment_followed_by_small_one]: debug

@koivunej koivunej changed the base branch from main to arpad/pageserver_io_async_kmerge July 31, 2023 11:23
@arpad-m arpad-m force-pushed the arpad/pageserver_io_async_kmerge_slice branch 2 times, most recently from 374cef8 to 92784e3 Compare July 31, 2023 18:18
Base automatically changed from arpad/pageserver_io_async_kmerge to main August 1, 2023 11:38
@arpad-m arpad-m force-pushed the arpad/pageserver_io_async_kmerge_slice branch from 92784e3 to 10d2740 Compare August 1, 2023 11:43
@arpad-m arpad-m marked this pull request as ready for review August 1, 2023 11:44
@arpad-m arpad-m requested review from a team as code owners August 1, 2023 11:44
@arpad-m arpad-m requested review from save-buffer and koivunej and removed request for a team August 1, 2023 11:44
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this already implements what I suggested; a single walk to get sizes and valuerefs. Looking good.

Always enjoy a decrease in:

  • amount of virtual calls
  • amount of duplicated in-mem contents
  • amount of LOC

@arpad-m
Copy link
Member Author

arpad-m commented Aug 1, 2023

I understand this already implements what I suggested; a single walk to get sizes and valuerefs. Looking good.

Indeed I have pushed a commit to merge the two functions, but now there is a failure in tenant::tests::test_random_updates, and I can reproduce it locally so it seems to be valid. I am still investigating, but the load_keys function is doing some early returning if last.0 == delta_key.key() holds, while the load_val_refs function doesn't do that.

@arpad-m
Copy link
Member Author

arpad-m commented Aug 1, 2023

Mhh yeah if I print the size of the arrays, it seems that the key search has way less items than the value search:

val refs len: 1000, becoming  1000
val refs len: 1000, becoming  2000
val refs len: 1000, becoming  3000
val refs len: 1000, becoming  4000
val refs len: 1000, becoming  5000
val refs len: 1000, becoming  6000
val refs len: 1000, becoming  7000
val refs len: 1000, becoming  8000
val refs len: 1000, becoming  9000
val refs len: 1000, becoming 10000
keys     len:  641, becoming   641
keys     len:  635, becoming  1276
keys     len:  653, becoming  1929
keys     len:  619, becoming  2548
keys     len:  627, becoming  3175
keys     len:  629, becoming  3804
keys     len:  631, becoming  4435
keys     len:  627, becoming  5062
keys     len:  625, becoming  5687
keys     len:  642, becoming  6329

The failure is gone if I remove the return true from the load_keys function, but that might mean that we now iterate over more keys than needed...

@arpad-m
Copy link
Member Author

arpad-m commented Aug 2, 2023

Okay, more investigation reveals that the code in question comes from #1927, more specifically the commit "Optimize stoage keys iterator" (link). It feels weird to me that we early exit the search and also that we extend the length of the last key to the entire rest of the file.

@arpad-m
Copy link
Member Author

arpad-m commented Aug 2, 2023

I've pushed a commit which I hope fixes the test failures.... but I don't understand why this prior optimization worked. I think due to the reordering, it's not possible to have a similarly optimized state where a chunk of the keys are gone when we have a shared list (for the tenant::tests::test_random_updates it's roughly half as observable above)... not sure if then having a shared list is worth it.

@arpad-m arpad-m force-pushed the arpad/pageserver_io_async_kmerge_slice branch 2 times, most recently from 64cbebf to ab36f4a Compare August 2, 2023 20:54
@arpad-m
Copy link
Member Author

arpad-m commented Aug 2, 2023

I've looked at what is done with the keys, and am now more confident that all the merging did was to ensure that blobs for the same key (but different lsns) would end up in the same layer file. To preserve that, I've added a commit to coalesce the iterator. It won't have precisely the same coalescing merging though: first, the old approach would not merge key pairs from different files. in the new vec we have no good way to distinguish files so instead we just look at the size: if the combined key reaches the target file size, we don't merge. This might still create files larger than the target file size, however it will stay below the double of the target file size.

@arpad-m arpad-m force-pushed the arpad/pageserver_io_async_kmerge_slice branch from dc9889b to 10d2740 Compare August 3, 2023 12:15
@arpad-m
Copy link
Member Author

arpad-m commented Aug 3, 2023

I've pushed an earlier state of the PR (with two separate vecs) to this branch to enable faster merging. This will unblock some downstream refactors and separates the more involved changes from the ones that just do a refactor. For the changes that originally were part of this PR, see #4888.

@arpad-m arpad-m merged commit 416c14b into main Aug 3, 2023
56 checks passed
@arpad-m arpad-m deleted the arpad/pageserver_io_async_kmerge_slice branch August 3, 2023 13:30
arpad-m added a commit that referenced this pull request Aug 3, 2023
## Problem

`DiskBtreeReader::get` and `DiskBtreeReader::visit` both call `read_blk`
internally, which we would like to make async in the future. This PR
focuses on making the interface of these two functions `async`. There is
further work to be done in forms of making `visit` to not be recursive
any more, similar to #4838. For that, see
#4884.

Builds on top of #4839, part of
#4743

## Summary of changes

Make `DiskBtreeReader::get` and `DiskBtreeReader::visit` async functions
and `await` in the places that call these functions.
arpad-m added a commit that referenced this pull request Aug 4, 2023
## Problem

The functions `DeltaLayer::load_inner` and `ImageLayer::load_inner` are
calling `read_blk` internally, which we would like to turn into an async
fn.

## Summary of changes

We switch from `once_cell`'s `OnceCell` implementation to the one in
`tokio` in order to be able to call an async `get_or_try_init` function.

Builds on top of #4839, part of #4743
arpad-m added a commit that referenced this pull request Aug 7, 2023
## Problem

PR #4839 didn't output the keys/values in lsn order, but for a given
key, the lsns were kept in incoming file order.

I think the ordering by lsn is expected.

## Summary of changes

We now also sort by `(key, lsn)`, like we did before #4839.
arpad-m added a commit that referenced this pull request Aug 16, 2023
## Problem

PR #4839 has already reduced the number of b-tree traversals and vec
creations from 3 to 2, but as pointed out in
#4839 (comment) ,
we would ideally just traverse the b-tree once during compaction.

Afer #4836, the two vecs created are one for the list of keys, lsns and
sizes, and one for the list of `(key, lsn, value reference)`. However,
they are not equal, as pointed out in
#4839 (comment)
and the following comment: the key vec creation combines multiple
entries for which the lsn is changing but the key stays the same into
one, with the size being the sum of the sub-sizes. In SQL, this would
correspond to something like `SELECT key, lsn, SUM(size) FROM b_tree
GROUP BY key;` and `SELECT key, lsn, val_ref FROM b_tree;`. Therefore,
the join operation is non-trivial.

## Summary of changes

This PR merges the two lists of keys and value references into one. It's
not a trivial change and affects the size pattern of the resulting
files, which is why this is in a separate PR from #4839 .

The key vec is used in compaction for determining when to start a new
layer file. The loop uses various thresholds to come to this conclusion,
but the grouping via the key has led to the behaviour that regardless of
the threshold, it only starts a new file when either a new key is
encountered, or a new delta file.

The new code now does the combination after the merging and sorting of
the various keys from the delta files. This *mostly* does the same as
the old code, except for a detail: with the grouping done on a
per-delta-layer basis, the sorted and merged vec would still have
multiple entries for multiple delta files, but now, we don't have an
easy way to tell when a new input delta layer file is encountered, so we
cannot create multiple entries on that basis easily.

To prevent possibly infinite growth, our new grouping code compares the
combined size with the threshold, and if it is exceeded, it cuts a new
entry so that the downstream code can cut a new output file. Here, we
perform a tradeoff however, as if the threshold is too small, we risk
putting entries for the same key into multiple layer files, but if the
threshold is too big, we can in some instances exceed the target size.

Currently, we set the threshold to the target size, so in theory we
would stay below or roughly at double the `target_file_size`.

We also fix the way the size was calculated for the last key. The calculation
was wrong and accounted for the old layer's btree, even though we
already account for the overhead of the in-construction btree.

Builds on top of #4839 .
koivunej added a commit that referenced this pull request Oct 26, 2023
…#4938)

Implement a new `struct Layer` abstraction which manages downloadness
internally, requiring no LayerMap locking or rewriting to download or
evict providing a property "you have a layer, you can read it". The new
`struct Layer` provides ability to keep the file resident via a RAII
structure for new layers which still need to be uploaded. Previous
solution solved this `RemoteTimelineClient::wait_completion` which lead
to bugs like #5639. Evicting or the final local deletion after garbage
collection is done using Arc'd value `Drop`.

With a single `struct Layer` the closed open ended `trait Layer`, `trait
PersistentLayer` and `struct RemoteLayer` are removed following noting
that compaction could be simplified by simply not using any of the
traits in between: #4839.

The new `struct Layer` is a preliminary to remove
`Timeline::layer_removal_cs` documented in #4745.

Preliminaries: #4936, #4937, #5013, #5014, #5022, #5033, #5044, #5058,
#5059, #5061, #5074, #5103, epic #5172, #5645, #5649. Related split off:
#5057, #5134.
arpad-m added a commit that referenced this pull request May 9, 2024
This PR does two things:

First, it fixes a bug with tiered compaction's k-merge implementation.
It ignored the lsn of a key during ordering, so multiple updates of the
same key could be read in arbitrary order, say from different layers.
For example there is layers `[(a, 2),(b, 3)]` and `[(a, 1),(c, 2)]` in
the heap, they might return `(a,2)` and `(a,1)`.

Ultimately, this change wasn't enough to fix the ordering issues in
#7296, in other words there is likely still bugs in the k-merge. So as
the second thing, we switch away from the k-merge to an in-memory based
one, similar to #4839, but leave the code around to be improved and
maybe switched to later on.

Part of #7296
a-masterov pushed a commit that referenced this pull request May 20, 2024
This PR does two things:

First, it fixes a bug with tiered compaction's k-merge implementation.
It ignored the lsn of a key during ordering, so multiple updates of the
same key could be read in arbitrary order, say from different layers.
For example there is layers `[(a, 2),(b, 3)]` and `[(a, 1),(c, 2)]` in
the heap, they might return `(a,2)` and `(a,1)`.

Ultimately, this change wasn't enough to fix the ordering issues in
#7296, in other words there is likely still bugs in the k-merge. So as
the second thing, we switch away from the k-merge to an in-memory based
one, similar to #4839, but leave the code around to be improved and
maybe switched to later on.

Part of #7296
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants