Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: refactor merging iterator's prefix handling for TrySeekUsingNext #2178

Merged
merged 1 commit into from
Dec 12, 2022

Conversation

jbowens
Copy link
Collaborator

@jbowens jbowens commented Dec 7, 2022

Builds off of #2151. This commit refactors the merging iterator's
handling of prefix-related invariants for enabling TrySeekUsingNext
optimizations when using SeekPrefixGE.

The TrySeekUsingNext optimization notes successively larger seek keys,
and attempts to use next to find the appropriate key in the later seeks,
avoiding seeking from scratch. In a level's file metadata, we now use
the presence of the TrySeekUsingNext flag to signal that we can Next
through the file metadata. This saves key comparisons, especially in
larger LSMs with many files within a level.

Ever since the introduction of #2048, we've encountered a slew of
metamorphic test failures due to interactions between this optimization
and prefix iteration. We've made various fixes in #2087, #2111, #2123
and #2151.

Most of these failures revolve around the progression of a level
iterator beyond the iterator's current iteration prefix. This is
problematic, because the incomplete view of the LSM during prefix
iteration can result in inconsistent positioning outside of the current
iterator prefix. The internal iterators do not guarantee that keys or
range deletions outside the prefix are observed.

This commit fixes the latest instance of this (#2174), and attempts to
restructure the code to clarify the required invariants. Now, any method
that may advance the iterator during prefix iteration (eg, nextEntry or
isNextEntryDeleted) check that the iteration prefix has not already been
exceeded. This check is only performed when the merging iterator is
advancing internal iterators of its own volition (eg, due to non-visible
keys or keys deleted by range tombstones). It does NOT perform this
check when the top-level Iterator Nexts the merging iterator because
of a user Next, a point-delete, etc. This means these prefix checks are
only redundant with the top-level Iterator Nexts in two cases:

a) the heap root is deleted by a range tombstone
b) the heap root is not visible at the iterator's sequence number

Both of these cases are expected to be rare, and the redundant prefix
check is expected to be insignificant.

Additionally, to obey these new stricter invariants, the interleaving
iterator is adjusted to exhaust the iterator when the remainder of a
prefix is masked by a range key. Besides ensuring invariants are
respected, this change provides a large benefit to prefix iteration
within a large swath of masked point keys.

name                                                                     old time/op  new time/op  delta
Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/seekprefix-24   59.8ms ± 2%  60.3ms ± 2%   +0.92%  (p=0.013 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/next-24         4.50ms ± 6%  4.52ms ± 3%     ~     (p=0.388 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@10/backward-24             5.64ms ± 7%  5.51ms ± 3%   -2.26%  (p=0.001 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/seekprefix-24   64.0ms ± 5%  60.2ms ± 4%   -5.89%  (p=0.000 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/next-24         4.11ms ± 3%  4.16ms ± 5%   +1.06%  (p=0.040 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/backward-24             4.94ms ± 3%  4.89ms ± 3%   -0.97%  (p=0.011 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/seekprefix-24   69.0ms ± 2%  60.1ms ± 2%  -12.88%  (p=0.000 n=22+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/next-24         2.55ms ± 2%  2.59ms ± 3%   +1.90%  (p=0.000 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/backward-24             3.20ms ± 2%  3.21ms ± 3%     ~     (p=0.418 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/seekprefix-24   2.19s ± 3%   0.06s ± 2%  -97.27%  (p=0.000 n=24+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/next-24         346µs ± 2%   346µs ± 2%     ~     (p=0.898 n=24+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/backward-24             687µs ± 3%   696µs ± 2%   +1.31%  (p=0.000 n=25+25)

Fix #2174.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jbowens jbowens requested review from itsbilal, sumeerbhola and a team and removed request for itsbilal December 7, 2022 23:27
Copy link
Contributor

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Thanks for sprinkling the extra documentation around too.

:lgtm: - though perhaps wait for a second set of 👀.

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)


-- commits line 18 at r1:
nit: final , -> ..


merging_iter.go line 559 at r1 (raw file):

	if m.prefix != nil {
		n := m.split(key)
		return !bytes.Equal(m.prefix, key[:n])

nit: I realize the equality check is sufficient here. However, in a strict sense, isn't this method behaving more like equalsPrefix? Worth updating the method, or actually doing the comparison for prefix(key) > m.prefix?

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new invariant checking in nextEntry revealed another problem. Range key masking may result in the interleaving iterator Next-ing beyond the current iteration prefix. I made a fix to avoid Next-ing beyond a masked point if the masked point is already beyond the iteration prefix. I also documented this new variant we're adding on the internal iterator comment. I expect this change to have a large benefit to MVCCGets that are retrieving within a large swath of masked point keys. Previously, we'd next until we exhausted the merging iterator stopped surfacing keys or we found a non-masked point key. I'll try to get some numbers from a benchmark today.

Reviewable status: 6 of 9 files reviewed, 2 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)


merging_iter.go line 559 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

nit: I realize the equality check is sufficient here. However, in a strict sense, isn't this method behaving more like equalsPrefix? Worth updating the method, or actually doing the comparison for prefix(key) > m.prefix?

adjusted to unequalPrefix. I want to keep it in the negative (eg, as opposed to equalsPrefix) since if we're not in prefix iteration mode, it's a little nonsensical to return true from equalsPrefix.

@jbowens jbowens force-pushed the merging-prefix branch 3 times, most recently from d04149b to 41160ce Compare December 8, 2022 19:26
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like that the code is simpler and we're imposing stronger invariants.
My only questions are related to performance.

Reviewed 2 of 3 files at r2, 1 of 2 files at r3, all commit messages.
Reviewable status: 8 of 10 files reviewed, 5 unresolved discussions (waiting on @itsbilal, @jbowens, and @nicktrav)


-- commits line 38 at r3:
is b) rare?
This was a concern raised in #2151 (review)


merging_iter.go line 556 at r3 (raw file):

// unequalPrefix returns true if the iterator is in prefix iteration mode and
// the provided key does not contain the current iteration prefix.
func (m *mergingIter) unequalPrefix(key []byte) bool {

is this performance sensitive? If yes, does this inline?


merging_iter.go line 568 at r3 (raw file):

// exhausts the iterator by resetting the heap to empty if the iteration prefix
// has already been exceeded.
func (m *mergingIter) maybeNextEntry(item *mergingIterItem) {

same inlining question.

Copy link
Contributor

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, also really like that the code is simpler now in findNextEntry.

:lgtm: aside from Sumeer's points.

Reviewed 5 of 6 files at r1, 3 of 3 files at r2, 1 of 2 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @jbowens and @nicktrav)


-- commits line 18 at r1:

Previously, nicktrav (Nick Travers) wrote…

nit: final , -> ..

also nit: missing #2151.


internal/keyspan/interleaving_iter.go line 692 at r4 (raw file):

						// iterating too far beyond the prefix when there's a
						// broad swath of point keys masked by a range key.
						if i.cmp(i.pointKey.UserKey, i.nextPrefixBuf) >= 0 {

Nice

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @nicktrav and @sumeerbhola)


-- commits line 38 at r3:

Previously, sumeerbhola wrote…

is b) rare?
This was a concern raised in #2151 (review)

i'm not sure. i originally thought it was because snapshots are rare in Cockroach (and prefix iteration through a snapshot more rare, I believe), and the mutable memtable keys make up a small portion of the keys in a database. in thinking about it some more, i realized that successful bloom filter exclusions make it more likely that the next key in the mutable memtable will be a heap root.

an alternative: we require InternalIterator to return keys within the iteration prefix. if n levels are unable to efficiently exclude the level through bloom filter checks, etc, then we must perform n new prefix equality checks. however if k prefix equality checks then exclude those levels, we reduce the size of the merging iterator heap to n-k, which reduces the heapify key comparisons by O(k).

in the (typical?) case that there are 0 or 1 keys with the prefix (even if n is large), there's no regression. instead of our current O(n) heapify key comparisons plus one prefix check in the top-level Iterator, we'd perform O(n) faster prefix checks.

in the case where n-k ≥ 2 (multiple levels contain keys with the iteration prefix), i don't see how we'd avoid some regression, at least on seeks that do not then scan the remainder of the prefix.


merging_iter.go line 556 at r3 (raw file):

Previously, sumeerbhola wrote…

is this performance sensitive? If yes, does this inline?

it did not inline, so I manually inlined it.


merging_iter.go line 568 at r3 (raw file):

Previously, sumeerbhola wrote…

same inlining question.

this one also did not inline. I refactored it into a separate method that's only invoked within prefix iteration mode, so that only prefix-iteration mode needs to suffer the function call overhead.

@jbowens jbowens force-pushed the merging-prefix branch 2 times, most recently from 841c72e to b83901c Compare December 9, 2022 01:18
Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 9 of 10 files reviewed, 5 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)


internal/keyspan/interleaving_iter.go line 687 at r5 (raw file):

						// all subsequent keys with the same prefix must also be
						// masked according to the key ordering. We can stop and
						// return nil.

missed this opportunity the first time around

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 6 files at r1, 1 of 3 files at r2, 1 of 1 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @itsbilal, @jbowens, and @nicktrav)


-- commits line 38 at r3:

i realized that successful bloom filter exclusions make it more likely that the next key in the mutable memtable will be a heap root.

Hmm, I am a bit confused. What I was concerned about earlier is

		if !item.key.Visible(m.snapshot, m.batchSnapshot) {
			if m.prefix == nil {
				m.nextEntry(item)
			} else {
				m.maybeNextEntryWithinPrefix(item)
			}
			continue
		}

where the key found repeatedly matches the prefix and is not visible. And so we are doing repeated prefix comparisons when previously we did not. I buy the fact that we shouldn't be concerned about snapshots. So the remaining concern is a batch where the key is not visible due to the value of m.batchSnapshot. I didn't quite follow what that has to do with mutable memtables. Also, I think my worry was contrived -- I suspect this is not a real concern.

You raised a good point in that to keep iterating we are doing O(log heap-size) comparisons for each iteration step. And we are adding 1 more prefix comparison.
If the top element is a prefix and not visible, it is likely to keep coming from the same level, so there are 2 heap comparisons involved (if there are >= 3 element in the heap) per step. So 2 heap comparisons + 1 prefix comparison. If that level would have its InternalIterator do the prefix comparison instead (as you outlined), it would just shift that prefix comparison into that InternalIterator. So it doesn't really save during iteration.

in the (typical?) case that there are 0 or 1 keys with the prefix (even if n is large), there's no regression. instead of our current O(n) heapify key comparisons plus one prefix check in the top-level Iterator, we'd perform O(n) faster prefix checks.

This is compelling. I am not sure we need to do this now -- perhaps you could make an issue for a future optimization?


-- commits line 47 at r6:
nice


merging_iter.go line 555 at r6 (raw file):

// maybeNextEntryWithinPrefix steps to the next entry, as long as the iteration
// prefix has not alreday been exceeded. If it has, it exhausts the iterator by

nit: already

Copy link
Collaborator Author

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tftrs!

Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)


-- commits line 38 at r3:

I didn't quite follow what that has to do with mutable memtables.

I think an iterator's mutable memtable iterator can encounter keys (in the arenaskl skiplist) committed after the iterator was opened, so these newer keys would be filtered out here.

perhaps you could make an issue for a future optimization?

filed #2182.

Builds off of cockroachdb#2151. This commit refactors the merging iterator's
handling of prefix-related invariants for enabling TrySeekUsingNext
optimizations when using SeekPrefixGE.

The TrySeekUsingNext optimization notes successively larger seek keys,
and attempts to use next to find the appropriate key in the later seeks,
avoiding seeking from scratch. In a level's file metadata, we now use
the presence of the TrySeekUsingNext flag to signal that we can Next
through the file metadata. This saves key comparisons, especially in
larger LSMs with many files within a level.

Ever since the introduction of cockroachdb#2048, we've encountered a slew of
metamorphic test failures due to interactions between this optimization
and prefix iteration. We've made various fixes in cockroachdb#2087, cockroachdb#2111, cockroachdb#2123
and cockroachdb#2151.

Most of these failures revolve around the progression of a level
iterator beyond the iterator's current iteration prefix. This is
problematic, because the incomplete view of the LSM during prefix
iteration can result in inconsistent positioning outside of the current
iterator prefix. The internal iterators do not guarantee that keys or
range deletions outside the prefix are observed.

This commit fixes the latest instance of this (cockroachdb#2174), and attempts to
restructure the code to clarify the required invariants. Now, any method
that may advance the iterator during prefix iteration (eg, nextEntry or
isNextEntryDeleted) check that the iteration prefix has not already been
exceeded. This check is only performed when the merging iterator is
advancing internal iterators of its own volition (eg, due to non-visible
keys or keys deleted by range tombstones). It does NOT perform this
check when the top-level Iterator Nexts the merging iterator because
of a user Next, a point-delete, etc. This means these prefix checks are
only redundant with the top-level Iterator Nexts in two cases:

  a) the heap root is deleted by a range tombstone
  b) the heap root is not visible at the iterator's sequence number

Both of these cases are expected to be rare, and the redundant prefix
check is expected to be insignificant.

Additionally, to obey these new stricter invariants, the interleaving
iterator is adjusted to exhaust the iterator when the remainder of a
prefix is masked by a range key. Besides ensuring invariants are
respected, this change provides a large benefit to prefix iteration
within a large swath of masked point keys.

```
name                                                                     old time/op  new time/op  delta
Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/seekprefix-24   59.8ms ± 2%  60.3ms ± 2%   +0.92%  (p=0.013 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/next-24         4.50ms ± 6%  4.52ms ± 3%     ~     (p=0.388 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@10/backward-24             5.64ms ± 7%  5.51ms ± 3%   -2.26%  (p=0.001 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/seekprefix-24   64.0ms ± 5%  60.2ms ± 4%   -5.89%  (p=0.000 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/next-24         4.11ms ± 3%  4.16ms ± 5%   +1.06%  (p=0.040 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/backward-24             4.94ms ± 3%  4.89ms ± 3%   -0.97%  (p=0.011 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/seekprefix-24   69.0ms ± 2%  60.1ms ± 2%  -12.88%  (p=0.000 n=22+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/next-24         2.55ms ± 2%  2.59ms ± 3%   +1.90%  (p=0.000 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/backward-24             3.20ms ± 2%  3.21ms ± 3%     ~     (p=0.418 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/seekprefix-24   2.19s ± 3%   0.06s ± 2%  -97.27%  (p=0.000 n=24+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/next-24         346µs ± 2%   346µs ± 2%     ~     (p=0.898 n=24+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/backward-24             687µs ± 3%   696µs ± 2%   +1.31%  (p=0.000 n=25+25)
```

Fix cockroachdb#2174.
@jbowens jbowens merged commit 0604785 into cockroachdb:master Dec 12, 2022
@jbowens jbowens deleted the merging-prefix branch December 12, 2022 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

github.com/cockroachdb/pebble/internal/metamorphic: TestMeta failed
5 participants