db: refactor merging iterator's prefix handling for TrySeekUsingNext #2178

jbowens · 2022-12-07T23:26:48Z

Builds off of #2151. This commit refactors the merging iterator's
handling of prefix-related invariants for enabling TrySeekUsingNext
optimizations when using SeekPrefixGE.

The TrySeekUsingNext optimization notes successively larger seek keys,
and attempts to use next to find the appropriate key in the later seeks,
avoiding seeking from scratch. In a level's file metadata, we now use
the presence of the TrySeekUsingNext flag to signal that we can Next
through the file metadata. This saves key comparisons, especially in
larger LSMs with many files within a level.

Ever since the introduction of #2048, we've encountered a slew of
metamorphic test failures due to interactions between this optimization
and prefix iteration. We've made various fixes in #2087, #2111, #2123
and #2151.

Most of these failures revolve around the progression of a level
iterator beyond the iterator's current iteration prefix. This is
problematic, because the incomplete view of the LSM during prefix
iteration can result in inconsistent positioning outside of the current
iterator prefix. The internal iterators do not guarantee that keys or
range deletions outside the prefix are observed.

This commit fixes the latest instance of this (#2174), and attempts to
restructure the code to clarify the required invariants. Now, any method
that may advance the iterator during prefix iteration (eg, nextEntry or
isNextEntryDeleted) check that the iteration prefix has not already been
exceeded. This check is only performed when the merging iterator is
advancing internal iterators of its own volition (eg, due to non-visible
keys or keys deleted by range tombstones). It does NOT perform this
check when the top-level Iterator Nexts the merging iterator because
of a user Next, a point-delete, etc. This means these prefix checks are
only redundant with the top-level Iterator Nexts in two cases:

a) the heap root is deleted by a range tombstone
b) the heap root is not visible at the iterator's sequence number

Both of these cases are expected to be rare, and the redundant prefix
check is expected to be insignificant.

Additionally, to obey these new stricter invariants, the interleaving
iterator is adjusted to exhaust the iterator when the remainder of a
prefix is masked by a range key. Besides ensuring invariants are
respected, this change provides a large benefit to prefix iteration
within a large swath of masked point keys.

name                                                                     old time/op  new time/op  delta
Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/seekprefix-24   59.8ms ± 2%  60.3ms ± 2%   +0.92%  (p=0.013 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/next-24         4.50ms ± 6%  4.52ms ± 3%     ~     (p=0.388 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@10/backward-24             5.64ms ± 7%  5.51ms ± 3%   -2.26%  (p=0.001 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/seekprefix-24   64.0ms ± 5%  60.2ms ± 4%   -5.89%  (p=0.000 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/next-24         4.11ms ± 3%  4.16ms ± 5%   +1.06%  (p=0.040 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@50/backward-24             4.94ms ± 3%  4.89ms ± 3%   -0.97%  (p=0.011 n=25+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/seekprefix-24   69.0ms ± 2%  60.1ms ± 2%  -12.88%  (p=0.000 n=22+24)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/next-24         2.55ms ± 2%  2.59ms ± 3%   +1.90%  (p=0.000 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@75/backward-24             3.20ms ± 2%  3.21ms ± 3%     ~     (p=0.418 n=25+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/seekprefix-24   2.19s ± 3%   0.06s ± 2%  -97.27%  (p=0.000 n=24+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/next-24         346µs ± 2%   346µs ± 2%     ~     (p=0.898 n=24+25)
Iterator_RangeKeyMasking/range-keys-suffixes=@100/backward-24             687µs ± 3%   696µs ± 2%   +1.31%  (p=0.000 n=25+25)

Fix #2174.

cockroach-teamcity · 2022-12-07T23:26:54Z

This change is

nicktrav

Nice. Thanks for sprinkling the extra documentation around too.

- though perhaps wait for a second set of 👀.

Reviewed 6 of 6 files at r1, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @itsbilal, @jbowens, and @sumeerbhola)

-- commits line 18 at r1:
nit: final , -> ..

merging_iter.go line 559 at r1 (raw file):

	if m.prefix != nil {
		n := m.split(key)
		return !bytes.Equal(m.prefix, key[:n])

nit: I realize the equality check is sufficient here. However, in a strict sense, isn't this method behaving more like equalsPrefix? Worth updating the method, or actually doing the comparison for prefix(key) > m.prefix?

jbowens

The new invariant checking in nextEntry revealed another problem. Range key masking may result in the interleaving iterator Next-ing beyond the current iteration prefix. I made a fix to avoid Next-ing beyond a masked point if the masked point is already beyond the iteration prefix. I also documented this new variant we're adding on the internal iterator comment. I expect this change to have a large benefit to MVCCGets that are retrieving within a large swath of masked point keys. Previously, we'd next until we exhausted the merging iterator stopped surfacing keys or we found a non-masked point key. I'll try to get some numbers from a benchmark today.

Reviewable status: 6 of 9 files reviewed, 2 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)

merging_iter.go line 559 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

nit: I realize the equality check is sufficient here. However, in a strict sense, isn't this method behaving more like equalsPrefix? Worth updating the method, or actually doing the comparison for prefix(key) > m.prefix?

adjusted to unequalPrefix. I want to keep it in the negative (eg, as opposed to equalsPrefix) since if we're not in prefix iteration mode, it's a little nonsensical to return true from equalsPrefix.

sumeerbhola

I really like that the code is simpler and we're imposing stronger invariants.
My only questions are related to performance.

Reviewed 2 of 3 files at r2, 1 of 2 files at r3, all commit messages.
Reviewable status: 8 of 10 files reviewed, 5 unresolved discussions (waiting on @itsbilal, @jbowens, and @nicktrav)

-- commits line 38 at r3:
is b) rare?
This was a concern raised in #2151 (review)

merging_iter.go line 556 at r3 (raw file):

// unequalPrefix returns true if the iterator is in prefix iteration mode and
// the provided key does not contain the current iteration prefix.
func (m *mergingIter) unequalPrefix(key []byte) bool {

is this performance sensitive? If yes, does this inline?

merging_iter.go line 568 at r3 (raw file):

// exhausts the iterator by resetting the heap to empty if the iteration prefix
// has already been exceeded.
func (m *mergingIter) maybeNextEntry(item *mergingIterItem) {

same inlining question.

itsbilal

Same here, also really like that the code is simpler now in findNextEntry.

aside from Sumeer's points.

Reviewed 5 of 6 files at r1, 3 of 3 files at r2, 1 of 2 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @jbowens and @nicktrav)

-- commits line 18 at r1:

Previously, nicktrav (Nick Travers) wrote…

nit: final , -> ..

also nit: missing #2151.

internal/keyspan/interleaving_iter.go line 692 at r4 (raw file):

						// iterating too far beyond the prefix when there's a
						// broad swath of point keys masked by a range key.
						if i.cmp(i.pointKey.UserKey, i.nextPrefixBuf) >= 0 {

Nice

jbowens

Reviewable status: all files reviewed, 4 unresolved discussions (waiting on @nicktrav and @sumeerbhola)

-- commits line 38 at r3:

Previously, sumeerbhola wrote…

is b) rare?
This was a concern raised in #2151 (review)

i'm not sure. i originally thought it was because snapshots are rare in Cockroach (and prefix iteration through a snapshot more rare, I believe), and the mutable memtable keys make up a small portion of the keys in a database. in thinking about it some more, i realized that successful bloom filter exclusions make it more likely that the next key in the mutable memtable will be a heap root.

an alternative: we require InternalIterator to return keys within the iteration prefix. if n levels are unable to efficiently exclude the level through bloom filter checks, etc, then we must perform n new prefix equality checks. however if k prefix equality checks then exclude those levels, we reduce the size of the merging iterator heap to n-k, which reduces the heapify key comparisons by O(k).

in the (typical?) case that there are 0 or 1 keys with the prefix (even if n is large), there's no regression. instead of our current O(n) heapify key comparisons plus one prefix check in the top-level Iterator, we'd perform O(n) faster prefix checks.

in the case where n-k ≥ 2 (multiple levels contain keys with the iteration prefix), i don't see how we'd avoid some regression, at least on seeks that do not then scan the remainder of the prefix.

merging_iter.go line 556 at r3 (raw file):

Previously, sumeerbhola wrote…

is this performance sensitive? If yes, does this inline?

it did not inline, so I manually inlined it.

merging_iter.go line 568 at r3 (raw file):

Previously, sumeerbhola wrote…

same inlining question.

this one also did not inline. I refactored it into a separate method that's only invoked within prefix iteration mode, so that only prefix-iteration mode needs to suffer the function call overhead.

jbowens

Reviewable status: 9 of 10 files reviewed, 5 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)

internal/keyspan/interleaving_iter.go line 687 at r5 (raw file):

						// all subsequent keys with the same prefix must also be
						// masked according to the key ordering. We can stop and
						// return nil.

missed this opportunity the first time around

sumeerbhola

Reviewed 3 of 6 files at r1, 1 of 3 files at r2, 1 of 1 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: all files reviewed, 7 unresolved discussions (waiting on @itsbilal, @jbowens, and @nicktrav)

-- commits line 38 at r3:

i realized that successful bloom filter exclusions make it more likely that the next key in the mutable memtable will be a heap root.

Hmm, I am a bit confused. What I was concerned about earlier is

		if !item.key.Visible(m.snapshot, m.batchSnapshot) {
			if m.prefix == nil {
				m.nextEntry(item)
			} else {
				m.maybeNextEntryWithinPrefix(item)
			}
			continue
		}

where the key found repeatedly matches the prefix and is not visible. And so we are doing repeated prefix comparisons when previously we did not. I buy the fact that we shouldn't be concerned about snapshots. So the remaining concern is a batch where the key is not visible due to the value of m.batchSnapshot. I didn't quite follow what that has to do with mutable memtables. Also, I think my worry was contrived -- I suspect this is not a real concern.

You raised a good point in that to keep iterating we are doing O(log heap-size) comparisons for each iteration step. And we are adding 1 more prefix comparison.
If the top element is a prefix and not visible, it is likely to keep coming from the same level, so there are 2 heap comparisons involved (if there are >= 3 element in the heap) per step. So 2 heap comparisons + 1 prefix comparison. If that level would have its InternalIterator do the prefix comparison instead (as you outlined), it would just shift that prefix comparison into that InternalIterator. So it doesn't really save during iteration.

in the (typical?) case that there are 0 or 1 keys with the prefix (even if n is large), there's no regression. instead of our current O(n) heapify key comparisons plus one prefix check in the top-level Iterator, we'd perform O(n) faster prefix checks.

This is compelling. I am not sure we need to do this now -- perhaps you could make an issue for a future optimization?

-- commits line 47 at r6:
nice

merging_iter.go line 555 at r6 (raw file):

// maybeNextEntryWithinPrefix steps to the next entry, as long as the iteration
// prefix has not alreday been exceeded. If it has, it exhausts the iterator by

nit: already

jbowens

tftrs!

Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @itsbilal, @nicktrav, and @sumeerbhola)

-- commits line 38 at r3:

I didn't quite follow what that has to do with mutable memtables.

I think an iterator's mutable memtable iterator can encounter keys (in the arenaskl skiplist) committed after the iterator was opened, so these newer keys would be filtered out here.

perhaps you could make an issue for a future optimization?

filed #2182.

Builds off of cockroachdb#2151. This commit refactors the merging iterator's handling of prefix-related invariants for enabling TrySeekUsingNext optimizations when using SeekPrefixGE. The TrySeekUsingNext optimization notes successively larger seek keys, and attempts to use next to find the appropriate key in the later seeks, avoiding seeking from scratch. In a level's file metadata, we now use the presence of the TrySeekUsingNext flag to signal that we can Next through the file metadata. This saves key comparisons, especially in larger LSMs with many files within a level. Ever since the introduction of cockroachdb#2048, we've encountered a slew of metamorphic test failures due to interactions between this optimization and prefix iteration. We've made various fixes in cockroachdb#2087, cockroachdb#2111, cockroachdb#2123 and cockroachdb#2151. Most of these failures revolve around the progression of a level iterator beyond the iterator's current iteration prefix. This is problematic, because the incomplete view of the LSM during prefix iteration can result in inconsistent positioning outside of the current iterator prefix. The internal iterators do not guarantee that keys or range deletions outside the prefix are observed. This commit fixes the latest instance of this (cockroachdb#2174), and attempts to restructure the code to clarify the required invariants. Now, any method that may advance the iterator during prefix iteration (eg, nextEntry or isNextEntryDeleted) check that the iteration prefix has not already been exceeded. This check is only performed when the merging iterator is advancing internal iterators of its own volition (eg, due to non-visible keys or keys deleted by range tombstones). It does NOT perform this check when the top-level Iterator Nexts the merging iterator because of a user Next, a point-delete, etc. This means these prefix checks are only redundant with the top-level Iterator Nexts in two cases: a) the heap root is deleted by a range tombstone b) the heap root is not visible at the iterator's sequence number Both of these cases are expected to be rare, and the redundant prefix check is expected to be insignificant. Additionally, to obey these new stricter invariants, the interleaving iterator is adjusted to exhaust the iterator when the remainder of a prefix is masked by a range key. Besides ensuring invariants are respected, this change provides a large benefit to prefix iteration within a large swath of masked point keys. ``` name old time/op new time/op delta Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/seekprefix-24 59.8ms ± 2% 60.3ms ± 2% +0.92% (p=0.013 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@10/forward/next-24 4.50ms ± 6% 4.52ms ± 3% ~ (p=0.388 n=25+24) Iterator_RangeKeyMasking/range-keys-suffixes=@10/backward-24 5.64ms ± 7% 5.51ms ± 3% -2.26% (p=0.001 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/seekprefix-24 64.0ms ± 5% 60.2ms ± 4% -5.89% (p=0.000 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@50/forward/next-24 4.11ms ± 3% 4.16ms ± 5% +1.06% (p=0.040 n=25+24) Iterator_RangeKeyMasking/range-keys-suffixes=@50/backward-24 4.94ms ± 3% 4.89ms ± 3% -0.97% (p=0.011 n=25+24) Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/seekprefix-24 69.0ms ± 2% 60.1ms ± 2% -12.88% (p=0.000 n=22+24) Iterator_RangeKeyMasking/range-keys-suffixes=@75/forward/next-24 2.55ms ± 2% 2.59ms ± 3% +1.90% (p=0.000 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@75/backward-24 3.20ms ± 2% 3.21ms ± 3% ~ (p=0.418 n=25+25) Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/seekprefix-24 2.19s ± 3% 0.06s ± 2% -97.27% (p=0.000 n=24+25) Iterator_RangeKeyMasking/range-keys-suffixes=@100/forward/next-24 346µs ± 2% 346µs ± 2% ~ (p=0.898 n=24+25) Iterator_RangeKeyMasking/range-keys-suffixes=@100/backward-24 687µs ± 3% 696µs ± 2% +1.31% (p=0.000 n=25+25) ``` Fix cockroachdb#2174.

jbowens requested review from itsbilal, sumeerbhola and a team and removed request for itsbilal December 7, 2022 23:27

nicktrav approved these changes Dec 8, 2022

View reviewed changes

jbowens force-pushed the merging-prefix branch from a502d7a to 45ba088 Compare December 8, 2022 17:00

jbowens commented Dec 8, 2022

View reviewed changes

jbowens force-pushed the merging-prefix branch 3 times, most recently from d04149b to 41160ce Compare December 8, 2022 19:26

sumeerbhola requested changes Dec 8, 2022

View reviewed changes

jbowens force-pushed the merging-prefix branch from 41160ce to aaa62c8 Compare December 8, 2022 20:20

itsbilal approved these changes Dec 8, 2022

View reviewed changes

jbowens commented Dec 8, 2022

View reviewed changes

jbowens force-pushed the merging-prefix branch 2 times, most recently from 841c72e to b83901c Compare December 9, 2022 01:18

jbowens commented Dec 9, 2022

View reviewed changes

jbowens force-pushed the merging-prefix branch from b83901c to 923535f Compare December 9, 2022 01:24

sumeerbhola approved these changes Dec 9, 2022

View reviewed changes

jbowens mentioned this pull request Dec 12, 2022

db: consider requiring prefix filtering within InternalIterator #2182

Closed

jbowens commented Dec 12, 2022

View reviewed changes

jbowens force-pushed the merging-prefix branch from 923535f to 73b34ab Compare December 12, 2022 16:41

jbowens merged commit 0604785 into cockroachdb:master Dec 12, 2022

jbowens deleted the merging-prefix branch December 12, 2022 17:13

sumeerbhola mentioned this pull request Aug 15, 2024

db: fix cases where SeekPrefixGE prefix doesn't match key #3845

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: refactor merging iterator's prefix handling for TrySeekUsingNext #2178

db: refactor merging iterator's prefix handling for TrySeekUsingNext #2178

jbowens commented Dec 7, 2022 •

edited

Loading

cockroach-teamcity commented Dec 7, 2022

nicktrav left a comment

jbowens left a comment

sumeerbhola left a comment

itsbilal left a comment

jbowens left a comment

jbowens left a comment

sumeerbhola left a comment

jbowens left a comment

db: refactor merging iterator's prefix handling for TrySeekUsingNext #2178

db: refactor merging iterator's prefix handling for TrySeekUsingNext #2178

Conversation

jbowens commented Dec 7, 2022 • edited Loading

cockroach-teamcity commented Dec 7, 2022

nicktrav left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

itsbilal left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

jbowens commented Dec 7, 2022 •

edited

Loading