Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: mergingIter.SeekPrefixGE stops early if prefix cannot match #1085

Merged
merged 1 commit into from
Mar 8, 2021

Conversation

sumeerbhola
Copy link
Collaborator

This is motivated by the CockroachDB issues discussed
in #1070 where
range tombstones in L6 cause the iterator to go through
all the deleted data. The situation is even worse in that
each successive SeekPrefixGE in a batch request (which is
in sorted order) will typically have to start all over again
since the file at which the seek finally ended its work is
probably later than the file which contains the relevant key.
Note that CockroachDB does not set bounds when using
SeekPrefixGE because the assumption is that the underlying
Pebble implementation can imply bounds, and we don't want
to change this behavior in CockroachDB. Since CockroachDB
primarily uses range tombstones when deleting CockroachDB
ranges, the SeekPrefixGE calls that were slow were probably
on a preceding CockroachDB range, hence stopping early,
as done in this PR, should fix the issue. If range tombstones
were used within the CockroachDB range that is being read
using SeekPrefixGE, the seek could land on a deleted key
and will need to iterate through all its MVCC versions.
Even in that case, the work is bounded by the number of
deleted versions of a single MVCC key.

BenchmarkIteratorSeqSeekPrefixGENotFound demonstrates the
existing problem when with-tombstone=true.

name                                                            old time/op    new time/op    delta
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16     446ns ± 0%     433ns ± 0%    -2.91%
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16     10.3ms ± 0%     0.0ms ± 0%   -99.99%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16     416ns ± 0%     429ns ± 0%    +3.12%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16     10.6ms ± 0%     0.0ms ± 0%   -99.99%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16     414ns ± 0%     437ns ± 0%    +5.56%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16     10.5ms ± 0%     0.0ms ± 0%   -99.99%
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16               1.65µs ± 0%    1.75µs ± 0%    +5.75%
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16                 463ns ± 0%     459ns ± 0%    -0.86%
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16               1.61µs ± 0%    1.67µs ± 0%    +3.73%
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16                 476ns ± 0%     475ns ± 0%    -0.21%
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16               1.62µs ± 0%    1.77µs ± 0%    +9.26%
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16                 513ns ± 0%     525ns ± 0%    +2.34%
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16               1.71µs ± 0%    1.84µs ± 0%    +7.65%
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16                1.10µs ± 0%    1.16µs ± 0%    +5.27%
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16              1.80µs ± 0%    1.86µs ± 0%    +2.99%
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16               1.34µs ± 0%    1.20µs ± 0%   -10.23%

name                                                            old alloc/op   new alloc/op   delta
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16       300B ± 0%        0B       -100.00%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16       300B ± 0%        0B       -100.00%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16       292B ± 0%        0B       -100.00%
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16               0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16                0.00B          0.00B          0.00%

Informs #1070

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@sumeerbhola
Copy link
Collaborator Author

Ignore the first commit which is from #1079
It is used only to pick up a tiny benchmark change.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to take a closer look at the testing tomorrow, but this seems relatively small and safe. I like it.

Reviewable status: 0 of 14 files reviewed, 1 unresolved discussion (waiting on @itsbilal, @jbowens, and @sumeerbhola)


merging_iter.go, line 637 at r2 (raw file):

					return nil, nil
				}
			}

This is a lot smaller than I expected.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Do we have sufficient test coverage already that a mistake in this optimization would cause a failure? I suspect yes, but just want to verify.

Reviewable status: 0 of 14 files reviewed, 1 unresolved discussion (waiting on @itsbilal, @jbowens, and @sumeerbhola)


merging_iter.go, line 637 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This is a lot smaller than I expected.

As an alternative, could you have performed this optimization down in sstable.singleLevelIterator? In particular, a call to singleLevelIterator.SeekPrefixGE could save away the prefix as mergingIter does, and return false from either SeekPrefixGE or a subsequent Next when the key no longer has the desired prefix. I haven't thought through whether it is more desirable to do this in mergingIter vs singleLevelIterator.

This is motivated by the CockroachDB issues discussed
in cockroachdb#1070 where
range tombstones in L6 cause the iterator to go through
all the deleted data. The situation is even worse in that
each successive SeekPrefixGE in a batch request (which is
in sorted order) will typically have to start all over again
since the file at which the seek finally ended its work is
probably later than the file which contains the relevant key.
Note that CockroachDB does not set bounds when using
SeekPrefixGE because the assumption is that the underlying
Pebble implementation can imply bounds, and we don't want
to change this behavior in CockroachDB. Since CockroachDB
primarily uses range tombstones when deleting CockroachDB
ranges, the SeekPrefixGE calls that were slow were probably
on a preceding CockroachDB range, hence stopping early,
as done in this PR, should fix the issue. If range tombstones
were used within the CockroachDB range that is being read
using SeekPrefixGE, the seek could land on a deleted key
and will need to iterate through all its MVCC versions.
Even in that case, the work is bounded by the number of
deleted versions of a single MVCC key.

The code stops early in prefix iteration mode, both for
SeekPrefixGE and Next.

BenchmarkIteratorSeqSeekPrefixGENotFound demonstrates the
existing problem when with-tombstone=true.

name                                                            old time/op    new time/op    delta
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16     446ns ± 0%     433ns ± 0%    -2.91%
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16     10.3ms ± 0%     0.0ms ± 0%   -99.99%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16     416ns ± 0%     429ns ± 0%    +3.12%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16     10.6ms ± 0%     0.0ms ± 0%   -99.99%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16     414ns ± 0%     437ns ± 0%    +5.56%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16     10.5ms ± 0%     0.0ms ± 0%   -99.99%
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16               1.65µs ± 0%    1.75µs ± 0%    +5.75%
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16                 463ns ± 0%     459ns ± 0%    -0.86%
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16               1.61µs ± 0%    1.67µs ± 0%    +3.73%
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16                 476ns ± 0%     475ns ± 0%    -0.21%
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16               1.62µs ± 0%    1.77µs ± 0%    +9.26%
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16                 513ns ± 0%     525ns ± 0%    +2.34%
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16               1.71µs ± 0%    1.84µs ± 0%    +7.65%
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16                1.10µs ± 0%    1.16µs ± 0%    +5.27%
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16              1.80µs ± 0%    1.86µs ± 0%    +2.99%
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16               1.34µs ± 0%    1.20µs ± 0%   -10.23%

name                                                            old alloc/op   new alloc/op   delta
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=1/with-tombstone=true-16       300B ± 0%        0B       -100.00%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=2/with-tombstone=true-16       300B ± 0%        0B       -100.00%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=false-16     0.00B          0.00B          0.00%
IteratorSeqSeekPrefixGENotFound/skip=4/with-tombstone=true-16       292B ± 0%        0B       -100.00%
MergingIterSeqSeekPrefixGE/skip=1/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=1/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=2/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=2/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=4/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=4/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=8/use-next=false-16                0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=8/use-next=true-16                 0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=16/use-next=false-16               0.00B          0.00B          0.00%
MergingIterSeqSeekPrefixGE/skip=16/use-next=true-16                0.00B          0.00B          0.00%

Informs cockroachdb#1070
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Do we have sufficient test coverage already that a mistake in this optimization would cause a failure? I suspect yes, but just want to verify.

I've added a test case to merging_iter_test.go. And metamorphic testing will exercise this.

Also, I've removed the amortization logic since it was a premature optimization, and have added a comment there mentioning that we could do so, including the compares we are already incurring (which is why I removed the logic).

Reviewable status: 0 of 6 files reviewed, 1 unresolved discussion (waiting on @itsbilal, @jbowens, and @petermattis)


merging_iter.go, line 637 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

As an alternative, could you have performed this optimization down in sstable.singleLevelIterator? In particular, a call to singleLevelIterator.SeekPrefixGE could save away the prefix as mergingIter does, and return false from either SeekPrefixGE or a subsequent Next when the key no longer has the desired prefix. I haven't thought through whether it is more desirable to do this in mergingIter vs singleLevelIterator.

This is definitely debatable.

Theoretically, one could construct a large memtable with a similar pattern of sets which are range deleted and incur the cost of iterating over 64MB of data for each SeekPrefixGE, though it is unlikely to occur in practice.

In general, I've been making iterator optimizations at the highest layer in the iterator stack as possible, since it reduces the places to touch.

  • The next optimizations ended up at the lowest layer because those iterators know whether they are close enough for it to be worthwhile (e.g. same ssblock), and know about bloom filter matching -- this is true for both the SeekPrefixGE, and SeekGE/LT with bounds optimizations (the former passes trySeekUsingNext up from the Iterator as a hint only, and the optimization is done at the lowest layers. I should have done the same for the latter by adding a parameter to SetBounds instead of making each InternalIterator do bounds comparisons).
  • The noop optimization for SeekGE,LT was feasible to do in Iterator.
  • For this change mergingIter was the highest level where it was possible.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:LGTM:!

I'm really glad we didn't need to do anything more invasive than this.

Reviewed 1 of 5 files at r2, 4 of 14 files at r3.
Reviewable status: 2 of 6 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @petermattis)

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 6 files reviewed, all discussions resolved (waiting on @itsbilal)


merging_iter.go, line 637 at r2 (raw file):

Previously, sumeerbhola wrote…

This is definitely debatable.

Theoretically, one could construct a large memtable with a similar pattern of sets which are range deleted and incur the cost of iterating over 64MB of data for each SeekPrefixGE, though it is unlikely to occur in practice.

In general, I've been making iterator optimizations at the highest layer in the iterator stack as possible, since it reduces the places to touch.

  • The next optimizations ended up at the lowest layer because those iterators know whether they are close enough for it to be worthwhile (e.g. same ssblock), and know about bloom filter matching -- this is true for both the SeekPrefixGE, and SeekGE/LT with bounds optimizations (the former passes trySeekUsingNext up from the Iterator as a hint only, and the optimization is done at the lowest layers. I should have done the same for the latter by adding a parameter to SetBounds instead of making each InternalIterator do bounds comparisons).
  • The noop optimization for SeekGE,LT was feasible to do in Iterator.
  • For this change mergingIter was the highest level where it was possible.

Ok, the approach here SGTM. Thanks for laying out your thinking.

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Reviewable status: 2 of 6 files reviewed, all discussions resolved (waiting on @itsbilal)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants