storageccl: scan all versions of changed key in MVCCIncrementalIterator #45785

pbardea · 2020-03-06T07:53:10Z

This PR re-introduces #45163. It needs to be merged after #45759.

This latest implementation of MVCCIncrementalIterator updates its use of
the time-bound iterator to only be used to skip over keys that were not
modified at all within the time bound limits.

Previously, the incremental iterator primarily used the time-bound
iterator and then used a non-time-bound iterator (sanity_iter) to check
that meta keys were properly read. See #43799 for discussion around why
this may not be sufficient.

This commit updates the implementation to use the time-bound iterator
more conservatively. The 2 iterators that it now uses are iter (a
non-time-bound iterator and timeBoundIter). The main iteration is now
done with iter. When iter sees a new key, the timeBoundIter is
incremented and if it goes further than iter, iter is Seeked
forward to the meta key of the new key. This means that iter should
visit every version of every key that timeBoundIter visits.

It should be noted that if many adjacent keys were modified, the above
logic tries to avoid to do a Seek to the next key as a Seek is ~an
order of magnitude more expensive than a Next.

In cases where a lot of keys were modified, we expect to do ~double the
work since 2 iterators would need to seek to every key. However, the
expected use case for the MVCCIncrementalIterator is such that the
vast majority of the data is excluded from the range (as should be the
case for incremental backups), since there is potentially years worth of
data that can be skipped over, versus hours worth of data to include.

Release note: None

cockroach-teamcity · 2020-03-06T07:53:19Z

This change is

pbardea · 2020-03-06T07:54:34Z

Note that this PR differs slightly from the first implementation:

diff --git a/c-deps/libroach/incremental_iterator.cc b/c-deps/libroach/incremental_iterator.cc
index 837f2bffa8..b613dfbad3 100644
--- a/c-deps/libroach/incremental_iterator.cc
+++ b/c-deps/libroach/incremental_iterator.cc
@@ -79,7 +79,7 @@ void DBIncrementalIterator::maybeSkipKeys() {
     return;
   }
   rocksdb::Slice iter_key;
-  if(!extractKey(time_bound_iter->rep->key(), &iter_key)) {
+  if(!extractKey(iter->rep->key(), &iter_key)) {
     return;
   }

This should be followed up with another test that ensures that we're actually skipping some keys with this optimization.

miretskiy

I'm having problems submitting my pr due to flaky test -- so make sure you wait for that pr to hit the master.

miretskiy · 2020-03-06T14:02:14Z

c-deps/libroach/incremental_iterator.cc

+    // This means that for the incremental iterator to perform a Next or NextKey
+    // will require only 1 extra NextKey invocation while they remain in
+    // lockstep.
+


Do you think this comment belongs outside this if statement?
I mean it talks about fastpath (iter_key == tbi_key).

Also, I think it's easier to read if you use e.g. "iter_key == tbi_key" in the comment as opposed to specifics (Compare)

miretskiy · 2020-03-06T14:03:21Z

c-deps/libroach/incremental_iterator.cc

+      // range tombstones. In this case we can SeekGE() the TBI to the main iterator.
+
+      // NB: Seek() is expensive, and this is only acceptable since we expect
+      // this case to rarely occur.


I feel like this important note belongs right at the beginning of this if statement -- it explains why we call Next, as opposed to seeking right away.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)

c-deps/libroach/incremental_iterator.cc, line 105 at r2 (raw file):

Do you think this comment belongs outside this if statement?
I mean it talks about fastpath (iter_key == tbi_key).

I think there are three cases (if we have any incremental backup metrics we should keep a counter for each. Or maybe just log?):

tbi was one behind the iter: The next call on tbi will make them equal. This is the fast path mentioned here. I think this comment could be elevated just before the call to DBIterNext(time_bound_iter.get(), true /* skip_current_key_versions */); since it motivates calling next instead of seek. This could be common if most keys are modified or the modifications are clustered in the key space.
the next call on tbi moves it far ahead of iter. This will be the common case when modifications are sparse in the key space. This may turn out to be the real common case -- in which case the fast path did not help.
the next call on tbi keeps it behind the iter (should be very rare).

pbardea

Update on this issue: I've moved the comments around based on the comments, but I am still seeing the libroach assertion failing on this branch. Investigating further.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

c-deps/libroach/incremental_iterator.cc, line 113 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I feel like this important note belongs right at the beginning of this if statement -- it explains why we call Next, as opposed to seeking right away.

Done.

miretskiy

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @pbardea)

c-deps/libroach/incremental_iterator.h, line 22 at r3 (raw file):

which is used as an oracle to test DBIncrementalIterator.

Is it going to be used only as an oracle? Or, eventually, it will be the iterator to be used w/ pebble?

c-deps/libroach/incremental_iterator.cc, line 49 at r4 (raw file):

// boolean indicating whether t1 is less than t2.
bool DBIncrementalIterator::legacyTimestampIsLess(const cockroach::util::hlc::LegacyTimestamp& t1,
                                                  const cockroach::util::hlc::LegacyTimestamp& t2) {

might wrapping these long lines? c++ usually tries to stay under 80 cols

c-deps/libroach/incremental_iterator.cc, line 56 at r4 (raw file):

// extractKey extracts the key portion of the mvcc_key and put it in key. It
// returns a validity indicator.
WARN_UNUSED_RESULT bool DBIncrementalIterator::extractKey(rocksdb::Slice mvcc_key, rocksdb::Slice *key) {

wrap?

c-deps/libroach/incremental_iterator.cc, line 250 at r4 (raw file):

    }
    const rocksdb::Slice tbi_key(time_bound_iter->rep->key());
    const rocksdb::Slice iter_key(ToSlice(key.key));

This line deserves an "NB: comment" ...

pbardea

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

c-deps/libroach/incremental_iterator.h, line 22 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

which is used as an oracle to test DBIncrementalIterator.
Is it going to be used only as an oracle? Or, eventually, it will be the iterator to be used w/ pebble?

It seems reasonable that it'll be used as the iterator when we move over to Pebble. I added a note which mentions that this iterator may be useful when Pebble becomes the default store.

c-deps/libroach/incremental_iterator.cc, line 49 at r4 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

might wrapping these long lines? c++ usually tries to stay under 80 cols

Done. I ran the formatter (https://github.com/cockroachdb/cockroach/blob/master/c-deps/libroach/README.md). I saw that there were formatting changes to be made to other files that I didn't touch - it may be worth doing another PR with those changes.

c-deps/libroach/incremental_iterator.cc, line 56 at r4 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

wrap?

Same as previous comment.

c-deps/libroach/incremental_iterator.cc, line 250 at r4 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

This line deserves an "NB: comment" ...

Done.

This latest implementation of MVCCIncrementalIterator updates its use of the time-bound iterator to only be used to skip over keys that were not modified at all within the time bound limits. Previously, the incremental iterator primarily used the time-bound iterator and then used a non-time-bound iterator (sanity_iter) to check that meta keys were properly read. See cockroachdb#43799 for discussion around why this may not be sufficient. This commit updates the implementation to use the time-bound iterator more conservatively. The 2 iterators that it now uses are `iter` (a non-time-bound iterator and `timeBoundIter`). The main iteration is now done with `iter`. When `iter` sees a new key, the `timeBoundIter` is incremented and if it goes further than `iter`, `iter` is `Seek`ed forward to the meta key of the new key. This means that `iter` should visit every version of every key that `timeBoundIter` visits. It should be noted that if many adjacent keys were modified, the above logic tries to avoid to do a `Seek` to the next key as a `Seek` is ~an order of magnitude more expensive than a `Next`. In cases where a lot of keys were modified, we expect to do ~double the work since 2 iterators would need to seek to every key. However, the expected use case for the MVCCIncrementalIterator is such that the _vast_ majority of the data is excluded from the range (as should be the case for incremental backups), since there is potentially years worth of data that can be skipped over, versus hours worth of data to include. Release note: None

These tests exercise the Go MVCCIncrementalIterator which is no longer used directly in CockroachDB. However, the MVCCIncrementalIterator is used as an oracle to test the C++ implementation. Therefore, it would still be beneficial to have these tests present as long as we plan to keep these 2 implementations in sync. Release note: None

pbardea

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

c-deps/libroach/incremental_iterator.cc, line 105 at r2 (raw file):

Previously, sumeerbhola wrote…

Do you think this comment belongs outside this if statement?
I mean it talks about fastpath (iter_key == tbi_key).

I think there are three cases (if we have any incremental backup metrics we should keep a counter for each. Or maybe just log?):

tbi was one behind the iter: The next call on tbi will make them equal. This is the fast path mentioned here. I think this comment could be elevated just before the call to DBIterNext(time_bound_iter.get(), true /* skip_current_key_versions */); since it motivates calling next instead of seek. This could be common if most keys are modified or the modifications are clustered in the key space.

the next call on tbi moves it far ahead of iter. This will be the common case when modifications are sparse in the key space. This may turn out to be the real common case -- in which case the fast path did not help.

the next call on tbi keeps it behind the iter (should be very rare).

Updated the comments based on these suggestions.

miretskiy

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

Within the current BACKUP/RESTORE feature offering, the way to reduce RPO is to backup more often. For this to work for more frequent durations, the cost of backing up what changed must scale with the size what changed, as opposed to the total data size. This is exactly what the time-bound iterator optimization is supposed to deliver -- by recording the span of timestamps that appear in any given SSTable and then installing a filter to only open sstables that contain relevant times, we can ignore irrelevant data very cheaply. However over 2017 and 2018, we encountered a few tricky correctness issues in the interaction of the time-bound iterator and our MVCC logic and handling intents, and generally lost confidence in this code, ultimately disabling its use in the scans used by BACKUP. However cockroachdb#45785 resolved those concerns by ensuring that every key emitted by the IncrementalIterator was actually read by a normal, non-TBI iterator even when TBI is used. Thus it is now believed to be safe to re-enable TBI for incremental backups. Release note (enterprise change): Incremental BACKUP can quickly skip of unchanged data making frequent incremental BACKUPs 10-100x faster depending on data-size and frequency. Release justification: low-risk and high-impact.

46108: backup: re-enable fast incremental BACKUP via TBI r=dt a=dt Within the current BACKUP/RESTORE feature offering, the way to reduce RPO is to backup more often. For this to work for more frequent durations, the cost of backing up what changed must scale with the size what changed, as opposed to the total data size. This is exactly what the time-bound iterator optimization is supposed to deliver -- by recording the span of timestamps that appear in any given SSTable and then installing a filter to only open sstables that contain relevant times, we can ignore irrelevant data very cheaply. However over 2017 and 2018, we encountered a few tricky correctness issues in the interaction of the time-bound iterator and our MVCC logic and handling intents, and generally lost confidence in this code, ultimately disabling its use in the scans used by BACKUP. However #45785 resolved those concerns by ensuring that every key emitted by the IncrementalIterator was actually read by a normal, non-TBI iterator even when TBI is used. Thus it is now believed to be safe to re-enable TBI for incremental backups. Closes #43799. Release note (enterprise change): Incremental BACKUP can quickly skip of unchanged data making frequent incremental BACKUPs 10-100x faster depending on data-size and frequency. Release justification: low-risk and high-impact. Co-authored-by: David Taylor <tinystatemachine@gmail.com>

Within the current BACKUP/RESTORE feature offering, the way to reduce RPO is to backup more often. For this to work for more frequent durations, the cost of backing up what changed must scale with the size what changed, as opposed to the total data size. This is exactly what the time-bound iterator optimization is supposed to deliver -- by recording the span of timestamps that appear in any given SSTable and then installing a filter to only open sstables that contain relevant times, we can ignore irrelevant data very cheaply. However over 2017 and 2018, we encountered a few tricky correctness issues in the interaction of the time-bound iterator and our MVCC logic and handling intents, and generally lost confidence in this code, ultimately disabling its use in the scans used by BACKUP. However cockroachdb#45785 resolved those concerns by ensuring that every key emitted by the IncrementalIterator was actually read by a normal, non-TBI iterator even when TBI is used. Thus it is now believed to be safe to re-enable TBI for incremental backups. Release note (enterprise change): Incremental BACKUP can quickly skip of unchanged data making frequent incremental BACKUPs 10-100x faster depending on data-size and frequency. Release justification: low-risk and high-impact.

pbardea requested review from miretskiy and sumeerbhola March 6, 2020 07:53

miretskiy approved these changes Mar 6, 2020

View reviewed changes

sumeerbhola reviewed Mar 6, 2020

View reviewed changes

pbardea force-pushed the tbi branch from 8ad8ccf to 7a8a860 Compare March 6, 2020 18:06

pbardea commented Mar 6, 2020

View reviewed changes

pbardea force-pushed the tbi branch 2 times, most recently from 9b5e934 to d117fce Compare March 9, 2020 13:19

miretskiy approved these changes Mar 9, 2020

View reviewed changes

pbardea force-pushed the tbi branch from d117fce to 0f5e932 Compare March 9, 2020 13:45

pbardea commented Mar 9, 2020

View reviewed changes

pbardea force-pushed the tbi branch from 0f5e932 to 49448fe Compare March 9, 2020 14:57

pbardea added 2 commits March 9, 2020 11:52

pbardea force-pushed the tbi branch from 49448fe to b5df12f Compare March 9, 2020 15:52

pbardea commented Mar 9, 2020

View reviewed changes

miretskiy approved these changes Mar 9, 2020

View reviewed changes

pbardea merged commit 6932a18 into cockroachdb:master Mar 9, 2020

pbardea deleted the tbi branch March 9, 2020 18:04

dt mentioned this pull request Mar 14, 2020

backup: re-enable fast incremental BACKUP via TBI #46108

Merged

jbowens mentioned this pull request Apr 19, 2023

kvserver: RefreshRange performance regression #98881

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storageccl: scan all versions of changed key in MVCCIncrementalIterator #45785

storageccl: scan all versions of changed key in MVCCIncrementalIterator #45785

pbardea commented Mar 6, 2020

cockroach-teamcity commented Mar 6, 2020

pbardea commented Mar 6, 2020 •

edited

Loading

miretskiy left a comment

miretskiy Mar 6, 2020

miretskiy Mar 6, 2020

sumeerbhola left a comment

pbardea left a comment

miretskiy left a comment

pbardea left a comment

pbardea left a comment

miretskiy left a comment

storageccl: scan all versions of changed key in MVCCIncrementalIterator #45785

storageccl: scan all versions of changed key in MVCCIncrementalIterator #45785

Conversation

pbardea commented Mar 6, 2020

cockroach-teamcity commented Mar 6, 2020

pbardea commented Mar 6, 2020 • edited Loading

miretskiy left a comment

Choose a reason for hiding this comment

miretskiy Mar 6, 2020

Choose a reason for hiding this comment

miretskiy Mar 6, 2020

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

pbardea left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

pbardea left a comment

Choose a reason for hiding this comment

pbardea left a comment

Choose a reason for hiding this comment

miretskiy left a comment

Choose a reason for hiding this comment

pbardea commented Mar 6, 2020 •

edited

Loading