Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage,libroach: update MVCCIncrementalIterator to look at every updated key #45163

Merged
merged 2 commits into from
Feb 28, 2020

Conversation

pbardea
Copy link
Contributor

@pbardea pbardea commented Feb 18, 2020

This latest implementation of MVCCIncrementalIterator updates its use of
the time-bound iterator to only be used to skip over keys that were not
modified at all within the time bound limits.

Previously, the incremental iterator primarily used the time-bound
iterator and then used a non-time-bound iterator (sanity_iter) to check
that meta keys were properly read. See #43799 for discussion around why
this may not be sufficient.

This commit updates the implementation to use the time-bound iterator
more conservatively. The 2 iterators that it now uses are iter (a
non-time-bound iterator and timeBoundIter). The main iteration is now
done with iter. When iter sees a new key, the timeBoundIter is
incremented and if it goes further than iter, iter is Seeked
forward to the meta key of the new key. This means that iter should
visit every version of every key that timeBoundIter visits.

It should be noted that if many adjacent keys were modified, the above
logic tries to avoid to do a Seek to the next key as a Seek is ~an
order of magnitude more expensive than a Next.

In cases where a lot of keys were modified, we expect to do ~double the
work since 2 iterators would need to seek to every key. However, the
expected use case for the MVCCIncrementalIterator is such that the
vast majority of the data is excluded from the range (as should be the
case for incremental backups), since there is potentially years worth of
data that can be skipped over, versus hours worth of data to include.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pbardea pbardea changed the title storage,libroach: update storage,libroach: update MVCCIncrementalIterator to look at every updated key Feb 18, 2020
@pbardea pbardea force-pushed the inc-iter branch 2 times, most recently from 523d394 to 2f43ce5 Compare February 18, 2020 18:31
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry about the delayed review

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @pbardea and @sumeerbhola)


pkg/storage/engine/mvcc_incremental_iterator.go, line 56 at r2 (raw file):

	// fields used for a workaround for a bug in the time-bound iterator
	// (#28358)

can you update the comment to also mention #43799


pkg/storage/engine/mvcc_incremental_iterator.go, line 85 at r2 (raw file):

	var timeBoundIter Iterator
	if !opts.IterOptions.MinTimestampHint.IsEmpty() && !opts.IterOptions.MaxTimestampHint.IsEmpty() {
		// It is necessary for correctness that sanityIter be created before iter.

I don't think the ordering matters anymore.

Also, I think it would be good if there was a file level comment justifying the correctness, say some revised version of the following:

When using (t_s, t_e], say there is a version (committed or provisional) k@t where t is in that interval, that is visible to iter. All sstables containing k@t will be included in timeBoundIter. Note that there may be multiple sequence numbers for the key k@t at the storage layer, say k@t#n1, k@t#n2, ... where n1 > n2 ... some of which may be deleted, but the latest sequence number will be visible using iter (since not being visible would be a contradiction of the initial assumption). Since there is no delete across all sstables that deletes k@t#n1, there is no delete in the subset of sstables used by timeBoundIter that deletes k@t#n1, so the timeBoundIter will see k@t.


pkg/storage/engine/mvcc_incremental_iterator.go, line 117 at r2 (raw file):

func (i *MVCCIncrementalIterator) SeekGE(startKey MVCCKey) {
	if i.timeBoundIter != nil {
		i.timeBoundIter.SeekGE(startKey)

this needs to call i.timeBoundIter.Valid() before calling i.timeBoundIter.Key()


pkg/storage/engine/mvcc_incremental_iterator.go, line 144 at r2 (raw file):

// key.
func (i *MVCCIncrementalIterator) Next() {
	i.iter.Next()

the old code had iter as the one constrained by the time bounds, and the new code has timeBoundIter in that role. I think both Next() and NextKey() need to operate on timeBoundIter and then call advance().

Given that MVCCIncrementalIterator is trying to work with both time-bounds and non-time bounds maybe the original structuring of using iter as either the time bounded one or not was the easier structure. Though I am wondering why it has to work in both cases. Can we just return a raw engine.Iterator from NewMVCCIncrementalIterator() if there are no time bounds? We could document as a code comment that the engine.Iterator returned from this "new" function only supports a subset of the engine.Iterator interface or explicitly make MVCCIncremetalIterator an interface. Is it the special handling of intents that prevents this? If so, it would be good to document more completely the behavioral contract of this iterator.


pkg/storage/engine/mvcc_incremental_iterator.go, line 189 at r2 (raw file):

// Advance
// 	Pre: TBI and Iter are pointing to the same key, but perhaps different versions.

I don't see how this precondition is satisfied by the call sites.
I think the precondition that we desire is that TBI is at the next key we could possibly return, but we have to validate using Iter and keep moving forward if the validation fails.

@pbardea pbardea force-pushed the inc-iter branch 4 times, most recently from 34f120d to 159b5d9 Compare February 25, 2020 19:46
@miretskiy miretskiy self-requested a review February 25, 2020 19:50
@pbardea pbardea force-pushed the inc-iter branch 2 times, most recently from 836b509 to d3dab9a Compare February 25, 2020 20:21
Copy link
Contributor Author

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the first look! I've fixed up some of the algorithm issues and added more documentation to explain what the iterator is doing and what guarantees it makes. A lot of this was based on your first pass, so thanks for that review.

I think that this is RFAL. I've also request @miretskiy take a look to double check my C++. :)

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @sumeerbhola)


pkg/storage/engine/mvcc_incremental_iterator.go, line 56 at r2 (raw file):

Previously, sumeerbhola wrote…

can you update the comment to also mention #43799

Done.


pkg/storage/engine/mvcc_incremental_iterator.go, line 85 at r2 (raw file):

Previously, sumeerbhola wrote…

I don't think the ordering matters anymore.

Also, I think it would be good if there was a file level comment justifying the correctness, say some revised version of the following:

When using (t_s, t_e], say there is a version (committed or provisional) k@t where t is in that interval, that is visible to iter. All sstables containing k@t will be included in timeBoundIter. Note that there may be multiple sequence numbers for the key k@t at the storage layer, say k@t#n1, k@t#n2, ... where n1 > n2 ... some of which may be deleted, but the latest sequence number will be visible using iter (since not being visible would be a contradiction of the initial assumption). Since there is no delete across all sstables that deletes k@t#n1, there is no delete in the subset of sstables used by timeBoundIter that deletes k@t#n1, so the timeBoundIter will see k@t.

Done.


pkg/storage/engine/mvcc_incremental_iterator.go, line 117 at r2 (raw file):

Previously, sumeerbhola wrote…

this needs to call i.timeBoundIter.Valid() before calling i.timeBoundIter.Key()

Done.


pkg/storage/engine/mvcc_incremental_iterator.go, line 144 at r2 (raw file):

Previously, sumeerbhola wrote…

the old code had iter as the one constrained by the time bounds, and the new code has timeBoundIter in that role. I think both Next() and NextKey() need to operate on timeBoundIter and then call advance().

Given that MVCCIncrementalIterator is trying to work with both time-bounds and non-time bounds maybe the original structuring of using iter as either the time bounded one or not was the easier structure. Though I am wondering why it has to work in both cases. Can we just return a raw engine.Iterator from NewMVCCIncrementalIterator() if there are no time bounds? We could document as a code comment that the engine.Iterator returned from this "new" function only supports a subset of the engine.Iterator interface or explicitly make MVCCIncremetalIterator an interface. Is it the special handling of intents that prevents this? If so, it would be good to document more completely the behavioral contract of this iterator.

So, I believe that only iter is needed to be explicitly incremented before the advance function. The time bound iterator, as I attempted to explain in the updated comments, is only updated in maybeSkipKeys. In particular, I expect that iter be incremented by either Next or NextKey exactly once between invocations of maybeSkipKeys. iter will either be incremented in Next or NextKey, or at the end of advance in the case that we end up on a key outside the desired time bounds. The timeBoundIter is only ever advanced by NextKey (and so only visits one version of any key) when iter gets ahead of it.

MVCCIncrementalIterator behaves similarly to an engine.Iterator with the exception that the MVCCIncrementalIterator will return an error if it encounters an inline value or an intent within the given time bounds. I've added comments to note that these errors are returned from MVCCIncrementalIterator.


pkg/storage/engine/mvcc_incremental_iterator.go, line 189 at r2 (raw file):

Previously, sumeerbhola wrote…

I don't see how this precondition is satisfied by the call sites.
I think the precondition that we desire is that TBI is at the next key we could possibly return, but we have to validate using Iter and keep moving forward if the validation fails.

Yes, this precondition wasn't quite correct and was left over from working through the initial implementation.

@pbardea pbardea marked this pull request as ready for review February 25, 2020 20:28
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)


c-deps/libroach/incremental_iterator.h, line 32 at r3 (raw file):

 large amount of keys

large number of keys?


c-deps/libroach/incremental_iterator.cc, line 28 at r3 (raw file):

      write_intent(write_intent) {

  time_bound_iter.reset(nullptr);

Not needed. If you prefer to be explicit, feel free to initialize inside argument list..

Or, if you want to be explicit, initialize it above (before "engine(engine)") or set
time_bound_iter = nullptr right in the header file.


c-deps/libroach/incremental_iterator.cc, line 56 at r3 (raw file):

}

// getKey returns the

we would never know what getKey returns... :)


c-deps/libroach/incremental_iterator.cc, line 82 at r3 (raw file):

// possibly different timestamp.
void DBIncrementalIterator::maybeSkipKeys() {
  if (time_bound_iter != nullptr) {

I would do
if (tbi == nullptr) return

and de-indent the rest of the code.


c-deps/libroach/incremental_iterator.cc, line 86 at r3 (raw file):

    // keys.
    std::string tbi_key = getKey(time_bound_iter->rep->key());
    std::string iter_key = getKey(iter->rep->key());

It appears that getKey operates only on the rocksdb slices returned by the underlying rocks iterators. As such, I think it makes sense to have getKey helper return a slice instead of making
string keys. Besides the issues of unnecessary copies, strings are really bad at handling data
which might contain \0 characters (so I'm not even sure about the correctness of the code using strings instead of slices)


c-deps/libroach/incremental_iterator.cc, line 92 at r3 (raw file):

Quoted 4 lines of code…
      DBIterNext(time_bound_iter.get(), true /* skip_current_key_versions */);
      if (!valid) {
        return;
      }

This doesn't look right to me

We call (our) DBIterNext helper, which happens to return a value which we ignore.
Then we check for !valid. Clearly, the underlying iterator might have some issues, but those are never reflected in the "valid" field of this incremental iterator.

It's kinda confusing, particularly, since we check for time_bound_iter->rep->Valid below.

What's also a bit confusing is that on one hand we call our helper (DBIterNext), but on the other hand we peak inside internal representation of the underlying iterator to see if the operation succeeded.

I think we should do one thing or the other... but not both.

I think we should do something like:

auto res = DBIterNext(...)
if (!res.valid) {
// handle error
status = res.status
valid = false
return
}


c-deps/libroach/incremental_iterator.cc, line 112 at r3 (raw file):

Quoted 4 lines of code…
      	// same as the main iterator, we may be able to skip over a large group
      	// of keys. The main iterator is seeked to the TBI in hopes that many
      	// keys were skipped. Note that a Seek is an order of magnitude more
      	// expensive than a Next call.

tabs?


c-deps/libroach/incremental_iterator.cc, line 116 at r3 (raw file):

!iter_rep->status().ok()) {

any benefit to this vs
iter_rep->Valid() check?


c-deps/libroach/incremental_iterator.cc, line 224 at r3 (raw file):

DBIterState DBIncrementalIterator::seek(DBKey key) {
  if (time_bound_iter != nullptr) {
    DBIterSeek(time_bound_iter.get(), key);

Same comment as above:
auto res = DBIterSeek(...)
if (!res.valid) {
// handle error
}


c-deps/libroach/incremental_iterator.cc, line 233 at r3 (raw file):

    const rocksdb::Slice iter_key(key.key.data, key.key.len);
    if (tbi_key.compare(iter_key) > 0) {
      key = ToDBKey(rocksdb::Slice(tbi_key));

tbi_key is already a slice -- no need to convert.

@pbardea pbardea force-pushed the inc-iter branch 2 times, most recently from 87c4c60 to c618471 Compare February 26, 2020 00:41
Copy link
Contributor Author

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)


c-deps/libroach/incremental_iterator.h, line 32 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…
 large amount of keys

large number of keys?

Done.


c-deps/libroach/incremental_iterator.cc, line 28 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Not needed. If you prefer to be explicit, feel free to initialize inside argument list..

Or, if you want to be explicit, initialize it above (before "engine(engine)") or set
time_bound_iter = nullptr right in the header file.

Done. Moved to header.


c-deps/libroach/incremental_iterator.cc, line 56 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

we would never know what getKey returns... :)

It's a secret. ;)
Updated.


c-deps/libroach/incremental_iterator.cc, line 82 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I would do
if (tbi == nullptr) return

and de-indent the rest of the code.

Done.


c-deps/libroach/incremental_iterator.cc, line 86 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

It appears that getKey operates only on the rocksdb slices returned by the underlying rocks iterators. As such, I think it makes sense to have getKey helper return a slice instead of making
string keys. Besides the issues of unnecessary copies, strings are really bad at handling data
which might contain \0 characters (so I'm not even sure about the correctness of the code using strings instead of slices)

Done, there should be no more keys as strings.


c-deps/libroach/incremental_iterator.cc, line 92 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…
      DBIterNext(time_bound_iter.get(), true /* skip_current_key_versions */);
      if (!valid) {
        return;
      }

This doesn't look right to me

We call (our) DBIterNext helper, which happens to return a value which we ignore.
Then we check for !valid. Clearly, the underlying iterator might have some issues, but those are never reflected in the "valid" field of this incremental iterator.

It's kinda confusing, particularly, since we check for time_bound_iter->rep->Valid below.

What's also a bit confusing is that on one hand we call our helper (DBIterNext), but on the other hand we peak inside internal representation of the underlying iterator to see if the operation succeeded.

I think we should do one thing or the other... but not both.

I think we should do something like:

auto res = DBIterNext(...)
if (!res.valid) {
// handle error
status = res.status
valid = false
return
}

Agreed. I did a pass to clean up our error checking throughout this file.


c-deps/libroach/incremental_iterator.cc, line 112 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…
      	// same as the main iterator, we may be able to skip over a large group
      	// of keys. The main iterator is seeked to the TBI in hopes that many
      	// keys were skipped. Note that a Seek is an order of magnitude more
      	// expensive than a Next call.

tabs?

Oops, this is why one should not mix IDEs. Fixed.


c-deps/libroach/incremental_iterator.cc, line 116 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…
!iter_rep->status().ok()) {

any benefit to this vs
iter_rep->Valid() check?

Updated in error checking audit.


c-deps/libroach/incremental_iterator.cc, line 233 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

tbi_key is already a slice -- no need to convert.

Done.

Copy link
Contributor Author

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @sumeerbhola)


c-deps/libroach/incremental_iterator.cc, line 224 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

Same comment as above:
auto res = DBIterSeek(...)
if (!res.valid) {
// handle error
}

Done.

Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)


c-deps/libroach/incremental_iterator.cc, line 233 at r3 (raw file):

y is already a slice -- no need to convert.
Can you pass tbi_key directly to ToDBKey?

key = ToDBKey(tbi_key)?


c-deps/libroach/incremental_iterator.cc, line 75 at r5 (raw file):

rocksdb::Slice DBIncrementalIterator::getKey(rocksdb::Slice mvcc_key) {

Looking more at getKey() calls, I think we should change its signature to:

bool DBIncrementalIterator::extractKey(rocksdb::Slice mvcc_key, rocksdb::Slice *key) WARN_UNUSED_RESULT

Pass the argument "*key' down to the SplitKey, and return valid indicator.
Basically, we're making multiple calls to getKey() which mutates internal validity state;
and we never check on that.

@pbardea pbardea force-pushed the inc-iter branch 2 times, most recently from f0ef78d to e011e4d Compare February 26, 2020 02:20
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)


pkg/storage/engine/mvcc_incremental_iterator.go, line 106 at r5 (raw file):

	var iter Iterator
	var timeBoundIter Iterator
	if !opts.IterOptions.MinTimestampHint.IsEmpty() && !opts.IterOptions.MaxTimestampHint.IsEmpty() {

the MinTimestampHint and MaxTimestampHint need to be consistent with StartTime and EndTime, yes? This structure is odd, since it looks like the same information is being provided twice -- can you add a code comment.


pkg/storage/engine/mvcc_incremental_iterator.go, line 141 at r5 (raw file):

			// Seek to the first version of the key that TBI provided.
			startKey = MakeMVCCMetadataKey(tbiKey)
		}

I am assuming the startKey that SeekGE is called with must be a metadata key, otherwise the else case would also need to construct a metadata key (since we need to notice intents). If yes, can you add a comment?


pkg/storage/engine/mvcc_incremental_iterator.go, line 229 at r5 (raw file):

// advance advances the main iterator until it is referencing a key within
// (start_time, end_time]. It may return an error if it encounters an

nit: "return an error" is not quite accurate since this function does not have a return value.


pkg/storage/engine/mvcc_incremental_iterator.go, line 180 at r6 (raw file):

//
// Pre: The TBI should be pointing to either the same key or the key prior
// to the same key as the main iterator. This is enforced by a combination

Did you mean to say that the TBI is at a key <= main iterator? The phrasing "the key prior" suggests that they only differ by a one call to NextKey()?


pkg/storage/engine/mvcc_incremental_iterator.go, line 189 at r6 (raw file):

// possibly different timestamp.
func (i *MVCCIncrementalIterator) maybeSkipKeys() {
	if i.timeBoundIter != nil {

can you structure this as follows to reduce the block nesting

if i.timeBoundIter == nil {
  return
} 
tbiKey := ...
iterKey := ...
if iterKey.Compare(tbiKey) <= 0 {
  return
}
...

pkg/storage/engine/mvcc_incremental_iterator.go, line 209 at r6 (raw file):

			// This means that for the incremental iterator to perform a Next or
			// NextKey in this case, it will require only 1 extra NextKey call to the
			// TBI.

I think there are two things that can break this lockstep behavior:

  • when the TBI is seeing phantom MVCCKey.Keys, i.e., keys that are completely absent from the main iterator -- this can happen rarely due to aborted transactions and keys that have been subsumed due to range tombstones.
  • when the TBI has fewer keys than the the main iterator, which will be quite common.

This code doesn't handle the first case (and I think we need test coverage for it) -- it corresponds to iterKey.Compare(tbiKey) > 0. Since this will be rare I think we should simply call i.timeBoundIter.SeekGE(MakeMVCCMetadataKey(iterKey)). The code here is still correct, but just wasteful in causing the main iterator to iterate over keys that are not in the TBI.


pkg/storage/engine/mvcc_incremental_iterator.go, line 233 at r6 (raw file):

func (i *MVCCIncrementalIterator) advance() {
	for {
		if !i.valid {

I think the invariant is that advance() is called when i.valid is true, though i.iter.Valid() may be true or false. In which case I don't think this if-block is needed. The only thing that will make i.valid false is maybeSkipKeys() and that is followed by an if-block that returns.


pkg/storage/engine/mvcc_incremental_iterator.go, line 280 at r6 (raw file):

			i.iter.Next()
			continue
		}

can you add a comment like the following since the use of i.meta is a bit confusing

// Else, i.meta contains an MVCC value and not an intent. This is handled next.

Copy link
Contributor Author

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @sumeerbhola)


c-deps/libroach/incremental_iterator.cc, line 233 at r3 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

y is already a slice -- no need to convert.
Can you pass tbi_key directly to ToDBKey?

key = ToDBKey(tbi_key)?

Oops, should be fixed now.


c-deps/libroach/incremental_iterator.cc, line 75 at r5 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…
rocksdb::Slice DBIncrementalIterator::getKey(rocksdb::Slice mvcc_key) {

Looking more at getKey() calls, I think we should change its signature to:

bool DBIncrementalIterator::extractKey(rocksdb::Slice mvcc_key, rocksdb::Slice *key) WARN_UNUSED_RESULT

Pass the argument "*key' down to the SplitKey, and return valid indicator.
Basically, we're making multiple calls to getKey() which mutates internal validity state;
and we never check on that.

Ah, nice - updated!


pkg/storage/engine/mvcc_incremental_iterator.go, line 106 at r5 (raw file):

Previously, sumeerbhola wrote…

the MinTimestampHint and MaxTimestampHint need to be consistent with StartTime and EndTime, yes? This structure is odd, since it looks like the same information is being provided twice -- can you add a code comment.

I believe that the requirement is that the timestamp hint range should not be more restrictive than the range between StartTime and EndTime, though I would expect callers in practice to have these consistent.


pkg/storage/engine/mvcc_incremental_iterator.go, line 141 at r5 (raw file):

Previously, sumeerbhola wrote…

I am assuming the startKey that SeekGE is called with must be a metadata key, otherwise the else case would also need to construct a metadata key (since we need to notice intents). If yes, can you add a comment?

Yep, I've added a comment.


pkg/storage/engine/mvcc_incremental_iterator.go, line 229 at r5 (raw file):

Previously, sumeerbhola wrote…

nit: "return an error" is not quite accurate since this function does not have a return value.

Done.


pkg/storage/engine/mvcc_incremental_iterator.go, line 180 at r6 (raw file):

Previously, sumeerbhola wrote…

Did you mean to say that the TBI is at a key <= main iterator? The phrasing "the key prior" suggests that they only differ by a one call to NextKey()?

Yep, updated the comment on the C++ side, but not on the go side. Updated here as well now.


pkg/storage/engine/mvcc_incremental_iterator.go, line 189 at r6 (raw file):

Previously, sumeerbhola wrote…

can you structure this as follows to reduce the block nesting

if i.timeBoundIter == nil {
  return
} 
tbiKey := ...
iterKey := ...
if iterKey.Compare(tbiKey) <= 0 {
  return
}
...

Done.


pkg/storage/engine/mvcc_incremental_iterator.go, line 209 at r6 (raw file):

Previously, sumeerbhola wrote…

I think there are two things that can break this lockstep behavior:

  • when the TBI is seeing phantom MVCCKey.Keys, i.e., keys that are completely absent from the main iterator -- this can happen rarely due to aborted transactions and keys that have been subsumed due to range tombstones.
  • when the TBI has fewer keys than the the main iterator, which will be quite common.

This code doesn't handle the first case (and I think we need test coverage for it) -- it corresponds to iterKey.Compare(tbiKey) > 0. Since this will be rare I think we should simply call i.timeBoundIter.SeekGE(MakeMVCCMetadataKey(iterKey)). The code here is still correct, but just wasteful in causing the main iterator to iterate over keys that are not in the TBI.

Ah, I was wondering if the iterKey > tbiKey case was possible and didn't consider phantom keys. Thanks for pointing that out! Will follow this up with a test.


pkg/storage/engine/mvcc_incremental_iterator.go, line 233 at r6 (raw file):

Previously, sumeerbhola wrote…

I think the invariant is that advance() is called when i.valid is true, though i.iter.Valid() may be true or false. In which case I don't think this if-block is needed. The only thing that will make i.valid false is maybeSkipKeys() and that is followed by an if-block that returns.

@miretskiy's review of the C++ side led to a re-working of the error handling there. I've updated the go implementation to mirror how we handle errors there closer. This adds a bit more verbosity here, but I think it's more readable. I remove this check at the start of the for and explicitly check after every invocation which may make the iterator invalid. I think this is clearer since each error handling check is right after the code which may have caused the error.


pkg/storage/engine/mvcc_incremental_iterator.go, line 280 at r6 (raw file):

Previously, sumeerbhola wrote…

can you add a comment like the following since the use of i.meta is a bit confusing

// Else, i.meta contains an MVCC value and not an intent. This is handled next.

Done. I was also a bit confused by the naming of i.meta so in addition to adding a comment here, I also added a comment to the meta property.

Copy link
Contributor Author

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy and @sumeerbhola)


pkg/storage/engine/mvcc_incremental_iterator.go, line 209 at r6 (raw file):

Previously, pbardea (Paul Bardea) wrote…

Ah, I was wondering if the iterKey > tbiKey case was possible and didn't consider phantom keys. Thanks for pointing that out! Will follow this up with a test.

Since it's not a correctness concern, I'd actually prefer getting this version in as soon as possible so we can start stressing it sooner. I am interested in eventually following up with a test though.

Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)

a discussion (no related file):
Giving my LGTM for C++ -- i'll let @sumeerbhola handle go stuff.


Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r6, 3 of 5 files at r7, 1 of 1 files at r8.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)


pkg/storage/engine/mvcc_incremental_iterator.go, line 225 at r8 (raw file):

		// This means that for the incremental iterator to perform a Next or NextKey
		// will require only 1 extra NextKey invocation while they remain in
		// lockstep.

btw, for performance experiments you may want to keep stats like a counter of the fast path and slow path -- unless modified keys are adjacent, the slow path will be more common.
(Not suggesting that you add that to this PR)


pkg/storage/engine/mvcc_incremental_iterator.go, line 227 at r8 (raw file):

		// lockstep.

		if iterKey.Compare(tbiKey) > 0 {

can you restructure this as

cmp := iterKey.Compare(tbiKey)
if cmp > 0 {
...
}

if cmp < 0 {
...
}

to avoid repeated string comparisons.


pkg/storage/engine/mvcc_incremental_iterator_test.go, line 756 at r8 (raw file):

	// and the second is the MVCC metadata entry (i.e. an intent). The next
	// SSTable contains the provisional value for the intent. The effect is that
	// the metadata entry is separated from the entry it is metadata for.

how is this different from TestMVCCIncrementalIteratorIntentStraddlesSStables?

Copy link
Contributor Author

@pbardea pbardea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)


pkg/storage/engine/mvcc_incremental_iterator.go, line 225 at r8 (raw file):

Previously, sumeerbhola wrote…

btw, for performance experiments you may want to keep stats like a counter of the fast path and slow path -- unless modified keys are adjacent, the slow path will be more common.
(Not suggesting that you add that to this PR)

Ack.


pkg/storage/engine/mvcc_incremental_iterator.go, line 227 at r8 (raw file):

Previously, sumeerbhola wrote…

can you restructure this as

cmp := iterKey.Compare(tbiKey)
if cmp > 0 {
...
}

if cmp < 0 {
...
}

to avoid repeated string comparisons.

Done.


pkg/storage/engine/mvcc_incremental_iterator_test.go, line 756 at r8 (raw file):

Previously, sumeerbhola wrote…

how is this different from TestMVCCIncrementalIteratorIntentStraddlesSStables?

Removed. Accidentally added when I was playing around with the idea of testing the phantom key case.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 4 files at r9.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @miretskiy, @pbardea, and @sumeerbhola)

@pbardea pbardea force-pushed the inc-iter branch 2 times, most recently from 14cbc56 to ee76b9e Compare February 27, 2020 21:52
This latest implementation of MVCCIncrementalIterator updates its use of
the time-bound iterator to only be used to skip over keys that were not
modified at all within the time bound limits.

Previously, the incremental iterator primarily used the time-bound
iterator and then used a non-time-bound iterator (sanity_iter) to check
that meta keys were properly read. See cockroachdb#43799 for discussion around why
this may not be sufficient.

This commit updates the implementation to use the time-bound iterator
more conservatively. The 2 iterators that it now uses are `iter` (a
non-time-bound iterator and `timeBoundIter`). The main iteration is now
done with `iter`. When `iter` sees a new key, the `timeBoundIter` is
incremented and if it goes further than `iter`, `iter` is `Seek`ed
forward to the meta key of the new key. This means that `iter` should
visit every version of every key that `timeBoundIter` visits.

It should be noted that if many adjacent keys were modified, the above
logic tries to avoid to do a `Seek` to the next key as a `Seek` is ~an
order of magnitude more expensive than a `Next`.

In cases where a lot of keys were modified, we expect to do ~double the
work since 2 iterators would need to seek to every key. However, the
expected use case for the MVCCIncrementalIterator is such that the
_vast_ majority of the data is excluded from the range (as should be the
case for incremental backups), since there is potentially years worth of
data that can be skipped over, versus hours worth of data to include.

Release note: None
These tests exercise the Go MVCCIncrementalIterator which is no longer
used directly in CockroachDB. However, the MVCCIncrementalIterator is
used as an oracle to test the C++ implementation. Therefore, it would
still be beneficial to have these tests present as long as we plan to
keep these 2 implementations in sync.

Release note: None
@pbardea
Copy link
Contributor Author

pbardea commented Feb 27, 2020

Thanks for the reviews!
bors r+

@craig
Copy link
Contributor

craig bot commented Feb 28, 2020

Build succeeded

@craig craig bot merged commit fc5c7f0 into cockroachdb:master Feb 28, 2020
craig bot pushed a commit that referenced this pull request Feb 28, 2020
45535: Revert "storage,libroach: update MVCCIncrementalIterator to look at every updated key" r=pbardea a=pbardea

Reverts #45163

To stop the errors we're seeing on #45524. Will investigate further once it's off master.

Co-authored-by: Paul Bardea <pbardea@gmail.com>
craig bot pushed a commit that referenced this pull request Feb 28, 2020
45143: ui: Release notes signup r=dhartunian a=koorosh

Resolves: #43912
Depends on: #44744, #44821, #44856 

- [x] WIP. Current branch has two branches merged which haven't been approved/merged in master branch yet.

- [x] rebase branch to remove merge commits from branches other than master.

Add Release notes signup form to Cluster Overview page
right after page title.
Release Notes signup view is created in `src/views/dashboard`
directory because it will be the main page where we display
this view. And Cluster Overview page is a temporary place
while Dashboard view doesn't exist.

These changes integrate three main parts: ReleaseNotesForm,
AlertNotification component and custom analytics saga.



45426: coldata: minor tweak of flat bytes r=yuzefovich a=yuzefovich

This commit changes `maybeBackfillOffsets` to update `maxSetIndex`
accordingly (this might be a minor performance improvement). In a sense,
when we're backfilling offsets, we *are* setting indices to point to
empty `[]byte` slices. Also, the logic for `Set` method is slightly
refactored.

Release note: None

45455: clusterversion: significantly rework cluster version handling r=tbg a=irfansharif

We split off ClusterVersion out of pkg/settings/cluster into a dedicated
pkg/clusterversion. This is to pave the way for #39182 where we
introduce long running version migration infrastructure.

We then introduce clusterversion.Handle as a read only view to the
active cluster version and this binary's version details. We similarly
fold in the actual global cluster version setting into
pkg/clusterversion, and enforce all external usage to go through
clusterversion.Handle. We can also hide away external usage of the baked
in cluster.Binary{,MinimumSupported}Version constants. Instead we have
the implementation of clusterversion.Handle allow for tests to override
binary and minimum supported versions as needed.

Along the way we clean up Settings.Initialized, as it is not really
used. We document all the various "versions" in play ("binary version",
    "minimum supported version", "active version") and change usage of what
was previously "serverVersion" to simply be "binaryVersion", because
that's what it is. We also clean up the Settings constructors into
Test/Non-Test types that differ in cluster version setting
initialization behaviour.

---

For reviewers: It's probably easiest to start with what 
pkg/settings/cluster/cluster_settings.go looks like, then following into
pkg/clusterversion/cluster_version.go and then pkg/clusterversion/setting.go.

---

I still don't like the following detail about our pkg structure:

- pkg/settings/cluster depends on it's "parent" pkg, pkg/settings
- pkg/settings/cluster depends pkg/clusterversion, which in turn depends
on pkg/settings

Naively, pkg/settings/cluster should be a top level package, but I'm not
making that change in this PR. For now I've renamed the settings.go file
to cluster_settings.go.

Release note: None


45535: Revert "storage,libroach: update MVCCIncrementalIterator to look at every updated key" r=pbardea a=pbardea

Reverts #45163

To stop the errors we're seeing on #45524. Will investigate further once it's off master.

Co-authored-by: Andrii Vorobiov <and.vorobiov@gmail.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
Co-authored-by: Paul Bardea <pbardea@gmail.com>
craig bot pushed a commit that referenced this pull request Feb 29, 2020
45535: Revert "storage,libroach: update MVCCIncrementalIterator to look at every updated key" r=pbardea a=pbardea

Reverts #45163

To stop the errors we're seeing on #45524. Will investigate further once it's off master.

Co-authored-by: Paul Bardea <pbardea@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants