Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvnemesis: uniquely identify all versions #89477

Merged
merged 4 commits into from
Dec 7, 2022

Conversation

tbg
Copy link
Member

@tbg tbg commented Oct 6, 2022

This is essentially a v2 of kvnemesis. While a lot of code has changed, it's not a rewrite, rather we are actually bringing kvnemesis closer to the idea which ultimately led to it being built. That idea is that if values can uniquely identify the operation which wrote them, serializability checking becomes easier as any observation of a value totally orders the reader and the writer with respect to each other. "Easier" meant both simpler code, as well as actually being able to computationally do the verification on complex histories.

Prior to this PR, kvnemesis was writing unique values where it could, but it couldn't do it for deletions - after all, a deletion is like writing a "nothing" to MVCC, and there wasn't any way to make two "nothings" distinguishable. Having broken with the basic premise of unique values, there was a lot of bending over backwards going on to avoid, for the most part, devolving into a "normal" serializability checker. However, for contrived edge cases this would not be avoidable, and so there would be histories that kvnemesis just couldn't handle.

"v2" (this PR) gets this right. The main contribution is that we now thread kvnemesis' sequence numbers all the way down into MVCC and back up with the RangeFeed. Values are now truly unique: a deletion tombstone is identifiable via its MVCCValueHeader, which carries the kvnemesisutil.Seq of the Operation that originally wrote it. On top of this, everything "just works" as you expect.

Plumbing testing-only fields through the KV API protobufs isn't something that was taken lightly and that required a good amount of deliberation. However, we figured out a clever (maybe too clever?) way to have these fields be zero-sized in production, meaning they cannot possibly affect production logic and they also do not influence struct sizes or the wire encoding. As a drawback, kvnemesis requires the crdb_test build tag (it will t.Skip() otherwise).

The remainder of this PR is mostly improvements in code quality. echodriven testing was introduced for many of the tests that could benefit from it. The core components of kvnemesis were reworked for clarity (the author spent the first week very confused and wishes for that experience to remain unrepeated by anyone). kvnemesis also tracks the execution timestamps as reported by the KV layer, which a future change could use for additional validation.

Of note is the updated doc comment, which is reproduced below in entirety.

Fixes #90955.
Fixes #88988.
Fixes #76435.
Fixes #69642.


Package kvnemesis exercises the KV API with random concurrent traffic (as
well as splits, merges, etc) and then validates that the observed behaviors
are serializable.

It does so in polynomial time based on the techniques used by Elle (see in
particular section 4.2.3), using the after-the-fact MVCC history as a record
of truth. It ensures that all write operations embed a unique identifier that
is stored in MVCC history, and can thus identify which of its operations'
mutations are reflected in the database ("recoverability" in Elle parlance).

A run of kvnemesis proceeds as follows.

First, a Generator is instantiated. It can create, upon request, a sequence
in which each element is a random Operation. Operations that are mutations
(i.e. not pure reads) are assigned a unique kvnemesisutil.Seq which will be
stored alongside the MVCC data written by the Operation.

A pool of worker threads concurrently generates Operations and executes them
against a CockroachDB cluster. Some of these Operations may
succeed, some may fail, and for some of them an ambiguous result may be
encountered.
Alongside this random traffic, kvnemesis maintains a RangeFeed that ingests
the MVCC history. This creates a "carbon copy" of the MVCC history.

All of these workers wind down once they have in aggregate completed a
configured number of steps.

Next, kvnemesis validates that the Operations that were executed and the
results they saw match the MVCC history, i.e. checks for Serializability. In
general, this is an NP-hard problem1, but the use of unique sequence
numbers (kvnemesisutil.Seq) makes it tractable, as each mutation in the MVCC
keyspace uniquely identifies the Operation that must have written the value.

We make use of this property as follows. First, the Validate method iterates
through all Operations performed and, for each, inspects

  • the type of the Operation (i.e. Put, Delete, Get, ...),
  • the (point or range) key of the operation, and
  • its results (i.e. whether there was an error or which key-value pairs were returned).

Each atomic unit (i.e. individual non-transactional operation, or batch of
non-transactional operations, or transaction) results in a slice (in order
in which the Operations within executed2) of observations, where each
element is either an observed read or an observed write.

For example, a Batch that first writes v1 to k1, then scans [k1-k3)
(reading its own write as well as some other key k2 with value v2) and then
deletes k3 would generate the following slice:

   [
     observedWrite(k1->v1),
     observedScan(k1->v1, k2->v2),
     observedWrite(k3->v3),
   ]

Each such slice (i.e. atomic unit) will then be compared to the MVCC history.
For all units that succeeded, their writes must match up with versions in
the MVCC history, and this matching is trivial thanks to unique values
(including for deletions, since we embed the kvnemesisutil.Seq in the value),
and in particular a single write will entirely fix the MVCC timestamp at
which the atomic unit must have executed.

For each read (i.e. get or scan), we compute at which time intervals each
read would have been valid. For example, if the MVCC history for a key k1
is as follows:

              k1

   	 -----------------
   	 t5      v2
   	 t4
   	 t3
   	 t2     <del>
   	 t1      v1

then

  • observedGet(k1->v1) is valid for [t1,t2),
  • observedGet(k1->nil) is valid for [0,t1) and [t2,t5), and
  • observedGet(k1->v2) is valid for [t5,inf).

By intersecting the time intervals for each Operation in an atomic unit, we
then get the MVCC timestamps at which this Operation would have observed what
it ended up observing. If this intersection is empty, serializability must have
been violated.

In the error case, kvnemesis verifies that no part of the Operation became visible.
For ambiguous results, kvnemesis requires that either no Operation become visible,
or otherwise treats the atomic unit as committed.

The KV API also has the capability to return the actual execution timestamp directly
with responses. At the time of writing, kvnemesis does verify that it does do this
reliably, but it does not verify that the execution timestamp is contained in the
intersection of time intervals obtained from inspecting MVCC history3.

Epic: none

Footnotes

  1. https://dl.acm.org/doi/10.1145/322154.322158

  2. there is currently concurrency within the atomic unit in kvnemesis. It
    could in theory carry out multiple reads concurrently while not also writing,
    such as DistSQL does, but this is not implemented today. See:
    https://github.com/cockroachdb/cockroach/issues/64825

  3. tracked in https://github.com/cockroachdb/cockroach/issues/92898.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg mentioned this pull request Oct 6, 2022
@tbg tbg force-pushed the kvnemesis-mvccvh branch 2 times, most recently from 181726d to 2d2e706 Compare October 6, 2022 20:04
@tbg tbg force-pushed the kvnemesis-mvccvh branch 5 times, most recently from 57cf6f5 to a03f3f9 Compare October 13, 2022 19:19
@tbg tbg requested a review from erikgrinaker October 13, 2022 20:28
@tbg
Copy link
Member Author

tbg commented Oct 13, 2022

No rush on reviewing this Erik, it's not ready for merging anyway, need to come up with a version of the plumbing that could at least in theory get consensus.

But this seems to work and I'll base DeleteRangeUsingTombstone on it so that we can start getting some test coverage.

@tbg tbg force-pushed the kvnemesis-mvccvh branch 2 times, most recently from 9c10c9f to 2b0afee Compare November 8, 2022 13:43
@tbg
Copy link
Member Author

tbg commented Nov 9, 2022

I haven't done the plumbing and there are some test failures to fix up (because I changed some proto sizes under crdb_test), but this is worth a look now.

@tbg tbg marked this pull request as ready for review November 9, 2022 07:55
@tbg tbg requested review from a team as code owners November 9, 2022 07:55
@tbg tbg requested review from herkolategan and srosenberg and removed request for a team November 9, 2022 07:55
@tbg
Copy link
Member Author

tbg commented Nov 9, 2022

@erikgrinaker maybe we should kick this off with a 30 minute live walkthrough? I haven't written enough high-level description yet, though I suspect you're in the loop already.

@erikgrinaker
Copy link
Contributor

Sure, we can do that. After standup? I'm pretty familiar already, but the diff is large enough that there are probably bits that I've missed.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, reviewing everything outside of kvnemesis itself. I think the plumbing here came out really nice!

pkg/kv/kvclient/kvcoord/transport.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/batcheval/cmd_scan.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/testing_knobs.go Show resolved Hide resolved
pkg/kv/kvserver/replica_rangefeed.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/replica_rangefeed.go Outdated Show resolved Hide resolved
pkg/storage/mvcc.go Outdated Show resolved Hide resolved
pkg/storage/point_synthesizing_iter.go Outdated Show resolved Hide resolved
pkg/roachpb/api.proto Show resolved Hide resolved
pkg/util/buildutil/crdb_test_off.go Outdated Show resolved Hide resolved
pkg/util/buildutil/testingint/testing_int64.go Outdated Show resolved Hide resolved
@tbg
Copy link
Member Author

tbg commented Dec 2, 2022

Haven't squashed yet, but it's rebased and has a nice PR message. PTALL @erikgrinaker
There are also two comments (maybe easiest to find on reviewable, I resolved everything else) that I'd like your take on.

Hope to get this in on Monday 🤞🏽🤞🏽🤞🏽🤞🏽🤞🏽

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable seems to have given up on this PR, so I haven't reviewed everything again in detail, but looked over the comments and the code outside of the kvnemesis package.

Reviewed 234 of 248 files at r17, 20 of 20 files at r18, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/kv/kvnemesis/validator.go line 1235 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Figured it out, we're using this iter just to find all points that have MVCC history, and then throwing each of them in validReadTimes which does do proper tombstone handling. Added a comment and removed some effectively unused code here.

Got it, thanks for updating the comment. There's a duplicate "MVCC range deletions" in the comment here.


pkg/kv/kvnemesis/validator.go line 467 at r17 (raw file):

Previously, tbg (Tobias Grieger) wrote…

(I'm allowing this now, just wanted to give you a ping in case you think we should discuss w KV folks)

Since the operation is non-atomic, the caller has to keep retrying until it succeeds across all ranges. The semantics here are effectively the same as for ClearRange, which was used previously.


pkg/storage/mvcc.go line 1105 at r17 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I removed this TODO btw since there isn't a problem now, but want to make sure it's future proof.

Oof, yeah, this isn't great. I mean, it works, but I wouldn't expect this to be part of the pebbleMVCCScanner API guarantees, so someone can easily end up breaking this later. On the other hand, we wouldn't want to add in any non-trivial work to track this in results either, since this is on the hottest read path we have. I think I'm ok with this for now just because we only ever make use of this in test code and I don't have any better ideas, but it deserves a TODO or NB at least.


pkg/storage/mvcc.go line 6623 at r18 (raw file):

		}

		bounds := it.RangeBounds().Clone()

nit: let's reuse the span.EndKey slice instead, when we assign at the end of the loop. It doesn't currently matter much, since this is only used in test code, but just in case someone starts using it in production paths.

@tbg
Copy link
Member Author

tbg commented Dec 2, 2022

so I haven't reviewed everything again in detail, but looked over the comments and the code outside of the kvnemesis package.

The fixup commits are all there, so if you really wanted to, you could find the one that looks familiar from last time and review from there. But there is limited utility there too, just pointing it out as an option.

TFTR, will address later.

@tbg tbg force-pushed the kvnemesis-mvccvh branch 2 times, most recently from 46e199f to c5cd8ea Compare December 6, 2022 09:39
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to get this green now.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

tbg added 4 commits December 7, 2022 08:57
This is essentially a v2 of kvnemesis. While a lot of code has changed, it's not a rewrite, rather we are actually bringing kvnemesis closer to the idea which ultimately led to it being built. That idea is that if values can uniquely identify the operation which wrote them, serializability checking becomes easier as any observation of a value totally orders the reader and the writer with respect to each other. "Easier" meant both simpler code, as well as actually being able to computationally do the verification on complex histories.

Prior to this PR, kvnemesis was writing unique values where it could, but it couldn't do it for deletions - after all, a deletion is like writing a "nothing" to MVCC, and there wasn't any way to make two "nothings" distinguishable. Having broken with the basic premise of unique values, there was a lot of bending over backwards going on to avoid, for the most part, devolving into a "normal" serializability checker. However, for contrived edge cases this would not be avoidable, and so there would be histories that kvnemesis just couldn't handle.

"v2" (this PR) gets this right. The main contribution is that we now thread kvnemesis' sequence numbers all the way down into MVCC and back up with the RangeFeed. Values are now truly unique: a deletion tombstone is identifiable via its `MVCCValueHeader`, which carries the `kvnemesisutil.Seq` of the `Operation` that originally wrote it. On top of this, everything "just works" as you expect.

Plumbing testing-only fields through the KV API protobufs isn't something that was taken lightly and that required a good amount of deliberation. However, we figured out a clever (maybe too clever?) way to have these fields be zero-sized in production, meaning they cannot possibly affect production logic and they also do not influence struct sizes or the wire encoding. As a drawback, kvnemesis requires the `crdb_test` build tag (it will `t.Skip()` otherwise).

The remainder of this PR is mostly improvements in code quality. `echodriven` testing was introduced for many of the tests that could benefit from it. The core components of kvnemesis were reworked for clarity (the author spent the first week very confused and wishes for that experience to remain unrepeated by anyone). kvnemesis also tracks the execution timestamps as reported by the KV layer, which a future change could [use](cockroachdb#92898) for additional validation.

Of note is the updated doc comment, which is reproduced below in entirety.

Fixes cockroachdb#90955.
Fixes cockroachdb#88988.
Fixes cockroachdb#76435.
Fixes cockroachdb#69642.

----

Package kvnemesis exercises the KV API with random concurrent traffic (as
well as splits, merges, etc) and then validates that the observed behaviors
are serializable.

It does so in polynomial time based on the techniques used by [Elle] (see in
particular section 4.2.3), using the after-the-fact MVCC history as a record
of truth. It ensures that all write operations embed a unique identifier that
is stored in MVCC history, and can thus identify which of its operations'
mutations are reflected in the database ("recoverability" in Elle parlance).

A run of kvnemesis proceeds as follows.

First, a Generator is instantiated. It can create, upon request, a sequence
in which each element is a random Operation. Operations that are mutations
(i.e. not pure reads) are assigned a unique kvnemesisutil.Seq which will be
stored alongside the MVCC data written by the Operation.

A pool of worker threads concurrently generates Operations and executes them
against a CockroachDB cluster. Some of these Operations may
succeed, some may fail, and for some of them an ambiguous result may be
encountered.
Alongside this random traffic, kvnemesis maintains a RangeFeed that ingests
the MVCC history. This creates a "carbon copy" of the MVCC history.

All of these workers wind down once they have in aggregate completed a
configured number of steps.

Next, kvnemesis validates that the Operations that were executed and the
results they saw match the MVCC history, i.e. checks for Serializability. In
general, this is an NP-hard problem[^1], but the use of unique sequence
numbers (kvnemesisutil.Seq) makes it tractable, as each mutation in the MVCC
keyspace uniquely identifies the Operation that must have written the value.

We make use of this property as follows. First, the Validate method iterates
through all Operations performed and, for each, inspects

- the type of the Operation (i.e. Put, Delete, Get, ...),
- the (point or range) key of the operation, and
- its results (i.e. whether there was an error or which key-value pairs were returned).

Each atomic unit (i.e. individual non-transactional operation, or batch of
non-transactional operations, or transaction) results in a slice (in order
in which the Operations within executed[^2]) of observations, where each
element is either an observed read or an observed write.

For example, a Batch that first writes `v1` to `k1`, then scans `[k1-k3)`
(reading its own write as well as some other key k2 with value v2) and then
deletes `k3` would generate the following slice:

       [
         observedWrite(k1->v1),
         observedScan(k1->v1, k2->v2),
         observedWrite(k3->v3),
       ]

Each such slice (i.e. atomic unit) will then be compared to the MVCC history.
For all units that succeeded, their writes must match up with versions in
the MVCC history, and this matching is trivial thanks to unique values
(including for deletions, since we embed the kvnemesisutil.Seq in the value),
and in particular a single write will entirely fix the MVCC timestamp at
which the atomic unit must have executed.

For each read (i.e. get or scan), we compute at which time intervals each
read would have been valid. For example, if the MVCC history for a key `k1`
is as follows:

                  k1

       	 -----------------
       	 t5      v2
       	 t4
       	 t3
       	 t2     <del>
       	 t1      v1

then

  - observedGet(k1->v1)  is valid for [t1,t2),
  - observedGet(k1->nil) is valid for [0,t1) and [t2,t5), and
  - observedGet(k1->v2)  is valid for [t5,inf).

By intersecting the time intervals for each Operation in an atomic unit, we
then get the MVCC timestamps at which this Operation would have observed what
it ended up observing. If this intersection is empty, serializability must have
been violated.

In the error case, kvnemesis verifies that no part of the Operation became visible.
For ambiguous results, kvnemesis requires that either no Operation become visible,
or otherwise treats the atomic unit as committed.

The KV API also has the capability to return the actual execution timestamp directly
with responses. At the time of writing, kvnemesis does verify that it does do this
reliably, but it does not verify that the execution timestamp is contained in the
intersection of time intervals obtained from inspecting MVCC history[^3].

[Elle]: https://arxiv.org/pdf/2003.10554.pdf
[^1]: https://dl.acm.org/doi/10.1145/322154.322158
[^2]: there is currently concurrency within the atomic unit in kvnemesis. It
could in theory carry out multiple reads concurrently while not also writing,
such as DistSQL does, but this is not implemented today. See:
cockroachdb#64825
[^3]: tracked in cockroachdb#92898.

Epic: none
Release note: None
Fixes a kvnemesis flake:

>  kvnemesis.go:166: error applying x.AdminChangeReplicas(ctx, tk(9865574548402477322), [{REMOVE_VOTER n2,s2}]) // replica cannot hold lease: replica cannot hold lease

Release note: None
This works around cockroachdb#92189 while it's being fixed (and allows me to merge
the PR this commit is in, which otherwise fairly reliably fails
maye_stress in CI).

Release note: None
@tbg tbg force-pushed the kvnemesis-mvccvh branch from 123ff43 to b3550aa Compare December 7, 2022 07:57
@tbg
Copy link
Member Author

tbg commented Dec 7, 2022

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Dec 7, 2022

Build succeeded:

@craig craig bot merged commit 1b4947b into cockroachdb:master Dec 7, 2022
@tbg tbg deleted the kvnemesis-mvccvh branch December 7, 2022 10:28
tbg added a commit to tbg/cockroach that referenced this pull request Dec 7, 2022
craig bot pushed a commit that referenced this pull request Dec 7, 2022
92257: kvcoord: return error on failed scan response checksum r=erikgrinaker a=tbg

Found in #89477.

Epic: none
Release note: None


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants