Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvnemesis: add MVCC range tombstone tests #76435

Closed
erikgrinaker opened this issue Feb 11, 2022 · 8 comments · Fixed by #89477
Closed

kvnemesis: add MVCC range tombstone tests #76435

erikgrinaker opened this issue Feb 11, 2022 · 8 comments · Fixed by #89477
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-storage Relating to our storage engine (Pebble) on-disk storage. A-testing Testing tools and infrastructure O-qa T-storage Storage Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Feb 11, 2022

Once #70415 lands, and #70412 matures, we should add range tombstone tests to kvnemesis.

Jira issue: CRDB-13120

Epic CRDB-20465

@blathers-crl
Copy link

blathers-crl bot commented Jul 6, 2022

cc @cockroachdb/replication

@blathers-crl blathers-crl bot added the A-kv-replication Relating to Raft, consensus, and coordination. label Jul 6, 2022
@exalate-issue-sync exalate-issue-sync bot added T-kv-replication and removed A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels Jul 6, 2022
@blathers-crl
Copy link

blathers-crl bot commented Jul 6, 2022

cc @cockroachdb/replication

@exalate-issue-sync exalate-issue-sync bot added the A-kv-replication Relating to Raft, consensus, and coordination. label Jul 6, 2022
@nicktrav nicktrav added the T-storage Storage Team label Jul 13, 2022
@blathers-crl blathers-crl bot added the A-storage Relating to our storage engine (Pebble) on-disk storage. label Jul 13, 2022
@tbg
Copy link
Member

tbg commented Aug 29, 2022

Hit a snag, probably not a show stopper, but DeleteRange is currently only supported in a txn, but we can't have a txn here. See https://reviewable.io/reviews/cockroachdb/cockroach/68003#-Mi3-qUv9iLDNMACdaB8

#69642
#71236

@tbg
Copy link
Member

tbg commented Aug 29, 2022

This is related to this code for puts:

// Mark which deletes are materialized and match them with a stored
// tombstone, since this cannot be done before the end of the txn.
// This is because materialized deletes do not write unique values,
// but must be the final write in a txn for that key.
if o.isDelete() {
key := string(o.Key)
v.committedDeletesForKey[key]++
if optTxn == nil {
// In the case that the delete is not in a transaction (or in an
// ambiguous transaction), we do not match it to a specific
// tombstone as we cannot be certain which tombstone resulted from
// this operation; hence, we leave the timestamp empty.
o.Materialized = v.committedDeletesForKey[key] <= len(v.tombstonesForKey[key])
} else if storedDelete, ok := v.getDeleteForKey(key, optTxn); ok {
o.Materialized = true
o.Timestamp = storedDelete.Timestamp
}
}
}

} else {
// NB: While Put operations can be identified as having materialized
// (or not) in the storage engine because the Generator guarantees each
// value to be unique (and thus, if a MVCC key/value pair exists in the
// storage engine with a value matching that of a write operation, it
// materialized), the same cannot be done for Delete operations, which
// all write the same tombstone value. Thus, Delete operations can only
// be identified as materialized by determining if the final write
// operation for a key in a given transaction was a Delete, and
// validating that a potential tombstone for that key was stored.
// This validation must be done at the end of the transaction;
// specifically, in the function `checkCommittedTxn(..)` where it looks
// up a corresponding tombstone with `getDeleteForKey(..)`.
write := &observedWrite{
Key: t.Key,
Value: roachpb.Value{},
}
v.observedOpsByTxn[*txnID] = append(v.observedOpsByTxn[*txnID], write)
}

A deletion doesn't return the timestamp at which it materialized, so here we have to go in manually and "do stuff" and if we're not in a txn then we don't even know what timestamp to use.

I will need to get more familiar with the validation code to sort out how this should handle the case we want to add here.

@nvanbenschoten and/or @AlexTalks would be grateful for some pointers here. I still have to make a solid effort diving into this, but maybe you two have an idea of what should or can't work right off the bat.

@tbg
Copy link
Member

tbg commented Aug 29, 2022

I'm going to try adding this as a new request type. edit: err, maybe not. Going to wait for expert advice to not waste my time.

@erikgrinaker
Copy link
Contributor Author

I'm going to try adding this as a new request type.

I think this is probably the better approach short-term, given that UseRangeTombstone mandates non-txnal semantics, but it's possible that we'll hit similar difficulties with it that we'd have to resolve. +1 on getting some advice, but let's not get pulled into anything laborious here.

@tbg
Copy link
Member

tbg commented Sep 1, 2022

Erik points out that mvcc rangedels are sidestepping some of the problems above since they physically cover the entire range. We likely still need to teach the verifier how to deal with range keys etc but maybe it's mostly elbow grease after all. Would welcome some expert eyes on this.

@tbg
Copy link
Member

tbg commented Sep 7, 2022

Here's the relevant code explains why non-transactional DeleteRange is not supported (#69642):

case *DeleteRangeOperation:
if txnID == nil {
v.checkAtomic(`deleteRange`, t.Result, nil, op)
} else {
// For the purposes of validation, DelRange operations decompose into
// a specialized scan for keys with non-nil values, followed by
// writes for each key, with a span to validate that the keys we are
// deleting are within the proper bounds. See above comment for how
// the individual deletion tombstones for each key are validated.
scan := &observedScan{
Span: roachpb.Span{
Key: t.Key,
EndKey: t.EndKey,
},
IsDeleteRange: true,
KVs: make([]roachpb.KeyValue, len(t.Result.Keys)),
}
deleteOps := make([]observedOp, len(t.Result.Keys))
for i, key := range t.Result.Keys {
scan.KVs[i] = roachpb.KeyValue{
Key: key,
Value: roachpb.Value{},
}
write := &observedWrite{
Key: key,
Value: roachpb.Value{},
IsDeleteRange: true,
}
deleteOps[i] = write
}
v.observedOpsByTxn[*txnID] = append(v.observedOpsByTxn[*txnID], scan)
v.observedOpsByTxn[*txnID] = append(v.observedOpsByTxn[*txnID], deleteOps...)
}
case *ScanOperation:

This seems to suggest that a) if we can figure out the write timestamp of the non-txnal operation and b) can synthesize the set of keys affected by the deletion (or just return it with the result to make these tests work) or find a way to somehow store additional information in the (point or range) tombstone, we can make it work.

I also don't yet understand how non-txnal point deletions work, since they also don't get assigned a timestamp:

key := string(o.Key)
v.committedDeletesForKey[key]++
if optTxn == nil {
// In the case that the delete is not in a transaction (or in an
// ambiguous transaction), we do not match it to a specific
// tombstone as we cannot be certain which tombstone resulted from
// this operation; hence, we leave the timestamp empty.
o.Materialized = v.committedDeletesForKey[key] <= len(v.tombstonesForKey[key])
} else if storedDelete, ok := v.getDeleteForKey(key, optTxn); ok {

tbg added a commit to tbg/cockroach that referenced this issue Sep 11, 2022
Very partial WIP

Fixes cockroachdb#76435.

Release justification:
Release note (<category, see below>): <what> <show> <why>
craig bot pushed a commit that referenced this issue Dec 7, 2022
89477: kvnemesis: uniquely identify all versions r=erikgrinaker a=tbg

This is essentially a v2 of kvnemesis. While a lot of code has changed, it's not a rewrite, rather we are actually bringing kvnemesis closer to the idea which ultimately led to it being built. That idea is that if values can uniquely identify the operation which wrote them, serializability checking becomes easier as any observation of a value totally orders the reader and the writer with respect to each other. "Easier" meant both simpler code, as well as actually being able to computationally do the verification on complex histories.

Prior to this PR, kvnemesis was writing unique values where it could, but it couldn't do it for deletions - after all, a deletion is like writing a "nothing" to MVCC, and there wasn't any way to make two "nothings" distinguishable. Having broken with the basic premise of unique values, there was a lot of bending over backwards going on to avoid, for the most part, devolving into a "normal" serializability checker. However, for contrived edge cases this would not be avoidable, and so there would be histories that kvnemesis just couldn't handle.

"v2" (this PR) gets this right. The main contribution is that we now thread kvnemesis' sequence numbers all the way down into MVCC and back up with the RangeFeed. Values are now truly unique: a deletion tombstone is identifiable via its `MVCCValueHeader`, which carries the `kvnemesisutil.Seq` of the `Operation` that originally wrote it. On top of this, everything "just works" as you expect.

Plumbing testing-only fields through the KV API protobufs isn't something that was taken lightly and that required a good amount of deliberation. However, we figured out a clever (maybe too clever?) way to have these fields be zero-sized in production, meaning they cannot possibly affect production logic and they also do not influence struct sizes or the wire encoding. As a drawback, kvnemesis requires the `crdb_test` build tag (it will `t.Skip()` otherwise).

The remainder of this PR is mostly improvements in code quality. `echodriven` testing was introduced for many of the tests that could benefit from it. The core components of kvnemesis were reworked for clarity (the author spent the first week very confused and wishes for that experience to remain unrepeated by anyone). kvnemesis also tracks the execution timestamps as reported by the KV layer, which a future change could [use](#92898) for additional validation.

Of note is the updated doc comment, which is reproduced below in entirety.


Fixes #90955.
Fixes #88988.
Fixes #76435.
Fixes #69642.

----

Package kvnemesis exercises the KV API with random concurrent traffic (as
well as splits, merges, etc) and then validates that the observed behaviors
are serializable.

It does so in polynomial time based on the techniques used by [Elle] (see in
particular section 4.2.3), using the after-the-fact MVCC history as a record
of truth. It ensures that all write operations embed a unique identifier that
is stored in MVCC history, and can thus identify which of its operations'
mutations are reflected in the database ("recoverability" in Elle parlance).

A run of kvnemesis proceeds as follows.

First, a Generator is instantiated. It can create, upon request, a sequence
in which each element is a random Operation. Operations that are mutations
(i.e. not pure reads) are assigned a unique kvnemesisutil.Seq which will be
stored alongside the MVCC data written by the Operation.

A pool of worker threads concurrently generates Operations and executes them
against a CockroachDB cluster. Some of these Operations may
succeed, some may fail, and for some of them an ambiguous result may be
encountered.
Alongside this random traffic, kvnemesis maintains a RangeFeed that ingests
the MVCC history. This creates a "carbon copy" of the MVCC history.

All of these workers wind down once they have in aggregate completed a
configured number of steps.

Next, kvnemesis validates that the Operations that were executed and the
results they saw match the MVCC history, i.e. checks for Serializability. In
general, this is an NP-hard problem[^1], but the use of unique sequence
numbers (kvnemesisutil.Seq) makes it tractable, as each mutation in the MVCC
keyspace uniquely identifies the Operation that must have written the value.

We make use of this property as follows. First, the Validate method iterates
through all Operations performed and, for each, inspects

- the type of the Operation (i.e. Put, Delete, Get, ...),
- the (point or range) key of the operation, and
- its results (i.e. whether there was an error or which key-value pairs were returned).

Each atomic unit (i.e. individual non-transactional operation, or batch of
non-transactional operations, or transaction) results in a slice (in order
in which the Operations within executed[^2]) of observations, where each
element is either an observed read or an observed write.

For example, a Batch that first writes `v1` to `k1`, then scans `[k1-k3)`
(reading its own write as well as some other key k2 with value v2) and then
deletes `k3` would generate the following slice:

       [
         observedWrite(k1->v1),
         observedScan(k1->v1, k2->v2),
         observedWrite(k3->v3),
       ]

Each such slice (i.e. atomic unit) will then be compared to the MVCC history.
For all units that succeeded, their writes must match up with versions in
the MVCC history, and this matching is trivial thanks to unique values
(including for deletions, since we embed the kvnemesisutil.Seq in the value),
and in particular a single write will entirely fix the MVCC timestamp at
which the atomic unit must have executed.

For each read (i.e. get or scan), we compute at which time intervals each
read would have been valid. For example, if the MVCC history for a key `k1`
is as follows:

                  k1

       	 -----------------
       	 t5      v2
       	 t4
       	 t3
       	 t2     <del>
       	 t1      v1

then

  - observedGet(k1->v1)  is valid for [t1,t2),
  - observedGet(k1->nil) is valid for [0,t1) and [t2,t5), and
  - observedGet(k1->v2)  is valid for [t5,inf).

By intersecting the time intervals for each Operation in an atomic unit, we
then get the MVCC timestamps at which this Operation would have observed what
it ended up observing. If this intersection is empty, serializability must have
been violated.

In the error case, kvnemesis verifies that no part of the Operation became visible.
For ambiguous results, kvnemesis requires that either no Operation become visible,
or otherwise treats the atomic unit as committed.

The KV API also has the capability to return the actual execution timestamp directly
with responses. At the time of writing, kvnemesis does verify that it does do this
reliably, but it does not verify that the execution timestamp is contained in the
intersection of time intervals obtained from inspecting MVCC history[^3].

[Elle]: https://arxiv.org/pdf/2003.10554.pdf
[^1]: https://dl.acm.org/doi/10.1145/322154.322158
[^2]: there is currently concurrency within the atomic unit in kvnemesis. It
could in theory carry out multiple reads concurrently while not also writing,
such as DistSQL does, but this is not implemented today. See:
#64825
[^3]: tracked in #92898.

Epic: none

Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@craig craig bot closed this as completed in 97054a0 Dec 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. A-storage Relating to our storage engine (Pebble) on-disk storage. A-testing Testing tools and infrastructure O-qa T-storage Storage Team
Projects
None yet
3 participants