kv/storage: introduce local timestamps for MVCC versions in MVCCValue #80706

nvanbenschoten · 2022-04-28T14:30:36Z

Fixes #36431.
Fixes #49360.
Replaces #72121.
Replaces #77342.

NOTE: this is an alternative to #77342 that stores the new LocalTimestamp in a key-value's value instead of its key.

This commit fixes the potential for a stale read as detailed in #36431 using the "remember when intents were written" approach described in #36431 (comment) and later expanded on in #72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we introduced the multi-register test to the Jepsen test suite, and have since observed the test fail when combined with the strobe-skews nemesis due to this bug in #49360 (and a few issues linked to that one). This commit stabilizes that test.

Explanation

The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure¹ relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the version timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make.

This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back correctness without sacrificing performance by recognizing that only a small fraction² of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to each MVCC key-value version called the "local timestamp". The existing version timestamp dictates the key-value's visibility to readers and is tied to the writer's commit timestamp. The local clock timestamp records the value of the local HLC clock on the leaseholder when the key was originally written. It is used to make claims about the relative real time ordering of the key's writer and readers when comparing a reader's uncertainty interval (and observed timestamps) to the key. Ignoring edge cases, readers with an observed timestamp from the key's leaseholder that is greater than the local clock timestamp stored in the key cannot make claims about real time ordering and must consider it possible that the key's write occurred before the read began. However, readers with an observed timestamp from the key's leaseholder that is less than the clock timestamp can claim that the reader captured that observed timestamp before the key was written and therefore can consider the key's write to have been concurrent with the read. In doing so, the reader can avoid an uncertainty restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each key-value pair, the commit optimizes the clock timestamp away in the common case where it leads the version timestamp. Only in the rare cases where the local timestamp trails the version timestamp (e.g. future-time writes, async intent resolution with a new commit timestamp) does the local timestamp need to be explicitly represented in the key encoding. This is possible because it is safe for the local clock timestamp to be rounded down, as this will simply lead to additional uncertainty restarts. However, it is not safe for the local clock timestamp to be rounded up, as this could lead to stale reads.

MVCCValue

type MVCCValue struct {
    MVCCValueHeader
    Value roachpb.Value
}

To store the local timestamp, the commit introduces a new MVCCValue type to parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with MVCC-level metadata which is stored in an enginepb.MVCCValueHeader protobuf struct. To this point, the MVCC layer has treated versioned values as opaque blobs of bytes and has not enforced any structure on them. Now that MVCC will use the value to store metadata, it needs to enforce more structure on the values provided to it. This is the cause of some testing churn, but is otherwise not a problem, as all production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended" encoding scheme, depending on whether the value's header is empty or not. If the value's header is empty, it is omitted in the encoding and the mvcc value's encoding is identical to that of roachpb.Value. This provided backwards compatibility and ensures that the MVCCValue optimizes away in the common case. If the value's header is not empty, it is prepended to the roachpb.Value encoding. The encoding scheme's variants are:

Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>

The two encoding scheme variants are distinguished using the 5th byte, which is either the roachpb.Value tag (which has many possible values) or a sentinel tag not used by the roachpb.Value encoding which indicates the extended encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple" encoding are fast by avoiding heap allocations, memory copies, or function calls by exploiting mid-stack inlining. See microbenchmarks below.

Future improvements

As noted in #72121 (comment), this commit paves a path towards the complete removal of synthetic timestamps, which were originally introduced in support of non-blocking transactions and GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for ClockTimestamps is no longer necessary now that we never need to "push" transaction-domain timestamps into HLC clocks. Instead, the invariant that underpins observed timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC clock.

The synthetic bit's second role of disabling observed timestamps is replaced by the generalization provided by "local timestamps". Local timestamps precisely track when an MVCC version was written in the leaseholder's clock timestamp domain. This establishes a total ordering across clock observations (local timestamp assignment for writers and observed timestamps for readers) and establish a partial ordering between writer and reader transactions. As a result, the use of observed timestamps during uncertainty checking becomes a comparison between two ClockTimestamps, the version's local timestamp and the reader's observed timestamp.

Correctness testing

I was not able to stress jepsen/multi-register/strobe-skews hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called TestTxnReadWithinUncertaintyIntervalAfterIntentResolution which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes.

Performance analysis

This correctness fix will lead to an increased rate of transaction retries under some workloads.

MVCCValue Encoding and Decoding Microbenchmarks

name                                                                   time/op
EncodeMVCCValue/header=empty/value=tombstone-10                        3.11ns ± 0%
EncodeMVCCValue/header=empty/value=short-10                            3.11ns ± 0%
EncodeMVCCValue/header=empty/value=long-10                             3.10ns ± 0%
EncodeMVCCValue/header=local_walltime/value=tombstone-10               38.9ns ± 0%
EncodeMVCCValue/header=local_walltime/value=short-10                   42.1ns ± 0%
EncodeMVCCValue/header=local_walltime/value=long-10                     533ns ± 3%
EncodeMVCCValue/header=local_walltime+logical/value=tombstone-10       40.5ns ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=short-10           42.9ns ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=long-10             541ns ± 4%
DecodeMVCCValue/header=empty/value=tombstone/inline=false-10           7.81ns ± 1%
DecodeMVCCValue/header=empty/value=tombstone/inline=true-10            0.93ns ± 0%
DecodeMVCCValue/header=empty/value=short/inline=false-10               8.39ns ± 0%
DecodeMVCCValue/header=empty/value=short/inline=true-10                1.55ns ± 0%
DecodeMVCCValue/header=empty/value=long/inline=false-10                8.38ns ± 0%
DecodeMVCCValue/header=empty/value=long/inline=true-10                 1.55ns ± 0%
DecodeMVCCValue/header=local_walltime/value=long/inline=false-10       32.2ns ± 0%
DecodeMVCCValue/header=local_walltime/value=long/inline=true-10        22.7ns ± 0%
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=false-10  32.2ns ± 0%
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=true-10   22.7ns ± 0%
DecodeMVCCValue/header=local_walltime/value=short/inline=false-10      32.2ns ± 0%
DecodeMVCCValue/header=local_walltime/value=short/inline=true-10       22.7ns ± 0%

name                                                                   alloc/op
EncodeMVCCValue/header=empty/value=tombstone-10                         0.00B
EncodeMVCCValue/header=empty/value=short-10                             0.00B
EncodeMVCCValue/header=empty/value=long-10                              0.00B
EncodeMVCCValue/header=local_walltime/value=tombstone-10                24.0B ± 0%
EncodeMVCCValue/header=local_walltime/value=short-10                    32.0B ± 0%
EncodeMVCCValue/header=local_walltime/value=long-10                    4.86kB ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=tombstone-10        24.0B ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=short-10            32.0B ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=long-10            4.86kB ± 0%
DecodeMVCCValue/header=empty/value=tombstone/inline=false-10            0.00B
DecodeMVCCValue/header=empty/value=tombstone/inline=true-10             0.00B
DecodeMVCCValue/header=empty/value=short/inline=false-10                0.00B
DecodeMVCCValue/header=empty/value=short/inline=true-10                 0.00B
DecodeMVCCValue/header=empty/value=long/inline=false-10                 0.00B
DecodeMVCCValue/header=empty/value=long/inline=true-10                  0.00B
DecodeMVCCValue/header=local_walltime/value=long/inline=false-10        0.00B
DecodeMVCCValue/header=local_walltime/value=long/inline=true-10         0.00B
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=false-10   0.00B
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=true-10    0.00B
DecodeMVCCValue/header=local_walltime/value=short/inline=false-10       0.00B
DecodeMVCCValue/header=local_walltime/value=short/inline=true-10        0.00B

name                                                                   allocs/op
EncodeMVCCValue/header=empty/value=tombstone-10                          0.00
EncodeMVCCValue/header=empty/value=short-10                              0.00
EncodeMVCCValue/header=empty/value=long-10                               0.00
EncodeMVCCValue/header=local_walltime/value=tombstone-10                 1.00 ± 0%
EncodeMVCCValue/header=local_walltime/value=short-10                     1.00 ± 0%
EncodeMVCCValue/header=local_walltime/value=long-10                      1.00 ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=tombstone-10         1.00 ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=short-10             1.00 ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=long-10              1.00 ± 0%
DecodeMVCCValue/header=empty/value=tombstone/inline=false-10             0.00
DecodeMVCCValue/header=empty/value=tombstone/inline=true-10              0.00
DecodeMVCCValue/header=empty/value=short/inline=false-10                 0.00
DecodeMVCCValue/header=empty/value=short/inline=true-10                  0.00
DecodeMVCCValue/header=empty/value=long/inline=false-10                  0.00
DecodeMVCCValue/header=empty/value=long/inline=true-10                   0.00
DecodeMVCCValue/header=local_walltime/value=long/inline=false-10         0.00
DecodeMVCCValue/header=local_walltime/value=long/inline=true-10          0.00
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=false-10    0.00
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=true-10     0.00
DecodeMVCCValue/header=local_walltime/value=short/inline=false-10        0.00
DecodeMVCCValue/header=local_walltime/value=short/inline=true-10         0.00

pkg/sql/tests End-To-End Microbenchmarks

name                                old time/op    new time/op    delta
KV/Delete/Native/rows=1-30             106µs ± 2%     104µs ± 2%  -1.38%  (p=0.012 n=8+10)
KV/Insert/SQL/rows=100-30             1.26ms ± 2%    1.24ms ± 2%  -1.25%  (p=0.029 n=10+10)
KV/Scan/SQL/rows=1-30                  281µs ± 2%     277µs ± 2%  -1.17%  (p=0.009 n=10+10)
KV/Insert/Native/rows=1-30             107µs ± 2%     107µs ± 2%    ~     (p=0.353 n=10+10)
KV/Insert/Native/rows=10-30            156µs ± 2%     155µs ± 4%    ~     (p=0.481 n=10+10)
KV/Insert/Native/rows=100-30           570µs ± 1%     576µs ± 2%    ~     (p=0.075 n=10+10)
KV/Insert/Native/rows=1000-30         4.18ms ± 2%    4.23ms ± 2%    ~     (p=0.143 n=10+10)
KV/Insert/Native/rows=10000-30        43.9ms ± 2%    44.1ms ± 2%    ~     (p=0.393 n=10+10)
KV/Insert/SQL/rows=1-30                412µs ± 2%     410µs ± 2%    ~     (p=0.280 n=10+10)
KV/Insert/SQL/rows=10-30               511µs ± 2%     508µs ± 1%    ~     (p=0.436 n=10+10)
KV/Insert/SQL/rows=1000-30            8.18ms ± 1%    8.11ms ± 2%    ~     (p=0.165 n=10+10)
KV/Insert/SQL/rows=10000-30           85.2ms ± 2%    84.7ms ± 2%    ~     (p=0.481 n=10+10)
KV/Update/Native/rows=1-30             163µs ± 2%     162µs ± 2%    ~     (p=0.436 n=10+10)
KV/Update/Native/rows=10-30            365µs ± 1%     365µs ± 1%    ~     (p=0.739 n=10+10)
KV/Update/Native/rows=10000-30         143ms ± 1%     144ms ± 2%    ~     (p=0.075 n=10+10)
KV/Update/SQL/rows=1-30                525µs ± 2%     522µs ± 3%    ~     (p=0.190 n=10+10)
KV/Update/SQL/rows=10-30               815µs ± 1%     810µs ± 1%    ~     (p=0.190 n=10+10)
KV/Update/SQL/rows=100-30             2.47ms ± 1%    2.48ms ± 1%    ~     (p=0.356 n=9+10)
KV/Update/SQL/rows=1000-30            16.6ms ± 1%    16.7ms ± 1%    ~     (p=0.065 n=9+10)
KV/Update/SQL/rows=10000-30            193ms ± 2%     195ms ± 3%    ~     (p=0.315 n=10+10)
KV/Delete/Native/rows=10-30            173µs ± 2%     171µs ± 3%    ~     (p=0.247 n=10+10)
KV/Delete/Native/rows=100-30           677µs ± 1%     674µs ± 0%    ~     (p=0.475 n=10+7)
KV/Delete/Native/rows=1000-30         5.20ms ± 2%    5.19ms ± 1%    ~     (p=0.853 n=10+10)
KV/Delete/Native/rows=10000-30        53.7ms ± 1%    53.9ms ± 1%    ~     (p=0.684 n=10+10)
KV/Delete/SQL/rows=1-30                377µs ± 3%     375µs ± 1%    ~     (p=0.305 n=10+9)
KV/Delete/SQL/rows=10-30               503µs ± 1%     500µs ± 2%    ~     (p=0.123 n=10+10)
KV/Delete/SQL/rows=100-30             1.52ms ± 4%    1.51ms ± 2%    ~     (p=0.529 n=10+10)
KV/Delete/SQL/rows=1000-30            2.53ms ± 3%    2.53ms ± 3%    ~     (p=1.000 n=10+10)
KV/Delete/SQL/rows=10000-30           21.9ms ± 1%    21.8ms ± 1%    ~     (p=0.315 n=9+10)
KV/Scan/Native/rows=1-30              29.6µs ± 2%    29.8µs ± 1%    ~     (p=0.143 n=10+10)
KV/Scan/Native/rows=100-30            54.6µs ± 1%    55.0µs ± 2%    ~     (p=0.052 n=10+10)
KV/Scan/SQL/rows=10-30                 292µs ± 1%     290µs ± 1%    ~     (p=0.190 n=10+10)
KV/Scan/SQL/rows=100-30                364µs ± 2%     363µs ± 1%    ~     (p=0.684 n=10+10)
KV/Scan/SQL/rows=10000-30             5.34ms ± 2%    5.32ms ± 1%    ~     (p=0.631 n=10+10)
KVAndStorageUsingSQL/versions=2-30    2.59ms ± 1%    2.59ms ± 2%    ~     (p=0.842 n=9+10)
KVAndStorageUsingSQL/versions=4-30    2.57ms ± 3%    2.56ms ± 2%    ~     (p=0.684 n=10+10)
KVAndStorageUsingSQL/versions=8-30    2.60ms ± 3%    2.59ms ± 2%    ~     (p=0.315 n=10+10)
KV/Update/Native/rows=100-30          2.02ms ± 1%    2.04ms ± 1%  +0.95%  (p=0.015 n=10+10)
KV/Update/Native/rows=1000-30         16.2ms ± 2%    16.5ms ± 2%  +1.30%  (p=0.001 n=10+9)
KV/Scan/Native/rows=10-30             32.6µs ± 2%    33.0µs ± 3%  +1.39%  (p=0.019 n=10+10)
KV/Scan/Native/rows=1000-30            266µs ± 1%     270µs ± 1%  +1.51%  (p=0.000 n=9+10)
KV/Scan/SQL/rows=1000-30               982µs ± 1%     997µs ± 1%  +1.60%  (p=0.000 n=10+10)
KV/Scan/Native/rows=10000-30          2.18ms ± 1%    2.23ms ± 1%  +2.55%  (p=0.000 n=10+10)

name                                old alloc/op   new alloc/op   delta
KV/Insert/Native/rows=1-30            15.6kB ± 0%    15.6kB ± 0%    ~     (p=0.631 n=10+10)
KV/Insert/Native/rows=10000-30        34.7MB ± 2%    35.0MB ± 2%    ~     (p=0.089 n=10+10)
KV/Insert/SQL/rows=1-30               41.3kB ± 1%    41.4kB ± 1%    ~     (p=0.353 n=10+10)
KV/Insert/SQL/rows=10-30              90.8kB ± 1%    90.8kB ± 0%    ~     (p=0.945 n=10+8)
KV/Insert/SQL/rows=1000-30            5.69MB ± 1%    5.71MB ± 1%    ~     (p=0.436 n=10+10)
KV/Insert/SQL/rows=10000-30           69.6MB ± 1%    69.6MB ± 1%    ~     (p=0.853 n=10+10)
KV/Update/Native/rows=1-30            21.4kB ± 1%    21.4kB ± 0%    ~     (p=0.915 n=9+9)
KV/Update/Native/rows=10000-30        67.8MB ± 1%    68.1MB ± 2%    ~     (p=0.315 n=10+10)
KV/Update/SQL/rows=1-30               47.2kB ± 1%    47.3kB ± 1%    ~     (p=0.280 n=10+10)
KV/Update/SQL/rows=10-30               116kB ± 1%     117kB ± 1%    ~     (p=0.353 n=10+10)
KV/Update/SQL/rows=1000-30            5.20MB ± 2%    5.18MB ± 1%    ~     (p=0.278 n=10+9)
KV/Delete/Native/rows=10000-30        30.5MB ± 2%    30.5MB ± 1%    ~     (p=0.631 n=10+10)
KV/Delete/SQL/rows=1-30               42.9kB ± 0%    43.0kB ± 1%    ~     (p=0.393 n=10+10)
KV/Delete/SQL/rows=10-30              66.9kB ± 3%    66.5kB ± 1%    ~     (p=0.853 n=10+10)
KV/Delete/SQL/rows=1000-30            1.19MB ± 0%    1.19MB ± 0%    ~     (p=0.447 n=9+10)
KV/Delete/SQL/rows=10000-30           14.3MB ± 1%    14.3MB ± 0%    ~     (p=0.353 n=10+10)
KV/Scan/Native/rows=1-30              7.17kB ± 0%    7.18kB ± 0%    ~     (p=0.061 n=10+9)
KV/Scan/Native/rows=10-30             8.59kB ± 0%    8.59kB ± 0%    ~     (p=0.926 n=10+10)
KV/Scan/Native/rows=100-30            21.2kB ± 0%    21.2kB ± 0%    ~     (p=0.781 n=10+10)
KV/Scan/Native/rows=1000-30            172kB ± 0%     172kB ± 0%    ~     (p=0.264 n=10+8)
KV/Scan/Native/rows=10000-30          1.51MB ± 0%    1.51MB ± 0%    ~     (p=0.968 n=9+10)
KV/Scan/SQL/rows=1-30                 19.3kB ± 0%    19.3kB ± 1%    ~     (p=0.968 n=9+10)
KV/Scan/SQL/rows=10-30                20.6kB ± 1%    20.7kB ± 0%    ~     (p=0.897 n=10+8)
KV/Scan/SQL/rows=100-30               32.8kB ± 1%    32.9kB ± 0%    ~     (p=0.400 n=10+9)
KV/Scan/SQL/rows=1000-30               185kB ± 0%     185kB ± 0%    ~     (p=0.247 n=10+10)
KV/Scan/SQL/rows=10000-30             1.10MB ± 0%    1.10MB ± 0%    ~     (p=0.631 n=10+10)
KVAndStorageUsingSQL/versions=2-30     793kB ± 3%     793kB ± 2%    ~     (p=0.796 n=10+10)
KVAndStorageUsingSQL/versions=4-30     788kB ± 3%     783kB ± 3%    ~     (p=0.353 n=10+10)
KVAndStorageUsingSQL/versions=8-30     787kB ± 3%     785kB ± 2%    ~     (p=0.853 n=10+10)
KV/Delete/Native/rows=1-30            15.1kB ± 0%    15.2kB ± 0%  +0.26%  (p=0.029 n=10+10)
KV/Update/Native/rows=10-30           66.8kB ± 0%    67.0kB ± 0%  +0.38%  (p=0.029 n=10+10)
KV/Insert/SQL/rows=100-30              550kB ± 1%     553kB ± 1%  +0.44%  (p=0.043 n=10+10)
KV/Update/SQL/rows=100-30              578kB ± 1%     580kB ± 1%  +0.46%  (p=0.019 n=10+10)
KV/Update/Native/rows=100-30           503kB ± 0%     506kB ± 0%  +0.59%  (p=0.000 n=10+10)
KV/Update/SQL/rows=10000-30            153MB ± 1%     154MB ± 1%  +0.62%  (p=0.006 n=9+10)
KV/Delete/SQL/rows=100-30              306kB ± 0%     308kB ± 0%  +0.67%  (p=0.000 n=10+8)
KV/Delete/Native/rows=10-30           36.1kB ± 1%    36.4kB ± 1%  +0.82%  (p=0.001 n=10+9)
KV/Update/Native/rows=1000-30         4.75MB ± 1%    4.79MB ± 1%  +0.83%  (p=0.002 n=10+10)
KV/Insert/Native/rows=10-30           39.7kB ± 1%    40.1kB ± 1%  +0.83%  (p=0.023 n=10+10)
KV/Insert/Native/rows=1000-30         2.56MB ± 1%    2.58MB ± 1%  +1.00%  (p=0.000 n=9+10)
KV/Insert/Native/rows=100-30           279kB ± 1%     282kB ± 1%  +1.02%  (p=0.007 n=10+10)
KV/Delete/Native/rows=100-30           243kB ± 1%     246kB ± 1%  +1.07%  (p=0.000 n=10+10)
KV/Delete/Native/rows=1000-30         2.23MB ± 1%    2.26MB ± 1%  +1.33%  (p=0.000 n=10+9)

name                                old allocs/op  new allocs/op  delta
KV/Update/SQL/rows=1000-30             58.6k ± 0%     58.4k ± 0%  -0.28%  (p=0.000 n=9+9)
KV/Scan/SQL/rows=10-30                   277 ± 0%       276 ± 0%  -0.25%  (p=0.001 n=8+10)
KV/Update/SQL/rows=10000-30             740k ± 0%      739k ± 0%  -0.12%  (p=0.040 n=8+7)
KV/Insert/Native/rows=1-30               129 ± 0%       129 ± 0%    ~     (all equal)
KV/Insert/Native/rows=10-30              272 ± 0%       272 ± 0%    ~     (all equal)
KV/Insert/Native/rows=1000-30          14.3k ± 0%     14.3k ± 0%    ~     (p=0.221 n=10+10)
KV/Insert/Native/rows=10000-30          141k ± 0%      141k ± 0%    ~     (p=0.356 n=10+9)
KV/Insert/SQL/rows=1-30                  349 ± 0%       348 ± 0%    ~     (p=0.776 n=9+10)
KV/Insert/SQL/rows=10-30                 625 ± 0%       625 ± 0%    ~     (p=0.068 n=9+8)
KV/Insert/SQL/rows=100-30              3.12k ± 0%     3.12k ± 0%    ~     (p=0.351 n=10+9)
KV/Insert/SQL/rows=1000-30             29.3k ± 0%     29.3k ± 0%    ~     (p=0.644 n=9+10)
KV/Insert/SQL/rows=10000-30             294k ± 0%      293k ± 0%    ~     (p=0.134 n=9+7)
KV/Update/Native/rows=1-30               181 ± 0%       181 ± 0%    ~     (all equal)
KV/Update/Native/rows=10-30              458 ± 0%       458 ± 0%    ~     (all equal)
KV/Update/Native/rows=1000-30          26.9k ± 0%     27.0k ± 0%    ~     (p=0.232 n=9+10)
KV/Update/Native/rows=10000-30          273k ± 0%      274k ± 0%    ~     (p=0.315 n=9+10)
KV/Update/SQL/rows=1-30                  503 ± 0%       503 ± 0%    ~     (p=1.000 n=10+10)
KV/Update/SQL/rows=10-30                 904 ± 1%       905 ± 0%    ~     (p=0.223 n=10+10)
KV/Update/SQL/rows=100-30              6.79k ± 0%     6.79k ± 0%    ~     (p=0.669 n=10+10)
KV/Delete/Native/rows=1-30               128 ± 0%       128 ± 0%    ~     (all equal)
KV/Delete/Native/rows=10-30              248 ± 0%       248 ± 0%    ~     (all equal)
KV/Delete/Native/rows=10000-30          112k ± 0%      112k ± 0%    ~     (p=0.825 n=10+9)
KV/Delete/SQL/rows=1-30                  331 ± 0%       331 ± 0%    ~     (all equal)
KV/Delete/SQL/rows=10-30                 512 ± 0%       512 ± 0%    ~     (p=0.406 n=8+10)
KV/Delete/SQL/rows=100-30              2.01k ± 0%     2.01k ± 0%    ~     (p=0.947 n=10+8)
KV/Delete/SQL/rows=1000-30             7.51k ± 0%     7.51k ± 0%    ~     (p=0.204 n=9+10)
KV/Delete/SQL/rows=10000-30            71.2k ± 0%     71.2k ± 0%    ~     (p=0.986 n=9+10)
KV/Scan/Native/rows=1-30                54.0 ± 0%      54.0 ± 0%    ~     (all equal)
KV/Scan/Native/rows=10-30               58.0 ± 0%      58.0 ± 0%    ~     (all equal)
KV/Scan/Native/rows=100-30              62.3 ± 1%      62.3 ± 1%    ~     (p=1.000 n=10+10)
KV/Scan/Native/rows=1000-30             72.3 ± 1%      72.0 ± 0%    ~     (p=0.137 n=10+8)
KV/Scan/Native/rows=10000-30             115 ± 1%       115 ± 1%    ~     (p=0.193 n=9+10)
KV/Scan/SQL/rows=1-30                    242 ± 0%       242 ± 0%    ~     (p=0.828 n=9+10)
KV/Scan/SQL/rows=100-30                  648 ± 0%       648 ± 0%    ~     (all equal)
KV/Scan/SQL/rows=1000-30               5.04k ± 0%     5.04k ± 0%    ~     (p=1.000 n=10+10)
KV/Scan/SQL/rows=10000-30              50.3k ± 0%     50.3k ± 0%    ~     (p=0.208 n=10+10)
KVAndStorageUsingSQL/versions=2-30     7.61k ± 0%     7.61k ± 0%    ~     (p=0.223 n=8+10)
KVAndStorageUsingSQL/versions=4-30     7.61k ± 0%     7.60k ± 0%    ~     (p=0.643 n=10+10)
KVAndStorageUsingSQL/versions=8-30     7.61k ± 0%     7.61k ± 0%    ~     (p=0.888 n=10+9)
KV/Update/Native/rows=100-30           2.76k ± 0%     2.76k ± 0%  +0.03%  (p=0.044 n=10+10)
KV/Delete/Native/rows=1000-30          11.3k ± 0%     11.3k ± 0%  +0.03%  (p=0.006 n=10+9)
KV/Insert/Native/rows=100-30           1.57k ± 0%     1.57k ± 0%  +0.06%  (p=0.000 n=8+7)
KV/Delete/Native/rows=100-30           1.27k ± 0%     1.28k ± 0%  +0.08%  (p=0.000 n=8+8)

YCSB benchmark suite

name                   old ops/s    new ops/s    delta
ycsb/A/nodes=3          15.2k ± 6%   15.4k ± 3%    ~     (p=0.579 n=10+10)
ycsb/B/nodes=3          28.7k ± 4%   28.9k ± 1%    ~     (p=0.780 n=10+9)
ycsb/C/nodes=3          41.3k ± 4%   41.4k ± 2%    ~     (p=0.529 n=10+10)
ycsb/D/nodes=3          33.9k ± 2%   33.5k ± 2%  -1.37%  (p=0.003 n=9+9)
ycsb/E/nodes=3          2.10k ± 2%   2.10k ± 1%    ~     (p=0.813 n=10+7)
ycsb/F/nodes=3          7.06k ± 5%   7.05k ± 4%    ~     (p=0.853 n=10+10)
ycsb/A/nodes=3/cpu=32   30.0k ± 9%   31.4k ± 3%    ~     (p=0.095 n=10+9)
ycsb/B/nodes=3/cpu=32   97.6k ± 2%   98.1k ± 2%    ~     (p=0.237 n=8+10)
ycsb/C/nodes=3/cpu=32    129k ± 2%    130k ± 1%    ~     (p=0.243 n=10+9)
ycsb/D/nodes=3/cpu=32    105k ± 2%    106k ± 1%  +1.06%  (p=0.034 n=10+8)
ycsb/E/nodes=3/cpu=32   3.33k ± 1%   3.33k ± 1%    ~     (p=0.951 n=10+9)
ycsb/F/nodes=3/cpu=32   15.5k ±10%   16.4k ± 5%  +5.35%  (p=0.029 n=10+10)

name                   old avg(ms)  new avg(ms)  delta
ycsb/A/nodes=3           6.32 ± 6%    6.23 ± 3%    ~     (p=0.635 n=10+10)
ycsb/B/nodes=3           5.02 ± 4%    4.97 ± 3%    ~     (p=0.542 n=10+10)
ycsb/C/nodes=3           3.49 ± 3%    3.50 ± 0%    ~     (p=0.443 n=10+9)
ycsb/D/nodes=3           2.80 ± 0%    2.87 ± 2%  +2.50%  (p=0.008 n=8+10)
ycsb/E/nodes=3           45.8 ± 1%    45.8 ± 1%    ~     (p=0.718 n=10+7)
ycsb/F/nodes=3           13.5 ± 3%    13.6 ± 4%    ~     (p=0.509 n=9+10)
ycsb/A/nodes=3/cpu=32    4.82 ±10%    4.62 ± 6%    ~     (p=0.092 n=10+10)
ycsb/B/nodes=3/cpu=32    2.00 ± 0%    1.96 ± 3%    ~     (p=0.065 n=9+10)
ycsb/C/nodes=3/cpu=32    1.50 ± 0%    1.50 ± 0%    ~     (all equal)
ycsb/D/nodes=3/cpu=32    1.40 ± 0%    1.36 ± 4%    ~     (p=0.065 n=9+10)
ycsb/E/nodes=3/cpu=32    43.3 ± 1%    43.3 ± 0%    ~     (p=0.848 n=10+9)
ycsb/F/nodes=3/cpu=32    9.30 ±11%    8.79 ± 5%  -5.48%  (p=0.022 n=10+10)

name                   old p99(ms)  new p99(ms)  delta
ycsb/A/nodes=3           52.0 ±17%    48.4 ±18%    ~     (p=0.379 n=10+10)
ycsb/B/nodes=3           32.2 ±11%    32.2 ± 9%    ~     (p=0.675 n=10+10)
ycsb/C/nodes=3           13.8 ± 6%    13.8 ± 3%    ~     (p=1.000 n=10+10)
ycsb/D/nodes=3           12.6 ± 0%    12.9 ± 2%  +2.38%  (p=0.023 n=8+10)
ycsb/E/nodes=3            200 ± 8%     197 ± 6%    ~     (p=0.375 n=10+8)
ycsb/F/nodes=3            124 ±19%     125 ±21%    ~     (p=0.856 n=9+10)
ycsb/A/nodes=3/cpu=32    68.1 ±17%    63.9 ± 5%    ~     (p=0.211 n=10+8)
ycsb/B/nodes=3/cpu=32    15.7 ± 0%    15.5 ± 2%    ~     (p=0.071 n=6+10)
ycsb/C/nodes=3/cpu=32    10.5 ± 0%    10.2 ± 3%  -2.86%  (p=0.003 n=8+10)
ycsb/D/nodes=3/cpu=32    8.70 ± 3%    8.40 ± 0%  -3.45%  (p=0.003 n=10+8)
ycsb/E/nodes=3/cpu=32     151 ± 0%     151 ± 0%    ~     (all equal)
ycsb/F/nodes=3/cpu=32     148 ±15%     145 ±10%    ~     (p=0.503 n=9+10)

Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew.

see pkg/kv/kvserver/observedts/doc.go for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix. ↩
see analysis in https://github.com/cockroachdb/cockroach/issues/36431#issuecomment-714221846. ↩

cockroach-teamcity · 2022-04-28T14:30:50Z

This change is

nvanbenschoten · 2022-04-28T19:46:23Z

@sumeerbhola @aayushshah15 @erikgrinaker this is an alternative to #77342 that stores the new LocalTimestamp in a key-value's value instead of its key. We've discussed the difference between these two options in a few different venues. At a high-level, I think the tradeoffs between the two approaches are:

Option 1: Store local_timestamp in mvcc key suffix
+ Enables fast-path to fully resolve uncertainty checking by looking at pebble key, which may be important for key-value separation
+ MVCC can continue to treat values as opaque bytes
+ Smaller change in general, no need for complex value encoding scheme
- Confuses key identity, feels wrong
- Less flexible
- Unclear interaction with Pebble range keys
- Depends heavily on #77520

Option 2: Store local_timestamp in mvcc value
- The value is now required to fully resolve uncertainty, which may harm key-value separation
- MVCC must enforce an encoding scheme for values
- Larger change, some subtle logic around MVCCValue vs. roachpb.Value
+ Places the local_timestamp metadata where it feels like it belongs, simpler to understand
+ Works with Pebble range keys
+ Now that we have the MVCCValue type with a protobuf header, we can use it in other cases (within reason). For instance, it could be used to store kv-level TTL information if we wanted to push TTL into the MVCC layer
+ We could revert #77520 if we wanted to, though I don't intend to until at least after we've migrated away from synthetic timestamps

Both approaches touch performance-sensitive code, so both have to be very careful to not introduce regressions.

After writing this out and thinking about the trade-offs, I'm leaning towards the approach taken by this PR. The main reason for that is that I don't think the impact of this on key-value separation will be pronounced. Storing the local timestamp in the value means that we need to fetch the value of keys seen by scans to check uncertainty if 1) the scan has an uncertainty interval and 2) the scan is reading at a timestamp below the value. I'm going to assume that we don't care about cases where the value actually is uncertain, because this leads to a much more expensive uncertainty restart and is also a cost paid once (i.e. it doesn't compound because we stop scanning once we see an uncertain value). So let's focus on the case where the values are above the read timestamp but not uncertain.

First off, is this a common situation in workloads that we care about? I'm not sure that it is because any historical AOST read does not have an uncertainty interval and any consistent read will rarely see keys at timestamps above its read timestamp, as they would have needed to be written roughly concurrently with the read. So this is really only a problem for long-running consistent scans operating over keyspaces that see high write traffic.

If this does cause issues and we do find that avoiding a fetch of the value of keys above a consistent read's timestamp is important for performance, we could avoid some of the cost by splitting uncertainty checking into two phases. We could first perform an imprecise check using just the version timestamp from the key and comparing that to the scan's global uncertainty limit. Only if this check fails would we need to pull the value to perform the precise uncertainty check using the local timestamp and the local uncertainty limit. That means that long-running scans will have a max_clock_offset sized window above their read timestamp where checking for uncertainty with concurrently written values will be less optimal than possible (and again, this suboptimality only materializes when an observed timestamp is used to avoid uncertainty). My intuition is that this is not a worthwhile case to optimize for at this point.

Meanwhile, I think there are immediately meaningful benefits to this approach.

If you've already reviewed #77342 then only the last two commits here are new. Thanks for working with me through the multiple iterations of this patch.

nvanbenschoten · 2022-04-29T05:06:06Z

I added benchmarks results demonstrating the impact of this change on performance to the PR description. The benchmarks are from three different levels: microbenchmarks around mvcc value encoding and decoding, pkg/sql/tests end-to-end microbenchmarks, and the YCSB benchmark suite run on roachprod clusters. The results demonstrate that this change has a negligible impact on workload performance.

sumeerbhola · 2022-05-05T20:19:09Z

So let's focus on the case where the values are above the read timestamp but not uncertain.

Also, when Pebble starts storing older versions in a value block, and there is exactly one version above this read timestamp, that version's value will not be in the value block.

nvanbenschoten · 2022-05-05T21:39:08Z

pkg/kv/kvserver/uncertainty/doc.go

+
+// D0 ————————————————————————————————————————————————
+//
+// MVCCKey


TODO(@nvanbenschoten): fix this, now that the local timestamp is in the value.

sumeerbhola

(still reading -- flushing some comments)

Reviewed 14 of 64 files at r11.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @erikgrinaker, @nvanbenschoten, @stevendanna, and @sumeerbhola)

pkg/storage/mvcc.go line 400 at r11 (raw file):

		// Straightforward: old contribution goes, new contribution comes, and we're done.
		ms.SysBytes -= origMetaKeySize + origMetaValSize + orig.KeyBytes + orig.ValBytes
		ms.SysBytes += metaKeySize + metaValSize + meta.KeyBytes + meta.ValBytes

Was this a bug in the existing code since it was not using orig.{KeyBytes,ValBytes} and meta.{KeyBytes,ValBytes}?
The KeyBytes is always MVCCVersionTimestampSize, yes?

pkg/storage/mvcc_value.go line 60 at r11 (raw file):

//
// In either encoding, the suffix corresponding to the roachpb.Value can be
// omitted, indicating a deletion tombstone. For the simple encoding, this

This is a bit confusing, partly because roachpb.Value is already encoded, in that Value.RawBytes is already the encoded value. How about spelling this out a bit more, with something like:
// For a deletion tombstone, the encoding of roachpb.Value is special cased to be empty, i.e., no checksum, tag or encoded-data. In that case the extended encoding above is
// simply <4-byte-header-len><1-byte-sentinel>.

pkg/storage/mvcc_value.go line 103 at r11 (raw file):

	if v.LocalTimestamp.IsEmpty() {
		if k.Timestamp.Synthetic {
			return hlc.MinClockTimestamp

Is this to ensure that we can never ignore this value based on observed timestamp, and have to use the global uncertainty limit? Could use a code comment.

pkg/storage/mvcc_value.go line 170 at r11 (raw file):

		return nil, errors.Wrap(err, "marshaling MVCCValueHeader")
	}
	// <4-byte-checksum><1-byte-tag><encoded-data>

or empty for a tombstone?

pkg/storage/enginepb/mvcc.go line 266 at r11 (raw file):

	if meta.LocalTimestamp == nil {
		if meta.Timestamp.ToTimestamp().Synthetic {
			return hlc.MinClockTimestamp

I assume this is to ensure that we can never ignore this value based on observed timestamp, and have to use the global uncertainty limit. Is that correct? Could use a code comment.

pkg/storage/enginepb/mvcc.proto line 40 at r11 (raw file):

  // The local timestamp of the most recent versioned value if this is a
  // value that may have multiple versions. For values which may have only
  // one version, this timestamp is set to nil. See MVCCValueHeader for

"may have only version" is a bit confusing to me. Is this talking about this particular key or the category of versioned values? I assume it is the former, since this comment is about versioned values and that means it is always possible to have multiple versions.
Why do we need this at all -- it seems duplication from what we have stored in the provisional value?

pkg/storage/enginepb/mvcc.proto line 72 at r11 (raw file):

    // Value is the value written to the key as part of the transaction at
    // the above Sequence. Value uses the roachpb.Value encoding.
    optional bytes value = 2;

So the local timestamp is not stored for the older sequence of values written by this transaction?
And if one of those was restored due to a savepoint rollback, is it behaving like a new write in terms of the local timestamp?
A code comment would be useful.

sumeerbhola

Apart from a few tests, I've finished reading the last 2 commits. Looks good!

Reviewed 40 of 64 files at r11, 11 of 15 files at r12.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @erikgrinaker, @nvanbenschoten, and @stevendanna)

pkg/kv/kvserver/batcheval/cmd_add_sstable.go line 408 at r7 (raw file):

			return errors.AssertionFailedf("SST contains inline value or intent for key %s", key)
		}
		if len(value) == 0 {

just curious: what is the confidence level that all such code has been found and changed?
Tests that don't write a local timestamp would still succeed with buggy code, yes?
Should we have some way to force all tests to write local timestamps?

pkg/kv/kvserver/uncertainty/interval.go line 55 at r11 (raw file):

// version timestamp and with the specified uncertainty interval.
func (in Interval) IsUncertain(valueTs hlc.Timestamp, localTs hlc.ClockTimestamp) bool {
	if !in.LocalLimit.IsEmpty() && in.LocalLimit.Less(localTs) {

Is everything required to have a non-empty localTs now, whether it be the version timestamp itself or hlc.MinTimestamp for the synthetic case?
Should we assert !localTs.IsEmpty() to make sure we have not forgotten something?

pkg/storage/mvcc.go line 3087 at r10 (raw file):

		if err != nil {
			return false, err
		}

This change looks like it is mainly code movement, since we need to figure out newValue.LocalTimestamp before updating and writng meta. Is that correct?
Do you think we have good test coverage of the various paths in this function?

pkg/storage/mvcc.go line 858 at r11 (raw file):

	iterAlreadyPositioned bool,
	meta *enginepb.MVCCMetadata,
	localTs *hlc.ClockTimestamp,

is localTS for allocation avoidance? is it required that meta.localTS be populated if the return value of ok is true?
A code comment would help.

pkg/storage/mvcc.go line 1495 at r11 (raw file):

					}
					if haveNextVersion {
						prevVal, err := DecodeMVCCValue(prevUnsafeVal)

is this rare enough to use this less optimized function?

pkg/storage/mvcc.go line 3045 at r11 (raw file):

		// TODO(nvanbenschoten): this is an awkward interface. We shouldn't
		// be mutating meta and we shouldn't be restoring the previous value
		// here. Instead, this should all be handled down below.

This whole intent resolution function is quite confusing, though we probably need to shore up our randomized testing before making any significant changes here. There is also a TODO mentioned in mvcc_test.go (may be stale now)

	// TODO(sumeer): mvccResolveWriteIntent has a bug when the txn is being
	// ABORTED and there are IgnoredSeqNums that are causing a partial rollback.
	// It does the partial rollback and does not actually resolve the intent.
	// This does not affect correctness since the intent resolution will get
	// retried.

pkg/storage/pebble.go line 1310 at r12 (raw file):

// a cluster are at or beyond clusterversion.TODO, different nodes will see the
// version state transition at different times. Nodes that have not yet seen the
// transition may remove the local timestamp from an intent that has one during

comment needs updating

…tents Related to cockroachdb#80706. Related to cockroachdb#66485. This commit makes a slight modification to `pebbleMVCCScanner` that changes how it determines whether an intent is uncertain or not. Instead of consulting the version timestamp in the `MVCCMetadata` struct and comparing that against the scan's uncertainty interval, the `pebbleMVCCScanner` now scans through the uncertainty interval and uses the intent's provisional value's timestamp to determine uncertainty. The `pebbleMVCCScanner` was actually already doing this scan to compute uncertainty for other committed values in its uncertainty interval if it found that the intent was not uncertain. However, after this change, it also relies on the scan to compute uncertainty for the intent itself. This is safe, because the calling code knows that the intent has a higher timestamp than the scan, so there is no risk that the scan adds the provisional value to its result set. This change is important for two reasons: 1. it avoids the need to store the `local_timestamp` field (introduced in cockroachdb#80706) in the `MVCCMetadata` struct. 2. it moves us closer in the direction of using `MVCCMetadata` values (ts=0, essentially locks protecting provisional values) to determine read-write conflicts but then using versioned provisional values to determine uncertainty. Doing so allows us to decompose a KV scan into a separate lock table scan to detect read-write conflicts and a MVCC scan to accumulate a result set while checking for uncertainty. This will be important for cockroachdb#66485.

This commit adds a cluster version gate and a cluster setting for local timestamps, to assist with their migration into an existing cluster. This fixes mixed-version clusters' interaction with local timestamps.

This commit switches from storing encoded `roachpb.Value`s to storing encoded `storage.MVCCValue`s in `MVCCMetadata`'s `SequencedIntent`s. Doing so ensures that MVCCValue headers are not lost when an intent is rolled back. This is important to avoid losing the local timestamp of values in a key's intent history. Failure to do so could allow for stale reads.

This commit adds an assertion to `Interval.IsUncertain` that the provided value and local timestamps are non-zero.

…testing This commit adds a metamorphic knob that randomly disables the simple MVCC value encoding scheme. Doing so ensures that code which interacts with encoded MVCC values does not mistake these values for encoded roachpb values. This could take place in two different ways: 1. broken code could assign an encoded MVCC value directly to a roachpb.Value's `RawBytes` field. This typically caused the test to fail with an error. 2. broken code could assume that a non-zero-length value was not a tombstone. This caused tests to fail in more obscure ways. The commit then fixes broken tests in one of three ways: - it fixes incorrect assumptions about the MVCC value encoding being equivalent to the roachpb value encoding. - it updates a few important tests (mostly related to MVCC stats) to work with and without the simple encoding. - it skips a few unimportant tests when the simple encoding scheme is disabled.

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @erikgrinaker, @stevendanna, and @sumeerbhola)

pkg/kv/kvserver/uncertainty/doc.go line 200 at r27 (raw file):