Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/storage: introduce local timestamps for MVCC versions in MVCCValue #80706

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Apr 28, 2022

Fixes #36431.
Fixes #49360.
Replaces #72121.
Replaces #77342.

NOTE: this is an alternative to #77342 that stores the new LocalTimestamp in a key-value's value instead of its key.

This commit fixes the potential for a stale read as detailed in #36431 using the "remember when intents were written" approach described in #36431 (comment) and later expanded on in #72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we introduced the multi-register test to the Jepsen test suite, and have since observed the test fail when combined with the strobe-skews nemesis due to this bug in #49360 (and a few issues linked to that one). This commit stabilizes that test.

Explanation

The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure1 relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the version timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make.

This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back correctness without sacrificing performance by recognizing that only a small fraction2 of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to each MVCC key-value version called the "local timestamp". The existing version timestamp dictates the key-value's visibility to readers and is tied to the writer's commit timestamp. The local clock timestamp records the value of the local HLC clock on the leaseholder when the key was originally written. It is used to make claims about the relative real time ordering of the key's writer and readers when comparing a reader's uncertainty interval (and observed timestamps) to the key. Ignoring edge cases, readers with an observed timestamp from the key's leaseholder that is greater than the local clock timestamp stored in the key cannot make claims about real time ordering and must consider it possible that the key's write occurred before the read began. However, readers with an observed timestamp from the key's leaseholder that is less than the clock timestamp can claim that the reader captured that observed timestamp before the key was written and therefore can consider the key's write to have been concurrent with the read. In doing so, the reader can avoid an uncertainty restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each key-value pair, the commit optimizes the clock timestamp away in the common case where it leads the version timestamp. Only in the rare cases where the local timestamp trails the version timestamp (e.g. future-time writes, async intent resolution with a new commit timestamp) does the local timestamp need to be explicitly represented in the key encoding. This is possible because it is safe for the local clock timestamp to be rounded down, as this will simply lead to additional uncertainty restarts. However, it is not safe for the local clock timestamp to be rounded up, as this could lead to stale reads.

MVCCValue

type MVCCValue struct {
    MVCCValueHeader
    Value roachpb.Value
}

To store the local timestamp, the commit introduces a new MVCCValue type to parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with MVCC-level metadata which is stored in an enginepb.MVCCValueHeader protobuf struct. To this point, the MVCC layer has treated versioned values as opaque blobs of bytes and has not enforced any structure on them. Now that MVCC will use the value to store metadata, it needs to enforce more structure on the values provided to it. This is the cause of some testing churn, but is otherwise not a problem, as all production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended" encoding scheme, depending on whether the value's header is empty or not. If the value's header is empty, it is omitted in the encoding and the mvcc value's encoding is identical to that of roachpb.Value. This provided backwards compatibility and ensures that the MVCCValue optimizes away in the common case. If the value's header is not empty, it is prepended to the roachpb.Value encoding. The encoding scheme's variants are:

Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>

The two encoding scheme variants are distinguished using the 5th byte, which is either the roachpb.Value tag (which has many possible values) or a sentinel tag not used by the roachpb.Value encoding which indicates the extended encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple" encoding are fast by avoiding heap allocations, memory copies, or function calls by exploiting mid-stack inlining. See microbenchmarks below.

Future improvements

As noted in #72121 (comment), this commit paves a path towards the complete removal of synthetic timestamps, which were originally introduced in support of non-blocking transactions and GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for ClockTimestamps is no longer necessary now that we never need to "push" transaction-domain timestamps into HLC clocks. Instead, the invariant that underpins observed timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC clock.

The synthetic bit's second role of disabling observed timestamps is replaced by the generalization provided by "local timestamps". Local timestamps precisely track when an MVCC version was written in the leaseholder's clock timestamp domain. This establishes a total ordering across clock observations (local timestamp assignment for writers and observed timestamps for readers) and establish a partial ordering between writer and reader transactions. As a result, the use of observed timestamps during uncertainty checking becomes a comparison between two ClockTimestamps, the version's local timestamp and the reader's observed timestamp.

Correctness testing

I was not able to stress jepsen/multi-register/strobe-skews hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called TestTxnReadWithinUncertaintyIntervalAfterIntentResolution which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes.

Performance analysis

This correctness fix will lead to an increased rate of transaction retries under some workloads.

MVCCValue Encoding and Decoding Microbenchmarks
name                                                                   time/op
EncodeMVCCValue/header=empty/value=tombstone-10                        3.11ns ± 0%
EncodeMVCCValue/header=empty/value=short-10                            3.11ns ± 0%
EncodeMVCCValue/header=empty/value=long-10                             3.10ns ± 0%
EncodeMVCCValue/header=local_walltime/value=tombstone-10               38.9ns ± 0%
EncodeMVCCValue/header=local_walltime/value=short-10                   42.1ns ± 0%
EncodeMVCCValue/header=local_walltime/value=long-10                     533ns ± 3%
EncodeMVCCValue/header=local_walltime+logical/value=tombstone-10       40.5ns ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=short-10           42.9ns ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=long-10             541ns ± 4%
DecodeMVCCValue/header=empty/value=tombstone/inline=false-10           7.81ns ± 1%
DecodeMVCCValue/header=empty/value=tombstone/inline=true-10            0.93ns ± 0%
DecodeMVCCValue/header=empty/value=short/inline=false-10               8.39ns ± 0%
DecodeMVCCValue/header=empty/value=short/inline=true-10                1.55ns ± 0%
DecodeMVCCValue/header=empty/value=long/inline=false-10                8.38ns ± 0%
DecodeMVCCValue/header=empty/value=long/inline=true-10                 1.55ns ± 0%
DecodeMVCCValue/header=local_walltime/value=long/inline=false-10       32.2ns ± 0%
DecodeMVCCValue/header=local_walltime/value=long/inline=true-10        22.7ns ± 0%
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=false-10  32.2ns ± 0%
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=true-10   22.7ns ± 0%
DecodeMVCCValue/header=local_walltime/value=short/inline=false-10      32.2ns ± 0%
DecodeMVCCValue/header=local_walltime/value=short/inline=true-10       22.7ns ± 0%

name                                                                   alloc/op
EncodeMVCCValue/header=empty/value=tombstone-10                         0.00B
EncodeMVCCValue/header=empty/value=short-10                             0.00B
EncodeMVCCValue/header=empty/value=long-10                              0.00B
EncodeMVCCValue/header=local_walltime/value=tombstone-10                24.0B ± 0%
EncodeMVCCValue/header=local_walltime/value=short-10                    32.0B ± 0%
EncodeMVCCValue/header=local_walltime/value=long-10                    4.86kB ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=tombstone-10        24.0B ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=short-10            32.0B ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=long-10            4.86kB ± 0%
DecodeMVCCValue/header=empty/value=tombstone/inline=false-10            0.00B
DecodeMVCCValue/header=empty/value=tombstone/inline=true-10             0.00B
DecodeMVCCValue/header=empty/value=short/inline=false-10                0.00B
DecodeMVCCValue/header=empty/value=short/inline=true-10                 0.00B
DecodeMVCCValue/header=empty/value=long/inline=false-10                 0.00B
DecodeMVCCValue/header=empty/value=long/inline=true-10                  0.00B
DecodeMVCCValue/header=local_walltime/value=long/inline=false-10        0.00B
DecodeMVCCValue/header=local_walltime/value=long/inline=true-10         0.00B
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=false-10   0.00B
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=true-10    0.00B
DecodeMVCCValue/header=local_walltime/value=short/inline=false-10       0.00B
DecodeMVCCValue/header=local_walltime/value=short/inline=true-10        0.00B

name                                                                   allocs/op
EncodeMVCCValue/header=empty/value=tombstone-10                          0.00
EncodeMVCCValue/header=empty/value=short-10                              0.00
EncodeMVCCValue/header=empty/value=long-10                               0.00
EncodeMVCCValue/header=local_walltime/value=tombstone-10                 1.00 ± 0%
EncodeMVCCValue/header=local_walltime/value=short-10                     1.00 ± 0%
EncodeMVCCValue/header=local_walltime/value=long-10                      1.00 ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=tombstone-10         1.00 ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=short-10             1.00 ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=long-10              1.00 ± 0%
DecodeMVCCValue/header=empty/value=tombstone/inline=false-10             0.00
DecodeMVCCValue/header=empty/value=tombstone/inline=true-10              0.00
DecodeMVCCValue/header=empty/value=short/inline=false-10                 0.00
DecodeMVCCValue/header=empty/value=short/inline=true-10                  0.00
DecodeMVCCValue/header=empty/value=long/inline=false-10                  0.00
DecodeMVCCValue/header=empty/value=long/inline=true-10                   0.00
DecodeMVCCValue/header=local_walltime/value=long/inline=false-10         0.00
DecodeMVCCValue/header=local_walltime/value=long/inline=true-10          0.00
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=false-10    0.00
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=true-10     0.00
DecodeMVCCValue/header=local_walltime/value=short/inline=false-10        0.00
DecodeMVCCValue/header=local_walltime/value=short/inline=true-10         0.00
pkg/sql/tests End-To-End Microbenchmarks
name                                old time/op    new time/op    delta
KV/Delete/Native/rows=1-30             106µs ± 2%     104µs ± 2%  -1.38%  (p=0.012 n=8+10)
KV/Insert/SQL/rows=100-30             1.26ms ± 2%    1.24ms ± 2%  -1.25%  (p=0.029 n=10+10)
KV/Scan/SQL/rows=1-30                  281µs ± 2%     277µs ± 2%  -1.17%  (p=0.009 n=10+10)
KV/Insert/Native/rows=1-30             107µs ± 2%     107µs ± 2%    ~     (p=0.353 n=10+10)
KV/Insert/Native/rows=10-30            156µs ± 2%     155µs ± 4%    ~     (p=0.481 n=10+10)
KV/Insert/Native/rows=100-30           570µs ± 1%     576µs ± 2%    ~     (p=0.075 n=10+10)
KV/Insert/Native/rows=1000-30         4.18ms ± 2%    4.23ms ± 2%    ~     (p=0.143 n=10+10)
KV/Insert/Native/rows=10000-30        43.9ms ± 2%    44.1ms ± 2%    ~     (p=0.393 n=10+10)
KV/Insert/SQL/rows=1-30                412µs ± 2%     410µs ± 2%    ~     (p=0.280 n=10+10)
KV/Insert/SQL/rows=10-30               511µs ± 2%     508µs ± 1%    ~     (p=0.436 n=10+10)
KV/Insert/SQL/rows=1000-30            8.18ms ± 1%    8.11ms ± 2%    ~     (p=0.165 n=10+10)
KV/Insert/SQL/rows=10000-30           85.2ms ± 2%    84.7ms ± 2%    ~     (p=0.481 n=10+10)
KV/Update/Native/rows=1-30             163µs ± 2%     162µs ± 2%    ~     (p=0.436 n=10+10)
KV/Update/Native/rows=10-30            365µs ± 1%     365µs ± 1%    ~     (p=0.739 n=10+10)
KV/Update/Native/rows=10000-30         143ms ± 1%     144ms ± 2%    ~     (p=0.075 n=10+10)
KV/Update/SQL/rows=1-30                525µs ± 2%     522µs ± 3%    ~     (p=0.190 n=10+10)
KV/Update/SQL/rows=10-30               815µs ± 1%     810µs ± 1%    ~     (p=0.190 n=10+10)
KV/Update/SQL/rows=100-30             2.47ms ± 1%    2.48ms ± 1%    ~     (p=0.356 n=9+10)
KV/Update/SQL/rows=1000-30            16.6ms ± 1%    16.7ms ± 1%    ~     (p=0.065 n=9+10)
KV/Update/SQL/rows=10000-30            193ms ± 2%     195ms ± 3%    ~     (p=0.315 n=10+10)
KV/Delete/Native/rows=10-30            173µs ± 2%     171µs ± 3%    ~     (p=0.247 n=10+10)
KV/Delete/Native/rows=100-30           677µs ± 1%     674µs ± 0%    ~     (p=0.475 n=10+7)
KV/Delete/Native/rows=1000-30         5.20ms ± 2%    5.19ms ± 1%    ~     (p=0.853 n=10+10)
KV/Delete/Native/rows=10000-30        53.7ms ± 1%    53.9ms ± 1%    ~     (p=0.684 n=10+10)
KV/Delete/SQL/rows=1-30                377µs ± 3%     375µs ± 1%    ~     (p=0.305 n=10+9)
KV/Delete/SQL/rows=10-30               503µs ± 1%     500µs ± 2%    ~     (p=0.123 n=10+10)
KV/Delete/SQL/rows=100-30             1.52ms ± 4%    1.51ms ± 2%    ~     (p=0.529 n=10+10)
KV/Delete/SQL/rows=1000-30            2.53ms ± 3%    2.53ms ± 3%    ~     (p=1.000 n=10+10)
KV/Delete/SQL/rows=10000-30           21.9ms ± 1%    21.8ms ± 1%    ~     (p=0.315 n=9+10)
KV/Scan/Native/rows=1-30              29.6µs ± 2%    29.8µs ± 1%    ~     (p=0.143 n=10+10)
KV/Scan/Native/rows=100-30            54.6µs ± 1%    55.0µs ± 2%    ~     (p=0.052 n=10+10)
KV/Scan/SQL/rows=10-30                 292µs ± 1%     290µs ± 1%    ~     (p=0.190 n=10+10)
KV/Scan/SQL/rows=100-30                364µs ± 2%     363µs ± 1%    ~     (p=0.684 n=10+10)
KV/Scan/SQL/rows=10000-30             5.34ms ± 2%    5.32ms ± 1%    ~     (p=0.631 n=10+10)
KVAndStorageUsingSQL/versions=2-30    2.59ms ± 1%    2.59ms ± 2%    ~     (p=0.842 n=9+10)
KVAndStorageUsingSQL/versions=4-30    2.57ms ± 3%    2.56ms ± 2%    ~     (p=0.684 n=10+10)
KVAndStorageUsingSQL/versions=8-30    2.60ms ± 3%    2.59ms ± 2%    ~     (p=0.315 n=10+10)
KV/Update/Native/rows=100-30          2.02ms ± 1%    2.04ms ± 1%  +0.95%  (p=0.015 n=10+10)
KV/Update/Native/rows=1000-30         16.2ms ± 2%    16.5ms ± 2%  +1.30%  (p=0.001 n=10+9)
KV/Scan/Native/rows=10-30             32.6µs ± 2%    33.0µs ± 3%  +1.39%  (p=0.019 n=10+10)
KV/Scan/Native/rows=1000-30            266µs ± 1%     270µs ± 1%  +1.51%  (p=0.000 n=9+10)
KV/Scan/SQL/rows=1000-30               982µs ± 1%     997µs ± 1%  +1.60%  (p=0.000 n=10+10)
KV/Scan/Native/rows=10000-30          2.18ms ± 1%    2.23ms ± 1%  +2.55%  (p=0.000 n=10+10)

name                                old alloc/op   new alloc/op   delta
KV/Insert/Native/rows=1-30            15.6kB ± 0%    15.6kB ± 0%    ~     (p=0.631 n=10+10)
KV/Insert/Native/rows=10000-30        34.7MB ± 2%    35.0MB ± 2%    ~     (p=0.089 n=10+10)
KV/Insert/SQL/rows=1-30               41.3kB ± 1%    41.4kB ± 1%    ~     (p=0.353 n=10+10)
KV/Insert/SQL/rows=10-30              90.8kB ± 1%    90.8kB ± 0%    ~     (p=0.945 n=10+8)
KV/Insert/SQL/rows=1000-30            5.69MB ± 1%    5.71MB ± 1%    ~     (p=0.436 n=10+10)
KV/Insert/SQL/rows=10000-30           69.6MB ± 1%    69.6MB ± 1%    ~     (p=0.853 n=10+10)
KV/Update/Native/rows=1-30            21.4kB ± 1%    21.4kB ± 0%    ~     (p=0.915 n=9+9)
KV/Update/Native/rows=10000-30        67.8MB ± 1%    68.1MB ± 2%    ~     (p=0.315 n=10+10)
KV/Update/SQL/rows=1-30               47.2kB ± 1%    47.3kB ± 1%    ~     (p=0.280 n=10+10)
KV/Update/SQL/rows=10-30               116kB ± 1%     117kB ± 1%    ~     (p=0.353 n=10+10)
KV/Update/SQL/rows=1000-30            5.20MB ± 2%    5.18MB ± 1%    ~     (p=0.278 n=10+9)
KV/Delete/Native/rows=10000-30        30.5MB ± 2%    30.5MB ± 1%    ~     (p=0.631 n=10+10)
KV/Delete/SQL/rows=1-30               42.9kB ± 0%    43.0kB ± 1%    ~     (p=0.393 n=10+10)
KV/Delete/SQL/rows=10-30              66.9kB ± 3%    66.5kB ± 1%    ~     (p=0.853 n=10+10)
KV/Delete/SQL/rows=1000-30            1.19MB ± 0%    1.19MB ± 0%    ~     (p=0.447 n=9+10)
KV/Delete/SQL/rows=10000-30           14.3MB ± 1%    14.3MB ± 0%    ~     (p=0.353 n=10+10)
KV/Scan/Native/rows=1-30              7.17kB ± 0%    7.18kB ± 0%    ~     (p=0.061 n=10+9)
KV/Scan/Native/rows=10-30             8.59kB ± 0%    8.59kB ± 0%    ~     (p=0.926 n=10+10)
KV/Scan/Native/rows=100-30            21.2kB ± 0%    21.2kB ± 0%    ~     (p=0.781 n=10+10)
KV/Scan/Native/rows=1000-30            172kB ± 0%     172kB ± 0%    ~     (p=0.264 n=10+8)
KV/Scan/Native/rows=10000-30          1.51MB ± 0%    1.51MB ± 0%    ~     (p=0.968 n=9+10)
KV/Scan/SQL/rows=1-30                 19.3kB ± 0%    19.3kB ± 1%    ~     (p=0.968 n=9+10)
KV/Scan/SQL/rows=10-30                20.6kB ± 1%    20.7kB ± 0%    ~     (p=0.897 n=10+8)
KV/Scan/SQL/rows=100-30               32.8kB ± 1%    32.9kB ± 0%    ~     (p=0.400 n=10+9)
KV/Scan/SQL/rows=1000-30               185kB ± 0%     185kB ± 0%    ~     (p=0.247 n=10+10)
KV/Scan/SQL/rows=10000-30             1.10MB ± 0%    1.10MB ± 0%    ~     (p=0.631 n=10+10)
KVAndStorageUsingSQL/versions=2-30     793kB ± 3%     793kB ± 2%    ~     (p=0.796 n=10+10)
KVAndStorageUsingSQL/versions=4-30     788kB ± 3%     783kB ± 3%    ~     (p=0.353 n=10+10)
KVAndStorageUsingSQL/versions=8-30     787kB ± 3%     785kB ± 2%    ~     (p=0.853 n=10+10)
KV/Delete/Native/rows=1-30            15.1kB ± 0%    15.2kB ± 0%  +0.26%  (p=0.029 n=10+10)
KV/Update/Native/rows=10-30           66.8kB ± 0%    67.0kB ± 0%  +0.38%  (p=0.029 n=10+10)
KV/Insert/SQL/rows=100-30              550kB ± 1%     553kB ± 1%  +0.44%  (p=0.043 n=10+10)
KV/Update/SQL/rows=100-30              578kB ± 1%     580kB ± 1%  +0.46%  (p=0.019 n=10+10)
KV/Update/Native/rows=100-30           503kB ± 0%     506kB ± 0%  +0.59%  (p=0.000 n=10+10)
KV/Update/SQL/rows=10000-30            153MB ± 1%     154MB ± 1%  +0.62%  (p=0.006 n=9+10)
KV/Delete/SQL/rows=100-30              306kB ± 0%     308kB ± 0%  +0.67%  (p=0.000 n=10+8)
KV/Delete/Native/rows=10-30           36.1kB ± 1%    36.4kB ± 1%  +0.82%  (p=0.001 n=10+9)
KV/Update/Native/rows=1000-30         4.75MB ± 1%    4.79MB ± 1%  +0.83%  (p=0.002 n=10+10)
KV/Insert/Native/rows=10-30           39.7kB ± 1%    40.1kB ± 1%  +0.83%  (p=0.023 n=10+10)
KV/Insert/Native/rows=1000-30         2.56MB ± 1%    2.58MB ± 1%  +1.00%  (p=0.000 n=9+10)
KV/Insert/Native/rows=100-30           279kB ± 1%     282kB ± 1%  +1.02%  (p=0.007 n=10+10)
KV/Delete/Native/rows=100-30           243kB ± 1%     246kB ± 1%  +1.07%  (p=0.000 n=10+10)
KV/Delete/Native/rows=1000-30         2.23MB ± 1%    2.26MB ± 1%  +1.33%  (p=0.000 n=10+9)

name                                old allocs/op  new allocs/op  delta
KV/Update/SQL/rows=1000-30             58.6k ± 0%     58.4k ± 0%  -0.28%  (p=0.000 n=9+9)
KV/Scan/SQL/rows=10-30                   277 ± 0%       276 ± 0%  -0.25%  (p=0.001 n=8+10)
KV/Update/SQL/rows=10000-30             740k ± 0%      739k ± 0%  -0.12%  (p=0.040 n=8+7)
KV/Insert/Native/rows=1-30               129 ± 0%       129 ± 0%    ~     (all equal)
KV/Insert/Native/rows=10-30              272 ± 0%       272 ± 0%    ~     (all equal)
KV/Insert/Native/rows=1000-30          14.3k ± 0%     14.3k ± 0%    ~     (p=0.221 n=10+10)
KV/Insert/Native/rows=10000-30          141k ± 0%      141k ± 0%    ~     (p=0.356 n=10+9)
KV/Insert/SQL/rows=1-30                  349 ± 0%       348 ± 0%    ~     (p=0.776 n=9+10)
KV/Insert/SQL/rows=10-30                 625 ± 0%       625 ± 0%    ~     (p=0.068 n=9+8)
KV/Insert/SQL/rows=100-30              3.12k ± 0%     3.12k ± 0%    ~     (p=0.351 n=10+9)
KV/Insert/SQL/rows=1000-30             29.3k ± 0%     29.3k ± 0%    ~     (p=0.644 n=9+10)
KV/Insert/SQL/rows=10000-30             294k ± 0%      293k ± 0%    ~     (p=0.134 n=9+7)
KV/Update/Native/rows=1-30               181 ± 0%       181 ± 0%    ~     (all equal)
KV/Update/Native/rows=10-30              458 ± 0%       458 ± 0%    ~     (all equal)
KV/Update/Native/rows=1000-30          26.9k ± 0%     27.0k ± 0%    ~     (p=0.232 n=9+10)
KV/Update/Native/rows=10000-30          273k ± 0%      274k ± 0%    ~     (p=0.315 n=9+10)
KV/Update/SQL/rows=1-30                  503 ± 0%       503 ± 0%    ~     (p=1.000 n=10+10)
KV/Update/SQL/rows=10-30                 904 ± 1%       905 ± 0%    ~     (p=0.223 n=10+10)
KV/Update/SQL/rows=100-30              6.79k ± 0%     6.79k ± 0%    ~     (p=0.669 n=10+10)
KV/Delete/Native/rows=1-30               128 ± 0%       128 ± 0%    ~     (all equal)
KV/Delete/Native/rows=10-30              248 ± 0%       248 ± 0%    ~     (all equal)
KV/Delete/Native/rows=10000-30          112k ± 0%      112k ± 0%    ~     (p=0.825 n=10+9)
KV/Delete/SQL/rows=1-30                  331 ± 0%       331 ± 0%    ~     (all equal)
KV/Delete/SQL/rows=10-30                 512 ± 0%       512 ± 0%    ~     (p=0.406 n=8+10)
KV/Delete/SQL/rows=100-30              2.01k ± 0%     2.01k ± 0%    ~     (p=0.947 n=10+8)
KV/Delete/SQL/rows=1000-30             7.51k ± 0%     7.51k ± 0%    ~     (p=0.204 n=9+10)
KV/Delete/SQL/rows=10000-30            71.2k ± 0%     71.2k ± 0%    ~     (p=0.986 n=9+10)
KV/Scan/Native/rows=1-30                54.0 ± 0%      54.0 ± 0%    ~     (all equal)
KV/Scan/Native/rows=10-30               58.0 ± 0%      58.0 ± 0%    ~     (all equal)
KV/Scan/Native/rows=100-30              62.3 ± 1%      62.3 ± 1%    ~     (p=1.000 n=10+10)
KV/Scan/Native/rows=1000-30             72.3 ± 1%      72.0 ± 0%    ~     (p=0.137 n=10+8)
KV/Scan/Native/rows=10000-30             115 ± 1%       115 ± 1%    ~     (p=0.193 n=9+10)
KV/Scan/SQL/rows=1-30                    242 ± 0%       242 ± 0%    ~     (p=0.828 n=9+10)
KV/Scan/SQL/rows=100-30                  648 ± 0%       648 ± 0%    ~     (all equal)
KV/Scan/SQL/rows=1000-30               5.04k ± 0%     5.04k ± 0%    ~     (p=1.000 n=10+10)
KV/Scan/SQL/rows=10000-30              50.3k ± 0%     50.3k ± 0%    ~     (p=0.208 n=10+10)
KVAndStorageUsingSQL/versions=2-30     7.61k ± 0%     7.61k ± 0%    ~     (p=0.223 n=8+10)
KVAndStorageUsingSQL/versions=4-30     7.61k ± 0%     7.60k ± 0%    ~     (p=0.643 n=10+10)
KVAndStorageUsingSQL/versions=8-30     7.61k ± 0%     7.61k ± 0%    ~     (p=0.888 n=10+9)
KV/Update/Native/rows=100-30           2.76k ± 0%     2.76k ± 0%  +0.03%  (p=0.044 n=10+10)
KV/Delete/Native/rows=1000-30          11.3k ± 0%     11.3k ± 0%  +0.03%  (p=0.006 n=10+9)
KV/Insert/Native/rows=100-30           1.57k ± 0%     1.57k ± 0%  +0.06%  (p=0.000 n=8+7)
KV/Delete/Native/rows=100-30           1.27k ± 0%     1.28k ± 0%  +0.08%  (p=0.000 n=8+8)
YCSB benchmark suite
name                   old ops/s    new ops/s    delta
ycsb/A/nodes=3          15.2k ± 6%   15.4k ± 3%    ~     (p=0.579 n=10+10)
ycsb/B/nodes=3          28.7k ± 4%   28.9k ± 1%    ~     (p=0.780 n=10+9)
ycsb/C/nodes=3          41.3k ± 4%   41.4k ± 2%    ~     (p=0.529 n=10+10)
ycsb/D/nodes=3          33.9k ± 2%   33.5k ± 2%  -1.37%  (p=0.003 n=9+9)
ycsb/E/nodes=3          2.10k ± 2%   2.10k ± 1%    ~     (p=0.813 n=10+7)
ycsb/F/nodes=3          7.06k ± 5%   7.05k ± 4%    ~     (p=0.853 n=10+10)
ycsb/A/nodes=3/cpu=32   30.0k ± 9%   31.4k ± 3%    ~     (p=0.095 n=10+9)
ycsb/B/nodes=3/cpu=32   97.6k ± 2%   98.1k ± 2%    ~     (p=0.237 n=8+10)
ycsb/C/nodes=3/cpu=32    129k ± 2%    130k ± 1%    ~     (p=0.243 n=10+9)
ycsb/D/nodes=3/cpu=32    105k ± 2%    106k ± 1%  +1.06%  (p=0.034 n=10+8)
ycsb/E/nodes=3/cpu=32   3.33k ± 1%   3.33k ± 1%    ~     (p=0.951 n=10+9)
ycsb/F/nodes=3/cpu=32   15.5k ±10%   16.4k ± 5%  +5.35%  (p=0.029 n=10+10)

name                   old avg(ms)  new avg(ms)  delta
ycsb/A/nodes=3           6.32 ± 6%    6.23 ± 3%    ~     (p=0.635 n=10+10)
ycsb/B/nodes=3           5.02 ± 4%    4.97 ± 3%    ~     (p=0.542 n=10+10)
ycsb/C/nodes=3           3.49 ± 3%    3.50 ± 0%    ~     (p=0.443 n=10+9)
ycsb/D/nodes=3           2.80 ± 0%    2.87 ± 2%  +2.50%  (p=0.008 n=8+10)
ycsb/E/nodes=3           45.8 ± 1%    45.8 ± 1%    ~     (p=0.718 n=10+7)
ycsb/F/nodes=3           13.5 ± 3%    13.6 ± 4%    ~     (p=0.509 n=9+10)
ycsb/A/nodes=3/cpu=32    4.82 ±10%    4.62 ± 6%    ~     (p=0.092 n=10+10)
ycsb/B/nodes=3/cpu=32    2.00 ± 0%    1.96 ± 3%    ~     (p=0.065 n=9+10)
ycsb/C/nodes=3/cpu=32    1.50 ± 0%    1.50 ± 0%    ~     (all equal)
ycsb/D/nodes=3/cpu=32    1.40 ± 0%    1.36 ± 4%    ~     (p=0.065 n=9+10)
ycsb/E/nodes=3/cpu=32    43.3 ± 1%    43.3 ± 0%    ~     (p=0.848 n=10+9)
ycsb/F/nodes=3/cpu=32    9.30 ±11%    8.79 ± 5%  -5.48%  (p=0.022 n=10+10)

name                   old p99(ms)  new p99(ms)  delta
ycsb/A/nodes=3           52.0 ±17%    48.4 ±18%    ~     (p=0.379 n=10+10)
ycsb/B/nodes=3           32.2 ±11%    32.2 ± 9%    ~     (p=0.675 n=10+10)
ycsb/C/nodes=3           13.8 ± 6%    13.8 ± 3%    ~     (p=1.000 n=10+10)
ycsb/D/nodes=3           12.6 ± 0%    12.9 ± 2%  +2.38%  (p=0.023 n=8+10)
ycsb/E/nodes=3            200 ± 8%     197 ± 6%    ~     (p=0.375 n=10+8)
ycsb/F/nodes=3            124 ±19%     125 ±21%    ~     (p=0.856 n=9+10)
ycsb/A/nodes=3/cpu=32    68.1 ±17%    63.9 ± 5%    ~     (p=0.211 n=10+8)
ycsb/B/nodes=3/cpu=32    15.7 ± 0%    15.5 ± 2%    ~     (p=0.071 n=6+10)
ycsb/C/nodes=3/cpu=32    10.5 ± 0%    10.2 ± 3%  -2.86%  (p=0.003 n=8+10)
ycsb/D/nodes=3/cpu=32    8.70 ± 3%    8.40 ± 0%  -3.45%  (p=0.003 n=10+8)
ycsb/E/nodes=3/cpu=32     151 ± 0%     151 ± 0%    ~     (all equal)
ycsb/F/nodes=3/cpu=32     148 ±15%     145 ±10%    ~     (p=0.503 n=9+10)

Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew.

Footnotes

  1. see pkg/kv/kvserver/observedts/doc.go for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix.

  2. see analysis in https://github.com/cockroachdb/cockroach/issues/36431#issuecomment-714221846.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/localTimestampValue branch 3 times, most recently from c46d218 to 204c01e Compare April 28, 2022 17:50
@nvanbenschoten
Copy link
Member Author

@sumeerbhola @aayushshah15 @erikgrinaker this is an alternative to #77342 that stores the new LocalTimestamp in a key-value's value instead of its key. We've discussed the difference between these two options in a few different venues. At a high-level, I think the tradeoffs between the two approaches are:

Option 1: Store local_timestamp in mvcc key suffix
+ Enables fast-path to fully resolve uncertainty checking by looking at pebble key, which may be important for key-value separation
+ MVCC can continue to treat values as opaque bytes
+ Smaller change in general, no need for complex value encoding scheme
- Confuses key identity, feels wrong
- Less flexible
- Unclear interaction with Pebble range keys
- Depends heavily on #77520

Option 2: Store local_timestamp in mvcc value
- The value is now required to fully resolve uncertainty, which may harm key-value separation
- MVCC must enforce an encoding scheme for values
- Larger change, some subtle logic around MVCCValue vs. roachpb.Value
+ Places the local_timestamp metadata where it feels like it belongs, simpler to understand
+ Works with Pebble range keys
+ Now that we have the MVCCValue type with a protobuf header, we can use it in other cases (within reason). For instance, it could be used to store kv-level TTL information if we wanted to push TTL into the MVCC layer
+ We could revert #77520 if we wanted to, though I don't intend to until at least after we've migrated away from synthetic timestamps

Both approaches touch performance-sensitive code, so both have to be very careful to not introduce regressions.

After writing this out and thinking about the trade-offs, I'm leaning towards the approach taken by this PR. The main reason for that is that I don't think the impact of this on key-value separation will be pronounced. Storing the local timestamp in the value means that we need to fetch the value of keys seen by scans to check uncertainty if 1) the scan has an uncertainty interval and 2) the scan is reading at a timestamp below the value. I'm going to assume that we don't care about cases where the value actually is uncertain, because this leads to a much more expensive uncertainty restart and is also a cost paid once (i.e. it doesn't compound because we stop scanning once we see an uncertain value). So let's focus on the case where the values are above the read timestamp but not uncertain.

First off, is this a common situation in workloads that we care about? I'm not sure that it is because any historical AOST read does not have an uncertainty interval and any consistent read will rarely see keys at timestamps above its read timestamp, as they would have needed to be written roughly concurrently with the read. So this is really only a problem for long-running consistent scans operating over keyspaces that see high write traffic.

If this does cause issues and we do find that avoiding a fetch of the value of keys above a consistent read's timestamp is important for performance, we could avoid some of the cost by splitting uncertainty checking into two phases. We could first perform an imprecise check using just the version timestamp from the key and comparing that to the scan's global uncertainty limit. Only if this check fails would we need to pull the value to perform the precise uncertainty check using the local timestamp and the local uncertainty limit. That means that long-running scans will have a max_clock_offset sized window above their read timestamp where checking for uncertainty with concurrently written values will be less optimal than possible (and again, this suboptimality only materializes when an observed timestamp is used to avoid uncertainty). My intuition is that this is not a worthwhile case to optimize for at this point.

Meanwhile, I think there are immediately meaningful benefits to this approach.


If you've already reviewed #77342 then only the last two commits here are new. Thanks for working with me through the multiple iterations of this patch.

@nvanbenschoten nvanbenschoten marked this pull request as ready for review April 28, 2022 19:46
@nvanbenschoten nvanbenschoten requested review from a team as code owners April 28, 2022 19:46
@nvanbenschoten nvanbenschoten requested a review from a team April 28, 2022 19:46
@nvanbenschoten nvanbenschoten requested review from a team as code owners April 28, 2022 19:46
@nvanbenschoten nvanbenschoten requested review from stevendanna, sumeerbhola, aayushshah15 and erikgrinaker and removed request for a team April 28, 2022 19:46
@nvanbenschoten
Copy link
Member Author

I added benchmarks results demonstrating the impact of this change on performance to the PR description. The benchmarks are from three different levels: microbenchmarks around mvcc value encoding and decoding, pkg/sql/tests end-to-end microbenchmarks, and the YCSB benchmark suite run on roachprod clusters. The results demonstrate that this change has a negligible impact on workload performance.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/localTimestampValue branch 2 times, most recently from ffe93b5 to ba07e50 Compare May 4, 2022 23:19
@sumeerbhola
Copy link
Collaborator

So let's focus on the case where the values are above the read timestamp but not uncertain.

Also, when Pebble starts storing older versions in a value block, and there is exactly one version above this read timestamp, that version's value will not be in the value block.


// D0 ————————————————————————————————————————————————
//
// MVCCKey
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO(@nvanbenschoten): fix this, now that the local timestamp is in the value.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(still reading -- flushing some comments)

Reviewed 14 of 64 files at r11.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @erikgrinaker, @nvanbenschoten, @stevendanna, and @sumeerbhola)


pkg/storage/mvcc.go line 400 at r11 (raw file):

		// Straightforward: old contribution goes, new contribution comes, and we're done.
		ms.SysBytes -= origMetaKeySize + origMetaValSize + orig.KeyBytes + orig.ValBytes
		ms.SysBytes += metaKeySize + metaValSize + meta.KeyBytes + meta.ValBytes

Was this a bug in the existing code since it was not using orig.{KeyBytes,ValBytes} and meta.{KeyBytes,ValBytes}?
The KeyBytes is always MVCCVersionTimestampSize, yes?


pkg/storage/mvcc_value.go line 60 at r11 (raw file):

//
// In either encoding, the suffix corresponding to the roachpb.Value can be
// omitted, indicating a deletion tombstone. For the simple encoding, this

This is a bit confusing, partly because roachpb.Value is already encoded, in that Value.RawBytes is already the encoded value. How about spelling this out a bit more, with something like:
// For a deletion tombstone, the encoding of roachpb.Value is special cased to be empty, i.e., no checksum, tag or encoded-data. In that case the extended encoding above is
// simply <4-byte-header-len><1-byte-sentinel>.


pkg/storage/mvcc_value.go line 103 at r11 (raw file):

	if v.LocalTimestamp.IsEmpty() {
		if k.Timestamp.Synthetic {
			return hlc.MinClockTimestamp

Is this to ensure that we can never ignore this value based on observed timestamp, and have to use the global uncertainty limit? Could use a code comment.


pkg/storage/mvcc_value.go line 170 at r11 (raw file):

		return nil, errors.Wrap(err, "marshaling MVCCValueHeader")
	}
	// <4-byte-checksum><1-byte-tag><encoded-data>

or empty for a tombstone?


pkg/storage/enginepb/mvcc.go line 266 at r11 (raw file):

	if meta.LocalTimestamp == nil {
		if meta.Timestamp.ToTimestamp().Synthetic {
			return hlc.MinClockTimestamp

I assume this is to ensure that we can never ignore this value based on observed timestamp, and have to use the global uncertainty limit. Is that correct? Could use a code comment.


pkg/storage/enginepb/mvcc.proto line 40 at r11 (raw file):

  // The local timestamp of the most recent versioned value if this is a
  // value that may have multiple versions. For values which may have only
  // one version, this timestamp is set to nil. See MVCCValueHeader for

"may have only version" is a bit confusing to me. Is this talking about this particular key or the category of versioned values? I assume it is the former, since this comment is about versioned values and that means it is always possible to have multiple versions.
Why do we need this at all -- it seems duplication from what we have stored in the provisional value?


pkg/storage/enginepb/mvcc.proto line 72 at r11 (raw file):

    // Value is the value written to the key as part of the transaction at
    // the above Sequence. Value uses the roachpb.Value encoding.
    optional bytes value = 2;

So the local timestamp is not stored for the older sequence of values written by this transaction?
And if one of those was restored due to a savepoint rollback, is it behaving like a new write in terms of the local timestamp?
A code comment would be useful.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from a few tests, I've finished reading the last 2 commits. Looks good!

Reviewed 40 of 64 files at r11, 11 of 15 files at r12.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @erikgrinaker, @nvanbenschoten, and @stevendanna)


pkg/kv/kvserver/batcheval/cmd_add_sstable.go line 408 at r7 (raw file):

			return errors.AssertionFailedf("SST contains inline value or intent for key %s", key)
		}
		if len(value) == 0 {

just curious: what is the confidence level that all such code has been found and changed?
Tests that don't write a local timestamp would still succeed with buggy code, yes?
Should we have some way to force all tests to write local timestamps?


pkg/kv/kvserver/uncertainty/interval.go line 55 at r11 (raw file):

// version timestamp and with the specified uncertainty interval.
func (in Interval) IsUncertain(valueTs hlc.Timestamp, localTs hlc.ClockTimestamp) bool {
	if !in.LocalLimit.IsEmpty() && in.LocalLimit.Less(localTs) {

Is everything required to have a non-empty localTs now, whether it be the version timestamp itself or hlc.MinTimestamp for the synthetic case?
Should we assert !localTs.IsEmpty() to make sure we have not forgotten something?


pkg/storage/mvcc.go line 3087 at r10 (raw file):

		if err != nil {
			return false, err
		}

This change looks like it is mainly code movement, since we need to figure out newValue.LocalTimestamp before updating and writng meta. Is that correct?
Do you think we have good test coverage of the various paths in this function?


pkg/storage/mvcc.go line 858 at r11 (raw file):

	iterAlreadyPositioned bool,
	meta *enginepb.MVCCMetadata,
	localTs *hlc.ClockTimestamp,

is localTS for allocation avoidance? is it required that meta.localTS be populated if the return value of ok is true?
A code comment would help.


pkg/storage/mvcc.go line 1495 at r11 (raw file):

					}
					if haveNextVersion {
						prevVal, err := DecodeMVCCValue(prevUnsafeVal)

is this rare enough to use this less optimized function?


pkg/storage/mvcc.go line 3045 at r11 (raw file):

		// TODO(nvanbenschoten): this is an awkward interface. We shouldn't
		// be mutating meta and we shouldn't be restoring the previous value
		// here. Instead, this should all be handled down below.

This whole intent resolution function is quite confusing, though we probably need to shore up our randomized testing before making any significant changes here. There is also a TODO mentioned in mvcc_test.go (may be stale now)

	// TODO(sumeer): mvccResolveWriteIntent has a bug when the txn is being
	// ABORTED and there are IgnoredSeqNums that are causing a partial rollback.
	// It does the partial rollback and does not actually resolve the intent.
	// This does not affect correctness since the intent resolution will get
	// retried.

pkg/storage/pebble.go line 1310 at r12 (raw file):

// a cluster are at or beyond clusterversion.TODO, different nodes will see the
// version state transition at different times. Nodes that have not yet seen the
// transition may remove the local timestamp from an intent that has one during

comment needs updating

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request May 7, 2022
…tents

Related to cockroachdb#80706.
Related to cockroachdb#66485.

This commit makes a slight modification to `pebbleMVCCScanner` that changes how
it determines whether an intent is uncertain or not. Instead of consulting the
version timestamp in the `MVCCMetadata` struct and comparing that against the
scan's uncertainty interval, the `pebbleMVCCScanner` now scans through the
uncertainty interval and uses the intent's provisional value's timestamp to
determine uncertainty.

The `pebbleMVCCScanner` was actually already doing this scan to compute
uncertainty for other committed values in its uncertainty interval if it found
that the intent was not uncertain. However, after this change, it also relies on
the scan to compute uncertainty for the intent itself. This is safe, because the
calling code knows that the intent has a higher timestamp than the scan, so
there is no risk that the scan adds the provisional value to its result set.

This change is important for two reasons:
1. it avoids the need to store the `local_timestamp` field (introduced in cockroachdb#80706)
   in the `MVCCMetadata` struct.
2. it moves us closer in the direction of using `MVCCMetadata` values (ts=0,
   essentially locks protecting provisional values) to determine read-write
   conflicts but then using versioned provisional values to determine uncertainty.
   Doing so allows us to decompose a KV scan into a separate lock table scan to
   detect read-write conflicts and a MVCC scan to accumulate a result set while
   checking for uncertainty. This will be important for cockroachdb#66485.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request May 9, 2022
…tents

Related to cockroachdb#80706.
Related to cockroachdb#66485.

This commit makes a slight modification to `pebbleMVCCScanner` that changes how
it determines whether an intent is uncertain or not. Instead of consulting the
version timestamp in the `MVCCMetadata` struct and comparing that against the
scan's uncertainty interval, the `pebbleMVCCScanner` now scans through the
uncertainty interval and uses the intent's provisional value's timestamp to
determine uncertainty.

The `pebbleMVCCScanner` was actually already doing this scan to compute
uncertainty for other committed values in its uncertainty interval if it found
that the intent was not uncertain. However, after this change, it also relies on
the scan to compute uncertainty for the intent itself. This is safe, because the
calling code knows that the intent has a higher timestamp than the scan, so
there is no risk that the scan adds the provisional value to its result set.

This change is important for two reasons:
1. it avoids the need to store the `local_timestamp` field (introduced in cockroachdb#80706)
   in the `MVCCMetadata` struct.
2. it moves us closer in the direction of using `MVCCMetadata` values (ts=0,
   essentially locks protecting provisional values) to determine read-write
   conflicts but then using versioned provisional values to determine uncertainty.
   Doing so allows us to decompose a KV scan into a separate lock table scan to
   detect read-write conflicts and a MVCC scan to accumulate a result set while
   checking for uncertainty. This will be important for cockroachdb#66485.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/localTimestampValue branch from ba07e50 to 8e87495 Compare May 9, 2022 03:10
@nvanbenschoten nvanbenschoten requested a review from a team as a code owner May 9, 2022 03:10
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/localTimestampValue branch 2 times, most recently from 95aae23 to 841c16e Compare May 9, 2022 04:30
This commit adds a cluster version gate and a cluster setting for
local timestamps, to assist with their migration into an existing
cluster.

This fixes mixed-version clusters' interaction with local timestamps.
This commit switches from storing encoded `roachpb.Value`s to storing
encoded `storage.MVCCValue`s in `MVCCMetadata`'s `SequencedIntent`s. Doing
so ensures that MVCCValue headers are not lost when an intent is rolled
back. This is important to avoid losing the local timestamp of values in a
key's intent history. Failure to do so could allow for stale reads.
This commit adds an assertion to `Interval.IsUncertain` that the provided
value and local timestamps are non-zero.
…testing

This commit adds a metamorphic knob that randomly disables the simple MVCC value
encoding scheme. Doing so ensures that code which interacts with encoded MVCC
values does not mistake these values for encoded roachpb values. This could take
place in two different ways:
1. broken code could assign an encoded MVCC value directly to a roachpb.Value's
   `RawBytes` field. This typically caused the test to fail with an error.
2. broken code could assume that a non-zero-length value was not a tombstone.
   This caused tests to fail in more obscure ways.

The commit then fixes broken tests in one of three ways:
- it fixes incorrect assumptions about the MVCC value encoding being equivalent
  to the roachpb value encoding.
- it updates a few important tests (mostly related to MVCC stats) to work with
  and without the simple encoding.
- it skips a few unimportant tests when the simple encoding scheme is disabled.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/localTimestampValue branch from 081c53a to 720c8cf Compare May 9, 2022 16:39
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @erikgrinaker, @stevendanna, and @sumeerbhola)


pkg/kv/kvserver/uncertainty/doc.go line 200 at r27 (raw file):

Previously, sumeerbhola wrote…

MVCC key-value pairs track their ...

Done.


pkg/kv/kvserver/uncertainty/doc.go line 204 at r27 (raw file):

Previously, sumeerbhola wrote…

should this be: with a greater or equal timestamp than the intent's local timestamp?
my understanding was that the local clock for a node is not advanced if a write has sampled it.

Kind of. HLC clock readings are strictly monotonically increasing, meaning that no two clock readings will be identical. So if a local timestamp is assigned from a reading of the leaseholder's clock, any future clock reading will be greater than this value.


pkg/storage/mvcc.go line 1434 at r27 (raw file):

Previously, sumeerbhola wrote…

I think of intent == MVCCMetadata. How about calling this curProvisionalValRaw?

Done.


pkg/storage/mvcc.go line 1500 at r27 (raw file):

Previously, sumeerbhola wrote…

is this inlining unsafeNextVersion, or something more subtle?

Yes, just inlining unsafeNextVersion, which was only used in one place and not worth keeping.


pkg/storage/mvcc.go line 3167 at r27 (raw file):

Previously, sumeerbhola wrote…

I am trying to make sure I don't overlook something regarding what we need for older versions, for the Pebble design that keeps older versions in value blocks. Earlier, it was keeping just the value size together with the key. Now I think we need an additional bit regarding whether the underlying roachpb.Value is empty or not. Is my understanding correct?

That depends on why it was keeping the value size together with the key. Was that because it was using this size to determine which keys were pointing at tombstones without looking at the value? We can still introduce a fast-path to avoid looking at the value block when the value size is 0, but we will no longer be able to say that a value is not a tombstone by just looking at its value size.

I feel like I'm may be missing something here. Could you explain why we would need an additional bit regarding whether the underlying roachpb.Value is empty or not?


pkg/storage/pebble.go line 1310 at r12 (raw file):

Previously, sumeerbhola wrote…

I still see a couple of TODOs in the comment.

Heh, I thought you were talking about s/key/value/ and missed the TODOs. Done.


pkg/storage/enginepb/mvcc.proto line 72 at r11 (raw file):

Though I suppose one could use the local timestamp from the latest provisional value, since that is the latest local timestamp we know that the txn is still active?

Yes, this would be correct, but I agree with you that it's not worth optimizing for at this point if doing so leads to more complex code.

@nvanbenschoten
Copy link
Member Author

Thanks for the review Sumeer!

I'm going to go ahead and merge this because it's at risk of skewing with other changes on master if it waits too long.

Aayush and I spent some time together walking through the changes last week. We were in agreement on the approach that this PR takes and nothing major came out of the code walk-through that hasn't since been addressed. I still think this could use a careful eye from someone on KV and I'm happy to address feedback in a follow-up PR.

bors r=sumeerbhola

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 39 files at r37, 1 of 7 files at r39.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @erikgrinaker, @nvanbenschoten, @stevendanna, and @sumeerbhola)


pkg/storage/mvcc.go line 3167 at r27 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

That depends on why it was keeping the value size together with the key. Was that because it was using this size to determine which keys were pointing at tombstones without looking at the value? We can still introduce a fast-path to avoid looking at the value block when the value size is 0, but we will no longer be able to say that a value is not a tombstone by just looking at its value size.

I feel like I'm may be missing something here. Could you explain why we would need an additional bit regarding whether the underlying roachpb.Value is empty or not?

I need to reread some stuff to remember, but the rough idea had been that there are various places (gc, stats etc.) that only needed the size of an older version. At that time I didn't pay full attention to what it did with the size, but now I think it used the size both for this "is-value" determination and for the actual size when it was > 0. So if we want to be able to do similar things without reading the value block, we could keep the "is-value" too. And yes, we can also just do a fast-path which relies on most of the !is-value cases being ones where there is no explicit local timestamp, and the slow-path would read the value.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @erikgrinaker, @nvanbenschoten, @stevendanna, and @sumeerbhola)


pkg/storage/enginepb/mvcc.proto line 72 at r11 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Though I suppose one could use the local timestamp from the latest provisional value, since that is the latest local timestamp we know that the txn is still active?

Yes, this would be correct, but I agree with you that it's not worth optimizing for at this point if doing so leads to more complex code.

How about a TODO, since we are leaving on the table some fast-forwarding opportunity of the local timestamp? Future PR is fine since this is already running through bors.

@craig
Copy link
Contributor

craig bot commented May 9, 2022

Build succeeded:

// Furthermore, even if our pushType is not PUSH_ABORT, we may have ended up
// with the responsibility to abort the intents (for example if we find the
// transaction aborted). To do better here, we need per-intent information
// on whether we need to poison.
resolve := roachpb.MakeLockUpdate(pusheeTxn, roachpb.Span{Key: ws.key})
if pusheeTxn.Status == roachpb.PENDING {
// The pushee was still PENDING at the time that the push observed its
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvanbenschoten: To confirm that I understand the idea here, we're saying "we want to move the local timestamp of an intent as close as possible to its mvcc commit timestamp". Is that fair to say?

In other words, with this commit we're leveraging the fact that before every Push, we have the opportunity to look at the intent leaseholder's local clock and update the intent's local timestamp if its txn is found to be PENDING (in order to move the intent out of the pusher's uncertainty window, since the intent's txn could not have causally preceded the pusher).

If the above sounds good to you, what I don't understand is what would happen if we didn't have this "optimization". I recall from our meeting that you'd alluded to this being more than just an optimization. Without this optimization, a reader might redundantly block on a txn that commits way later and doesn't causally precede the reader, yes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aayushshah15 and I talked about this in person and we're on the same page now. Summarizing the discussion below.

In other words, with this commit we're leveraging the fact that before every Push, we have the opportunity to look at the intent leaseholder's local clock and update the intent's local timestamp if its txn is found to be PENDING (in order to move the intent out of the pusher's uncertainty window, since the intent's txn could not have causally preceded the pusher).

Yes, this is correct.

I recall from our meeting that you'd alluded to this being more than just an optimization. Without this optimization, a reader might redundantly block on a txn that commits way later and doesn't causally precede the reader, yes?

This is also mostly correct. Without this, a high-priority pusher that pushes the timestamp of another transaction would still see the pushee's intent as uncertain when it returned to read because the intent would retain its local timestamp. It would then ratchet its read timestamp to that of the pushee and end up in the same situation when it attempted to read again. This would continue indefinitely. In effect, this would allow a high-priority reader to block on a lower-priotity writer — a form of priority inversion.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 8, 2022
Before this change, we were updating the local clock with each BatchResponse's
WriteTimestamp. This was meant to handle cases where the batch request timestamp
was forwarded during evaluation.

This was unnecessary for two reasons.

The first is that a BatchResponse can legitimately carry an operation timestamp
that leads the local HLC clock on the leaseholder that evaluated the request.
This has been true since cockroachdb#80706, which introduced the concept of a "local
timestamp". This allowed us to remove the (broken) attempt at ensuring that the
HLC on a leaseholder always leads the MVCC timestamp of all values in the
leaseholder's keyspace (see the update to `pkg/kv/kvserver/uncertainty/doc.go`
in that PR).

The second was that it was not even correct. The idea behind bumping the HLC on
the response path was to ensure that if a batch request was forwarded to a newer
timestamp during evaluation and then completed a write, that forwarded timestamp
would be reflected in the leaseholder's HLC. However, this ignored the fact that
any forwarded timestamp must have either come from an existing value in the
range or from the leaseholder's clock. So if those didn't lead the clock, the
derived timestamp wouldn't either. It also ignored the fact that the clock bump
here was too late (post-latch release) and if it had actually been needed (it
wasn't), it wouldn't have even ensured that the timestamp on any lease transfer
led the maximum time of any response served by the outgoing leaseholder.

There are no mixed-version migration concerns of this change, because cockroachdb#80706
ensured that any future-time operation will still continue to use the synthetic
bit until all nodes are running v22.2 or later.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 9, 2022
…ncasting

This commit adds an explicit `ClockTimestamp` field called `Now` to the
`BatchRequest` header, which mirrors the `Now` field on the `BatchResponse`
header.

In doing so, it removes the last instance where we downcasted a `Timestamp` to a
`ClockTimestamp` using the `TryToClockTimestamp` method. With this change, MVCC
("operation") timestamps never flow back into HLC clocks as clock signals. This
was enabled by cockroachdb#80706 and sets the groundwork to remove synthetic timestamps in
v23.1 — the role they played in dynamic typing of clock timestamps is now
entirely fulfilled by statically typed `ClockTimestamp` channels.

This is an important step in separating out the MVCC timestamp domain from the
clock timestamp domain and clarifying the roles of the two layers. In turn, this
layering opens the door for CockroachDB to start thinking about dynamic clock
synchronization error bounds.
craig bot pushed a commit that referenced this pull request Aug 11, 2022
63416: sql: emit point deletes during delete fastpath r=yuzefovich a=jordanlewis

Previously, the "deleteRange" SQL operator, which is meant to be a
fast-path for cases in which an entire range of keys can be deleted,
always did what it said: emitted DeleteRange KV operations. This
precludes a crucial optimization: sending point deletes when the list of
deleted keys is exactly known.

For example, a query like `DELETE FROM kv WHERE k = 10000` uses the
"fast path" delete, since it has a contiguous set of keys to delete, and
it doesn't need to know the values that were deleted. But, in this case,
the performance is actually worse if we use a DeleteRange KV operation
for various reasons (see #53939), because:

- ranged KV writes (DeleteRangeRequest) cannot be pipelined because an
  enumeration of the intents that they will leave cannot be known ahead
  of time. They must therefore perform evaluation and replication
  synchronously.
- ranged KV writes (DeleteRangeRequest) result in ranged intent
  resolution, which is less efficient (although this became less
  important since we re-enabled time-bound iterators).

The reason we couldn't previously emit point deletes in this case is
that SQL needs to know whether it deleted something or not. This means
we can't do a "blind put" of a deletion: we need to actually understand
whether there was something that we were "overwriting" with our delete.

This commit modifies the DeleteResponse to always return a boolean
indicating whether a key from the DeleteRequest was actually deleted.

Additionally, the deleteRange SQL operator detects when it can emit
single-key deletes, and does so.

Closes #53939.

Release note (performance improvement): point deletes in SQL are now
more efficient during concurrent workloads.

76233: kv: remove clock update on BatchResponse r=nvanbenschoten a=nvanbenschoten

Before this change, we were updating the local clock with each BatchResponse's WriteTimestamp. This was meant to handle cases where the batch request timestamp was forwarded during evaluation.

This was unnecessary for two reasons.

The first is that a BatchResponse can legitimately carry an operation timestamp that leads the local HLC clock on the leaseholder that evaluated the request. This has been true since #80706, which introduced the concept of a "local timestamp". This allowed us to remove the (broken) attempt at ensuring that the HLC on a leaseholder always leads the MVCC timestamp of all values in the leaseholder's keyspace (see the update to `pkg/kv/kvserver/uncertainty/doc.go` in that PR).

The second was that it was not even correct. The idea behind bumping the HLC on the response path was to ensure that if a batch request was forwarded to a newer timestamp during evaluation and then completed a write, that forwarded timestamp would be reflected in the leaseholder's HLC. However, this ignored the fact that any forwarded timestamp must have either come from an existing value in the range or from the leaseholder's clock. So if those didn't lead the clock, the derived timestamp wouldn't either. It also ignored the fact that the clock bump here was too late (post-latch release) and if it had actually been needed (it wasn't), it wouldn't have even ensured that the timestamp on any lease transfer led the maximum time of any response served by the outgoing leaseholder.

There are no mixed-version migration concerns of this change, because #80706 ensured that any future-time operation will still continue to use the synthetic bit until all nodes are running v22.2 or later.

85350: insights: ingester r=matthewtodd a=matthewtodd

Closes #81021.
    
Here we begin observing statements and transactions asynchronously, to
avoid slowing down the hot sql execution path as much as possible.
    
Release note: None

85440: colmem: improve memory-limiting behavior of the accounting helpers r=yuzefovich a=yuzefovich

**colmem: introduce a helper method when no memory limit should be applied**

This commit is a pure mechanical change.

Release note: None

**colmem: move some logic of capacity-limiting into the accounting helper**

This commit moves the logic that was duplicated across each user of the
SetAccountingHelper into the helper itself. Clearly, this allows us to
de-duplicate some code, but it'll make it easier to refactor the code
which is done in the following commit.

Additionally, this commit makes a tiny change to make the resetting
behavior in the hash aggregator more precise.

Release note: None

**colmem: improve memory-limiting behavior of the accounting helpers**

This commit fixes an oversight in how we are allocating batches of the
"dynamic" capacity. We have two related ways for reallocating batches,
and both of them work by growing the capacity of the batch until the
memory limit is exceeded, and then the batch would be reused until the
end of the query execution. This is a reasonable heuristic under the
assumption that all tuples in the data stream are roughly equal in size,
but this might not be the case.

In particular, consider an example when 10k small rows of 1KiB are
followed by 10k large rows of 1MiB. According to our heuristic, we
happily grow the batch until 1024 in capacity, and then we do not shrink
the capacity of that batch, so once the large rows start appearing, we
put 1GiB worth of data into a single batch, significantly exceeding our
memory limit (usually 64MiB with the default `workmem` setting).

This commit introduces a new heuristic as follows:
- the first time a batch exceeds the memory limit, its capacity is memorized,
  and from now on that capacity will determine the upper bound on the
  capacities of the batches allocated through the helper;
- if at any point in time a batch exceeds the memory limit by at least a
  factor of two, then that batch is discarded, and the capacity will never
  exceed half of the capacity of the discarded batch;
- if the memory limit is not reached, then the behavior of the dynamic growth
  of the capacity provided by `Allocator.ResetMaybeReallocate` is still
  applicable (i.e. the capacities will grow exponentially until
  coldata.BatchSize()).

Note that this heuristic does not have an ability to grow the maximum
capacity once it's been set although it might make sense to do so (say,
if after shrinking the capacity, the next five times we see that the
batch is using less than half of the memory limit). This is an conscious
omission since I want this change to be backported, and never growing
seems like a safer choice. Thus, this improvement is left as a TODO.

Also, we still might create batches that are too large in memory
footprint in those places that don't use the SetAccountingHelper (e.g.
in the columnarizer) since we perform the memory limit check at the
batch granularity. However, this commit improves things there so that we
don't reuse that batch on the next iteration and will use half of the
capacity on the next iteration.

Fixes: #76464.

Release note (bug fix): CockroachDB now more precisely respects the
`distsql_workmem` setting which improves the stability of each node and
makes OOMs less likely.

**colmem: unexport Allocator.ResetMaybeReallocate**

This commit is a mechanical change to unexport
`Allocator.ResetMaybeReallocate` so that the users would be forced to use
the method with the same name from the helpers. This required splitting
off the tests into two files.

Release note: None

85492: backupccl: remap all restored tables r=dt a=dt

This PR has a few changes, broken down into separate commits:
a) stop restoring tmp tables and remove the special-case code to synthesize their special schemas; These were previously restored only to be dropped so that restored jobs that referenced them would not be broken, but we stopped restoring jobs.
b) synthesize type-change jobs during cluster restore; this goes with not restoring jobs.
c) fix some assumptions in tests/other code about what IDs restored tables have.
d) finally, always assign new IDs to all restored objects, even during cluster restore, removing the need to carefully move conflicting tables or other things around.

Commit-by-commit review recommended.


85930: jobs: make expiration use intended txn priority r=ajwerner a=rafiss

In aed014f these operations were supposed to be changed to use
MinUserPriority. However, they weren't using the appropriate txn, so it
didn't have the intended effect.

Release note: None

Co-authored-by: Jordan Lewis <jordanthelewis@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Matthew Todd <todd@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 11, 2022
…ncasting

This commit adds an explicit `ClockTimestamp` field called `Now` to the
`BatchRequest` header, which mirrors the `Now` field on the `BatchResponse`
header.

In doing so, it removes the last instance where we downcasted a `Timestamp` to a
`ClockTimestamp` using the `TryToClockTimestamp` method. With this change, MVCC
("operation") timestamps never flow back into HLC clocks as clock signals. This
was enabled by cockroachdb#80706 and sets the groundwork to remove synthetic timestamps in
v23.1 — the role they played in dynamic typing of clock timestamps is now
entirely fulfilled by statically typed `ClockTimestamp` channels.

This is an important step in separating out the MVCC timestamp domain from the
clock timestamp domain and clarifying the roles of the two layers. In turn, this
layering opens the door for CockroachDB to start thinking about dynamic clock
synchronization error bounds.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 12, 2022
…ncasting

This commit adds an explicit `ClockTimestamp` field called `Now` to the
`BatchRequest` header, which mirrors the `Now` field on the `BatchResponse`
header.

In doing so, it removes the last instance where we downcasted a `Timestamp` to a
`ClockTimestamp` using the `TryToClockTimestamp` method. With this change, MVCC
("operation") timestamps never flow back into HLC clocks as clock signals. This
was enabled by cockroachdb#80706 and sets the groundwork to remove synthetic timestamps in
v23.1 — the role they played in dynamic typing of clock timestamps is now
entirely fulfilled by statically typed `ClockTimestamp` channels.

This is an important step in separating out the MVCC timestamp domain from the
clock timestamp domain and clarifying the roles of the two layers. In turn, this
layering opens the door for CockroachDB to start thinking about dynamic clock
synchronization error bounds.
craig bot pushed a commit that referenced this pull request Aug 12, 2022
85764: kv: pass explicit Now timestamp on BatchRequest, remove timestamp downcasting r=nvanbenschoten a=nvanbenschoten

This commit adds an explicit `ClockTimestamp` field called `Now` to the `BatchRequest` header, which mirrors the `Now` field on the `BatchResponse` header.

In doing so, it removes the last instance where we downcasted a `Timestamp` to a `ClockTimestamp` using the `TryToClockTimestamp` method. With this change, MVCC ("operation") timestamps never flow back into HLC clocks as clock signals. This was enabled by #80706 and sets the groundwork to remove synthetic timestamps in v23.1 — the role they played in dynamic typing of clock timestamps is now entirely fulfilled by statically typed ClockTimestamp channels.

This is an important step in separating out the MVCC timestamp domain from the clock timestamp domain and clarifying the roles of the two layers. In turn, this layering opens the door for CockroachDB to start thinking about dynamic clock synchronization error bounds.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 25, 2023
Informs cockroachdb#101938.

This commit removes logic in mvcc key decoding routines to decode
synthetic timestamps. We retain the ability to decode keys with the
synthetic timestamp bit set, but we simply ignore its presence.

As discussed in the previous commit, the role of these synthetic
timestamp markers was eliminated in cockroachdb#80706 by the local_timestamp field
in the mvcc value header, which was first present in v22.2. v23.2 does
not require compatibility with v22.2, so it can rely on the fact that
any txn that has a synthetic timestamp (because it writes in the future)
will also write local timestamps into each of its values.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants