Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/storage: disable observed timestamps for async resolved intents that are pushed #72121

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Oct 28, 2021

Fixes #36431.
Fixes #49360.

This commit fixes the potential for a stale read as detailed in #36431 using the approach detailed in #36431 (comment). This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we introduced the multi-register test to the Jepsen test suite, and have since observed the test fail when combined with the strobe-skews nemesis due to this bug in #49360 (and a few issues linked to that one). This commit stabilizes that test.

Explanation

The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure1 relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make.

This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back some utility by recognizing that only a small fraction2 of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold.

This commit does not take full advantage of this second observation, because we do not retain "written timestamps" in committed values (though we do in intents). Instead, we do something less optimal but cheaper and more convenient. Any intent whose timestamp is changed during asynchronous intent resolution is marked as "synthetic". Doing so is a compressed way of saying that the value could have originally been written as an intent at any time (even min_timestamp) and so observed timestamps cannot be used to limit uncertainty by pulling a read's uncertainty interval below the value's timestamp.

This application of the synthetic bit prevents observed timestamps from being used to avoid uncertainty restarts with the set of committed values whose timestamps do not reflect their original write time and therefore do not make a claim about the clock of their leaseholder at the time that they were committed. It does not change the interaction between observed timestamps and any other committed value.

Correctness testing

I was not able to stress jepsen/multi-register/strobe-skews hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called TestTxnReadWithinUncertaintyIntervalAfterIntentResolution which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes.

Performance analysis

This correctness fix will lead to an increased rate of transaction retries under some workloads.

TODO(nvanbenschoten):

  • observe TPC-C performance
    -- top-line performance
    -- uncertainty retry rate
    -- commit-wait rate (should be zero)
  • compare YCSB performance

Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew.

Footnotes

  1. see pkg/kv/kvserver/observedts/doc.go for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix.

  2. see analysis in https://github.com/cockroachdb/cockroach/issues/36431#issuecomment-714221846.

This commit adds a new test called TestTxnReadWithinUncertaintyInterval which
tests cases where a transaction observes a committed value below its global
uncertainty limit while performing a read. In one variant, the transaction does
not have an observed timestamp, so it hits a ReadWithinUncertaintyIntervalError.
In a second variant, it does have an observed timestamp, so it avoids the error.

This is very old functionality, but I could not find a functional test like this
in `pkg/kv/kvserver`, so I figured I'd add one. We have tests below this level
(in pkg/storage) and above this level (jepsen, I hope elsewhere as well), but I
couldn't find a similar test operating at this exact level.
This commit adds a small amount of client-side and server-side logic to
opportunistically strip the synthetic bit from a transaction timestamp when
doing so is permitted by the local HLC clock during transaction refreshes and
retries. This will prevent the use of synthetic timestamps in future commits
from leaking out beyond the first transaction that encounters a value with a
synthetic timestamp that is below the local HLC.
Non-functional change in preparation for next commit.
This commit adds a new "written timestamp" field to the MVCCMetadata
struct. WrittenTimestamp is the timestamp that the intent was last
written at by its own transaction, in cases where the intent has been
pushed to a higher timestamp than this by a conflicting transaction. The
timestamp is originally empty before the intent has been pushed and is
reset to empty if the intent is rewritten by its own transaction with a
newer epoch or sequence number.

If set, the timestamp represents the highest time that the intent held
in response to an action taken by its own transaction before committing.
As a result, if the transaction does commit, this timestamp also serves
as a lower bound on the clock of the leaseholder that holds this intent
at the time that the transaction committed, because leaseholders forward
their clock to the timestamp of incoming writes. By extension, it also
serves as a lower bound on the clock of any leaseholder that holds the
intent at any time after the transaction committed, because the clock
across subsequent leaseholders for a range increases monotonically.

This field will be used by the next commit to help resolve cockroachdb#36431.
@cockroach-teamcity
Copy link
Member

This change is Reviewable

…at are pushed

Fixes cockroachdb#36431.
Fixes cockroachdb#49360.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using
the approach detailed in cockroachdb#36431 (comment).
This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back some utility by recognizing that only
a small fraction[^2] of transactions commit at a different timestamps than the
one they used while writing intents. We can also recognize that if we were to
compare observed timestamps against the timestamp that a committed value was
originally written (its provisional value if it was once an intent) instead of
the timestamp that it had been moved to on commit, then the invariant would
hold.

This commit does not take full advantage of this second observation, because we
do not retain "written timestamps" in committed values (though we do in
intents). Instead, we do something less optimal but cheaper and more convenient.
Any intent whose timestamp is changed during asynchronous intent resolution is
marked as "synthetic". Doing so is a compressed way of saying that the value
could have originally been written as an intent at any time (even min_timestamp)
and so observed timestamps cannot be used to limit uncertainty by pulling a
read's uncertainty interval below the value's timestamp.

This application of the synthetic bit prevents observed timestamps from being
used to avoid uncertainty restarts with the set of committed values whose
timestamps do not reflect their original write time and therefore do not make a
claim about the clock of their leaseholder at the time that they were committed.
It does not change the interaction between observed timestamps and any other
committed value.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called `TestTxnReadWithinUncertaintyIntervalAfterIntentResolution`
which controls manual clocks directly and was able to deterministically
reproduce the stale read before this fix in a few different ways. After this
fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- observe TPC-C performance
-- top-line performance
-- uncertainty retry rate
-- commit-wait rate (should be zero)
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
@nvanbenschoten
Copy link
Member Author

nvanbenschoten commented Oct 29, 2021

Please hold off on reviewing this for the time being, at least in any detail. The process of drafting this PR has forced me to page this all back in, and it's made me question whether this use of synthetic timestamps (and their use with GLOBAL tables) is a good long-term design.

As the PR description mentions, synthetic timestamps are a useful tool to solve this problem. However, I'm worried that they are serving two orthogonal roles. Because they do so reasonably well, this is allowing us to avoid separating and being more explicit about each role, which makes them difficult to understand. It also leads to valid questions like #36431 (comment), which highlight the mismatch between whether the state we're adding here is conceptually a property of a committed value's timestamp (which should be identical across all committed values in its txn) or a property of a committed value itself (which can have other individual attributes that it need not share with other committed values in its txn).

The two roles that synthetic timestamps currently play in the system are:

  • a form of dynamic typing, which ensures that only real clock readings make their way back in to other node's HLC clocks. They are accompanied by the ClockTimestamp type, which serves as the static typing counterpart.
  • a convenience to avoid the need to represent both a commit_timestamp (related to the visibility of the value) and a written_timestamp (related to the value's relationship with the local HLC clock) in committed MVCC versions. See the "remember when intents were written" option from kv: stale read because intent from our past gets committed in our certain future? #36431 (comment). Instead, the synthetic bit serves as a compressed way to represent written_timestamp = min_timestamp.

Before expanding the use of synthetic timestamps further, I'd like to take a shot at decomposing its roles and then seeing whether we can remove the concept entirely. I don't imagine this will be a small lift, but each step along the way will lead to more explicit code, which a likely the right choice in the long run. Here's how I see this playing out:

Replace role 1:

  1. eliminate casts from Timestamp -> ClockTimestamp (e.g. TryToClockTimestamp).
  2. add explicit ClockTimestamp channels in RPCs where necessary.
  3. add associated assertions that RPC operation have ts <= clock_timestamp, where necessary. We already have some of these.
  4. after step 2, we can stop mixing "data-plane" Timestamps with "control-plane" ClockTimestamps.

Replace role 2:

  1. Add "extended" tag bit to roachpb.Value. Use to support additional, optional fixed-size fields in-front of encoded data. This could be used in the future for additional metadata, like per-value TTL.
  2. Add WrittenTimestamp metadata field to roachpb.Value. This is similar to the 4th commit in this PR, but will allow us to store the WrittenTimestamp in both intents and committed values.
  3. Set WrittenTimestamp for intent/committed values in two places. First, do so when pushing intents during (pending or commit) intent resolution. This is similar to the 5th commit in this PR. Second, do so when writing intents/values into the future (i.e. write_timestamp > clock.Now()).
  4. Update uncertainty rules in pebbleMVCCScanner to throw a ReadWithinUncertaintyIntervalError when encountering an intent or value iff WrittenTimestamp < LocalUncertantyLimit && Timestamp < GlobalUncertaintyLimit.

Notice that these last two steps will allow GLOBAL tables to work without synthetic timestamps.

Bonus: I think a soft migration away from synthetic timestamps would be possible by performing a translation at the MVCC layer from a key-value with a synthetic timestamp in the key to a key-value without a synthetic timestamp in the key but with a WrittenTimestamp set to min_timestamp in the value.

@sumeerbhola
Copy link
Collaborator

Add WrittenTimestamp metadata field to roachpb.Value.

Is putting this in the value inherent to the design? I am asking since occasionally the idea of separating keys and values in storage has come up (see cockroachdb/pebble#112) when storing rows that are very large. In such an implementation values would be lazily loaded, and it would be wasteful to load a value only to parse the timestamp part and realize that it is not needed. Is it possible to keep this optional written timestamp in the key?

@nvanbenschoten
Copy link
Member Author

No, the WrittenTimestamp could be in the key or in the value. I don't have strong opinions about where it should live, other than that it feels slightly more natural to store additional metadata in the value so that it does not impact key ordering. Although I guess it would live in the key suffix and we could ignore it in the custom key comparator. Does one option seem simpler than the other to you?

Your point about storing the WrittenTimestamp in the key so that we can avoid fetching the value in some cases is interesting. Based on the uncertainty rules (when encountering an intent or value iff WrittenTimestamp < LocalUncertantyLimit && Timestamp < GlobalUncertaintyLimit and when ReadTimestamp < Timestamp) and the fact that LocalUncertantyLimit can be at most max_offset less than GlobalUncertaintyLimit, I don't expect that an avoidable value fetch in the WrittenTimestamp > LocalUncertantyLimit && Timestamp < GlobalUncertaintyLimit case will make much of an impact in practice.

That said, when a key-value is uncertain, we don't actually need the value – we just return an error. So there is something nice about simplifying the condition for whether the value is needed or not to just ReadTimestamp >= Timestamp. But that's not to say that we would never complicate this condition later on. For instance, as discussed here and elsewhere, we may want to store per-KV TTLs at some point in the future. If we did, would we want to store those in the key as well?

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Nov 1, 2021
This commit adds a `doc.go` file to the `pkg/util/hlc package` that details that
properties and the uses of Hybrid Logical Clocks throughout the system. It is
meant to capture an overview of the ways in which HLCs benfit CockroachDB and to
exhaustively enumerate the (few) places in which the HLC is a key component to
the correctness of different subsystems.

This was inspired by cockroachdb#72121, which is once again making me feel uneasy about how
implicit most interactions with the HLC clock are. Specifically, the uses of the
HLC for causality tracking are subtle and underdocumented. The typing changes
made in cockroachdb#58349 help to isolate timestamps used for causality tracking from any
other timestamps in the system, but until we remove the escape hatch of
dynamically casting a `Timestamp` back to a `ClockTimestamp` with
`TryToClockTimestamp()`, it is still too difficult to understand when and why
clock signals are being passed between HLCs on different nodes, and where doing
so is necessary for correctness. I'm looking to make that change and I'm hoping
that documenting this first (with help from the reviewers!) will set that up to
be successful.

Some of this was adapted from section 4 of the SIGMOD 2020 paper.
@sumeerbhola
Copy link
Collaborator

and the fact that LocalUncertantyLimit can be at most max_offset less than GlobalUncertaintyLimit, I don't expect that an avoidable value fetch in the WrittenTimestamp > LocalUncertantyLimit && Timestamp < GlobalUncertaintyLimit case will make much of an impact in practice.

My understanding is that our current max_offset is 500ms (I realize we want to reduce it), so

  • GlobalUncertaintyLimit is txn-start-time + max_offset so can potentially capture many transactions
  • common case: LocalUncertaintyLimit close to txn-start-time; most txns have WrittenTimestamp=Timestamp.
    So the common case will fall in that category above. What am I missing?

That said, when a key-value is uncertain, we don't actually need the value – we just return an error. So there is something nice about simplifying the condition for whether the value is needed or not to just ReadTimestamp >= Timestamp. But that's not to say that we would never complicate this condition later on. For instance, as discussed here and elsewhere, we may want to store per-KV TTLs at some point in the future. If we did, would we want to store those in the key as well?

The TTL question is interesting. I am assuming the fidelity required is 1min or lower, so 32 bits would be sufficient, and we could put it in the key. We would probably use TBI to optimize the scan for TTL enforcement, since once we switch TBI to block property collectors, it will likely be effective even for big L6 files that will have the expired data hopefully in a small number of blocks in the file (outlined in cockroachdb/pebble#1190 (comment)). So maybe we could leave it in the value.

btw, in addition to the complete key-value separation outlined in cockroachdb/pebble#112 there is also an intermediate approach that I've worked on in the past that keep keys and values in different parts of the same file. Such a scheme is helpful when filters can be applied to the key -- this isn't quite as valuable with CockroachDB since (a) there are secondary indexes, and (b) some of the filters may be on columns in the value. But it may still be narrowly useful when values are large enough and many of them don't need reading after examining the timestamps in the key.

@nvanbenschoten
Copy link
Member Author

So the common case will fall in that category above. What am I missing?

Well the common case is ReadTimestamp > Timestamp. The point I was trying to make (which wasn't very clear) was that the bounded interval between GlobalUncertaintyLimit and LocalUncertaintyLimit places an upper bound (assuming a fixed rate of new KV versions) on the number of values that a given transaction could avoid loading with this optimization.

Independent of uncertainty, it seems to me that the larger win with the lazy value loading in cockroachdb/pebble#112 wouldn't be avoiding loading values whose keys have a Timestamp > ReadTimestamp, but in avoiding the need to load values for KV versions that are shadowed by later KV versions during the iteration from one key to the next. In some sense, I think that means that lazily loaded values provide some of the same benefit as cockroachdb/pebble#1170.

Does one option seem simpler than the other to you?

Did you have an opinion about this?

@sumeerbhola
Copy link
Collaborator

Does one option seem simpler than the other to you?
Did you have an opinion about this?

I would lean towards putting it in the key suffix as an optional part and adjusting the key comparator.

Independent of uncertainty, it seems to me that the larger win with the lazy value loading in cockroachdb/pebble#112 wouldn't be avoiding loading values whose keys have a Timestamp > ReadTimestamp, but in avoiding the need to load values for KV versions that are shadowed by later KV versions during the iteration from one key to the next. In some sense, I think that means that lazily loaded values provide some of the same benefit as cockroachdb/pebble#1170.

The simplest way of implementing such key-value separation in the same file ends up with the values in the same order as the corresponding keys (for preserving locality). With 32KB blocks, and N bytes per row, we'd need 32KB/N to be small enough that at least 1 whole block gets skipped before the seek optimization in pebbleMVCCScanner gets applied. I've wondered about the distribution of N in production workloads, in case you have any insights (and the distribution of key lengths).

In contrast, cockroachdb/pebble#1170 can be helpful even for small N, but is much harder of course. But you've raised a very interesting point. I am wondering whether all we need to do is separate keys and values in the same file, and change the ordering of the values such that we proceed from the latest version of a roachpb.Key to the latest version of the next roachpb.Key, so that locality is high for the latest version of values. Infrequently one of these latest versions will have been deleted (e.g. aborted transaction), but there is no correctness problem with that.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Nov 10, 2021
This commit adds a `doc.go` file to the `pkg/util/hlc package` that details that
properties and the uses of Hybrid Logical Clocks throughout the system. It is
meant to capture an overview of the ways in which HLCs benfit CockroachDB and to
exhaustively enumerate the (few) places in which the HLC is a key component to
the correctness of different subsystems.

This was inspired by cockroachdb#72121, which is once again making me feel uneasy about how
implicit most interactions with the HLC clock are. Specifically, the uses of the
HLC for causality tracking are subtle and underdocumented. The typing changes
made in cockroachdb#58349 help to isolate timestamps used for causality tracking from any
other timestamps in the system, but until we remove the escape hatch of
dynamically casting a `Timestamp` back to a `ClockTimestamp` with
`TryToClockTimestamp()`, it is still too difficult to understand when and why
clock signals are being passed between HLCs on different nodes, and where doing
so is necessary for correctness. I'm looking to make that change and I'm hoping
that documenting this first (with help from the reviewers!) will set that up to
be successful.

Some of this was adapted from section 4 of the SIGMOD 2020 paper.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Nov 11, 2021
This commit adds a `doc.go` file to the `pkg/util/hlc package` that details that
properties and the uses of Hybrid Logical Clocks throughout the system. It is
meant to capture an overview of the ways in which HLCs benfit CockroachDB and to
exhaustively enumerate the (few) places in which the HLC is a key component to
the correctness of different subsystems.

This was inspired by cockroachdb#72121, which is once again making me feel uneasy about how
implicit most interactions with the HLC clock are. Specifically, the uses of the
HLC for causality tracking are subtle and underdocumented. The typing changes
made in cockroachdb#58349 help to isolate timestamps used for causality tracking from any
other timestamps in the system, but until we remove the escape hatch of
dynamically casting a `Timestamp` back to a `ClockTimestamp` with
`TryToClockTimestamp()`, it is still too difficult to understand when and why
clock signals are being passed between HLCs on different nodes, and where doing
so is necessary for correctness. I'm looking to make that change and I'm hoping
that documenting this first (with help from the reviewers!) will set that up to
be successful.

Some of this was adapted from section 4 of the SIGMOD 2020 paper.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Nov 12, 2021
This commit adds a `doc.go` file to the `pkg/util/hlc package` that details that
properties and the uses of Hybrid Logical Clocks throughout the system. It is
meant to capture an overview of the ways in which HLCs benfit CockroachDB and to
exhaustively enumerate the (few) places in which the HLC is a key component to
the correctness of different subsystems.

This was inspired by cockroachdb#72121, which is once again making me feel uneasy about how
implicit most interactions with the HLC clock are. Specifically, the uses of the
HLC for causality tracking are subtle and underdocumented. The typing changes
made in cockroachdb#58349 help to isolate timestamps used for causality tracking from any
other timestamps in the system, but until we remove the escape hatch of
dynamically casting a `Timestamp` back to a `ClockTimestamp` with
`TryToClockTimestamp()`, it is still too difficult to understand when and why
clock signals are being passed between HLCs on different nodes, and where doing
so is necessary for correctness. I'm looking to make that change and I'm hoping
that documenting this first (with help from the reviewers!) will set that up to
be successful.

Some of this was adapted from section 4 of the SIGMOD 2020 paper.
craig bot pushed a commit that referenced this pull request Nov 12, 2021
72278: hlc: document properties and uses of Hybrid Logical Clocks r=nvanbenschoten a=nvanbenschoten

The godoc is currently hosted on [here](http://34.75.249.248:8000/pkg/github.com/cockroachdb/cockroach/pkg/util/hlc/). I'll leave that up for a few days.

----

This commit adds a `doc.go` file to the `pkg/util/hlc` package that details that
properties and the uses of Hybrid Logical Clocks throughout the system. It is
meant to capture an overview of the ways in which HLCs benfit CockroachDB and to
exhaustively enumerate the (few) places in which the HLC is a key component to
the correctness of different subsystems.

This was inspired by #72121, which is once again making me feel uneasy about how
implicit most interactions with the HLC clock are. Specifically, the uses of the
HLC for causality tracking are subtle and underdocumented. The typing changes
made in #58349 help to isolate timestamps used for causality tracking from any
other timestamps in the system, but until we remove the escape hatch of
dynamically casting a `Timestamp` back to a `ClockTimestamp` with
`TryToClockTimestamp()`, it is still too difficult to understand when and why
clock signals are being passed between HLCs on different nodes, and where doing
so is necessary for correctness. I'm looking to make that change and I'm hoping
that documenting this first (with help from the reviewers!) will set that up to
be successful.

Some of this was adapted from [section 4 of the SIGMOD 2020 paper](https://dl.acm.org/doi/pdf/10.1145/3318464.3386134).

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
RajivTS pushed a commit to RajivTS/cockroach that referenced this pull request Mar 6, 2022
Fixes cockroachdb#58459.
Informs cockroachdb#36431.

This commit fixes a long-standing correctness issue where non-transactional
requests did not ensure single-key linearizability even if they deferred their
timestamp allocation to the leaseholder of their (single) range. They still
don't entirely, because of cockroachdb#36431, but this change brings us one step closer to
the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests.

The change addresses this by giving non-transactional requests uncertainty
intervals. This ensures that they also guarantee single-key linearizability even
with only loose (but bounded) clock synchronization. Non-transactional requests
that use a client-provided timestamp do not have uncertainty intervals and do
not make real-time ordering guarantees.

Unlike transactions, which establish an uncertainty interval on their
coordinator node during initialization, non-transactional requests receive
uncertainty intervals from their target leaseholder, using a clock reading
from the leaseholder's local HLC as the local limit and this clock reading +
the cluster's maximum clock skew as the global limit.

It is somewhat non-intuitive that non-transactional requests need uncertainty
intervals — after all, they receive their timestamp to the leaseholder of the
only range that they talk to, so isn't every value with a commit timestamp
above their read timestamp certainly concurrent? The answer is surprisingly
"no" for the following reasons, so they cannot forgo the use of uncertainty
interval:

1. the request timestamp is allocated before consulting the replica's lease.
   This means that there are times when the replica is not the leaseholder at
   the point of timestamp allocation, and only becomes the leaseholder later.
   In such cases, the timestamp assigned to the request is not guaranteed to
   be greater than the written_timestamp of all writes served by the range at
   the time of allocation. This is true despite invariants 1 & 2 presented in
   `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the
   timestamp is not yet the leaseholder.

   In cases where the replica that assigned the non-transactional request's
   timestamp takes over as the leaseholder after the timestamp allocation, we
   expect minimumLocalLimitForLeaseholder to forward the local uncertainty
   limit above TimestampFromServerClock, to the lease start time.

   For example, consider the following series of events:
   - client A writes k = v1
   - leaseholder writes v1 at ts = 100
   - client A receives ack for write
   - client B wants to read k using a non-txn request
   - follower replica with slower clock receives non-txn request
   - follower replica assigns request ts = 95
   - lease transferred to follower replica with lease start time = 101
   - non-txn request must use 101 as limit of uncertainty interval to ensure
     that it observes k = v1 in uncertainty interval, performs a server-side
     retry, bumps its read timestamp, and returns k = v1.

2. even if the replica's lease is stable and the timestamp is assigned to the
   non-transactional request by the leaseholder, the assigned clock reading
   only reflects the written_timestamp of all of the writes served by the
   leaseholder (and previous leaseholders) thus far. This clock reading is
   not guaranteed to lead the commit timestamp of all of these writes,
   especially if they are committed remotely and resolved after the request
   has received its clock reading but before the request begins evaluating.

   As a result, the non-transactional request needs an uncertainty interval
   with a global uncertainty limit far enough in advance of the leaseholder's
   local HLC clock to ensure that it considers any value that was part of a
   transaction which could have committed before the request was received by
   the leaseholder to be uncertain. Concretely, the non-transactional request
   needs to consider values of the following form to be uncertain:

     written_timestamp < local_limit && commit_timestamp < global_limit

   The value that the non-transactional request is observing may have been
   written on the local leaseholder at time 10, its transaction may have been
   committed remotely at time 20, acknowledged, then the non-transactional
   request may have begun and received a timestamp of 15 from the local
   leaseholder, then finally the value may have been resolved asynchronously
   and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20).
   The failure of the non-transactional request to observe this value would
   be a stale read.

   For example, consider the following series of events:
   - client A begins a txn and is assigned provisional commit timestamp = 95
   - client A's txn performs a Put(k, v1)
   - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95
   - client A's txn performs a write elsewhere and hits a WriteTooOldError
     that bumps its provisional commit timestamp to 100
   - client A's txn refreshes to ts = 100. This notably happens without
     involvement of the leaseholder that served the Put (this is at the heart
     of cockroachdb#36431), so that leaseholder's clock is not updated
   - client A's txn commits remotely and client A receives the acknowledgment
   - client B initiates non-txn read of k
   - leaseholder assigns read timestamp ts = 97
   - asynchronous intent resolution resolves the txn's intent at k, moving v1
     to ts = 100 in the process
   - non-txn request must use an uncertainty interval that extends past 100
     to ensure that it observes k = v1 in uncertainty interval, performs a
     server-side retry, bumps its read timestamp, and returns k = v1. Failure
     to do so would be a stale read

----

This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional
requests to benefit from our incoming fix for that issue. Second, it unblocks
some of the clock refactors proposed in cockroachdb#72121 (comment),
and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't
feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make
this change. We know from cockroachdb#36431 that invariant 1 from
[`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131)
doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send`
masks the bugs in this area in many cases. Removing that clock update (I don't
actually plan to remove it, but I plan to disconnect it entirely from operation
timestamps) without first giving non-transactional requests uncertainty intervals
seems like it may reveal these bugs in ways we haven't seen in the past. So I'd
like to land this fix before making that change.

----

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Apr 4, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.

NOTE: this commit does not yet handle mixed-version compatibility and is not
intended to merge for the v22.1 release.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key called the "local timestamp". The MVCC key's existing version
timestamp dictates the key's visibility to readers and is tied to the writer's
commit timestamp. The local clock timestamp records the value of the local HLC
clock on the leaseholder when the key was originally written. It is used to make
claims about the relative real time ordering of the key's writer and readers
when comparing a reader's uncertainty interval (and observed timestamps) to the
key. Ignoring edge cases, readers with an observed timestamp from the key's
leaseholder that is greater than the local clock timestamp stored in the key
cannot make claims about real time ordering and must consider it possible that
the key's write occurred before the read began. However, readers with an
observed timestamp from the key's leaseholder that is less than the clock
timestamp can claim that the reader captured that observed timestamp before the
key was written and therefore can consider the key's write to have been
concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to the
MVCC key encoding, the commit optimizes the clock timestamp away in the common
case where it leads the version timestamp. Only in the rare cases where the
local timestamp trails the version timestamp (e.g. future-time writes, async
intent resolution with a new commit timestamp) does the local timestamp need to
be explicitly represented in the key encoding. This is possible because it is
safe for the local clock timestamp to be rounded down, as this will simply lead
to additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- observe TPC-C performance
-- top-line performance
-- uncertainty retry rate
-- commit-wait rate (should be zero)
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
nvanbenschoten added a commit that referenced this pull request Apr 14, 2022
Fixes #36431.
Fixes #49360.
Replaces #72121.

NOTE: this commit does not yet handle mixed-version compatibility and is not
intended to merge for the v22.1 release.

This commit fixes the potential for a stale read as detailed in #36431 using the
"remember when intents were written" approach described in
#36431 (comment) and
later expanded on in
#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in #49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key called the "local timestamp". The MVCC key's existing version
timestamp dictates the key's visibility to readers and is tied to the writer's
commit timestamp. The local clock timestamp records the value of the local HLC
clock on the leaseholder when the key was originally written. It is used to make
claims about the relative real time ordering of the key's writer and readers
when comparing a reader's uncertainty interval (and observed timestamps) to the
key. Ignoring edge cases, readers with an observed timestamp from the key's
leaseholder that is greater than the local clock timestamp stored in the key
cannot make claims about real time ordering and must consider it possible that
the key's write occurred before the read began. However, readers with an
observed timestamp from the key's leaseholder that is less than the clock
timestamp can claim that the reader captured that observed timestamp before the
key was written and therefore can consider the key's write to have been
concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to the
MVCC key encoding, the commit optimizes the clock timestamp away in the common
case where it leads the version timestamp. Only in the rare cases where the
local timestamp trails the version timestamp (e.g. future-time writes, async
intent resolution with a new commit timestamp) does the local timestamp need to
be explicitly represented in the key encoding. This is possible because it is
safe for the local clock timestamp to be rounded down, as this will simply lead
to additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### Future improvements

As noted in #72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- observe TPC-C performance
-- top-line performance
-- uncertainty retry rate
-- commit-wait rate (should be zero)
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in #36431 (comment).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Apr 15, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key called the "local timestamp". The MVCC key's existing version
timestamp dictates the key's visibility to readers and is tied to the writer's
commit timestamp. The local clock timestamp records the value of the local HLC
clock on the leaseholder when the key was originally written. It is used to make
claims about the relative real time ordering of the key's writer and readers
when comparing a reader's uncertainty interval (and observed timestamps) to the
key. Ignoring edge cases, readers with an observed timestamp from the key's
leaseholder that is greater than the local clock timestamp stored in the key
cannot make claims about real time ordering and must consider it possible that
the key's write occurred before the read began. However, readers with an
observed timestamp from the key's leaseholder that is less than the clock
timestamp can claim that the reader captured that observed timestamp before the
key was written and therefore can consider the key's write to have been
concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to the
MVCC key encoding, the commit optimizes the clock timestamp away in the common
case where it leads the version timestamp. Only in the rare cases where the
local timestamp trails the version timestamp (e.g. future-time writes, async
intent resolution with a new commit timestamp) does the local timestamp need to
be explicitly represented in the key encoding. This is possible because it is
safe for the local clock timestamp to be rounded down, as this will simply lead
to additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- observe TPC-C performance
-- top-line performance
-- uncertainty retry rate
-- commit-wait rate (should be zero)
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Apr 28, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.
Replaces cockroachdb#77342.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key-value version called the "local timestamp". The existing version
timestamp dictates the key-value's visibility to readers and is tied to the
writer's commit timestamp. The local clock timestamp records the value of the
local HLC clock on the leaseholder when the key was originally written. It is
used to make claims about the relative real time ordering of the key's writer
and readers when comparing a reader's uncertainty interval (and observed
timestamps) to the key. Ignoring edge cases, readers with an observed timestamp
from the key's leaseholder that is greater than the local clock timestamp stored
in the key cannot make claims about real time ordering and must consider it
possible that the key's write occurred before the read began. However, readers
with an observed timestamp from the key's leaseholder that is less than the
clock timestamp can claim that the reader captured that observed timestamp
before the key was written and therefore can consider the key's write to have
been concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each
key-value pair, the commit optimizes the clock timestamp away in the common case
where it leads the version timestamp. Only in the rare cases where the local
timestamp trails the version timestamp (e.g. future-time writes, async intent
resolution with a new commit timestamp) does the local timestamp need to be
explicitly represented in the key encoding. This is possible because it is safe
for the local clock timestamp to be rounded down, as this will simply lead to
additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### MVCCValue

To store the local timestamp, the commit introduces a new MVCCValue type to
parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with
MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To
this point, the MVCC layer has treated versioned values as opaque blobs of bytes
and has not enforced any structure on them. Now that MVCC will use the value to
store metadata, it needs to enforce more structure on the values provided to it.
This is the cause of some testing churn, but is otherwise not a problem, as all
production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended"
encoding scheme, depending on whether the value's header is empty or not. If the
value's header is empty, it is omitted in the encoding and the mvcc value's
encoding is identical to that of roachpb.Value. This provided backwards
compatibility and ensures that the MVCCValue optimizes away in the common case.
If the value's header is not empty, it is prepended to the roachpb.Value
encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which
is either the roachpb.Value tag (which has many possible values) or a sentinel
tag not used by the roachpb.Value encoding which indicates the extended
encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple"
encoding are fast by avoiding heap allocations, memory copies, or function calls
by exploiting mid-stack inlining.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- microbenchmarks
- single-process benchmarks
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request May 4, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.
Replaces cockroachdb#77342.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key-value version called the "local timestamp". The existing version
timestamp dictates the key-value's visibility to readers and is tied to the
writer's commit timestamp. The local clock timestamp records the value of the
local HLC clock on the leaseholder when the key was originally written. It is
used to make claims about the relative real time ordering of the key's writer
and readers when comparing a reader's uncertainty interval (and observed
timestamps) to the key. Ignoring edge cases, readers with an observed timestamp
from the key's leaseholder that is greater than the local clock timestamp stored
in the key cannot make claims about real time ordering and must consider it
possible that the key's write occurred before the read began. However, readers
with an observed timestamp from the key's leaseholder that is less than the
clock timestamp can claim that the reader captured that observed timestamp
before the key was written and therefore can consider the key's write to have
been concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each
key-value pair, the commit optimizes the clock timestamp away in the common case
where it leads the version timestamp. Only in the rare cases where the local
timestamp trails the version timestamp (e.g. future-time writes, async intent
resolution with a new commit timestamp) does the local timestamp need to be
explicitly represented in the key encoding. This is possible because it is safe
for the local clock timestamp to be rounded down, as this will simply lead to
additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### MVCCValue

To store the local timestamp, the commit introduces a new MVCCValue type to
parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with
MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To
this point, the MVCC layer has treated versioned values as opaque blobs of bytes
and has not enforced any structure on them. Now that MVCC will use the value to
store metadata, it needs to enforce more structure on the values provided to it.
This is the cause of some testing churn, but is otherwise not a problem, as all
production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended"
encoding scheme, depending on whether the value's header is empty or not. If the
value's header is empty, it is omitted in the encoding and the mvcc value's
encoding is identical to that of roachpb.Value. This provided backwards
compatibility and ensures that the MVCCValue optimizes away in the common case.
If the value's header is not empty, it is prepended to the roachpb.Value
encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which
is either the roachpb.Value tag (which has many possible values) or a sentinel
tag not used by the roachpb.Value encoding which indicates the extended
encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple"
encoding are fast by avoiding heap allocations, memory copies, or function calls
by exploiting mid-stack inlining.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- microbenchmarks
- single-process benchmarks
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request May 9, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.
Replaces cockroachdb#77342.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key-value version called the "local timestamp". The existing version
timestamp dictates the key-value's visibility to readers and is tied to the
writer's commit timestamp. The local clock timestamp records the value of the
local HLC clock on the leaseholder when the key was originally written. It is
used to make claims about the relative real time ordering of the key's writer
and readers when comparing a reader's uncertainty interval (and observed
timestamps) to the key. Ignoring edge cases, readers with an observed timestamp
from the key's leaseholder that is greater than the local clock timestamp stored
in the key cannot make claims about real time ordering and must consider it
possible that the key's write occurred before the read began. However, readers
with an observed timestamp from the key's leaseholder that is less than the
clock timestamp can claim that the reader captured that observed timestamp
before the key was written and therefore can consider the key's write to have
been concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each
key-value pair, the commit optimizes the clock timestamp away in the common case
where it leads the version timestamp. Only in the rare cases where the local
timestamp trails the version timestamp (e.g. future-time writes, async intent
resolution with a new commit timestamp) does the local timestamp need to be
explicitly represented in the key encoding. This is possible because it is safe
for the local clock timestamp to be rounded down, as this will simply lead to
additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### MVCCValue

To store the local timestamp, the commit introduces a new MVCCValue type to
parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with
MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To
this point, the MVCC layer has treated versioned values as opaque blobs of bytes
and has not enforced any structure on them. Now that MVCC will use the value to
store metadata, it needs to enforce more structure on the values provided to it.
This is the cause of some testing churn, but is otherwise not a problem, as all
production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended"
encoding scheme, depending on whether the value's header is empty or not. If the
value's header is empty, it is omitted in the encoding and the mvcc value's
encoding is identical to that of roachpb.Value. This provided backwards
compatibility and ensures that the MVCCValue optimizes away in the common case.
If the value's header is not empty, it is prepended to the roachpb.Value
encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which
is either the roachpb.Value tag (which has many possible values) or a sentinel
tag not used by the roachpb.Value encoding which indicates the extended
encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple"
encoding are fast by avoiding heap allocations, memory copies, or function calls
by exploiting mid-stack inlining.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- microbenchmarks
- single-process benchmarks
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request May 9, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.
Replaces cockroachdb#77342.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key-value version called the "local timestamp". The existing version
timestamp dictates the key-value's visibility to readers and is tied to the
writer's commit timestamp. The local clock timestamp records the value of the
local HLC clock on the leaseholder when the key was originally written. It is
used to make claims about the relative real time ordering of the key's writer
and readers when comparing a reader's uncertainty interval (and observed
timestamps) to the key. Ignoring edge cases, readers with an observed timestamp
from the key's leaseholder that is greater than the local clock timestamp stored
in the key cannot make claims about real time ordering and must consider it
possible that the key's write occurred before the read began. However, readers
with an observed timestamp from the key's leaseholder that is less than the
clock timestamp can claim that the reader captured that observed timestamp
before the key was written and therefore can consider the key's write to have
been concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each
key-value pair, the commit optimizes the clock timestamp away in the common case
where it leads the version timestamp. Only in the rare cases where the local
timestamp trails the version timestamp (e.g. future-time writes, async intent
resolution with a new commit timestamp) does the local timestamp need to be
explicitly represented in the key encoding. This is possible because it is safe
for the local clock timestamp to be rounded down, as this will simply lead to
additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### MVCCValue

To store the local timestamp, the commit introduces a new MVCCValue type to
parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with
MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To
this point, the MVCC layer has treated versioned values as opaque blobs of bytes
and has not enforced any structure on them. Now that MVCC will use the value to
store metadata, it needs to enforce more structure on the values provided to it.
This is the cause of some testing churn, but is otherwise not a problem, as all
production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended"
encoding scheme, depending on whether the value's header is empty or not. If the
value's header is empty, it is omitted in the encoding and the mvcc value's
encoding is identical to that of roachpb.Value. This provided backwards
compatibility and ensures that the MVCCValue optimizes away in the common case.
If the value's header is not empty, it is prepended to the roachpb.Value
encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which
is either the roachpb.Value tag (which has many possible values) or a sentinel
tag not used by the roachpb.Value encoding which indicates the extended
encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple"
encoding are fast by avoiding heap allocations, memory copies, or function calls
by exploiting mid-stack inlining.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- microbenchmarks
- single-process benchmarks
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request May 9, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.
Replaces cockroachdb#77342.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key-value version called the "local timestamp". The existing version
timestamp dictates the key-value's visibility to readers and is tied to the
writer's commit timestamp. The local clock timestamp records the value of the
local HLC clock on the leaseholder when the key was originally written. It is
used to make claims about the relative real time ordering of the key's writer
and readers when comparing a reader's uncertainty interval (and observed
timestamps) to the key. Ignoring edge cases, readers with an observed timestamp
from the key's leaseholder that is greater than the local clock timestamp stored
in the key cannot make claims about real time ordering and must consider it
possible that the key's write occurred before the read began. However, readers
with an observed timestamp from the key's leaseholder that is less than the
clock timestamp can claim that the reader captured that observed timestamp
before the key was written and therefore can consider the key's write to have
been concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each
key-value pair, the commit optimizes the clock timestamp away in the common case
where it leads the version timestamp. Only in the rare cases where the local
timestamp trails the version timestamp (e.g. future-time writes, async intent
resolution with a new commit timestamp) does the local timestamp need to be
explicitly represented in the key encoding. This is possible because it is safe
for the local clock timestamp to be rounded down, as this will simply lead to
additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### MVCCValue

To store the local timestamp, the commit introduces a new MVCCValue type to
parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with
MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To
this point, the MVCC layer has treated versioned values as opaque blobs of bytes
and has not enforced any structure on them. Now that MVCC will use the value to
store metadata, it needs to enforce more structure on the values provided to it.
This is the cause of some testing churn, but is otherwise not a problem, as all
production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended"
encoding scheme, depending on whether the value's header is empty or not. If the
value's header is empty, it is omitted in the encoding and the mvcc value's
encoding is identical to that of roachpb.Value. This provided backwards
compatibility and ensures that the MVCCValue optimizes away in the common case.
If the value's header is not empty, it is prepended to the roachpb.Value
encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which
is either the roachpb.Value tag (which has many possible values) or a sentinel
tag not used by the roachpb.Value encoding which indicates the extended
encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple"
encoding are fast by avoiding heap allocations, memory copies, or function calls
by exploiting mid-stack inlining.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- microbenchmarks
- single-process benchmarks
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request May 9, 2022
Fixes cockroachdb#36431.
Fixes cockroachdb#49360.
Replaces cockroachdb#72121.
Replaces cockroachdb#77342.

This commit fixes the potential for a stale read as detailed in cockroachdb#36431 using the
"remember when intents were written" approach described in
cockroachdb#36431 (comment) and
later expanded on in
cockroachdb#72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split
across ranges whose leaseholders are stored on different nodes, a transaction
read refresh, and the use of observed timestamps to avoid an uncertainty
restart. With the combination of these four factors, it was possible to
construct an ordering of events that violated real-time ordering and allowed a
transaction to observe a stale read. Upon the discovery of the bug, we
[introduced](cockroachdb/jepsen#19) the `multi-register`
test to the Jepsen test suite, and have since observed the test fail when
combined with the `strobe-skews` nemesis due to this bug in cockroachdb#49360 (and a few
issues linked to that one). This commit stabilizes that test.

\### Explanation

The combination of all of the factors listed above can lead to the stale read
because it breaks one of the invariants that the observed timestamp
infrastructure[^1] relied upon for correctness. Specifically, observed
timestamps relied on the guarantee that a leaseholder's clock must always be
equal to or greater than the version timestamp of all writes that it has served.
However, this guarantee did not always hold. It does hold for non-transactional
writes. It also holds for transactions that perform all of their intent writes
at the same timestamp and then commit at this timestamp. However, it does not
hold for transactions which move their commit timestamp forward over their
lifetime before committing, writing intents at different timestamps along the
way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it
means for a leaseholder to "serve a write at a timestamp". The meaning of this
phrase is straightforward for non-transactional writes. However, for an intent
write whose original timestamp is provisional and whose eventual commit
timestamp is stored indirectly in its transaction record at its time of commit,
the meaning is less clear. This reconciliation to move the intent write's
timestamp up to its transaction's commit timestamp is asynchronous from the
transaction commit (and after it has been externally acknowledged). So even if a
leaseholder has only served writes with provisional timestamps up to timestamp
100 (placing a lower bound on its clock of 100), it can be in possession of
intents that, when resolved, will carry a timestamp of 200. To uphold the
real-time ordering property, this value must be observed by any transaction that
begins after the value's transaction committed and was acknowledged. So for
observed timestamps to be correct as currently written, we would need a
guarantee that this value's leaseholder would never return an observed timestamp
< 200 at any point after the transaction commits. But with the transaction
commit possibly occurring on another node and with communication to resolve the
intent occurring asynchronously, this seems like an impossible guarantee to
make.

This would appear to undermine observed timestamps to the point where they
cannot be used. However, we can claw back correctness without sacrificing
performance by recognizing that only a small fraction[^2] of transactions commit
at a different timestamps than the one they used while writing intents. We can
also recognize that if we were to compare observed timestamps against the
timestamp that a committed value was originally written (its provisional value
if it was once an intent) instead of the timestamp that it had been moved to on
commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to
each MVCC key-value version called the "local timestamp". The existing version
timestamp dictates the key-value's visibility to readers and is tied to the
writer's commit timestamp. The local clock timestamp records the value of the
local HLC clock on the leaseholder when the key was originally written. It is
used to make claims about the relative real time ordering of the key's writer
and readers when comparing a reader's uncertainty interval (and observed
timestamps) to the key. Ignoring edge cases, readers with an observed timestamp
from the key's leaseholder that is greater than the local clock timestamp stored
in the key cannot make claims about real time ordering and must consider it
possible that the key's write occurred before the read began. However, readers
with an observed timestamp from the key's leaseholder that is less than the
clock timestamp can claim that the reader captured that observed timestamp
before the key was written and therefore can consider the key's write to have
been concurrent with the read. In doing so, the reader can avoid an uncertainty
restart.

For more, see the updates made in this commit to pkg/kv/kvserver/observedts/doc.go.

To avoid the bulk of the performance hit from adding this new timestamp to each
key-value pair, the commit optimizes the clock timestamp away in the common case
where it leads the version timestamp. Only in the rare cases where the local
timestamp trails the version timestamp (e.g. future-time writes, async intent
resolution with a new commit timestamp) does the local timestamp need to be
explicitly represented in the key encoding. This is possible because it is safe
for the local clock timestamp to be rounded down, as this will simply lead to
additional uncertainty restarts. However, it is not safe for the local clock
timestamp to be rounded up, as this could lead to stale reads.

\### MVCCValue

To store the local timestamp, the commit introduces a new MVCCValue type to
parallel the MVCCKey type. MVCCValue wraps a roachpb.Value and extends it with
MVCC-level metadata which is stored in an enginepb.MVCCValueHeader struct. To
this point, the MVCC layer has treated versioned values as opaque blobs of bytes
and has not enforced any structure on them. Now that MVCC will use the value to
store metadata, it needs to enforce more structure on the values provided to it.
This is the cause of some testing churn, but is otherwise not a problem, as all
production code paths were already passing values in the roachpb.Value encoding.

To further avoid any performance hit, MVCCValue has a "simple" and an "extended"
encoding scheme, depending on whether the value's header is empty or not. If the
value's header is empty, it is omitted in the encoding and the mvcc value's
encoding is identical to that of roachpb.Value. This provided backwards
compatibility and ensures that the MVCCValue optimizes away in the common case.
If the value's header is not empty, it is prepended to the roachpb.Value
encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which
is either the roachpb.Value tag (which has many possible values) or a sentinel
tag not used by the roachpb.Value encoding which indicates the extended
encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple"
encoding are fast by avoiding heap allocations, memory copies, or function calls
by exploiting mid-stack inlining.

\### Future improvements

As noted in cockroachdb#72121 (comment),
this commit paves a path towards the complete removal of synthetic timestamps,
which were originally introduced in support of non-blocking transactions and
GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps`
is no longer necessary now that we never need to "push" transaction-domain
timestamps into HLC clocks. Instead, the invariant that underpins observed
timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC
clock.

The synthetic bit's second role of disabling observed timestamps is replaced by
the generalization provided by "local timestamps". Local timestamps precisely
track when an MVCC version was written in the leaseholder's clock timestamp
domain. This establishes a total ordering across clock observations (local
timestamp assignment for writers and observed timestamps for readers) and
establish a partial ordering between writer and reader transactions. As a
result, the use of observed timestamps during uncertainty checking becomes a
comparison between two `ClockTimestamps`, the version's local timestamp and the
reader's observed timestamp.

\### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to
cause it to fail, even on master. We've only seen the test fail a handful of
times over the past few years, so this isn't much of a surprise. Still, this
prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called
`TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls
manual clocks directly and was able to deterministically reproduce the stale
read before this fix in a few different ways. After this fix, the test passes.

\### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under
some workloads.

TODO(nvanbenschoten):
- microbenchmarks
- single-process benchmarks
- compare YCSB performance

----

Release note (bug fix): fixed a rare race condition that could allow for a
transaction to serve a stale read and violate real-time ordering under moderate
clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go)
for an explanation of the role of observed timestamps in the transaction model.
This commit updates that documentation to include this fix.

[^2]: see analysis in cockroachdb#36431 (comment).
craig bot pushed a commit that referenced this pull request May 9, 2022
80706: kv/storage: introduce local timestamps for MVCC versions in MVCCValue r=sumeerbhola a=nvanbenschoten

Fixes #36431.
Fixes #49360.
Replaces #72121.
Replaces #77342.

**NOTE: this is an alternative to #77342 that stores the new LocalTimestamp in a key-value's value instead of its key.**

This commit fixes the potential for a stale read as detailed in #36431 using the "remember when intents were written" approach described in #36431 (comment) and later expanded on in #72121 (comment).

This bug requires a combination of skewed clocks, multi-key transactions split across ranges whose leaseholders are stored on different nodes, a transaction read refresh, and the use of observed timestamps to avoid an uncertainty restart. With the combination of these four factors, it was possible to construct an ordering of events that violated real-time ordering and allowed a transaction to observe a stale read. Upon the discovery of the bug, we [introduced](cockroachdb/jepsen#19) the `multi-register` test to the Jepsen test suite, and have since observed the test fail when combined with the `strobe-skews` nemesis due to this bug in #49360 (and a few issues linked to that one). This commit stabilizes that test.

### Explanation

The combination of all of the factors listed above can lead to the stale read because it breaks one of the invariants that the observed timestamp infrastructure[^1] relied upon for correctness. Specifically, observed timestamps relied on the guarantee that a leaseholder's clock must always be equal to or greater than the version timestamp of all writes that it has served. However, this guarantee did not always hold. It does hold for non-transactional writes. It also holds for transactions that perform all of their intent writes at the same timestamp and then commit at this timestamp. However, it does not hold for transactions which move their commit timestamp forward over their lifetime before committing, writing intents at different timestamps along the way and "pulling them up" to the commit timestamp after committing.

In violating the invariant, this third case reveals an ambiguity in what it means for a leaseholder to "serve a write at a timestamp". The meaning of this phrase is straightforward for non-transactional writes. However, for an intent write whose original timestamp is provisional and whose eventual commit timestamp is stored indirectly in its transaction record at its time of commit, the meaning is less clear. This reconciliation to move the intent write's timestamp up to its transaction's commit timestamp is asynchronous from the transaction commit (and after it has been externally acknowledged). So even if a leaseholder has only served writes with provisional timestamps up to timestamp 100 (placing a lower bound on its clock of 100), it can be in possession of intents that, when resolved, will carry a timestamp of 200. To uphold the real-time ordering property, this value must be observed by any transaction that begins after the value's transaction committed and was acknowledged. So for observed timestamps to be correct as currently written, we would need a guarantee that this value's leaseholder would never return an observed timestamp < 200 at any point after the transaction commits. But with the transaction commit possibly occurring on another node and with communication to resolve the intent occurring asynchronously, this seems like an impossible guarantee to make.

This would appear to undermine observed timestamps to the point where they cannot be used. However, we can claw back correctness without sacrificing performance by recognizing that only a small fraction[^2] of transactions commit at a different timestamps than the one they used while writing intents. We can also recognize that if we were to compare observed timestamps against the timestamp that a committed value was originally written (its provisional value if it was once an intent) instead of the timestamp that it had been moved to on commit, then the invariant would hold.

This commit exploits this second observation by adding a second timestamp to each MVCC key-value version called the "local timestamp". The existing version timestamp dictates the key-value's visibility to readers and is tied to the writer's commit timestamp. The local clock timestamp records the value of the local HLC clock on the leaseholder when the key was originally written. It is used to make claims about the relative real time ordering of the key's writer and readers when comparing a reader's uncertainty interval (and observed timestamps) to the key. Ignoring edge cases, readers with an observed timestamp from the key's leaseholder that is greater than the local clock timestamp stored in the key cannot make claims about real time ordering and must consider it possible that the key's write occurred before the read began. However, readers with an observed timestamp from the key's leaseholder that is less than the clock timestamp can claim that the reader captured that observed timestamp before the key was written and therefore can consider the key's write to have been concurrent with the read. In doing so, the reader can avoid an uncertainty restart.

For more, see the updates made in this commit to `pkg/kv/kvserver/observedts/doc.go`.

To avoid the bulk of the performance hit from adding this new timestamp to each key-value pair, the commit optimizes the clock timestamp away in the common case where it leads the version timestamp. Only in the rare cases where the local timestamp trails the version timestamp (e.g. future-time writes, async intent resolution with a new commit timestamp) does the local timestamp need to be explicitly represented in the key encoding. This is possible because it is safe for the local clock timestamp to be rounded down, as this will simply lead to additional uncertainty restarts. However, it is not safe for the local clock timestamp to be rounded up, as this could lead to stale reads.

### MVCCValue

```go
type MVCCValue struct {
    MVCCValueHeader
    Value roachpb.Value
}
```

To store the local timestamp, the commit introduces a new `MVCCValue` type to parallel the `MVCCKey` type. `MVCCValue` wraps a `roachpb.Value` and extends it with MVCC-level metadata which is stored in an `enginepb.MVCCValueHeader` protobuf struct. To this point, the MVCC layer has treated versioned values as opaque blobs of bytes and has not enforced any structure on them. Now that MVCC will use the value to store metadata, it needs to enforce more structure on the values provided to it. This is the cause of some testing churn, but is otherwise not a problem, as all production code paths were already passing values in the `roachpb.Value` encoding.

To further avoid any performance hit, `MVCCValue` has a "simple" and an "extended" encoding scheme, depending on whether the value's header is empty or not. If the value's header is empty, it is omitted in the encoding and the mvcc value's encoding is identical to that of `roachpb.Value`. This provided backwards compatibility and ensures that the `MVCCValue` optimizes away in the common case. If the value's header is not empty, it is prepended to the roachpb.Value encoding. The encoding scheme's variants are:

```
Simple (identical to the roachpb.Value encoding):

  <4-byte-checksum><1-byte-tag><encoded-data>

Extended (header prepended to roachpb.Value encoding):

  <4-byte-header-len><1-byte-sentinel><mvcc-header><4-byte-checksum><1-byte-tag><encoded-data>
```

The two encoding scheme variants are distinguished using the 5th byte, which is either the `roachpb.Value` tag (which has many possible values) or a sentinel tag not used by the `roachpb.Value` encoding which indicates the extended encoding scheme.

Care was taken to ensure that encoding and decoding routines for the "simple" encoding are fast by avoiding heap allocations, memory copies, or function calls by exploiting mid-stack inlining. See microbenchmarks below.

### Future improvements

As noted in #72121 (comment), this commit paves a path towards the complete removal of synthetic timestamps, which were originally introduced in support of non-blocking transactions and GLOBAL tables.

The synthetic bit's first role of providing dynamic typing for `ClockTimestamps` is no longer necessary now that we never need to "push" transaction-domain timestamps into HLC clocks. Instead, the invariant that underpins observed timestamps is enforced by "pulling" local timestamps from the leaseholder's HLC clock.

The synthetic bit's second role of disabling observed timestamps is replaced by the generalization provided by "local timestamps". Local timestamps precisely track when an MVCC version was written in the leaseholder's clock timestamp domain. This establishes a total ordering across clock observations (local timestamp assignment for writers and observed timestamps for readers) and establish a partial ordering between writer and reader transactions. As a result, the use of observed timestamps during uncertainty checking becomes a comparison between two `ClockTimestamps`, the version's local timestamp and the reader's observed timestamp.

### Correctness testing

I was not able to stress `jepsen/multi-register/strobe-skews` hard enough to cause it to fail, even on master. We've only seen the test fail a handful of times over the past few years, so this isn't much of a surprise. Still, this prevents us from saying anything concrete about an reduced failure rate.

However, the commit does add a new test called `TestTxnReadWithinUncertaintyIntervalAfterIntentResolution` which controls manual clocks directly and was able to deterministically reproduce the stale read before this fix in a few different ways. After this fix, the test passes.

### Performance analysis

This correctness fix will lead to an increased rate of transaction retries under some workloads.

<details>
<summary><b>MVCCValue Encoding and Decoding Microbenchmarks</b></summary>
  
```
name                                                                   time/op
EncodeMVCCValue/header=empty/value=tombstone-10                        3.11ns ± 0%
EncodeMVCCValue/header=empty/value=short-10                            3.11ns ± 0%
EncodeMVCCValue/header=empty/value=long-10                             3.10ns ± 0%
EncodeMVCCValue/header=local_walltime/value=tombstone-10               38.9ns ± 0%
EncodeMVCCValue/header=local_walltime/value=short-10                   42.1ns ± 0%
EncodeMVCCValue/header=local_walltime/value=long-10                     533ns ± 3%
EncodeMVCCValue/header=local_walltime+logical/value=tombstone-10       40.5ns ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=short-10           42.9ns ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=long-10             541ns ± 4%
DecodeMVCCValue/header=empty/value=tombstone/inline=false-10           7.81ns ± 1%
DecodeMVCCValue/header=empty/value=tombstone/inline=true-10            0.93ns ± 0%
DecodeMVCCValue/header=empty/value=short/inline=false-10               8.39ns ± 0%
DecodeMVCCValue/header=empty/value=short/inline=true-10                1.55ns ± 0%
DecodeMVCCValue/header=empty/value=long/inline=false-10                8.38ns ± 0%
DecodeMVCCValue/header=empty/value=long/inline=true-10                 1.55ns ± 0%
DecodeMVCCValue/header=local_walltime/value=long/inline=false-10       32.2ns ± 0%
DecodeMVCCValue/header=local_walltime/value=long/inline=true-10        22.7ns ± 0%
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=false-10  32.2ns ± 0%
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=true-10   22.7ns ± 0%
DecodeMVCCValue/header=local_walltime/value=short/inline=false-10      32.2ns ± 0%
DecodeMVCCValue/header=local_walltime/value=short/inline=true-10       22.7ns ± 0%

name                                                                   alloc/op
EncodeMVCCValue/header=empty/value=tombstone-10                         0.00B
EncodeMVCCValue/header=empty/value=short-10                             0.00B
EncodeMVCCValue/header=empty/value=long-10                              0.00B
EncodeMVCCValue/header=local_walltime/value=tombstone-10                24.0B ± 0%
EncodeMVCCValue/header=local_walltime/value=short-10                    32.0B ± 0%
EncodeMVCCValue/header=local_walltime/value=long-10                    4.86kB ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=tombstone-10        24.0B ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=short-10            32.0B ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=long-10            4.86kB ± 0%
DecodeMVCCValue/header=empty/value=tombstone/inline=false-10            0.00B
DecodeMVCCValue/header=empty/value=tombstone/inline=true-10             0.00B
DecodeMVCCValue/header=empty/value=short/inline=false-10                0.00B
DecodeMVCCValue/header=empty/value=short/inline=true-10                 0.00B
DecodeMVCCValue/header=empty/value=long/inline=false-10                 0.00B
DecodeMVCCValue/header=empty/value=long/inline=true-10                  0.00B
DecodeMVCCValue/header=local_walltime/value=long/inline=false-10        0.00B
DecodeMVCCValue/header=local_walltime/value=long/inline=true-10         0.00B
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=false-10   0.00B
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=true-10    0.00B
DecodeMVCCValue/header=local_walltime/value=short/inline=false-10       0.00B
DecodeMVCCValue/header=local_walltime/value=short/inline=true-10        0.00B

name                                                                   allocs/op
EncodeMVCCValue/header=empty/value=tombstone-10                          0.00
EncodeMVCCValue/header=empty/value=short-10                              0.00
EncodeMVCCValue/header=empty/value=long-10                               0.00
EncodeMVCCValue/header=local_walltime/value=tombstone-10                 1.00 ± 0%
EncodeMVCCValue/header=local_walltime/value=short-10                     1.00 ± 0%
EncodeMVCCValue/header=local_walltime/value=long-10                      1.00 ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=tombstone-10         1.00 ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=short-10             1.00 ± 0%
EncodeMVCCValue/header=local_walltime+logical/value=long-10              1.00 ± 0%
DecodeMVCCValue/header=empty/value=tombstone/inline=false-10             0.00
DecodeMVCCValue/header=empty/value=tombstone/inline=true-10              0.00
DecodeMVCCValue/header=empty/value=short/inline=false-10                 0.00
DecodeMVCCValue/header=empty/value=short/inline=true-10                  0.00
DecodeMVCCValue/header=empty/value=long/inline=false-10                  0.00
DecodeMVCCValue/header=empty/value=long/inline=true-10                   0.00
DecodeMVCCValue/header=local_walltime/value=long/inline=false-10         0.00
DecodeMVCCValue/header=local_walltime/value=long/inline=true-10          0.00
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=false-10    0.00
DecodeMVCCValue/header=local_walltime/value=tombstone/inline=true-10     0.00
DecodeMVCCValue/header=local_walltime/value=short/inline=false-10        0.00
DecodeMVCCValue/header=local_walltime/value=short/inline=true-10         0.00
```
</details>

<details>
<summary><b>pkg/sql/tests End-To-End Microbenchmarks</b></summary>
  
```
name                                old time/op    new time/op    delta
KV/Delete/Native/rows=1-30             106µs ± 2%     104µs ± 2%  -1.38%  (p=0.012 n=8+10)
KV/Insert/SQL/rows=100-30             1.26ms ± 2%    1.24ms ± 2%  -1.25%  (p=0.029 n=10+10)
KV/Scan/SQL/rows=1-30                  281µs ± 2%     277µs ± 2%  -1.17%  (p=0.009 n=10+10)
KV/Insert/Native/rows=1-30             107µs ± 2%     107µs ± 2%    ~     (p=0.353 n=10+10)
KV/Insert/Native/rows=10-30            156µs ± 2%     155µs ± 4%    ~     (p=0.481 n=10+10)
KV/Insert/Native/rows=100-30           570µs ± 1%     576µs ± 2%    ~     (p=0.075 n=10+10)
KV/Insert/Native/rows=1000-30         4.18ms ± 2%    4.23ms ± 2%    ~     (p=0.143 n=10+10)
KV/Insert/Native/rows=10000-30        43.9ms ± 2%    44.1ms ± 2%    ~     (p=0.393 n=10+10)
KV/Insert/SQL/rows=1-30                412µs ± 2%     410µs ± 2%    ~     (p=0.280 n=10+10)
KV/Insert/SQL/rows=10-30               511µs ± 2%     508µs ± 1%    ~     (p=0.436 n=10+10)
KV/Insert/SQL/rows=1000-30            8.18ms ± 1%    8.11ms ± 2%    ~     (p=0.165 n=10+10)
KV/Insert/SQL/rows=10000-30           85.2ms ± 2%    84.7ms ± 2%    ~     (p=0.481 n=10+10)
KV/Update/Native/rows=1-30             163µs ± 2%     162µs ± 2%    ~     (p=0.436 n=10+10)
KV/Update/Native/rows=10-30            365µs ± 1%     365µs ± 1%    ~     (p=0.739 n=10+10)
KV/Update/Native/rows=10000-30         143ms ± 1%     144ms ± 2%    ~     (p=0.075 n=10+10)
KV/Update/SQL/rows=1-30                525µs ± 2%     522µs ± 3%    ~     (p=0.190 n=10+10)
KV/Update/SQL/rows=10-30               815µs ± 1%     810µs ± 1%    ~     (p=0.190 n=10+10)
KV/Update/SQL/rows=100-30             2.47ms ± 1%    2.48ms ± 1%    ~     (p=0.356 n=9+10)
KV/Update/SQL/rows=1000-30            16.6ms ± 1%    16.7ms ± 1%    ~     (p=0.065 n=9+10)
KV/Update/SQL/rows=10000-30            193ms ± 2%     195ms ± 3%    ~     (p=0.315 n=10+10)
KV/Delete/Native/rows=10-30            173µs ± 2%     171µs ± 3%    ~     (p=0.247 n=10+10)
KV/Delete/Native/rows=100-30           677µs ± 1%     674µs ± 0%    ~     (p=0.475 n=10+7)
KV/Delete/Native/rows=1000-30         5.20ms ± 2%    5.19ms ± 1%    ~     (p=0.853 n=10+10)
KV/Delete/Native/rows=10000-30        53.7ms ± 1%    53.9ms ± 1%    ~     (p=0.684 n=10+10)
KV/Delete/SQL/rows=1-30                377µs ± 3%     375µs ± 1%    ~     (p=0.305 n=10+9)
KV/Delete/SQL/rows=10-30               503µs ± 1%     500µs ± 2%    ~     (p=0.123 n=10+10)
KV/Delete/SQL/rows=100-30             1.52ms ± 4%    1.51ms ± 2%    ~     (p=0.529 n=10+10)
KV/Delete/SQL/rows=1000-30            2.53ms ± 3%    2.53ms ± 3%    ~     (p=1.000 n=10+10)
KV/Delete/SQL/rows=10000-30           21.9ms ± 1%    21.8ms ± 1%    ~     (p=0.315 n=9+10)
KV/Scan/Native/rows=1-30              29.6µs ± 2%    29.8µs ± 1%    ~     (p=0.143 n=10+10)
KV/Scan/Native/rows=100-30            54.6µs ± 1%    55.0µs ± 2%    ~     (p=0.052 n=10+10)
KV/Scan/SQL/rows=10-30                 292µs ± 1%     290µs ± 1%    ~     (p=0.190 n=10+10)
KV/Scan/SQL/rows=100-30                364µs ± 2%     363µs ± 1%    ~     (p=0.684 n=10+10)
KV/Scan/SQL/rows=10000-30             5.34ms ± 2%    5.32ms ± 1%    ~     (p=0.631 n=10+10)
KVAndStorageUsingSQL/versions=2-30    2.59ms ± 1%    2.59ms ± 2%    ~     (p=0.842 n=9+10)
KVAndStorageUsingSQL/versions=4-30    2.57ms ± 3%    2.56ms ± 2%    ~     (p=0.684 n=10+10)
KVAndStorageUsingSQL/versions=8-30    2.60ms ± 3%    2.59ms ± 2%    ~     (p=0.315 n=10+10)
KV/Update/Native/rows=100-30          2.02ms ± 1%    2.04ms ± 1%  +0.95%  (p=0.015 n=10+10)
KV/Update/Native/rows=1000-30         16.2ms ± 2%    16.5ms ± 2%  +1.30%  (p=0.001 n=10+9)
KV/Scan/Native/rows=10-30             32.6µs ± 2%    33.0µs ± 3%  +1.39%  (p=0.019 n=10+10)
KV/Scan/Native/rows=1000-30            266µs ± 1%     270µs ± 1%  +1.51%  (p=0.000 n=9+10)
KV/Scan/SQL/rows=1000-30               982µs ± 1%     997µs ± 1%  +1.60%  (p=0.000 n=10+10)
KV/Scan/Native/rows=10000-30          2.18ms ± 1%    2.23ms ± 1%  +2.55%  (p=0.000 n=10+10)

name                                old alloc/op   new alloc/op   delta
KV/Insert/Native/rows=1-30            15.6kB ± 0%    15.6kB ± 0%    ~     (p=0.631 n=10+10)
KV/Insert/Native/rows=10000-30        34.7MB ± 2%    35.0MB ± 2%    ~     (p=0.089 n=10+10)
KV/Insert/SQL/rows=1-30               41.3kB ± 1%    41.4kB ± 1%    ~     (p=0.353 n=10+10)
KV/Insert/SQL/rows=10-30              90.8kB ± 1%    90.8kB ± 0%    ~     (p=0.945 n=10+8)
KV/Insert/SQL/rows=1000-30            5.69MB ± 1%    5.71MB ± 1%    ~     (p=0.436 n=10+10)
KV/Insert/SQL/rows=10000-30           69.6MB ± 1%    69.6MB ± 1%    ~     (p=0.853 n=10+10)
KV/Update/Native/rows=1-30            21.4kB ± 1%    21.4kB ± 0%    ~     (p=0.915 n=9+9)
KV/Update/Native/rows=10000-30        67.8MB ± 1%    68.1MB ± 2%    ~     (p=0.315 n=10+10)
KV/Update/SQL/rows=1-30               47.2kB ± 1%    47.3kB ± 1%    ~     (p=0.280 n=10+10)
KV/Update/SQL/rows=10-30               116kB ± 1%     117kB ± 1%    ~     (p=0.353 n=10+10)
KV/Update/SQL/rows=1000-30            5.20MB ± 2%    5.18MB ± 1%    ~     (p=0.278 n=10+9)
KV/Delete/Native/rows=10000-30        30.5MB ± 2%    30.5MB ± 1%    ~     (p=0.631 n=10+10)
KV/Delete/SQL/rows=1-30               42.9kB ± 0%    43.0kB ± 1%    ~     (p=0.393 n=10+10)
KV/Delete/SQL/rows=10-30              66.9kB ± 3%    66.5kB ± 1%    ~     (p=0.853 n=10+10)
KV/Delete/SQL/rows=1000-30            1.19MB ± 0%    1.19MB ± 0%    ~     (p=0.447 n=9+10)
KV/Delete/SQL/rows=10000-30           14.3MB ± 1%    14.3MB ± 0%    ~     (p=0.353 n=10+10)
KV/Scan/Native/rows=1-30              7.17kB ± 0%    7.18kB ± 0%    ~     (p=0.061 n=10+9)
KV/Scan/Native/rows=10-30             8.59kB ± 0%    8.59kB ± 0%    ~     (p=0.926 n=10+10)
KV/Scan/Native/rows=100-30            21.2kB ± 0%    21.2kB ± 0%    ~     (p=0.781 n=10+10)
KV/Scan/Native/rows=1000-30            172kB ± 0%     172kB ± 0%    ~     (p=0.264 n=10+8)
KV/Scan/Native/rows=10000-30          1.51MB ± 0%    1.51MB ± 0%    ~     (p=0.968 n=9+10)
KV/Scan/SQL/rows=1-30                 19.3kB ± 0%    19.3kB ± 1%    ~     (p=0.968 n=9+10)
KV/Scan/SQL/rows=10-30                20.6kB ± 1%    20.7kB ± 0%    ~     (p=0.897 n=10+8)
KV/Scan/SQL/rows=100-30               32.8kB ± 1%    32.9kB ± 0%    ~     (p=0.400 n=10+9)
KV/Scan/SQL/rows=1000-30               185kB ± 0%     185kB ± 0%    ~     (p=0.247 n=10+10)
KV/Scan/SQL/rows=10000-30             1.10MB ± 0%    1.10MB ± 0%    ~     (p=0.631 n=10+10)
KVAndStorageUsingSQL/versions=2-30     793kB ± 3%     793kB ± 2%    ~     (p=0.796 n=10+10)
KVAndStorageUsingSQL/versions=4-30     788kB ± 3%     783kB ± 3%    ~     (p=0.353 n=10+10)
KVAndStorageUsingSQL/versions=8-30     787kB ± 3%     785kB ± 2%    ~     (p=0.853 n=10+10)
KV/Delete/Native/rows=1-30            15.1kB ± 0%    15.2kB ± 0%  +0.26%  (p=0.029 n=10+10)
KV/Update/Native/rows=10-30           66.8kB ± 0%    67.0kB ± 0%  +0.38%  (p=0.029 n=10+10)
KV/Insert/SQL/rows=100-30              550kB ± 1%     553kB ± 1%  +0.44%  (p=0.043 n=10+10)
KV/Update/SQL/rows=100-30              578kB ± 1%     580kB ± 1%  +0.46%  (p=0.019 n=10+10)
KV/Update/Native/rows=100-30           503kB ± 0%     506kB ± 0%  +0.59%  (p=0.000 n=10+10)
KV/Update/SQL/rows=10000-30            153MB ± 1%     154MB ± 1%  +0.62%  (p=0.006 n=9+10)
KV/Delete/SQL/rows=100-30              306kB ± 0%     308kB ± 0%  +0.67%  (p=0.000 n=10+8)
KV/Delete/Native/rows=10-30           36.1kB ± 1%    36.4kB ± 1%  +0.82%  (p=0.001 n=10+9)
KV/Update/Native/rows=1000-30         4.75MB ± 1%    4.79MB ± 1%  +0.83%  (p=0.002 n=10+10)
KV/Insert/Native/rows=10-30           39.7kB ± 1%    40.1kB ± 1%  +0.83%  (p=0.023 n=10+10)
KV/Insert/Native/rows=1000-30         2.56MB ± 1%    2.58MB ± 1%  +1.00%  (p=0.000 n=9+10)
KV/Insert/Native/rows=100-30           279kB ± 1%     282kB ± 1%  +1.02%  (p=0.007 n=10+10)
KV/Delete/Native/rows=100-30           243kB ± 1%     246kB ± 1%  +1.07%  (p=0.000 n=10+10)
KV/Delete/Native/rows=1000-30         2.23MB ± 1%    2.26MB ± 1%  +1.33%  (p=0.000 n=10+9)

name                                old allocs/op  new allocs/op  delta
KV/Update/SQL/rows=1000-30             58.6k ± 0%     58.4k ± 0%  -0.28%  (p=0.000 n=9+9)
KV/Scan/SQL/rows=10-30                   277 ± 0%       276 ± 0%  -0.25%  (p=0.001 n=8+10)
KV/Update/SQL/rows=10000-30             740k ± 0%      739k ± 0%  -0.12%  (p=0.040 n=8+7)
KV/Insert/Native/rows=1-30               129 ± 0%       129 ± 0%    ~     (all equal)
KV/Insert/Native/rows=10-30              272 ± 0%       272 ± 0%    ~     (all equal)
KV/Insert/Native/rows=1000-30          14.3k ± 0%     14.3k ± 0%    ~     (p=0.221 n=10+10)
KV/Insert/Native/rows=10000-30          141k ± 0%      141k ± 0%    ~     (p=0.356 n=10+9)
KV/Insert/SQL/rows=1-30                  349 ± 0%       348 ± 0%    ~     (p=0.776 n=9+10)
KV/Insert/SQL/rows=10-30                 625 ± 0%       625 ± 0%    ~     (p=0.068 n=9+8)
KV/Insert/SQL/rows=100-30              3.12k ± 0%     3.12k ± 0%    ~     (p=0.351 n=10+9)
KV/Insert/SQL/rows=1000-30             29.3k ± 0%     29.3k ± 0%    ~     (p=0.644 n=9+10)
KV/Insert/SQL/rows=10000-30             294k ± 0%      293k ± 0%    ~     (p=0.134 n=9+7)
KV/Update/Native/rows=1-30               181 ± 0%       181 ± 0%    ~     (all equal)
KV/Update/Native/rows=10-30              458 ± 0%       458 ± 0%    ~     (all equal)
KV/Update/Native/rows=1000-30          26.9k ± 0%     27.0k ± 0%    ~     (p=0.232 n=9+10)
KV/Update/Native/rows=10000-30          273k ± 0%      274k ± 0%    ~     (p=0.315 n=9+10)
KV/Update/SQL/rows=1-30                  503 ± 0%       503 ± 0%    ~     (p=1.000 n=10+10)
KV/Update/SQL/rows=10-30                 904 ± 1%       905 ± 0%    ~     (p=0.223 n=10+10)
KV/Update/SQL/rows=100-30              6.79k ± 0%     6.79k ± 0%    ~     (p=0.669 n=10+10)
KV/Delete/Native/rows=1-30               128 ± 0%       128 ± 0%    ~     (all equal)
KV/Delete/Native/rows=10-30              248 ± 0%       248 ± 0%    ~     (all equal)
KV/Delete/Native/rows=10000-30          112k ± 0%      112k ± 0%    ~     (p=0.825 n=10+9)
KV/Delete/SQL/rows=1-30                  331 ± 0%       331 ± 0%    ~     (all equal)
KV/Delete/SQL/rows=10-30                 512 ± 0%       512 ± 0%    ~     (p=0.406 n=8+10)
KV/Delete/SQL/rows=100-30              2.01k ± 0%     2.01k ± 0%    ~     (p=0.947 n=10+8)
KV/Delete/SQL/rows=1000-30             7.51k ± 0%     7.51k ± 0%    ~     (p=0.204 n=9+10)
KV/Delete/SQL/rows=10000-30            71.2k ± 0%     71.2k ± 0%    ~     (p=0.986 n=9+10)
KV/Scan/Native/rows=1-30                54.0 ± 0%      54.0 ± 0%    ~     (all equal)
KV/Scan/Native/rows=10-30               58.0 ± 0%      58.0 ± 0%    ~     (all equal)
KV/Scan/Native/rows=100-30              62.3 ± 1%      62.3 ± 1%    ~     (p=1.000 n=10+10)
KV/Scan/Native/rows=1000-30             72.3 ± 1%      72.0 ± 0%    ~     (p=0.137 n=10+8)
KV/Scan/Native/rows=10000-30             115 ± 1%       115 ± 1%    ~     (p=0.193 n=9+10)
KV/Scan/SQL/rows=1-30                    242 ± 0%       242 ± 0%    ~     (p=0.828 n=9+10)
KV/Scan/SQL/rows=100-30                  648 ± 0%       648 ± 0%    ~     (all equal)
KV/Scan/SQL/rows=1000-30               5.04k ± 0%     5.04k ± 0%    ~     (p=1.000 n=10+10)
KV/Scan/SQL/rows=10000-30              50.3k ± 0%     50.3k ± 0%    ~     (p=0.208 n=10+10)
KVAndStorageUsingSQL/versions=2-30     7.61k ± 0%     7.61k ± 0%    ~     (p=0.223 n=8+10)
KVAndStorageUsingSQL/versions=4-30     7.61k ± 0%     7.60k ± 0%    ~     (p=0.643 n=10+10)
KVAndStorageUsingSQL/versions=8-30     7.61k ± 0%     7.61k ± 0%    ~     (p=0.888 n=10+9)
KV/Update/Native/rows=100-30           2.76k ± 0%     2.76k ± 0%  +0.03%  (p=0.044 n=10+10)
KV/Delete/Native/rows=1000-30          11.3k ± 0%     11.3k ± 0%  +0.03%  (p=0.006 n=10+9)
KV/Insert/Native/rows=100-30           1.57k ± 0%     1.57k ± 0%  +0.06%  (p=0.000 n=8+7)
KV/Delete/Native/rows=100-30           1.27k ± 0%     1.28k ± 0%  +0.08%  (p=0.000 n=8+8)
```
</details>

<details>
<summary><b>YCSB benchmark suite</b></summary>
  
```
name                   old ops/s    new ops/s    delta
ycsb/A/nodes=3          15.2k ± 6%   15.4k ± 3%    ~     (p=0.579 n=10+10)
ycsb/B/nodes=3          28.7k ± 4%   28.9k ± 1%    ~     (p=0.780 n=10+9)
ycsb/C/nodes=3          41.3k ± 4%   41.4k ± 2%    ~     (p=0.529 n=10+10)
ycsb/D/nodes=3          33.9k ± 2%   33.5k ± 2%  -1.37%  (p=0.003 n=9+9)
ycsb/E/nodes=3          2.10k ± 2%   2.10k ± 1%    ~     (p=0.813 n=10+7)
ycsb/F/nodes=3          7.06k ± 5%   7.05k ± 4%    ~     (p=0.853 n=10+10)
ycsb/A/nodes=3/cpu=32   30.0k ± 9%   31.4k ± 3%    ~     (p=0.095 n=10+9)
ycsb/B/nodes=3/cpu=32   97.6k ± 2%   98.1k ± 2%    ~     (p=0.237 n=8+10)
ycsb/C/nodes=3/cpu=32    129k ± 2%    130k ± 1%    ~     (p=0.243 n=10+9)
ycsb/D/nodes=3/cpu=32    105k ± 2%    106k ± 1%  +1.06%  (p=0.034 n=10+8)
ycsb/E/nodes=3/cpu=32   3.33k ± 1%   3.33k ± 1%    ~     (p=0.951 n=10+9)
ycsb/F/nodes=3/cpu=32   15.5k ±10%   16.4k ± 5%  +5.35%  (p=0.029 n=10+10)

name                   old avg(ms)  new avg(ms)  delta
ycsb/A/nodes=3           6.32 ± 6%    6.23 ± 3%    ~     (p=0.635 n=10+10)
ycsb/B/nodes=3           5.02 ± 4%    4.97 ± 3%    ~     (p=0.542 n=10+10)
ycsb/C/nodes=3           3.49 ± 3%    3.50 ± 0%    ~     (p=0.443 n=10+9)
ycsb/D/nodes=3           2.80 ± 0%    2.87 ± 2%  +2.50%  (p=0.008 n=8+10)
ycsb/E/nodes=3           45.8 ± 1%    45.8 ± 1%    ~     (p=0.718 n=10+7)
ycsb/F/nodes=3           13.5 ± 3%    13.6 ± 4%    ~     (p=0.509 n=9+10)
ycsb/A/nodes=3/cpu=32    4.82 ±10%    4.62 ± 6%    ~     (p=0.092 n=10+10)
ycsb/B/nodes=3/cpu=32    2.00 ± 0%    1.96 ± 3%    ~     (p=0.065 n=9+10)
ycsb/C/nodes=3/cpu=32    1.50 ± 0%    1.50 ± 0%    ~     (all equal)
ycsb/D/nodes=3/cpu=32    1.40 ± 0%    1.36 ± 4%    ~     (p=0.065 n=9+10)
ycsb/E/nodes=3/cpu=32    43.3 ± 1%    43.3 ± 0%    ~     (p=0.848 n=10+9)
ycsb/F/nodes=3/cpu=32    9.30 ±11%    8.79 ± 5%  -5.48%  (p=0.022 n=10+10)

name                   old p99(ms)  new p99(ms)  delta
ycsb/A/nodes=3           52.0 ±17%    48.4 ±18%    ~     (p=0.379 n=10+10)
ycsb/B/nodes=3           32.2 ±11%    32.2 ± 9%    ~     (p=0.675 n=10+10)
ycsb/C/nodes=3           13.8 ± 6%    13.8 ± 3%    ~     (p=1.000 n=10+10)
ycsb/D/nodes=3           12.6 ± 0%    12.9 ± 2%  +2.38%  (p=0.023 n=8+10)
ycsb/E/nodes=3            200 ± 8%     197 ± 6%    ~     (p=0.375 n=10+8)
ycsb/F/nodes=3            124 ±19%     125 ±21%    ~     (p=0.856 n=9+10)
ycsb/A/nodes=3/cpu=32    68.1 ±17%    63.9 ± 5%    ~     (p=0.211 n=10+8)
ycsb/B/nodes=3/cpu=32    15.7 ± 0%    15.5 ± 2%    ~     (p=0.071 n=6+10)
ycsb/C/nodes=3/cpu=32    10.5 ± 0%    10.2 ± 3%  -2.86%  (p=0.003 n=8+10)
ycsb/D/nodes=3/cpu=32    8.70 ± 3%    8.40 ± 0%  -3.45%  (p=0.003 n=10+8)
ycsb/E/nodes=3/cpu=32     151 ± 0%     151 ± 0%    ~     (all equal)
ycsb/F/nodes=3/cpu=32     148 ±15%     145 ±10%    ~     (p=0.503 n=9+10)
```
</details>

----

Release note (bug fix): fixed a rare race condition that could allow for a transaction to serve a stale read and violate real-time ordering under moderate clock skew.

[^1]: see [pkg/kv/kvserver/observedts/doc.go](https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/observedts/doc.go) for an explanation of the role of observed timestamps in the transaction model. This commit updates that documentation to include this fix.

[^2]: see analysis in #36431 (comment).

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@nvanbenschoten
Copy link
Member Author

Replaced by #80706.

@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/staleReadBug branch May 9, 2022 19:20
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jun 25, 2023
Informs cockroachdb#101938.

This commit removes logic in mvcc key encoding routines that handle
synthetic timestamps. As a result, we no longer write keys with
synthetic timestamps, though we retain the ability to decode them.

As described in cockroachdb#72121 (comment)
and later in cockroachdb@24c56df
(see "Future improvements"), the introduction of the mvcc value header
and the optional, per-version local timestamp paved the way for the
removal of synthetic timestamps. MVCC keys no longer need to carry the
synthetic bit in order for reads from GLOBAL tables to behave properly.
As a result, we no longer need to write it.

Release note: None
craig bot pushed a commit that referenced this pull request Jul 10, 2023
105523: kv: stop encoding or decoding synthetic timestamp bit in/from mvcc keys r=sumeerbhola a=nvanbenschoten

Informs #101938.

This first commit removes logic in mvcc key encoding routines that handle synthetic timestamps. As a result, we no longer write keys with synthetic timestamps, though we retain the ability to decode them.

The second commit removes logic in mvcc key decoding routines to decode synthetic timestamps. We retain the ability to decode keys with the synthetic timestamp bit set, but we simply ignore its presence.

As described in #72121 (comment) and later in 24c56df (see "Future improvements"), the introduction of the mvcc value header and the optional, per-version local timestamp paved the way for the removal of synthetic timestamps. MVCC keys no longer need to carry the synthetic bit in order for reads from GLOBAL tables to behave properly. As a result, we no longer need to write it.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants