Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: give non-transactional requests uncertainty intervals #73732

Merged

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Dec 13, 2021

Fixes #58459.
Informs #36431.

This commit fixes a long-standing correctness issue where non-transactional requests did not ensure single-key linearizability even if they deferred their timestamp allocation to the leaseholder of their (single) range. They still don't entirely, because of #36431, but this change brings us one step closer to the fix we plan to land for #36431 also applying to non-transactional requests.

The change addresses this by giving non-transactional requests uncertainty intervals. This ensures that they also guarantee single-key linearizability even with only loose (but bounded) clock synchronization. Non-transactional requests that use a client-provided timestamp do not have uncertainty intervals and do not make real-time ordering guarantees.

Unlike transactions, which establish an uncertainty interval on their coordinator node during initialization, non-transactional requests receive uncertainty intervals from their target leaseholder, using a clock reading from the leaseholder's local HLC as the local limit and this clock reading + the cluster's maximum clock skew as the global limit.

It is somewhat non-intuitive that non-transactional requests need uncertainty intervals — after all, they receive their timestamp to the leaseholder of the only range that they talk to, so isn't every value with a commit timestamp above their read timestamp certainly concurrent? The answer is surprisingly "no" for the following reasons, so they cannot forgo the use of uncertainty interval:

  1. the request timestamp is allocated before consulting the replica's lease. This means that there are times when the replica is not the leaseholder at the point of timestamp allocation, and only becomes the leaseholder later. In such cases, the timestamp assigned to the request is not guaranteed to be greater than the written_timestamp of all writes served by the range at the time of allocation. This is true despite invariants 1 & 2 presented in pkg/kv/kvserver/uncertainty/doc.go because the replica allocating the timestamp is not yet the leaseholder.

    In cases where the replica that assigned the non-transactional request's timestamp takes over as the leaseholder after the timestamp allocation, we expect minimumLocalLimitForLeaseholder to forward the local uncertainty limit above TimestampFromServerClock, to the lease start time.

    For example, consider the following series of events:

    • client A writes k = v1
    • leaseholder writes v1 at ts = 100
    • client A receives ack for write
    • client B wants to read k using a non-txn request
    • follower replica with slower clock receives non-txn request
    • follower replica assigns request ts = 95
    • lease transferred to follower replica with lease start time = 101
    • non-txn request must use 101 as limit of uncertainty interval to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. Failure to do so would be a stale read
  2. even if the replica's lease is stable and the timestamp is assigned to the non-transactional request by the leaseholder, the assigned clock reading only reflects the written_timestamp of all of the writes served by the leaseholder (and previous leaseholders) thus far. This clock reading is not not guaranteed to lead the commit timestamp of all of these writes, especially if they are committed remotely and resolved after the request has received its clock reading but before the request begins evaluating.

    As a result, the non-transactional request needs an uncertainty interval with a global uncertainty limit far enough in advance of the leaseholder's local HLC clock to ensure that it considers any value that was part of a transaction which could have committed before the request was received by the leaseholder to be uncertain. Concretely, the non-transactional request needs to consider values of the following form to be uncertain:

    written_timestamp < local_limit && commit_timestamp < global_limit

    The value that the non-transactional request is observing may have been written on the local leaseholder at time 10, its transaction may have been committed remotely at time 20, acknowledged, then the non-transactional request may have begun and received a timestamp of 15 from the local leaseholder, then finally the value may have been resolved asynchronously and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20). The failure of the non-transactional request to observe this value would be a stale read.

    For example, consider the following series of events:

    • client A begins a txn and is assigned provisional commit timestamp = 95
    • client A's txn performs a Put(k, v1)
    • leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95
    • client A's txn performs a write elsewhere and hits a WriteTooOldError that bumps its provisional commit timestamp to 100
    • client A's txn refreshes to ts = 100. This notably happens without involvement of the leaseholder that served the Put (this is at the heart of kv: stale read because intent from our past gets committed in our certain future? #36431), so that leaseholder's clock is not updated
    • client A's txn commits remotely and client A receives the acknowledgment
    • client B initiates non-txn read of k
    • leaseholder assigns read timestamp ts = 97
    • asynchronous intent resolution resolves the txn's intent at k, moving v1 to ts = 100 in the process
    • non-txn request must use an uncertainty interval that extends past 100 to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. Failure to do so would be a stale read

This change is related to #36431 in two ways. First, it allows non-transactional requests to benefit from our incoming fix for that issue. Second, it unblocks some of the clock refactors proposed in #72121 (comment), and by extension #72663. Even though there are clear bugs today, I still don't feel comfortable removing the hlc.Clock.Update in Store.Send until we make this change. We know from #36431 that invariant 1 from uncertainty.D6 doesn't hold, and yet I still do think the hlc.Clock.Update in Store.Send masks the bugs in this area in many cases. Removing that clock update (I don't actually plan to remove it, but I plan to disconnect it entirely from operation timestamps) without first giving non-transactional requests uncertainty intervals seems like it may reveal these bugs in ways we haven't seen in the past. So I'd like to land this fix before making that change.


Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/uncertaintyNonTxn branch 2 times, most recently from 5c5e4b1 to 5264c34 Compare December 13, 2021 18:20
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't reviewed the main commit yet, but I did review its commit message.

Reviewed 4 of 4 files at r1, 1 of 1 files at r2, 34 of 34 files at r3, 9 of 9 files at r4, 34 of 34 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, @nvanbenschoten, and @tbg)


-- commits, line 85 at r6:
what are "invariants 1 & 2"? Are you referring to the "After all, ..." sentence? Could be clearer.


-- commits, line 86 at r6:
extend with an example. I know it's obvious but an example is always good imo.

  • client A wants to write k := v1
  • leaseholder writes v1 at ts = 100
  • client A receives ack for write
  • client B wants to read k
  • follower receives non-txn'al request, assigns timestamp ts = 95
  • follower becomes leader (say, at lease start ts = 101)
  • non-txn'al request does not return v1

-- commits, line 88 at r6:
I wasn't clear on whether this describes the pre-PR or post-PR state. It's post-PR, consider clarifying (and now it helps to have the example):

With this PR, when B wants to read k after having written v1, the non-txn'al request will similarly start out with TimestampFromServerClock = 95, but will have its local uncertainty limit forwarded via minimumLocalLimitForLeaseholder to at least the new lease start time t=101, and is thus guaranteed to observe all writes served by the old leaseholder, including v1 (which has ts = 100, which is covered by [95,101]).


-- commits, line 93 at r6:
Starting to struggle here to see the difference between #36431. Here's my attempt at improving the comment but I might be slightly off.

nit: I can't find written_timestamp anywhere except in comments and I'm not even sure where to find it.


Even if the lease is stable and the timestamp is assigned by the leaseholder, that leaseholder's clock reading only reflects the [written_timestamp] of all of the writes served by the leaseholder (and previous leaseholders) thus far. The written_timestamp of a write may be lower than its commit timestamp, but it is the latter that matters. For example:

  • client A begins transaction Put(k, v1) which is assigned provisional commit timestamp ts = 95
  • leaseholder serves Put(k, v1), i.e. lays down intent at written_timestamp = 95.
  • client A's transaction performs a write elsewhere and hits a WriteTooOldError that bumps its provisional commit timestamp to ts = 100.
  • client A's transaction refreshes to ts = 100. This notably happens without involvement of the leaseholder that served the Put (this is at the heart of kv: stale read because intent from our past gets committed in our certain future? #36431).
  • client A's transaction commits and client A receives the acknowledgment.
  • client A initiates non-txnal read of k.
  • leaseholder assigns read timestamp ts = 97
  • asynchronous intent resolution resolves the intent at k, moving v1 to ts = 100 in the process
  • read does not observe v1.

After this PR, the read is instead assigned a full uncertainty interval [97, 97+MaxOffset). Since the final commit timestamp of the preceding txn (ts = 100) could lead the subsequently assigned read timestamp (ts = 97) by at most MaxOffset, the read is guaranteed to consider v1 possibly concurrent and will observe it (after a local "restart"), thus preventing #36431 for non-transactional reads.

Note that the above problem is harder to solve in the case of a subsequent transactional read as transaction restarts are orders of magnitude more costly and so we more aggressively want to limit the uncertainty interval using observed timestamps; see #36431.

[written_timestamp]: TODO code link


pkg/roachpb/api.proto, line 2113 at r4 (raw file):

  // reference to the initial timestamp, we ensure that a non-transactional
  // request's uncertainty interval remains fixed across retries.
  util.hlc.Timestamp timestamp_from_server_clock = 25 [

Any backwards compatibility concerns here since new nodes will stop interpreting the bool? I assume not, but please explain why not in the commit msg.


pkg/roachpb/batch.go, line 78 at r4 (raw file):

			now := clock.NowAsClockTimestamp()
			ba.Timestamp = now.ToTimestamp()
			ba.TimestampFromServerClock = &now

Had to do a double take to see that we're not sharing memory between Timestamp{,FromServerClock}. Maybe this is clearer:

now := clock.NowAsClockTimestamp()
ba.TimestampFromServerClock = &now
ba.Timestamp = (*ba.TimestampFromServerClock).ToTimestamp()

Not sure this is really better.

@tbg tbg self-requested a review December 14, 2021 10:49
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 19 of 19 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, and @nvanbenschoten)


-- commits, line 93 at r6:

Previously, tbg (Tobias Grieger) wrote…

Starting to struggle here to see the difference between #36431. Here's my attempt at improving the comment but I might be slightly off.

nit: I can't find written_timestamp anywhere except in comments and I'm not even sure where to find it.


Even if the lease is stable and the timestamp is assigned by the leaseholder, that leaseholder's clock reading only reflects the [written_timestamp] of all of the writes served by the leaseholder (and previous leaseholders) thus far. The written_timestamp of a write may be lower than its commit timestamp, but it is the latter that matters. For example:

  • client A begins transaction Put(k, v1) which is assigned provisional commit timestamp ts = 95
  • leaseholder serves Put(k, v1), i.e. lays down intent at written_timestamp = 95.
  • client A's transaction performs a write elsewhere and hits a WriteTooOldError that bumps its provisional commit timestamp to ts = 100.
  • client A's transaction refreshes to ts = 100. This notably happens without involvement of the leaseholder that served the Put (this is at the heart of kv: stale read because intent from our past gets committed in our certain future? #36431).
  • client A's transaction commits and client A receives the acknowledgment.
  • client A initiates non-txnal read of k.
  • leaseholder assigns read timestamp ts = 97
  • asynchronous intent resolution resolves the intent at k, moving v1 to ts = 100 in the process
  • read does not observe v1.

After this PR, the read is instead assigned a full uncertainty interval [97, 97+MaxOffset). Since the final commit timestamp of the preceding txn (ts = 100) could lead the subsequently assigned read timestamp (ts = 97) by at most MaxOffset, the read is guaranteed to consider v1 possibly concurrent and will observe it (after a local "restart"), thus preventing #36431 for non-transactional reads.

Note that the above problem is harder to solve in the case of a subsequent transactional read as transaction restarts are orders of magnitude more costly and so we more aggressively want to limit the uncertainty interval using observed timestamps; see #36431.

[written_timestamp]: TODO code link

Having reviewed bits of the commit now, the example should also be explicit about the global vs local limit. If I understand correctly, at least once all your work is done and #36431 is fixed, when the intent is resolved it will be with a timestamp that instructs the read to apply the global limit (which is 97+MaxOffset as opposed to the local limit (which is 97). So it is incorrect that this PR is already addressing #36431 for the non-txn'al case, rather it establishes parity so that the txn fix will fix non-txn as well.


pkg/kv/kvserver/replica_batch_updates.go, line 243 at r6 (raw file):

	ba.Txn = ba.Txn.Clone()
	ba.Txn.Refresh(ts)
	ba.Timestamp = ba.Txn.ReadTimestamp

// NB: Refresh just updated ReadTimestamp


pkg/kv/kvserver/replica_evaluate.go, line 526 at r6 (raw file):

its timestamp

of what


pkg/kv/kvserver/replica_evaluate.go, line 609 at r6 (raw file):

			newTimestamp = tErr.RetryTimestamp()

		case *roachpb.ReadWithinUncertaintyIntervalError:

We're teaching txns a new trick here, right? Intentional? Or do you want to be prudent and make this a separate PR. I'll defer to your level of confidence on how likely this is to be worth it. Or are we not changing behavior here because we're holding latches? It's hard to know from just looking at the diff. Perhaps some version of the comment in this file that you removed could still be useful.

Ah, I see below that there is latching-related code for this case now.


pkg/kv/kvserver/replica_send.go, line 481 at r6 (raw file):

			// are strict conditions that must be met for this to be permitted. For
			// non-transactional requests, this is always allowed. If successful, an
			// updated BatchRequest will be returned.

Could mention what error is expected if not. I assume we'll propagate the RWUIE.


pkg/kv/kvserver/replica_send.go, line 731 at r6 (raw file):

		// is necessary to preserve real-time ordering for transactions that write
		// into the future.
		if err := r.Clock().SleepUntil(ctx, ba.Timestamp); err != nil {

nit: should the ctx here respect stopper cancellation?


pkg/kv/kvserver/batcheval/declare.go, line 57 at r6 (raw file):

		access = spanset.SpanReadOnly

		// For non-locking reads, acquire read latches all the way up to the

Could you remind me why DefaultDeclareKeys doesn't do any of this uncertainty stuff?


pkg/kv/kvserver/uncertainty/compute.go, line 79 at r6 (raw file):

	if status.State != kvserverpb.LeaseState_VALID {
		// If the lease is invalid, this must be a follower read.

... and so we must use the most pessimistic limit, right?


pkg/kv/kvserver/uncertainty/compute.go, line 126 at r6 (raw file):

		// these cases, it needs to perform an uncertainty restart.
		//
		// For example, the non-transactional request may observe an intent with a

Good place to plop one of the explicit examples that might make it into the commit msg.


pkg/kv/kvserver/uncertainty/doc.go, line 212 at r6 (raw file):

// the cluster's maximum clock skew as the global limit.
//
// It is somewhat non-intuitive that non-transactional requests need uncertainty

reminder to replace with the most up to date prose/examples before merge.


pkg/storage/pebble_mvcc_scanner.go, line 217 at r6 (raw file):

	// We must check uncertainty even if p.ts.Less(p.uncertainty.LocalLimit)
	// because the local uncertainty limit cannot be applied to values with
	// synthetic timestamps.

I know you're not really changing this but this comment is confusing - we set checkUncertainty = false and say "we must still check it"? Having trouble even making a better comment. I don't see any checks for the uncertainty on synthetic values (might've missed it). So maybe just replace this with a TODO referencing a Github issue?

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 3, 2022
Extracted from cockroachdb#73732, with relevant comments addressed.

This commit adds support for server-side refreshes of
`ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization
for transactional requests, which now benefit from this new capability to
refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction,
before they've accumulated any refresh spans. There's some complexity around
supporting this form of server-side retry, because it must be done above
latching, instead of below. However, the recent refactoring in cockroachdb#73557 has made
this possible to support cleanly.

This is also a prerequisite to giving non-transactional requests uncertainty
intervals (cockroachdb#73732), because we don't want ReadWithinUncertaintyIntervalErrors to
reach the client for non-transactional requests. Conveniently, because
non-transactional requests are always scoped to a single-range, those that hit
uncertainty errors will always be able to retry on the server, so these errors
will never bubble up to the client that initiated the request.

Release note (performance improvement): Certain forms of automatically retried
"read uncertainty" errors are now retried more efficiently, avoiding a network
round trip.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/uncertaintyNonTxn branch 2 times, most recently from fec6f22 to 181d586 Compare February 3, 2022 05:29
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the detailed review, and apologies for letting this sit for a month. I'm hoping to start pushing on this stuff again.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, and @tbg)


-- commits, line 85 at r6:

Previously, tbg (Tobias Grieger) wrote…

what are "invariants 1 & 2"? Are you referring to the "After all, ..." sentence? Could be clearer.

Done.


-- commits, line 86 at r6:

Previously, tbg (Tobias Grieger) wrote…

extend with an example. I know it's obvious but an example is always good imo.

  • client A wants to write k := v1
  • leaseholder writes v1 at ts = 100
  • client A receives ack for write
  • client B wants to read k
  • follower receives non-txn'al request, assigns timestamp ts = 95
  • follower becomes leader (say, at lease start ts = 101)
  • non-txn'al request does not return v1

Done.


-- commits, line 88 at r6:

Previously, tbg (Tobias Grieger) wrote…

I wasn't clear on whether this describes the pre-PR or post-PR state. It's post-PR, consider clarifying (and now it helps to have the example):

With this PR, when B wants to read k after having written v1, the non-txn'al request will similarly start out with TimestampFromServerClock = 95, but will have its local uncertainty limit forwarded via minimumLocalLimitForLeaseholder to at least the new lease start time t=101, and is thus guaranteed to observe all writes served by the old leaseholder, including v1 (which has ts = 100, which is covered by [95,101]).

Done in example.


-- commits, line 93 at r6:

Previously, tbg (Tobias Grieger) wrote…

Having reviewed bits of the commit now, the example should also be explicit about the global vs local limit. If I understand correctly, at least once all your work is done and #36431 is fixed, when the intent is resolved it will be with a timestamp that instructs the read to apply the global limit (which is 97+MaxOffset as opposed to the local limit (which is 97). So it is incorrect that this PR is already addressing #36431 for the non-txn'al case, rather it establishes parity so that the txn fix will fix non-txn as well.

Done. Thanks for the suggestions about the wording here.

You're correct that this does not address #36431 for the non-txn'al case, but rather ensures that when we do fix #36431, we fix it for both cases.

Regarding the written_timestamp, the concept does not exist yet, which is why some of this is a bit difficult to write. I considered omitting it, but given that we're planning on introducing it shortly in the replacement to #72121, it makes things a lot easier to just pretend it exists for now so that we have something concrete to anchor these discussions on and won't need to rewrite all of this later.

I added references to #72121 in each case where we discuss the written_timestamp concept to make this clearer.


pkg/kv/kvserver/replica_batch_updates.go, line 243 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// NB: Refresh just updated ReadTimestamp

Done.


pkg/kv/kvserver/replica_evaluate.go, line 526 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

its timestamp

of what

Done.


pkg/kv/kvserver/replica_evaluate.go, line 609 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We're teaching txns a new trick here, right? Intentional? Or do you want to be prudent and make this a separate PR. I'll defer to your level of confidence on how likely this is to be worth it. Or are we not changing behavior here because we're holding latches? It's hard to know from just looking at the diff. Perhaps some version of the comment in this file that you removed could still be useful.

Ah, I see below that there is latching-related code for this case now.

I agree that this should have been a different PR. I was going for that with #73717, but I ended up pulling the uninteresting (and no longer needed) change into a separate PR without pulling the bulk of the complexity out of this one.

I've pulled the changes related to teaching transactional requests to perform server-side refreshes of uncertainty intervals into #75905.


pkg/kv/kvserver/replica_send.go, line 481 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Could mention what error is expected if not. I assume we'll propagate the RWUIE.

Done.


pkg/kv/kvserver/replica_send.go, line 731 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: should the ctx here respect stopper cancellation?

Done.


pkg/kv/kvserver/batcheval/declare.go, line 57 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Could you remind me why DefaultDeclareKeys doesn't do any of this uncertainty stuff?

DefaultDeclareKeys is meant for non-MVCC operations (inline writes) and metadata operations (refresh requests), which is why it also doesn't instruct requests to check the lock table. None of these kinds of operations use uncertainty intervals. I'm not particularly happy with the way we talk about the distinction between these two classes of operations ("isolated vs. not"), but I haven't found a way to clean up the language and concepts used in this area.


pkg/kv/kvserver/uncertainty/compute.go, line 79 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

... and so we must use the most pessimistic limit, right?

Done.


pkg/kv/kvserver/uncertainty/compute.go, line 126 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Good place to plop one of the explicit examples that might make it into the commit msg.

Done.


pkg/kv/kvserver/uncertainty/doc.go, line 212 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

reminder to replace with the most up to date prose/examples before merge.

Done.


pkg/roachpb/api.proto, line 2113 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Any backwards compatibility concerns here since new nodes will stop interpreting the bool? I assume not, but please explain why not in the commit msg.

Done.


pkg/roachpb/batch.go, line 78 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Had to do a double take to see that we're not sharing memory between Timestamp{,FromServerClock}. Maybe this is clearer:

now := clock.NowAsClockTimestamp()
ba.TimestampFromServerClock = &now
ba.Timestamp = (*ba.TimestampFromServerClock).ToTimestamp()

Not sure this is really better.

Added a comment to make this clear.


pkg/storage/pebble_mvcc_scanner.go, line 217 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I know you're not really changing this but this comment is confusing - we set checkUncertainty = false and say "we must still check it"? Having trouble even making a better comment. I don't see any checks for the uncertainty on synthetic values (might've missed it). So maybe just replace this with a TODO referencing a Github issue?

The comment is trying to contrast p.ts < local_limit with p.ts < global_limit, but I think it was just wrong. Fixed.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review due to EOD

Reviewed 1 of 62 files at r8, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, @nvanbenschoten, and @tbg)


-- commits, line 95 at r11:
Is another (contrived but possible) example that the client reads back k but that due to a, uh, 10s GC pause, the read is eligible for a follower read? Or do we just not let non-txn'al requests do follower reads (which would be good; no reason for those to exist)?


-- commits, line 103 at r11:
local limit, right? I.e. instead of the pessimistic ts+500ms, we get to use 101 (which presumably is better in this example).


-- commits, line 111 at r11:
not not


-- commits, line 125 at r11:
nit: use timestamps from the example, then the prose can prime readers and make the example easier to follow.


-- commits, line 143 at r11:
Nit for all of these examples: why isn't client A doing both (i.e. remove client B)? The examples should drive home that you get stale reads, right? Now we always have to say "B causally relates to A" or something like that.


-- commits, line 150 at r11:
.

@tbg tbg self-requested a review February 4, 2022 17:05
// then it can not trivially bump its timestamp without dropping and
// re-acquiring those latches. Doing so could allow the request to read at an
// unprotected timestamp. We only look at global latch spans because local
// latch spans always use unbounded (NonMVCC) timestamps.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this enforced somehow?

// See https://github.com/cockroachdb/cockroach/issues/58459.
if txn == nil {
return Interval{}
func ComputeInterval(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth pointing out that this can return an empty Interval{} meaning that the request is explicitly bypassing uncertainty restarts.

//
// Unlike transactions, which establish an uncertainty interval on their
// coordinator node during initialization, non-transactional requests receive
// uncertainty intervals from their target leaseholder, using a clock reading
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a choice, it makes sense to motivate it: we have all reason to believe that this means the leaseholder is chosing the timestamp, which means we get the a relevant observed timestamp "for free".

// single-range, those that hit uncertainty errors can always retry on the
// server, so these errors never bubble up to the client that initiated the
// request.
var D7 = roachpb.Header{}.TimestampFromServerClock
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't read this comment in detail because I assume it mirrors exactly what is in the PR message and will be synced before merge.

require.NotNil(t, ba.TimestampFromServerClock)
nonTxnOrigTsC <- ba.Timestamp
// Wait for the test to give the go-ahead.
select {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you put a DefaultSucceedsSoon timeout here and on all other selects. These kind of tests tend to deadlock.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I assume you've stressed this a bit to shake out flakes)

@@ -183,7 +183,7 @@ func maybeStripInFlightWrites(ba *roachpb.BatchRequest) (*roachpb.BatchRequest,
// works for batches that exclusively contain writes; reads cannot be bumped
// like this because they've already acquired timestamp-aware latches.
func maybeBumpReadTimestampToWriteTimestamp(
ctx context.Context, ba *roachpb.BatchRequest, latchSpans *spanset.SpanSet,
ctx context.Context, ba *roachpb.BatchRequest, g *concurrency.Guard,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can g be nil?

// can change due to server-side uncertainty retries. By remembering a stable
// reference to the initial timestamp, we ensure that a non-transactional
// request's uncertainty interval remains fixed across retries.
util.hlc.Timestamp timestamp_from_server_clock = 27 [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me if there was a migration to worry about here? Is anything going to break cross-version due to the use of different fields?

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 5, 2022
Extracted from cockroachdb#73732, with relevant comments addressed.

This commit adds support for server-side refreshes of
`ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization
for transactional requests, which now benefit from this new capability to
refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction,
before they've accumulated any refresh spans. There's some complexity around
supporting this form of server-side retry, because it must be done above
latching, instead of below. However, the recent refactoring in cockroachdb#73557 has made
this possible to support cleanly.

This is also a prerequisite to giving non-transactional requests uncertainty
intervals (cockroachdb#73732), because we don't want ReadWithinUncertaintyIntervalErrors to
reach the client for non-transactional requests. Conveniently, because
non-transactional requests are always scoped to a single-range, those that hit
uncertainty errors will always be able to retry on the server, so these errors
will never bubble up to the client that initiated the request.

Release note (performance improvement): Certain forms of automatically retried
"read uncertainty" errors are now retried more efficiently, avoiding a network
round trip.
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, and @tbg)


-- commits, line 95 at r11:

Previously, tbg (Tobias Grieger) wrote…

Is another (contrived but possible) example that the client reads back k but that due to a, uh, 10s GC pause, the read is eligible for a follower read? Or do we just not let non-txn'al requests do follower reads (which would be good; no reason for those to exist)?

Right, we don't allow non-transactional reads that have server-assigned timestamps to be served by followers for this reason. See:

// BatchCanBeEvaluatedOnFollower determines if a batch consists exclusively of
// requests that can be evaluated on a follower replica, given a sufficiently
// advanced closed timestamp.
func BatchCanBeEvaluatedOnFollower(ba roachpb.BatchRequest) bool {
// Explanation of conditions:
// 1. the batch cannot have or intend to receive a timestamp set from a
// server-side clock. If a follower with a lagging clock sets its timestamp
// and this then allows the follower to evaluate the batch as a follower
// read, then the batch might miss past writes served at higher timestamps
// on the leaseholder.
// 2. each request in the batch needs to be "transactional", because those are
// the only ones that have clearly defined semantics when served under the
// closed timestamp.
// 3. the batch needs to be read-only, because a follower replica cannot
// propose writes to Raft.
// 4. the batch needs to be non-locking, because unreplicated locks are only
// held on the leaseholder.
tsFromServerClock := ba.Txn == nil && (ba.Timestamp.IsEmpty() || ba.TimestampFromServerClock)
if tsFromServerClock {
return false
}
return ba.IsAllTransactional() && ba.IsReadOnly() && !ba.IsLocking()
}


-- commits, line 103 at r11:

Previously, tbg (Tobias Grieger) wrote…

local limit, right? I.e. instead of the pessimistic ts+500ms, we get to use 101 (which presumably is better in this example).

Yes, updated.


-- commits, line 111 at r11:

Previously, tbg (Tobias Grieger) wrote…

not not

Done.


-- commits, line 125 at r11:

Previously, tbg (Tobias Grieger) wrote…

nit: use timestamps from the example, then the prose can prime readers and make the example easier to follow.

Done.


-- commits, line 143 at r11:

Previously, tbg (Tobias Grieger) wrote…

Nit for all of these examples: why isn't client A doing both (i.e. remove client B)? The examples should drive home that you get stale reads, right? Now we always have to say "B causally relates to A" or something like that.

Done.


-- commits, line 150 at r11:

Previously, tbg (Tobias Grieger) wrote…

.

Done.


pkg/kv/kvserver/replica_batch_updates.go, line 186 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

can g be nil?

It can, but it never is given a nil g


pkg/kv/kvserver/concurrency/concurrency_manager.go, line 671 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is this enforced somehow?

Yes, see https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/spanlatch/manager.go#L423-L432 and

if keys.IsLocal(span.Key) {
scope = SpanLocal
timestamp = hlc.Timestamp{}
}
.


pkg/kv/kvserver/uncertainty/compute.go, line 61 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Might be worth pointing out that this can return an empty Interval{} meaning that the request is explicitly bypassing uncertainty restarts.

Done.


pkg/kv/kvserver/uncertainty/doc.go, line 208 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Since this is a choice, it makes sense to motivate it: we have all reason to believe that this means the leaseholder is chosing the timestamp, which means we get the a relevant observed timestamp "for free".

Good point, done.


pkg/roachpb/api.proto, line 2149 at r11 (raw file):
No concerns, discussed in commit message.

There are no backwards compatibility concerns with this change because the previous field was only consulted on the same node that set it. It was not consulted across RPC boundaries.


pkg/kv/kvserver/client_replica_test.go, line 734 at r11 (raw file):
Done.

(I assume you've stressed this a bit to shake out flakes)

Yes, stressed for some time.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 62 files at r8, 53 of 53 files at r12, 8 of 8 files at r13, 34 of 34 files at r14, 15 of 15 files at r15, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, and @tbg)

@tbg tbg self-requested a review February 5, 2022 15:06
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 5, 2022
Extracted from cockroachdb#73732, with relevant comments addressed.

This commit adds support for server-side refreshes of
`ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization
for transactional requests, which now benefit from this new capability to
refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction,
before they've accumulated any refresh spans. There's some complexity around
supporting this form of server-side retry, because it must be done above
latching, instead of below. However, the recent refactoring in cockroachdb#73557 has made
this possible to support cleanly.

This is also a prerequisite to giving non-transactional requests uncertainty
intervals (cockroachdb#73732), because we don't want ReadWithinUncertaintyIntervalErrors to
reach the client for non-transactional requests. Conveniently, because
non-transactional requests are always scoped to a single-range, those that hit
uncertainty errors will always be able to retry on the server, so these errors
will never bubble up to the client that initiated the request.

Release note (performance improvement): Certain forms of automatically retried
"read uncertainty" errors are now retried more efficiently, avoiding a network
round trip.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 5, 2022
This commit eliminates the primary mechanism that we use to pass clock
information from a leaseholder, through Raft log entries, to a Range's
followers. As we found in cockroachdb#72278, this was only needed for correctness
in a few specific cases — namely lease transfers and range merges. These
two operations continue to pass clock signals through more explicit
channels, but we remove the unnecessary general case.

The allows us to remote one of the two remaining places where we convert
a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp`
method. As outlined in cockroachdb#72121 (comment),
I would like to remove ability to downcast a "data-plane" `Timestamp` to
a "control-plane" `CloudTimestamp` entirely. This will clarify the role
of `ClockTimestamps` in the system and clean up the channels through
which clock information is passed between nodes.

The other place where we cast from `Timestamp` to `ClockTimesatmp` is
in `Store.Send`, at the boundary of KV RPCs. I would also like to get
rid of this, but doing so needs to wait on cockroachdb#73732.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 5, 2022
This commit eliminates the primary mechanism that we use to pass clock
information from a leaseholder, through Raft log entries, to a Range's
followers. As we found in cockroachdb#72278, this was only needed for correctness
in a few specific cases — namely lease transfers and range merges. These
two operations continue to pass clock signals through more explicit
channels, but we remove the unnecessary general case.

The allows us to remote one of the two remaining places where we convert
a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp`
method. As outlined in cockroachdb#72121 (comment),
I would like to remove ability to downcast a "data-plane" `Timestamp` to
a "control-plane" `CloudTimestamp` entirely. This will clarify the role
of `ClockTimestamps` in the system and clean up the channels through
which clock information is passed between nodes.

The other place where we cast from `Timestamp` to `ClockTimesatmp` is
in `Store.Send`, at the boundary of KV RPCs. I would also like to get
rid of this, but doing so needs to wait on cockroachdb#73732.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 8, 2022
Extracted from cockroachdb#73732, with relevant comments addressed.

This commit adds support for server-side refreshes of
`ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization
for transactional requests, which now benefit from this new capability to
refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction,
before they've accumulated any refresh spans. There's some complexity around
supporting this form of server-side retry, because it must be done above
latching, instead of below. However, the recent refactoring in cockroachdb#73557 has made
this possible to support cleanly.

Specifically, we now handle `ReadWithinUncertaintyIntervalError` as a concurrency
error in the `executeWithConcurrencyRetries` retry loop. This is different from
other server-side retries, which are hit during writes and can be handled without
releasing latches. This difference stems from the difference in how read and write
latches behave. Write latches protect their MVCC timestamp and any later time.
Meanwhile, read latches protect their MVCC timestamp and any earlier time. This
means that a request holding read latches that hits an uncertainty error can't
refresh without dropping those latches and acquiring new ones.

This is also a prerequisite to giving non-transactional requests uncertainty
intervals (cockroachdb#73732), because we don't want ReadWithinUncertaintyIntervalErrors to
reach the client for non-transactional requests. Conveniently, because
non-transactional requests are always scoped to a single-range, those that hit
uncertainty errors will always be able to retry on the server, so these errors
will never bubble up to the client that initiated the request.

Release note (performance improvement): Certain forms of automatically retried
"read uncertainty" errors are now retried more efficiently, avoiding a network
round trip.
craig bot pushed a commit that referenced this pull request Feb 8, 2022
76095: kv: don't pass clock information through Raft log r=nvanbenschoten a=nvanbenschoten

This commit eliminates the primary mechanism that we use to pass clock information from a leaseholder, through Raft log entries, to a Range's followers. As we found in #72278, this was only needed for correctness in a few specific cases — namely lease transfers and range merges. These two operations continue to pass clock signals through more explicit channels, but we remove the unnecessary general case.

The allows us to remote one of the two remaining places where we convert a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp` method. As outlined in #72121 (comment), I would like to remove ability to downcast a "data-plane" `Timestamp` to a "control-plane" `CloudTimestamp` entirely. This will clarify the role of `ClockTimestamps` in the system and clean up the channels through which clock information is passed between nodes.

The other place where we cast from `Timestamp` to `ClockTimesatmp` is in `Store.Send`, at the boundary of KV RPCs. I would also like to get rid of this, but doing so needs to wait on #73732.

76163: bazel: remove old protos when generating new ones r=ajwerner a=ajwerner

This is what the Makefile did. It was painful to have the old onces because
they'd lead to spurious diffs.

Release note: None

76166: sql: support version numbers on descriptor validation r=fqazi a=fqazi

Previously, the descriptor validation code did not
take a version number, so it was not possible to version
gate new validation logic. This was inadequate  because
when new fields are introduced we don't want their validation
to kick in for certain cases like the debug doctor, such as any
new fields with non-zero defaults. To address this, this patch add supports
for version numbers inside the validation, and updates unit tests
to pass this in as well. It also adds a new option on debug doctor
to run a version of validation.

Release note (cli change): Add new optional version argument
to the doctor examine command. This can be used to enable /
disable validation when examining older zip directories.

76188: ci: make sure metamorphic nightly uses proper formatter for issues r=nicktrav a=rickystewart

The `--formatter=pebble-metamorphic` option got lost in #75585.

Release note: None

76191: gazelle: exclude `.pb.go` files r=dt a=rickystewart

Should prevent Gazelle from getting confused and adding these to `srcs`.

Release note: None

76192: ci: make sure extra env vars are set for `roachtest` jobs r=rail a=rickystewart

We [need these](https://github.com/cockroachdb/cockroach/blob/5e7690d6da09821ff431ef32fe8d1430d05aed9f/pkg/cmd/internal/issues/issues.go#L159-L171).

Release note: None

76194: tree: remove TODOs about bytea/float cast volatility r=mgartner a=rafiss

These comments aren't needed, since the current cast volatility is
correct, as far as I can tell.

We might want to report this as a bug in Postgres if it describes these
casts as immutable incorrectly.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
craig bot pushed a commit that referenced this pull request Feb 8, 2022
75905: kv: support server-side refreshes of uncertainty errors r=nvanbenschoten a=nvanbenschoten

Extracted from #73732, with relevant comments addressed.

This commit adds support for server-side refreshes of
`ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization
for transactional requests, which now benefit from this new capability to
refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction,
before they've accumulated any refresh spans. There's some complexity around
supporting this form of server-side retry, because it must be done above
latching, instead of below. However, the recent refactoring in #73557 has made
this possible to support cleanly.

Specifically, we now handle `ReadWithinUncertaintyIntervalError` as a concurrency
error in the `executeWithConcurrencyRetries` retry loop. This is different from
other server-side retries, which are hit during writes and can be handled without
releasing latches. This difference stems from the difference in how read and write
latches behave. Write latches protect their MVCC timestamp and any later time.
Meanwhile, read latches protect their MVCC timestamp and any earlier time. This
means that a request holding read latches that hits an uncertainty error can't
refresh without dropping those latches and acquiring new ones.

This is also a prerequisite to giving non-transactional requests uncertainty
intervals (#73732), because we don't want ReadWithinUncertaintyIntervalErrors to
reach the client for non-transactional requests. Conveniently, because
non-transactional requests are always scoped to a single-range, those that hit
uncertainty errors will always be able to retry on the server, so these errors
will never bubble up to the client that initiated the request.

Release note (performance improvement): Certain forms of automatically retried
"read uncertainty" errors are now retried more efficiently, avoiding a network
round trip.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
…tamp

This commit replaces the boolean `timestamp_from_server_clock` flag on
the `BatchRequest` header with a nullable `ClockTimestamp` field. Doing
so allows for the use of the field to be expanded in a following commit.

The field will soon be needed to record the time at which the request
was received by the server node. The operation timestamp cannot serve
this role because it can change due to server-side uncertainty retries.
By remembering a stable reference to the initial timestamp, we ensure
that a non-transactional request's uncertainty interval remains fixed
across retries.

There are no backwards compatibility concerns with this change because
the previous field was only consulted on the same node that set it. It
was not consulted across RPC boundaries.
This isn't needed until the next commit, but it's enough noise and code
movement that it's better to live on its own.
Fixes cockroachdb#58459.
Informs cockroachdb#36431.

This commit fixes a long-standing correctness issue where non-transactional
requests did not ensure single-key linearizability even if they deferred their
timestamp allocation to the leaseholder of their (single) range. They still
don't entirely, because of cockroachdb#36431, but this change brings us one step closer to
the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests.

The change addresses this by giving non-transactional requests uncertainty
intervals. This ensures that they also guarantee single-key linearizability even
with only loose (but bounded) clock synchronization. Non-transactional requests
that use a client-provided timestamp do not have uncertainty intervals and do
not make real-time ordering guarantees.

Unlike transactions, which establish an uncertainty interval on their
coordinator node during initialization, non-transactional requests receive
uncertainty intervals from their target leaseholder, using a clock reading
from the leaseholder's local HLC as the local limit and this clock reading +
the cluster's maximum clock skew as the global limit.

It is somewhat non-intuitive that non-transactional requests need uncertainty
intervals — after all, they receive their timestamp to the leaseholder of the
only range that they talk to, so isn't every value with a commit timestamp
above their read timestamp certainly concurrent? The answer is surprisingly
"no" for the following reasons, so they cannot forgo the use of uncertainty
interval:

1. the request timestamp is allocated before consulting the replica's lease.
   This means that there are times when the replica is not the leaseholder at
   the point of timestamp allocation, and only becomes the leaseholder later.
   In such cases, the timestamp assigned to the request is not guaranteed to
   be greater than the written_timestamp of all writes served by the range at
   the time of allocation. This is true despite invariants 1 & 2 presented in
   `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the
   timestamp is not yet the leaseholder.

   In cases where the replica that assigned the non-transactional request's
   timestamp takes over as the leaseholder after the timestamp allocation, we
   expect minimumLocalLimitForLeaseholder to forward the local uncertainty
   limit above TimestampFromServerClock, to the lease start time.

   For example, consider the following series of events:
   - client A writes k = v1
   - leaseholder writes v1 at ts = 100
   - client A receives ack for write
   - client B wants to read k using a non-txn request
   - follower replica with slower clock receives non-txn request
   - follower replica assigns request ts = 95
   - lease transferred to follower replica with lease start time = 101
   - non-txn request must use 101 as limit of uncertainty interval to ensure
     that it observes k = v1 in uncertainty interval, performs a server-side
     retry, bumps its read timestamp, and returns k = v1.

2. even if the replica's lease is stable and the timestamp is assigned to the
   non-transactional request by the leaseholder, the assigned clock reading
   only reflects the written_timestamp of all of the writes served by the
   leaseholder (and previous leaseholders) thus far. This clock reading is
   not guaranteed to lead the commit timestamp of all of these writes,
   especially if they are committed remotely and resolved after the request
   has received its clock reading but before the request begins evaluating.

   As a result, the non-transactional request needs an uncertainty interval
   with a global uncertainty limit far enough in advance of the leaseholder's
   local HLC clock to ensure that it considers any value that was part of a
   transaction which could have committed before the request was received by
   the leaseholder to be uncertain. Concretely, the non-transactional request
   needs to consider values of the following form to be uncertain:

     written_timestamp < local_limit && commit_timestamp < global_limit

   The value that the non-transactional request is observing may have been
   written on the local leaseholder at time 10, its transaction may have been
   committed remotely at time 20, acknowledged, then the non-transactional
   request may have begun and received a timestamp of 15 from the local
   leaseholder, then finally the value may have been resolved asynchronously
   and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20).
   The failure of the non-transactional request to observe this value would
   be a stale read.

   For example, consider the following series of events:
   - client A begins a txn and is assigned provisional commit timestamp = 95
   - client A's txn performs a Put(k, v1)
   - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95
   - client A's txn performs a write elsewhere and hits a WriteTooOldError
     that bumps its provisional commit timestamp to 100
   - client A's txn refreshes to ts = 100. This notably happens without
     involvement of the leaseholder that served the Put (this is at the heart
     of cockroachdb#36431), so that leaseholder's clock is not updated
   - client A's txn commits remotely and client A receives the acknowledgment
   - client B initiates non-txn read of k
   - leaseholder assigns read timestamp ts = 97
   - asynchronous intent resolution resolves the txn's intent at k, moving v1
     to ts = 100 in the process
   - non-txn request must use an uncertainty interval that extends past 100
     to ensure that it observes k = v1 in uncertainty interval, performs a
     server-side retry, bumps its read timestamp, and returns k = v1. Failure
     to do so would be a stale read

----

This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional
requests to benefit from our incoming fix for that issue. Second, it unblocks
some of the clock refactors proposed in cockroachdb#72121 (comment),
and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't
feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make
this change. We know from cockroachdb#36431 that invariant 1 from
[`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131)
doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send`
masks the bugs in this area in many cases. Removing that clock update (I don't
actually plan to remove it, but I plan to disconnect it entirely from operation
timestamps) without first giving non-transactional requests uncertainty intervals
seems like it may reveal these bugs in ways we haven't seen in the past. So I'd
like to land this fix before making that change.

----

Release note: None
@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Feb 8, 2022

Build succeeded:

@craig craig bot merged commit 498970c into cockroachdb:master Feb 8, 2022
RajivTS pushed a commit to RajivTS/cockroach that referenced this pull request Mar 6, 2022
This commit eliminates the primary mechanism that we use to pass clock
information from a leaseholder, through Raft log entries, to a Range's
followers. As we found in cockroachdb#72278, this was only needed for correctness
in a few specific cases — namely lease transfers and range merges. These
two operations continue to pass clock signals through more explicit
channels, but we remove the unnecessary general case.

The allows us to remote one of the two remaining places where we convert
a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp`
method. As outlined in cockroachdb#72121 (comment),
I would like to remove ability to downcast a "data-plane" `Timestamp` to
a "control-plane" `CloudTimestamp` entirely. This will clarify the role
of `ClockTimestamps` in the system and clean up the channels through
which clock information is passed between nodes.

The other place where we cast from `Timestamp` to `ClockTimesatmp` is
in `Store.Send`, at the boundary of KV RPCs. I would also like to get
rid of this, but doing so needs to wait on cockroachdb#73732.
RajivTS pushed a commit to RajivTS/cockroach that referenced this pull request Mar 6, 2022
Extracted from cockroachdb#73732, with relevant comments addressed.

This commit adds support for server-side refreshes of
`ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization
for transactional requests, which now benefit from this new capability to
refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction,
before they've accumulated any refresh spans. There's some complexity around
supporting this form of server-side retry, because it must be done above
latching, instead of below. However, the recent refactoring in cockroachdb#73557 has made
this possible to support cleanly.

Specifically, we now handle `ReadWithinUncertaintyIntervalError` as a concurrency
error in the `executeWithConcurrencyRetries` retry loop. This is different from
other server-side retries, which are hit during writes and can be handled without
releasing latches. This difference stems from the difference in how read and write
latches behave. Write latches protect their MVCC timestamp and any later time.
Meanwhile, read latches protect their MVCC timestamp and any earlier time. This
means that a request holding read latches that hits an uncertainty error can't
refresh without dropping those latches and acquiring new ones.

This is also a prerequisite to giving non-transactional requests uncertainty
intervals (cockroachdb#73732), because we don't want ReadWithinUncertaintyIntervalErrors to
reach the client for non-transactional requests. Conveniently, because
non-transactional requests are always scoped to a single-range, those that hit
uncertainty errors will always be able to retry on the server, so these errors
will never bubble up to the client that initiated the request.

Release note (performance improvement): Certain forms of automatically retried
"read uncertainty" errors are now retried more efficiently, avoiding a network
round trip.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kv: fix non-transactional requests' interaction with uncertainty intervals
3 participants