kv: give non-transactional requests uncertainty intervals #73732

nvanbenschoten · 2021-12-13T07:30:48Z

This commit fixes a long-standing correctness issue where non-transactional requests did not ensure single-key linearizability even if they deferred their timestamp allocation to the leaseholder of their (single) range. They still don't entirely, because of #36431, but this change brings us one step closer to the fix we plan to land for #36431 also applying to non-transactional requests.

The change addresses this by giving non-transactional requests uncertainty intervals. This ensures that they also guarantee single-key linearizability even with only loose (but bounded) clock synchronization. Non-transactional requests that use a client-provided timestamp do not have uncertainty intervals and do not make real-time ordering guarantees.

Unlike transactions, which establish an uncertainty interval on their coordinator node during initialization, non-transactional requests receive uncertainty intervals from their target leaseholder, using a clock reading from the leaseholder's local HLC as the local limit and this clock reading + the cluster's maximum clock skew as the global limit.

It is somewhat non-intuitive that non-transactional requests need uncertainty intervals — after all, they receive their timestamp to the leaseholder of the only range that they talk to, so isn't every value with a commit timestamp above their read timestamp certainly concurrent? The answer is surprisingly "no" for the following reasons, so they cannot forgo the use of uncertainty interval:

the request timestamp is allocated before consulting the replica's lease. This means that there are times when the replica is not the leaseholder at the point of timestamp allocation, and only becomes the leaseholder later. In such cases, the timestamp assigned to the request is not guaranteed to be greater than the written_timestamp of all writes served by the range at the time of allocation. This is true despite invariants 1 & 2 presented in pkg/kv/kvserver/uncertainty/doc.go because the replica allocating the timestamp is not yet the leaseholder.

In cases where the replica that assigned the non-transactional request's timestamp takes over as the leaseholder after the timestamp allocation, we expect minimumLocalLimitForLeaseholder to forward the local uncertainty limit above TimestampFromServerClock, to the lease start time.

For example, consider the following series of events:
- client A writes k = v1
- leaseholder writes v1 at ts = 100
- client A receives ack for write
- client B wants to read k using a non-txn request
- follower replica with slower clock receives non-txn request
- follower replica assigns request ts = 95
- lease transferred to follower replica with lease start time = 101
- non-txn request must use 101 as limit of uncertainty interval to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. Failure to do so would be a stale read
even if the replica's lease is stable and the timestamp is assigned to the non-transactional request by the leaseholder, the assigned clock reading only reflects the written_timestamp of all of the writes served by the leaseholder (and previous leaseholders) thus far. This clock reading is not not guaranteed to lead the commit timestamp of all of these writes, especially if they are committed remotely and resolved after the request has received its clock reading but before the request begins evaluating.

As a result, the non-transactional request needs an uncertainty interval with a global uncertainty limit far enough in advance of the leaseholder's local HLC clock to ensure that it considers any value that was part of a transaction which could have committed before the request was received by the leaseholder to be uncertain. Concretely, the non-transactional request needs to consider values of the following form to be uncertain:

written_timestamp < local_limit && commit_timestamp < global_limit

The value that the non-transactional request is observing may have been written on the local leaseholder at time 10, its transaction may have been committed remotely at time 20, acknowledged, then the non-transactional request may have begun and received a timestamp of 15 from the local leaseholder, then finally the value may have been resolved asynchronously and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20). The failure of the non-transactional request to observe this value would be a stale read.

For example, consider the following series of events:
- client A begins a txn and is assigned provisional commit timestamp = 95
- client A's txn performs a Put(k, v1)
- leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95
- client A's txn performs a write elsewhere and hits a WriteTooOldError that bumps its provisional commit timestamp to 100
- client A's txn refreshes to ts = 100. This notably happens without involvement of the leaseholder that served the Put (this is at the heart of kv: stale read because intent from our past gets committed in our certain future? #36431), so that leaseholder's clock is not updated
- client A's txn commits remotely and client A receives the acknowledgment
- client B initiates non-txn read of k
- leaseholder assigns read timestamp ts = 97
- asynchronous intent resolution resolves the txn's intent at k, moving v1 to ts = 100 in the process
- non-txn request must use an uncertainty interval that extends past 100 to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. Failure to do so would be a stale read

This change is related to #36431 in two ways. First, it allows non-transactional requests to benefit from our incoming fix for that issue. Second, it unblocks some of the clock refactors proposed in #72121 (comment), and by extension #72663. Even though there are clear bugs today, I still don't feel comfortable removing the hlc.Clock.Update in Store.Send until we make this change. We know from #36431 that invariant 1 from uncertainty.D6 doesn't hold, and yet I still do think the hlc.Clock.Update in Store.Send masks the bugs in this area in many cases. Removing that clock update (I don't actually plan to remove it, but I plan to disconnect it entirely from operation timestamps) without first giving non-transactional requests uncertainty intervals seems like it may reveal these bugs in ways we haven't seen in the past. So I'd like to land this fix before making that change.

Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.

cockroach-teamcity · 2021-12-13T07:31:00Z

This change is

tbg

Haven't reviewed the main commit yet, but I did review its commit message.

Reviewed 4 of 4 files at r1, 1 of 1 files at r2, 34 of 34 files at r3, 9 of 9 files at r4, 34 of 34 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, @nvanbenschoten, and @tbg)

-- commits, line 85 at r6:
what are "invariants 1 & 2"? Are you referring to the "After all, ..." sentence? Could be clearer.

-- commits, line 86 at r6:
extend with an example. I know it's obvious but an example is always good imo.

client A wants to write k := v1
leaseholder writes v1 at ts = 100
client A receives ack for write
client B wants to read k
follower receives non-txn'al request, assigns timestamp ts = 95
follower becomes leader (say, at lease start ts = 101)
non-txn'al request does not return v1

-- commits, line 88 at r6:
I wasn't clear on whether this describes the pre-PR or post-PR state. It's post-PR, consider clarifying (and now it helps to have the example):

With this PR, when B wants to read k after having written v1, the non-txn'al request will similarly start out with TimestampFromServerClock = 95, but will have its local uncertainty limit forwarded via minimumLocalLimitForLeaseholder to at least the new lease start time t=101, and is thus guaranteed to observe all writes served by the old leaseholder, including v1 (which has ts = 100, which is covered by [95,101]).

-- commits, line 93 at r6:
Starting to struggle here to see the difference between #36431. Here's my attempt at improving the comment but I might be slightly off.

nit: I can't find written_timestamp anywhere except in comments and I'm not even sure where to find it.

Even if the lease is stable and the timestamp is assigned by the leaseholder, that leaseholder's clock reading only reflects the [written_timestamp] of all of the writes served by the leaseholder (and previous leaseholders) thus far. The written_timestamp of a write may be lower than its commit timestamp, but it is the latter that matters. For example:

client A begins transaction Put(k, v1) which is assigned provisional commit timestamp ts = 95
leaseholder serves Put(k, v1), i.e. lays down intent at written_timestamp = 95.
client A's transaction performs a write elsewhere and hits a WriteTooOldError that bumps its provisional commit timestamp to ts = 100.
client A's transaction refreshes to ts = 100. This notably happens without involvement of the leaseholder that served the Put (this is at the heart of kv: stale read because intent from our past gets committed in our certain future? #36431).
client A's transaction commits and client A receives the acknowledgment.
client A initiates non-txnal read of k.
leaseholder assigns read timestamp ts = 97
asynchronous intent resolution resolves the intent at k, moving v1 to ts = 100 in the process
read does not observe v1.

After this PR, the read is instead assigned a full uncertainty interval [97, 97+MaxOffset). Since the final commit timestamp of the preceding txn (ts = 100) could lead the subsequently assigned read timestamp (ts = 97) by at most MaxOffset, the read is guaranteed to consider v1 possibly concurrent and will observe it (after a local "restart"), thus preventing #36431 for non-transactional reads.

Note that the above problem is harder to solve in the case of a subsequent transactional read as transaction restarts are orders of magnitude more costly and so we more aggressively want to limit the uncertainty interval using observed timestamps; see #36431.

[written_timestamp]: TODO code link

pkg/roachpb/api.proto, line 2113 at r4 (raw file):

  // reference to the initial timestamp, we ensure that a non-transactional
  // request's uncertainty interval remains fixed across retries.
  util.hlc.Timestamp timestamp_from_server_clock = 25 [

Any backwards compatibility concerns here since new nodes will stop interpreting the bool? I assume not, but please explain why not in the commit msg.

pkg/roachpb/batch.go, line 78 at r4 (raw file):

			now := clock.NowAsClockTimestamp()
			ba.Timestamp = now.ToTimestamp()
			ba.TimestampFromServerClock = &now

Had to do a double take to see that we're not sharing memory between Timestamp{,FromServerClock}. Maybe this is clearer:

now := clock.NowAsClockTimestamp()
ba.TimestampFromServerClock = &now
ba.Timestamp = (*ba.TimestampFromServerClock).ToTimestamp()

Not sure this is really better.

tbg

Reviewed 19 of 19 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, and @nvanbenschoten)

-- commits, line 93 at r6:

Previously, tbg (Tobias Grieger) wrote…

Starting to struggle here to see the difference between #36431. Here's my attempt at improving the comment but I might be slightly off.

nit: I can't find written_timestamp anywhere except in comments and I'm not even sure where to find it.

Even if the lease is stable and the timestamp is assigned by the leaseholder, that leaseholder's clock reading only reflects the [written_timestamp] of all of the writes served by the leaseholder (and previous leaseholders) thus far. The written_timestamp of a write may be lower than its commit timestamp, but it is the latter that matters. For example:

client A begins transaction Put(k, v1) which is assigned provisional commit timestamp ts = 95

leaseholder serves Put(k, v1), i.e. lays down intent at written_timestamp = 95.

client A's transaction performs a write elsewhere and hits a WriteTooOldError that bumps its provisional commit timestamp to ts = 100.

client A's transaction refreshes to ts = 100. This notably happens without involvement of the leaseholder that served the Put (this is at the heart of kv: stale read because intent from our past gets committed in our certain future? #36431).

client A's transaction commits and client A receives the acknowledgment.

client A initiates non-txnal read of k.

leaseholder assigns read timestamp ts = 97

asynchronous intent resolution resolves the intent at k, moving v1 to ts = 100 in the process

read does not observe v1.

After this PR, the read is instead assigned a full uncertainty interval [97, 97+MaxOffset). Since the final commit timestamp of the preceding txn (ts = 100) could lead the subsequently assigned read timestamp (ts = 97) by at most MaxOffset, the read is guaranteed to consider v1 possibly concurrent and will observe it (after a local "restart"), thus preventing #36431 for non-transactional reads.

Note that the above problem is harder to solve in the case of a subsequent transactional read as transaction restarts are orders of magnitude more costly and so we more aggressively want to limit the uncertainty interval using observed timestamps; see #36431.

[written_timestamp]: TODO code link

Having reviewed bits of the commit now, the example should also be explicit about the global vs local limit. If I understand correctly, at least once all your work is done and #36431 is fixed, when the intent is resolved it will be with a timestamp that instructs the read to apply the global limit (which is 97+MaxOffset as opposed to the local limit (which is 97). So it is incorrect that this PR is already addressing #36431 for the non-txn'al case, rather it establishes parity so that the txn fix will fix non-txn as well.

pkg/kv/kvserver/replica_batch_updates.go, line 243 at r6 (raw file):

	ba.Txn = ba.Txn.Clone()
	ba.Txn.Refresh(ts)
	ba.Timestamp = ba.Txn.ReadTimestamp

// NB: Refresh just updated ReadTimestamp

pkg/kv/kvserver/replica_evaluate.go, line 526 at r6 (raw file):

its timestamp

of what

pkg/kv/kvserver/replica_evaluate.go, line 609 at r6 (raw file):

			newTimestamp = tErr.RetryTimestamp()

		case *roachpb.ReadWithinUncertaintyIntervalError:

We're teaching txns a new trick here, right? Intentional? Or do you want to be prudent and make this a separate PR. I'll defer to your level of confidence on how likely this is to be worth it. Or are we not changing behavior here because we're holding latches? It's hard to know from just looking at the diff. Perhaps some version of the comment in this file that you removed could still be useful.

Ah, I see below that there is latching-related code for this case now.

pkg/kv/kvserver/replica_send.go, line 481 at r6 (raw file):

			// are strict conditions that must be met for this to be permitted. For
			// non-transactional requests, this is always allowed. If successful, an
			// updated BatchRequest will be returned.

Could mention what error is expected if not. I assume we'll propagate the RWUIE.

pkg/kv/kvserver/replica_send.go, line 731 at r6 (raw file):

		// is necessary to preserve real-time ordering for transactions that write
		// into the future.
		if err := r.Clock().SleepUntil(ctx, ba.Timestamp); err != nil {

nit: should the ctx here respect stopper cancellation?

pkg/kv/kvserver/batcheval/declare.go, line 57 at r6 (raw file):

		access = spanset.SpanReadOnly

		// For non-locking reads, acquire read latches all the way up to the

Could you remind me why DefaultDeclareKeys doesn't do any of this uncertainty stuff?

pkg/kv/kvserver/uncertainty/compute.go, line 79 at r6 (raw file):

	if status.State != kvserverpb.LeaseState_VALID {
		// If the lease is invalid, this must be a follower read.

... and so we must use the most pessimistic limit, right?

pkg/kv/kvserver/uncertainty/compute.go, line 126 at r6 (raw file):

		// these cases, it needs to perform an uncertainty restart.
		//
		// For example, the non-transactional request may observe an intent with a

Good place to plop one of the explicit examples that might make it into the commit msg.

pkg/kv/kvserver/uncertainty/doc.go, line 212 at r6 (raw file):

// the cluster's maximum clock skew as the global limit.
//
// It is somewhat non-intuitive that non-transactional requests need uncertainty

reminder to replace with the most up to date prose/examples before merge.

pkg/storage/pebble_mvcc_scanner.go, line 217 at r6 (raw file):

	// We must check uncertainty even if p.ts.Less(p.uncertainty.LocalLimit)
	// because the local uncertainty limit cannot be applied to values with
	// synthetic timestamps.

I know you're not really changing this but this comment is confusing - we set checkUncertainty = false and say "we must still check it"? Having trouble even making a better comment. I don't see any checks for the uncertainty on synthetic values (might've missed it). So maybe just replace this with a TODO referencing a Github issue?

Extracted from cockroachdb#73732, with relevant comments addressed. This commit adds support for server-side refreshes of `ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization for transactional requests, which now benefit from this new capability to refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction, before they've accumulated any refresh spans. There's some complexity around supporting this form of server-side retry, because it must be done above latching, instead of below. However, the recent refactoring in cockroachdb#73557 has made this possible to support cleanly. This is also a prerequisite to giving non-transactional requests uncertainty intervals (cockroachdb#73732), because we don't want ReadWithinUncertaintyIntervalErrors to reach the client for non-transactional requests. Conveniently, because non-transactional requests are always scoped to a single-range, those that hit uncertainty errors will always be able to retry on the server, so these errors will never bubble up to the client that initiated the request. Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.

nvanbenschoten

Thank you for the detailed review, and apologies for letting this sit for a month. I'm hoping to start pushing on this stuff again.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, and @tbg)

-- commits, line 85 at r6:

Previously, tbg (Tobias Grieger) wrote…

what are "invariants 1 & 2"? Are you referring to the "After all, ..." sentence? Could be clearer.

Done.

-- commits, line 86 at r6:

Previously, tbg (Tobias Grieger) wrote…

extend with an example. I know it's obvious but an example is always good imo.

client A wants to write k := v1

leaseholder writes v1 at ts = 100

client A receives ack for write

client B wants to read k

follower receives non-txn'al request, assigns timestamp ts = 95

follower becomes leader (say, at lease start ts = 101)

non-txn'al request does not return v1

Done.

-- commits, line 88 at r6:

Previously, tbg (Tobias Grieger) wrote…

I wasn't clear on whether this describes the pre-PR or post-PR state. It's post-PR, consider clarifying (and now it helps to have the example):

With this PR, when B wants to read k after having written v1, the non-txn'al request will similarly start out with TimestampFromServerClock = 95, but will have its local uncertainty limit forwarded via minimumLocalLimitForLeaseholder to at least the new lease start time t=101, and is thus guaranteed to observe all writes served by the old leaseholder, including v1 (which has ts = 100, which is covered by [95,101]).

Done in example.

-- commits, line 93 at r6:

Previously, tbg (Tobias Grieger) wrote…

Having reviewed bits of the commit now, the example should also be explicit about the global vs local limit. If I understand correctly, at least once all your work is done and #36431 is fixed, when the intent is resolved it will be with a timestamp that instructs the read to apply the global limit (which is 97+MaxOffset as opposed to the local limit (which is 97). So it is incorrect that this PR is already addressing #36431 for the non-txn'al case, rather it establishes parity so that the txn fix will fix non-txn as well.

Done. Thanks for the suggestions about the wording here.

You're correct that this does not address #36431 for the non-txn'al case, but rather ensures that when we do fix #36431, we fix it for both cases.

Regarding the written_timestamp, the concept does not exist yet, which is why some of this is a bit difficult to write. I considered omitting it, but given that we're planning on introducing it shortly in the replacement to #72121, it makes things a lot easier to just pretend it exists for now so that we have something concrete to anchor these discussions on and won't need to rewrite all of this later.

I added references to #72121 in each case where we discuss the written_timestamp concept to make this clearer.

pkg/kv/kvserver/replica_batch_updates.go, line 243 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

// NB: Refresh just updated ReadTimestamp

Done.

pkg/kv/kvserver/replica_evaluate.go, line 526 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

its timestamp

of what

Done.

pkg/kv/kvserver/replica_evaluate.go, line 609 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We're teaching txns a new trick here, right? Intentional? Or do you want to be prudent and make this a separate PR. I'll defer to your level of confidence on how likely this is to be worth it. Or are we not changing behavior here because we're holding latches? It's hard to know from just looking at the diff. Perhaps some version of the comment in this file that you removed could still be useful.

Ah, I see below that there is latching-related code for this case now.

I agree that this should have been a different PR. I was going for that with #73717, but I ended up pulling the uninteresting (and no longer needed) change into a separate PR without pulling the bulk of the complexity out of this one.

I've pulled the changes related to teaching transactional requests to perform server-side refreshes of uncertainty intervals into #75905.

pkg/kv/kvserver/replica_send.go, line 481 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Could mention what error is expected if not. I assume we'll propagate the RWUIE.

Done.

pkg/kv/kvserver/replica_send.go, line 731 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: should the ctx here respect stopper cancellation?

Done.

pkg/kv/kvserver/batcheval/declare.go, line 57 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Could you remind me why DefaultDeclareKeys doesn't do any of this uncertainty stuff?

DefaultDeclareKeys is meant for non-MVCC operations (inline writes) and metadata operations (refresh requests), which is why it also doesn't instruct requests to check the lock table. None of these kinds of operations use uncertainty intervals. I'm not particularly happy with the way we talk about the distinction between these two classes of operations ("isolated vs. not"), but I haven't found a way to clean up the language and concepts used in this area.

pkg/kv/kvserver/uncertainty/compute.go, line 79 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

... and so we must use the most pessimistic limit, right?

Done.

pkg/kv/kvserver/uncertainty/compute.go, line 126 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Good place to plop one of the explicit examples that might make it into the commit msg.

Done.

pkg/kv/kvserver/uncertainty/doc.go, line 212 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

reminder to replace with the most up to date prose/examples before merge.

Done.

pkg/roachpb/api.proto, line 2113 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Any backwards compatibility concerns here since new nodes will stop interpreting the bool? I assume not, but please explain why not in the commit msg.

Done.

pkg/roachpb/batch.go, line 78 at r4 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Had to do a double take to see that we're not sharing memory between Timestamp{,FromServerClock}. Maybe this is clearer:
now := clock.NowAsClockTimestamp()
ba.TimestampFromServerClock = &now
ba.Timestamp = (*ba.TimestampFromServerClock).ToTimestamp()
Not sure this is really better.

Added a comment to make this clear.

pkg/storage/pebble_mvcc_scanner.go, line 217 at r6 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I know you're not really changing this but this comment is confusing - we set checkUncertainty = false and say "we must still check it"? Having trouble even making a better comment. I don't see any checks for the uncertainty on synthetic values (might've missed it). So maybe just replace this with a TODO referencing a Github issue?

The comment is trying to contrast p.ts < local_limit with p.ts < global_limit, but I think it was just wrong. Fixed.

tbg

Partial review due to EOD

Reviewed 1 of 62 files at r8, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, @nvanbenschoten, and @tbg)

-- commits, line 95 at r11:
Is another (contrived but possible) example that the client reads back k but that due to a, uh, 10s GC pause, the read is eligible for a follower read? Or do we just not let non-txn'al requests do follower reads (which would be good; no reason for those to exist)?

-- commits, line 103 at r11:
local limit, right? I.e. instead of the pessimistic ts+500ms, we get to use 101 (which presumably is better in this example).

-- commits, line 111 at r11:
not not

-- commits, line 125 at r11:
nit: use timestamps from the example, then the prose can prime readers and make the example easier to follow.

-- commits, line 143 at r11:
Nit for all of these examples: why isn't client A doing both (i.e. remove client B)? The examples should drive home that you get stale reads, right? Now we always have to say "B causally relates to A" or something like that.

-- commits, line 150 at r11:
.

tbg · 2022-02-04T19:13:36Z

pkg/kv/kvserver/concurrency/concurrency_manager.go

+	// then it can not trivially bump its timestamp without dropping and
+	// re-acquiring those latches. Doing so could allow the request to read at an
+	// unprotected timestamp. We only look at global latch spans because local
+	// latch spans always use unbounded (NonMVCC) timestamps.


Is this enforced somehow?

tbg · 2022-02-04T19:16:48Z

pkg/kv/kvserver/uncertainty/compute.go

-	//  See https://github.com/cockroachdb/cockroach/issues/58459.
-	if txn == nil {
-		return Interval{}
+func ComputeInterval(


Might be worth pointing out that this can return an empty Interval{} meaning that the request is explicitly bypassing uncertainty restarts.

tbg · 2022-02-04T19:20:24Z

pkg/kv/kvserver/uncertainty/doc.go

+//
+// Unlike transactions, which establish an uncertainty interval on their
+// coordinator node during initialization, non-transactional requests receive
+// uncertainty intervals from their target leaseholder, using a clock reading


Since this is a choice, it makes sense to motivate it: we have all reason to believe that this means the leaseholder is chosing the timestamp, which means we get the a relevant observed timestamp "for free".

tbg · 2022-02-04T19:20:54Z

pkg/kv/kvserver/uncertainty/doc.go

+// single-range, those that hit uncertainty errors can always retry on the
+// server, so these errors never bubble up to the client that initiated the
+// request.
+var D7 = roachpb.Header{}.TimestampFromServerClock


Didn't read this comment in detail because I assume it mirrors exactly what is in the PR message and will be synced before merge.

tbg · 2022-02-04T19:22:57Z

pkg/kv/kvserver/client_replica_test.go

+			require.NotNil(t, ba.TimestampFromServerClock)
+			nonTxnOrigTsC <- ba.Timestamp
+			// Wait for the test to give the go-ahead.
+			select {


Could you put a DefaultSucceedsSoon timeout here and on all other selects. These kind of tests tend to deadlock.

(I assume you've stressed this a bit to shake out flakes)

tbg · 2022-02-04T19:23:40Z

pkg/kv/kvserver/replica_batch_updates.go

@@ -183,7 +183,7 @@ func maybeStripInFlightWrites(ba *roachpb.BatchRequest) (*roachpb.BatchRequest,
 // works for batches that exclusively contain writes; reads cannot be bumped
 // like this because they've already acquired timestamp-aware latches.
 func maybeBumpReadTimestampToWriteTimestamp(
-	ctx context.Context, ba *roachpb.BatchRequest, latchSpans *spanset.SpanSet,
+	ctx context.Context, ba *roachpb.BatchRequest, g *concurrency.Guard,


can g be nil?

tbg · 2022-02-04T19:32:02Z

pkg/roachpb/api.proto

+  // can change due to server-side uncertainty retries. By remembering a stable
+  // reference to the initial timestamp, we ensure that a non-transactional
+  // request's uncertainty interval remains fixed across retries.
+  util.hlc.Timestamp timestamp_from_server_clock = 27 [


Remind me if there was a migration to worry about here? Is anything going to break cross-version due to the use of different fields?

Extracted from cockroachdb#73732, with relevant comments addressed. This commit adds support for server-side refreshes of `ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization for transactional requests, which now benefit from this new capability to refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction, before they've accumulated any refresh spans. There's some complexity around supporting this form of server-side retry, because it must be done above latching, instead of below. However, the recent refactoring in cockroachdb#73557 has made this possible to support cleanly. This is also a prerequisite to giving non-transactional requests uncertainty intervals (cockroachdb#73732), because we don't want ReadWithinUncertaintyIntervalErrors to reach the client for non-transactional requests. Conveniently, because non-transactional requests are always scoped to a single-range, those that hit uncertainty errors will always be able to retry on the server, so these errors will never bubble up to the client that initiated the request. Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, and @tbg)

-- commits, line 95 at r11:

Previously, tbg (Tobias Grieger) wrote…

Is another (contrived but possible) example that the client reads back k but that due to a, uh, 10s GC pause, the read is eligible for a follower read? Or do we just not let non-txn'al requests do follower reads (which would be good; no reason for those to exist)?

Right, we don't allow non-transactional reads that have server-assigned timestamps to be served by followers for this reason. See:

cockroach/pkg/kv/kvserver/replica_follower_read.go

Lines 35 to 57 in 4a741cd

    
           // BatchCanBeEvaluatedOnFollower determines if a batch consists exclusively of 
        
           // requests that can be evaluated on a follower replica, given a sufficiently 
        
           // advanced closed timestamp. 
        
           func BatchCanBeEvaluatedOnFollower(ba roachpb.BatchRequest) bool { 
        
           	// Explanation of conditions: 
        
           	// 1. the batch cannot have or intend to receive a timestamp set from a 
        
           	//    server-side clock. If a follower with a lagging clock sets its timestamp 
        
           	//    and this then allows the follower to evaluate the batch as a follower 
        
           	//    read, then the batch might miss past writes served at higher timestamps 
        
           	//    on the leaseholder. 
        
           	// 2. each request in the batch needs to be "transactional", because those are 
        
           	//    the only ones that have clearly defined semantics when served under the 
        
           	//    closed timestamp. 
        
           	// 3. the batch needs to be read-only, because a follower replica cannot 
        
           	//    propose writes to Raft. 
        
           	// 4. the batch needs to be non-locking, because unreplicated locks are only 
        
           	//    held on the leaseholder. 
        
           	tsFromServerClock := ba.Txn == nil && (ba.Timestamp.IsEmpty() || ba.TimestampFromServerClock) 
        
           	if tsFromServerClock { 
        
           		return false 
        
           	} 
        
           	return ba.IsAllTransactional() && ba.IsReadOnly() && !ba.IsLocking() 
        
           }

-- commits, line 103 at r11:

Previously, tbg (Tobias Grieger) wrote…

local limit, right? I.e. instead of the pessimistic ts+500ms, we get to use 101 (which presumably is better in this example).

Yes, updated.

-- commits, line 111 at r11:

Previously, tbg (Tobias Grieger) wrote…

not not

Done.

-- commits, line 125 at r11:

Previously, tbg (Tobias Grieger) wrote…

nit: use timestamps from the example, then the prose can prime readers and make the example easier to follow.

Done.

-- commits, line 143 at r11:

Previously, tbg (Tobias Grieger) wrote…

Nit for all of these examples: why isn't client A doing both (i.e. remove client B)? The examples should drive home that you get stale reads, right? Now we always have to say "B causally relates to A" or something like that.

Done.

-- commits, line 150 at r11:

Previously, tbg (Tobias Grieger) wrote…

.

Done.

pkg/kv/kvserver/replica_batch_updates.go, line 186 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

can g be nil?

It can, but it never is given a nil g

pkg/kv/kvserver/concurrency/concurrency_manager.go, line 671 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is this enforced somehow?

Yes, see https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/spanlatch/manager.go#L423-L432 and

cockroach/pkg/kv/kvserver/spanset/spanset.go

Lines 192 to 195 in 4a741cd

    
           if keys.IsLocal(span.Key) { 
        
           	scope = SpanLocal 
        
           	timestamp = hlc.Timestamp{} 
        
           }

.

pkg/kv/kvserver/uncertainty/compute.go, line 61 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Might be worth pointing out that this can return an empty Interval{} meaning that the request is explicitly bypassing uncertainty restarts.

Done.

pkg/kv/kvserver/uncertainty/doc.go, line 208 at r11 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Since this is a choice, it makes sense to motivate it: we have all reason to believe that this means the leaseholder is chosing the timestamp, which means we get the a relevant observed timestamp "for free".

Good point, done.

pkg/roachpb/api.proto, line 2149 at r11 (raw file):
No concerns, discussed in commit message.

There are no backwards compatibility concerns with this change because the previous field was only consulted on the same node that set it. It was not consulted across RPC boundaries.

pkg/kv/kvserver/client_replica_test.go, line 734 at r11 (raw file):
Done.

(I assume you've stressed this a bit to shake out flakes)

Yes, stressed for some time.

tbg

Reviewed 7 of 62 files at r8, 53 of 53 files at r12, 8 of 8 files at r13, 34 of 34 files at r14, 15 of 15 files at r15, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, and @tbg)

Extracted from cockroachdb#73732, with relevant comments addressed. This commit adds support for server-side refreshes of `ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization for transactional requests, which now benefit from this new capability to refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction, before they've accumulated any refresh spans. There's some complexity around supporting this form of server-side retry, because it must be done above latching, instead of below. However, the recent refactoring in cockroachdb#73557 has made this possible to support cleanly. This is also a prerequisite to giving non-transactional requests uncertainty intervals (cockroachdb#73732), because we don't want ReadWithinUncertaintyIntervalErrors to reach the client for non-transactional requests. Conveniently, because non-transactional requests are always scoped to a single-range, those that hit uncertainty errors will always be able to retry on the server, so these errors will never bubble up to the client that initiated the request. Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.

This commit eliminates the primary mechanism that we use to pass clock information from a leaseholder, through Raft log entries, to a Range's followers. As we found in cockroachdb#72278, this was only needed for correctness in a few specific cases — namely lease transfers and range merges. These two operations continue to pass clock signals through more explicit channels, but we remove the unnecessary general case. The allows us to remote one of the two remaining places where we convert a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp` method. As outlined in cockroachdb#72121 (comment), I would like to remove ability to downcast a "data-plane" `Timestamp` to a "control-plane" `CloudTimestamp` entirely. This will clarify the role of `ClockTimestamps` in the system and clean up the channels through which clock information is passed between nodes. The other place where we cast from `Timestamp` to `ClockTimesatmp` is in `Store.Send`, at the boundary of KV RPCs. I would also like to get rid of this, but doing so needs to wait on cockroachdb#73732.

Extracted from cockroachdb#73732, with relevant comments addressed. This commit adds support for server-side refreshes of `ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization for transactional requests, which now benefit from this new capability to refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction, before they've accumulated any refresh spans. There's some complexity around supporting this form of server-side retry, because it must be done above latching, instead of below. However, the recent refactoring in cockroachdb#73557 has made this possible to support cleanly. Specifically, we now handle `ReadWithinUncertaintyIntervalError` as a concurrency error in the `executeWithConcurrencyRetries` retry loop. This is different from other server-side retries, which are hit during writes and can be handled without releasing latches. This difference stems from the difference in how read and write latches behave. Write latches protect their MVCC timestamp and any later time. Meanwhile, read latches protect their MVCC timestamp and any earlier time. This means that a request holding read latches that hits an uncertainty error can't refresh without dropping those latches and acquiring new ones. This is also a prerequisite to giving non-transactional requests uncertainty intervals (cockroachdb#73732), because we don't want ReadWithinUncertaintyIntervalErrors to reach the client for non-transactional requests. Conveniently, because non-transactional requests are always scoped to a single-range, those that hit uncertainty errors will always be able to retry on the server, so these errors will never bubble up to the client that initiated the request. Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.

76095: kv: don't pass clock information through Raft log r=nvanbenschoten a=nvanbenschoten This commit eliminates the primary mechanism that we use to pass clock information from a leaseholder, through Raft log entries, to a Range's followers. As we found in #72278, this was only needed for correctness in a few specific cases — namely lease transfers and range merges. These two operations continue to pass clock signals through more explicit channels, but we remove the unnecessary general case. The allows us to remote one of the two remaining places where we convert a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp` method. As outlined in #72121 (comment), I would like to remove ability to downcast a "data-plane" `Timestamp` to a "control-plane" `CloudTimestamp` entirely. This will clarify the role of `ClockTimestamps` in the system and clean up the channels through which clock information is passed between nodes. The other place where we cast from `Timestamp` to `ClockTimesatmp` is in `Store.Send`, at the boundary of KV RPCs. I would also like to get rid of this, but doing so needs to wait on #73732. 76163: bazel: remove old protos when generating new ones r=ajwerner a=ajwerner This is what the Makefile did. It was painful to have the old onces because they'd lead to spurious diffs. Release note: None 76166: sql: support version numbers on descriptor validation r=fqazi a=fqazi Previously, the descriptor validation code did not take a version number, so it was not possible to version gate new validation logic. This was inadequate because when new fields are introduced we don't want their validation to kick in for certain cases like the debug doctor, such as any new fields with non-zero defaults. To address this, this patch add supports for version numbers inside the validation, and updates unit tests to pass this in as well. It also adds a new option on debug doctor to run a version of validation. Release note (cli change): Add new optional version argument to the doctor examine command. This can be used to enable / disable validation when examining older zip directories. 76188: ci: make sure metamorphic nightly uses proper formatter for issues r=nicktrav a=rickystewart The `--formatter=pebble-metamorphic` option got lost in #75585. Release note: None 76191: gazelle: exclude `.pb.go` files r=dt a=rickystewart Should prevent Gazelle from getting confused and adding these to `srcs`. Release note: None 76192: ci: make sure extra env vars are set for `roachtest` jobs r=rail a=rickystewart We [need these](https://github.com/cockroachdb/cockroach/blob/5e7690d6da09821ff431ef32fe8d1430d05aed9f/pkg/cmd/internal/issues/issues.go#L159-L171). Release note: None 76194: tree: remove TODOs about bytea/float cast volatility r=mgartner a=rafiss These comments aren't needed, since the current cast volatility is correct, as far as I can tell. We might want to report this as a bug in Postgres if it describes these casts as immutable incorrectly. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Andrew Werner <awerner32@gmail.com> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>

75905: kv: support server-side refreshes of uncertainty errors r=nvanbenschoten a=nvanbenschoten Extracted from #73732, with relevant comments addressed. This commit adds support for server-side refreshes of `ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization for transactional requests, which now benefit from this new capability to refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction, before they've accumulated any refresh spans. There's some complexity around supporting this form of server-side retry, because it must be done above latching, instead of below. However, the recent refactoring in #73557 has made this possible to support cleanly. Specifically, we now handle `ReadWithinUncertaintyIntervalError` as a concurrency error in the `executeWithConcurrencyRetries` retry loop. This is different from other server-side retries, which are hit during writes and can be handled without releasing latches. This difference stems from the difference in how read and write latches behave. Write latches protect their MVCC timestamp and any later time. Meanwhile, read latches protect their MVCC timestamp and any earlier time. This means that a request holding read latches that hits an uncertainty error can't refresh without dropping those latches and acquiring new ones. This is also a prerequisite to giving non-transactional requests uncertainty intervals (#73732), because we don't want ReadWithinUncertaintyIntervalErrors to reach the client for non-transactional requests. Conveniently, because non-transactional requests are always scoped to a single-range, those that hit uncertainty errors will always be able to retry on the server, so these errors will never bubble up to the client that initiated the request. Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

…tamp This commit replaces the boolean `timestamp_from_server_clock` flag on the `BatchRequest` header with a nullable `ClockTimestamp` field. Doing so allows for the use of the field to be expanded in a following commit. The field will soon be needed to record the time at which the request was received by the server node. The operation timestamp cannot serve this role because it can change due to server-side uncertainty retries. By remembering a stable reference to the initial timestamp, we ensure that a non-transactional request's uncertainty interval remains fixed across retries. There are no backwards compatibility concerns with this change because the previous field was only consulted on the same node that set it. It was not consulted across RPC boundaries.

This isn't needed until the next commit, but it's enough noise and code movement that it's better to live on its own.

Fixes cockroachdb#58459. Informs cockroachdb#36431. This commit fixes a long-standing correctness issue where non-transactional requests did not ensure single-key linearizability even if they deferred their timestamp allocation to the leaseholder of their (single) range. They still don't entirely, because of cockroachdb#36431, but this change brings us one step closer to the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests. The change addresses this by giving non-transactional requests uncertainty intervals. This ensures that they also guarantee single-key linearizability even with only loose (but bounded) clock synchronization. Non-transactional requests that use a client-provided timestamp do not have uncertainty intervals and do not make real-time ordering guarantees. Unlike transactions, which establish an uncertainty interval on their coordinator node during initialization, non-transactional requests receive uncertainty intervals from their target leaseholder, using a clock reading from the leaseholder's local HLC as the local limit and this clock reading + the cluster's maximum clock skew as the global limit. It is somewhat non-intuitive that non-transactional requests need uncertainty intervals — after all, they receive their timestamp to the leaseholder of the only range that they talk to, so isn't every value with a commit timestamp above their read timestamp certainly concurrent? The answer is surprisingly "no" for the following reasons, so they cannot forgo the use of uncertainty interval: 1. the request timestamp is allocated before consulting the replica's lease. This means that there are times when the replica is not the leaseholder at the point of timestamp allocation, and only becomes the leaseholder later. In such cases, the timestamp assigned to the request is not guaranteed to be greater than the written_timestamp of all writes served by the range at the time of allocation. This is true despite invariants 1 & 2 presented in `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the timestamp is not yet the leaseholder. In cases where the replica that assigned the non-transactional request's timestamp takes over as the leaseholder after the timestamp allocation, we expect minimumLocalLimitForLeaseholder to forward the local uncertainty limit above TimestampFromServerClock, to the lease start time. For example, consider the following series of events: - client A writes k = v1 - leaseholder writes v1 at ts = 100 - client A receives ack for write - client B wants to read k using a non-txn request - follower replica with slower clock receives non-txn request - follower replica assigns request ts = 95 - lease transferred to follower replica with lease start time = 101 - non-txn request must use 101 as limit of uncertainty interval to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. 2. even if the replica's lease is stable and the timestamp is assigned to the non-transactional request by the leaseholder, the assigned clock reading only reflects the written_timestamp of all of the writes served by the leaseholder (and previous leaseholders) thus far. This clock reading is not guaranteed to lead the commit timestamp of all of these writes, especially if they are committed remotely and resolved after the request has received its clock reading but before the request begins evaluating. As a result, the non-transactional request needs an uncertainty interval with a global uncertainty limit far enough in advance of the leaseholder's local HLC clock to ensure that it considers any value that was part of a transaction which could have committed before the request was received by the leaseholder to be uncertain. Concretely, the non-transactional request needs to consider values of the following form to be uncertain: written_timestamp < local_limit && commit_timestamp < global_limit The value that the non-transactional request is observing may have been written on the local leaseholder at time 10, its transaction may have been committed remotely at time 20, acknowledged, then the non-transactional request may have begun and received a timestamp of 15 from the local leaseholder, then finally the value may have been resolved asynchronously and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20). The failure of the non-transactional request to observe this value would be a stale read. For example, consider the following series of events: - client A begins a txn and is assigned provisional commit timestamp = 95 - client A's txn performs a Put(k, v1) - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95 - client A's txn performs a write elsewhere and hits a WriteTooOldError that bumps its provisional commit timestamp to 100 - client A's txn refreshes to ts = 100. This notably happens without involvement of the leaseholder that served the Put (this is at the heart of cockroachdb#36431), so that leaseholder's clock is not updated - client A's txn commits remotely and client A receives the acknowledgment - client B initiates non-txn read of k - leaseholder assigns read timestamp ts = 97 - asynchronous intent resolution resolves the txn's intent at k, moving v1 to ts = 100 in the process - non-txn request must use an uncertainty interval that extends past 100 to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. Failure to do so would be a stale read ---- This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional requests to benefit from our incoming fix for that issue. Second, it unblocks some of the clock refactors proposed in cockroachdb#72121 (comment), and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make this change. We know from cockroachdb#36431 that invariant 1 from [`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131) doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send` masks the bugs in this area in many cases. Removing that clock update (I don't actually plan to remove it, but I plan to disconnect it entirely from operation timestamps) without first giving non-transactional requests uncertainty intervals seems like it may reveal these bugs in ways we haven't seen in the past. So I'd like to land this fix before making that change. ---- Release note: None

nvanbenschoten · 2022-02-08T13:52:00Z

bors r+

craig · 2022-02-08T15:09:31Z

Build succeeded:

GitHub CI (Cockroach)

This commit eliminates the primary mechanism that we use to pass clock information from a leaseholder, through Raft log entries, to a Range's followers. As we found in cockroachdb#72278, this was only needed for correctness in a few specific cases — namely lease transfers and range merges. These two operations continue to pass clock signals through more explicit channels, but we remove the unnecessary general case. The allows us to remote one of the two remaining places where we convert a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp` method. As outlined in cockroachdb#72121 (comment), I would like to remove ability to downcast a "data-plane" `Timestamp` to a "control-plane" `CloudTimestamp` entirely. This will clarify the role of `ClockTimestamps` in the system and clean up the channels through which clock information is passed between nodes. The other place where we cast from `Timestamp` to `ClockTimesatmp` is in `Store.Send`, at the boundary of KV RPCs. I would also like to get rid of this, but doing so needs to wait on cockroachdb#73732.

Extracted from cockroachdb#73732, with relevant comments addressed. This commit adds support for server-side refreshes of `ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization for transactional requests, which now benefit from this new capability to refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction, before they've accumulated any refresh spans. There's some complexity around supporting this form of server-side retry, because it must be done above latching, instead of below. However, the recent refactoring in cockroachdb#73557 has made this possible to support cleanly. Specifically, we now handle `ReadWithinUncertaintyIntervalError` as a concurrency error in the `executeWithConcurrencyRetries` retry loop. This is different from other server-side retries, which are hit during writes and can be handled without releasing latches. This difference stems from the difference in how read and write latches behave. Write latches protect their MVCC timestamp and any later time. Meanwhile, read latches protect their MVCC timestamp and any earlier time. This means that a request holding read latches that hits an uncertainty error can't refresh without dropping those latches and acquiring new ones. This is also a prerequisite to giving non-transactional requests uncertainty intervals (cockroachdb#73732), because we don't want ReadWithinUncertaintyIntervalErrors to reach the client for non-transactional requests. Conveniently, because non-transactional requests are always scoped to a single-range, those that hit uncertainty errors will always be able to retry on the server, so these errors will never bubble up to the client that initiated the request. Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.

nvanbenschoten requested review from andreimatei, tbg and irfansharif December 13, 2021 07:30

nvanbenschoten requested review from a team as code owners December 13, 2021 07:30

nvanbenschoten force-pushed the nvanbenschoten/uncertaintyNonTxn branch 2 times, most recently from 5c5e4b1 to 5264c34 Compare December 13, 2021 18:20

tbg reviewed Dec 14, 2021

View reviewed changes

tbg self-requested a review December 14, 2021 10:49

tbg reviewed Dec 14, 2021

View reviewed changes

nvanbenschoten mentioned this pull request Feb 2, 2022

kv: support server-side refreshes of uncertainty errors #75905

Merged

nvanbenschoten force-pushed the nvanbenschoten/uncertaintyNonTxn branch 2 times, most recently from fec6f22 to 181d586 Compare February 3, 2022 05:29

nvanbenschoten commented Feb 3, 2022

View reviewed changes

tbg reviewed Feb 4, 2022

View reviewed changes

tbg self-requested a review February 4, 2022 17:05

tbg approved these changes Feb 4, 2022

View reviewed changes

nvanbenschoten mentioned this pull request Feb 5, 2022

kv: don't pass clock information through Raft log #76095

Merged

nvanbenschoten force-pushed the nvanbenschoten/uncertaintyNonTxn branch from 181d586 to 7078527 Compare February 5, 2022 00:41

nvanbenschoten commented Feb 5, 2022

View reviewed changes

tbg approved these changes Feb 5, 2022

View reviewed changes

tbg self-requested a review February 5, 2022 15:06

nvanbenschoten force-pushed the nvanbenschoten/uncertaintyNonTxn branch from 7078527 to 676a867 Compare February 8, 2022 04:48

nvanbenschoten added 3 commits February 8, 2022 08:51

kv: plumb MaxOffset into DeclareKeysFunc

044cc44

This isn't needed until the next commit, but it's enough noise and code movement that it's better to live on its own.

nvanbenschoten force-pushed the nvanbenschoten/uncertaintyNonTxn branch from 676a867 to d676d7e Compare February 8, 2022 13:51

craig bot merged commit 498970c into cockroachdb:master Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: give non-transactional requests uncertainty intervals #73732

kv: give non-transactional requests uncertainty intervals #73732

nvanbenschoten commented Dec 13, 2021 •

edited

Loading

cockroach-teamcity commented Dec 13, 2021

tbg left a comment

tbg left a comment

nvanbenschoten left a comment

tbg left a comment

tbg Feb 4, 2022

tbg Feb 4, 2022

tbg Feb 4, 2022

tbg Feb 4, 2022

tbg Feb 4, 2022

tbg Feb 4, 2022

tbg Feb 4, 2022

tbg Feb 4, 2022

nvanbenschoten left a comment

tbg left a comment

nvanbenschoten commented Feb 8, 2022

craig bot commented Feb 8, 2022

	// BatchCanBeEvaluatedOnFollower determines if a batch consists exclusively of
	// requests that can be evaluated on a follower replica, given a sufficiently
	// advanced closed timestamp.
	func BatchCanBeEvaluatedOnFollower(ba roachpb.BatchRequest) bool {
	// Explanation of conditions:
	// 1. the batch cannot have or intend to receive a timestamp set from a
	// server-side clock. If a follower with a lagging clock sets its timestamp
	// and this then allows the follower to evaluate the batch as a follower
	// read, then the batch might miss past writes served at higher timestamps
	// on the leaseholder.
	// 2. each request in the batch needs to be "transactional", because those are
	// the only ones that have clearly defined semantics when served under the
	// closed timestamp.
	// 3. the batch needs to be read-only, because a follower replica cannot
	// propose writes to Raft.
	// 4. the batch needs to be non-locking, because unreplicated locks are only
	// held on the leaseholder.
	tsFromServerClock := ba.Txn == nil && (ba.Timestamp.IsEmpty() \|\| ba.TimestampFromServerClock)
	if tsFromServerClock {
	return false
	}
	return ba.IsAllTransactional() && ba.IsReadOnly() && !ba.IsLocking()
	}

	if keys.IsLocal(span.Key) {
	scope = SpanLocal
	timestamp = hlc.Timestamp{}
	}

kv: give non-transactional requests uncertainty intervals #73732

kv: give non-transactional requests uncertainty intervals #73732

Conversation

nvanbenschoten commented Dec 13, 2021 • edited Loading

cockroach-teamcity commented Dec 13, 2021

tbg left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Feb 8, 2022

craig bot commented Feb 8, 2022

nvanbenschoten commented Dec 13, 2021 •

edited

Loading