-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kv: give non-transactional requests uncertainty intervals #73732
kv: give non-transactional requests uncertainty intervals #73732
Conversation
5c5e4b1
to
5264c34
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't reviewed the main commit yet, but I did review its commit message.
Reviewed 4 of 4 files at r1, 1 of 1 files at r2, 34 of 34 files at r3, 9 of 9 files at r4, 34 of 34 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, @nvanbenschoten, and @tbg)
-- commits, line 85 at r6:
what are "invariants 1 & 2"? Are you referring to the "After all, ..." sentence? Could be clearer.
-- commits, line 86 at r6:
extend with an example. I know it's obvious but an example is always good imo.
- client A wants to write
k := v1
- leaseholder writes
v1
atts = 100
- client A receives ack for write
- client B wants to read
k
- follower receives non-txn'al request, assigns timestamp
ts = 95
- follower becomes leader (say, at lease start
ts = 101
) - non-txn'al request does not return
v1
-- commits, line 88 at r6:
I wasn't clear on whether this describes the pre-PR or post-PR state. It's post-PR, consider clarifying (and now it helps to have the example):
With this PR, when
B
wants to readk
after having writtenv1
, the non-txn'al request will similarly start out withTimestampFromServerClock = 95
, but will have its local uncertainty limit forwarded viaminimumLocalLimitForLeaseholder
to at least the new lease start timet=101
, and is thus guaranteed to observe all writes served by the old leaseholder, includingv1
(which hasts = 100
, which is covered by[95,101]
).
-- commits, line 93 at r6:
Starting to struggle here to see the difference between #36431. Here's my attempt at improving the comment but I might be slightly off.
nit: I can't find written_timestamp
anywhere except in comments and I'm not even sure where to find it.
Even if the lease is stable and the timestamp is assigned by the leaseholder, that leaseholder's clock reading only reflects the [written_timestamp] of all of the writes served by the leaseholder (and previous leaseholders) thus far. The written_timestamp
of a write may be lower than its commit timestamp, but it is the latter that matters. For example:
- client A begins transaction
Put(k, v1)
which is assigned provisional commit timestampts = 95
- leaseholder serves
Put(k, v1)
, i.e. lays down intent atwritten_timestamp = 95
. - client A's transaction performs a write elsewhere and hits a
WriteTooOldError
that bumps its provisional commit timestamp tots = 100
. - client A's transaction refreshes to
ts = 100
. This notably happens without involvement of the leaseholder that served thePut
(this is at the heart of kv: stale read because intent from our past gets committed in our certain future? #36431). - client A's transaction commits and client A receives the acknowledgment.
- client A initiates non-txnal read of
k
. - leaseholder assigns read timestamp
ts = 97
- asynchronous intent resolution resolves the intent at
k
, movingv1
tots = 100
in the process - read does not observe
v1
.
After this PR, the read is instead assigned a full uncertainty interval [97, 97+MaxOffset)
. Since the final commit timestamp of the preceding txn (ts = 100
) could lead the subsequently assigned read timestamp (ts = 97
) by at most MaxOffset
, the read is guaranteed to consider v1
possibly concurrent and will observe it (after a local "restart"), thus preventing #36431 for non-transactional reads.
Note that the above problem is harder to solve in the case of a subsequent transactional read as transaction restarts are orders of magnitude more costly and so we more aggressively want to limit the uncertainty interval using observed timestamps; see #36431.
[written_timestamp]: TODO code link
pkg/roachpb/api.proto, line 2113 at r4 (raw file):
// reference to the initial timestamp, we ensure that a non-transactional // request's uncertainty interval remains fixed across retries. util.hlc.Timestamp timestamp_from_server_clock = 25 [
Any backwards compatibility concerns here since new nodes will stop interpreting the bool? I assume not, but please explain why not in the commit msg.
pkg/roachpb/batch.go, line 78 at r4 (raw file):
now := clock.NowAsClockTimestamp() ba.Timestamp = now.ToTimestamp() ba.TimestampFromServerClock = &now
Had to do a double take to see that we're not sharing memory between Timestamp{,FromServerClock}
. Maybe this is clearer:
now := clock.NowAsClockTimestamp()
ba.TimestampFromServerClock = &now
ba.Timestamp = (*ba.TimestampFromServerClock).ToTimestamp()
Not sure this is really better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 19 of 19 files at r6, 1 of 1 files at r7, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, and @nvanbenschoten)
Previously, tbg (Tobias Grieger) wrote…
Starting to struggle here to see the difference between #36431. Here's my attempt at improving the comment but I might be slightly off.
nit: I can't find
written_timestamp
anywhere except in comments and I'm not even sure where to find it.
Even if the lease is stable and the timestamp is assigned by the leaseholder, that leaseholder's clock reading only reflects the [written_timestamp] of all of the writes served by the leaseholder (and previous leaseholders) thus far. The
written_timestamp
of a write may be lower than its commit timestamp, but it is the latter that matters. For example:
- client A begins transaction
Put(k, v1)
which is assigned provisional commit timestampts = 95
- leaseholder serves
Put(k, v1)
, i.e. lays down intent atwritten_timestamp = 95
.- client A's transaction performs a write elsewhere and hits a
WriteTooOldError
that bumps its provisional commit timestamp tots = 100
.- client A's transaction refreshes to
ts = 100
. This notably happens without involvement of the leaseholder that served thePut
(this is at the heart of kv: stale read because intent from our past gets committed in our certain future? #36431).- client A's transaction commits and client A receives the acknowledgment.
- client A initiates non-txnal read of
k
.- leaseholder assigns read timestamp
ts = 97
- asynchronous intent resolution resolves the intent at
k
, movingv1
tots = 100
in the process- read does not observe
v1
.After this PR, the read is instead assigned a full uncertainty interval
[97, 97+MaxOffset)
. Since the final commit timestamp of the preceding txn (ts = 100
) could lead the subsequently assigned read timestamp (ts = 97
) by at mostMaxOffset
, the read is guaranteed to considerv1
possibly concurrent and will observe it (after a local "restart"), thus preventing #36431 for non-transactional reads.Note that the above problem is harder to solve in the case of a subsequent transactional read as transaction restarts are orders of magnitude more costly and so we more aggressively want to limit the uncertainty interval using observed timestamps; see #36431.
[written_timestamp]: TODO code link
Having reviewed bits of the commit now, the example should also be explicit about the global vs local limit. If I understand correctly, at least once all your work is done and #36431 is fixed, when the intent is resolved it will be with a timestamp that instructs the read to apply the global limit (which is 97+MaxOffset
as opposed to the local limit (which is 97
). So it is incorrect that this PR is already addressing #36431 for the non-txn'al case, rather it establishes parity so that the txn fix will fix non-txn as well.
pkg/kv/kvserver/replica_batch_updates.go, line 243 at r6 (raw file):
ba.Txn = ba.Txn.Clone() ba.Txn.Refresh(ts) ba.Timestamp = ba.Txn.ReadTimestamp
// NB: Refresh just updated ReadTimestamp
pkg/kv/kvserver/replica_evaluate.go, line 526 at r6 (raw file):
its timestamp
of what
pkg/kv/kvserver/replica_evaluate.go, line 609 at r6 (raw file):
newTimestamp = tErr.RetryTimestamp() case *roachpb.ReadWithinUncertaintyIntervalError:
We're teaching txns a new trick here, right? Intentional? Or do you want to be prudent and make this a separate PR. I'll defer to your level of confidence on how likely this is to be worth it. Or are we not changing behavior here because we're holding latches? It's hard to know from just looking at the diff. Perhaps some version of the comment in this file that you removed could still be useful.
Ah, I see below that there is latching-related code for this case now.
pkg/kv/kvserver/replica_send.go, line 481 at r6 (raw file):
// are strict conditions that must be met for this to be permitted. For // non-transactional requests, this is always allowed. If successful, an // updated BatchRequest will be returned.
Could mention what error is expected if not. I assume we'll propagate the RWUIE.
pkg/kv/kvserver/replica_send.go, line 731 at r6 (raw file):
// is necessary to preserve real-time ordering for transactions that write // into the future. if err := r.Clock().SleepUntil(ctx, ba.Timestamp); err != nil {
nit: should the ctx
here respect stopper cancellation?
pkg/kv/kvserver/batcheval/declare.go, line 57 at r6 (raw file):
access = spanset.SpanReadOnly // For non-locking reads, acquire read latches all the way up to the
Could you remind me why DefaultDeclareKeys
doesn't do any of this uncertainty stuff?
pkg/kv/kvserver/uncertainty/compute.go, line 79 at r6 (raw file):
if status.State != kvserverpb.LeaseState_VALID { // If the lease is invalid, this must be a follower read.
... and so we must use the most pessimistic limit, right?
pkg/kv/kvserver/uncertainty/compute.go, line 126 at r6 (raw file):
// these cases, it needs to perform an uncertainty restart. // // For example, the non-transactional request may observe an intent with a
Good place to plop one of the explicit examples that might make it into the commit msg.
pkg/kv/kvserver/uncertainty/doc.go, line 212 at r6 (raw file):
// the cluster's maximum clock skew as the global limit. // // It is somewhat non-intuitive that non-transactional requests need uncertainty
reminder to replace with the most up to date prose/examples before merge.
pkg/storage/pebble_mvcc_scanner.go, line 217 at r6 (raw file):
// We must check uncertainty even if p.ts.Less(p.uncertainty.LocalLimit) // because the local uncertainty limit cannot be applied to values with // synthetic timestamps.
I know you're not really changing this but this comment is confusing - we set checkUncertainty = false
and say "we must still check it"? Having trouble even making a better comment. I don't see any checks for the uncertainty on synthetic values (might've missed it). So maybe just replace this with a TODO referencing a Github issue?
Extracted from cockroachdb#73732, with relevant comments addressed. This commit adds support for server-side refreshes of `ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization for transactional requests, which now benefit from this new capability to refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction, before they've accumulated any refresh spans. There's some complexity around supporting this form of server-side retry, because it must be done above latching, instead of below. However, the recent refactoring in cockroachdb#73557 has made this possible to support cleanly. This is also a prerequisite to giving non-transactional requests uncertainty intervals (cockroachdb#73732), because we don't want ReadWithinUncertaintyIntervalErrors to reach the client for non-transactional requests. Conveniently, because non-transactional requests are always scoped to a single-range, those that hit uncertainty errors will always be able to retry on the server, so these errors will never bubble up to the client that initiated the request. Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.
fec6f22
to
181d586
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the detailed review, and apologies for letting this sit for a month. I'm hoping to start pushing on this stuff again.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, and @tbg)
Previously, tbg (Tobias Grieger) wrote…
what are "invariants 1 & 2"? Are you referring to the "After all, ..." sentence? Could be clearer.
Done.
Previously, tbg (Tobias Grieger) wrote…
extend with an example. I know it's obvious but an example is always good imo.
- client A wants to write
k := v1
- leaseholder writes
v1
atts = 100
- client A receives ack for write
- client B wants to read
k
- follower receives non-txn'al request, assigns timestamp
ts = 95
- follower becomes leader (say, at lease start
ts = 101
)- non-txn'al request does not return
v1
Done.
Previously, tbg (Tobias Grieger) wrote…
I wasn't clear on whether this describes the pre-PR or post-PR state. It's post-PR, consider clarifying (and now it helps to have the example):
With this PR, when
B
wants to readk
after having writtenv1
, the non-txn'al request will similarly start out withTimestampFromServerClock = 95
, but will have its local uncertainty limit forwarded viaminimumLocalLimitForLeaseholder
to at least the new lease start timet=101
, and is thus guaranteed to observe all writes served by the old leaseholder, includingv1
(which hasts = 100
, which is covered by[95,101]
).
Done in example.
Previously, tbg (Tobias Grieger) wrote…
Having reviewed bits of the commit now, the example should also be explicit about the global vs local limit. If I understand correctly, at least once all your work is done and #36431 is fixed, when the intent is resolved it will be with a timestamp that instructs the read to apply the global limit (which is
97+MaxOffset
as opposed to the local limit (which is97
). So it is incorrect that this PR is already addressing #36431 for the non-txn'al case, rather it establishes parity so that the txn fix will fix non-txn as well.
Done. Thanks for the suggestions about the wording here.
You're correct that this does not address #36431 for the non-txn'al case, but rather ensures that when we do fix #36431, we fix it for both cases.
Regarding the written_timestamp, the concept does not exist yet, which is why some of this is a bit difficult to write. I considered omitting it, but given that we're planning on introducing it shortly in the replacement to #72121, it makes things a lot easier to just pretend it exists for now so that we have something concrete to anchor these discussions on and won't need to rewrite all of this later.
I added references to #72121 in each case where we discuss the written_timestamp concept to make this clearer.
pkg/kv/kvserver/replica_batch_updates.go, line 243 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
// NB: Refresh just updated ReadTimestamp
Done.
pkg/kv/kvserver/replica_evaluate.go, line 526 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
its timestamp
of what
Done.
pkg/kv/kvserver/replica_evaluate.go, line 609 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
We're teaching txns a new trick here, right? Intentional? Or do you want to be prudent and make this a separate PR. I'll defer to your level of confidence on how likely this is to be worth it. Or are we not changing behavior here because we're holding latches? It's hard to know from just looking at the diff. Perhaps some version of the comment in this file that you removed could still be useful.
Ah, I see below that there is latching-related code for this case now.
I agree that this should have been a different PR. I was going for that with #73717, but I ended up pulling the uninteresting (and no longer needed) change into a separate PR without pulling the bulk of the complexity out of this one.
I've pulled the changes related to teaching transactional requests to perform server-side refreshes of uncertainty intervals into #75905.
pkg/kv/kvserver/replica_send.go, line 481 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Could mention what error is expected if not. I assume we'll propagate the RWUIE.
Done.
pkg/kv/kvserver/replica_send.go, line 731 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: should the
ctx
here respect stopper cancellation?
Done.
pkg/kv/kvserver/batcheval/declare.go, line 57 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Could you remind me why
DefaultDeclareKeys
doesn't do any of this uncertainty stuff?
DefaultDeclareKeys
is meant for non-MVCC operations (inline writes) and metadata operations (refresh requests), which is why it also doesn't instruct requests to check the lock table. None of these kinds of operations use uncertainty intervals. I'm not particularly happy with the way we talk about the distinction between these two classes of operations ("isolated vs. not"), but I haven't found a way to clean up the language and concepts used in this area.
pkg/kv/kvserver/uncertainty/compute.go, line 79 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
... and so we must use the most pessimistic limit, right?
Done.
pkg/kv/kvserver/uncertainty/compute.go, line 126 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Good place to plop one of the explicit examples that might make it into the commit msg.
Done.
pkg/kv/kvserver/uncertainty/doc.go, line 212 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
reminder to replace with the most up to date prose/examples before merge.
Done.
pkg/roachpb/api.proto, line 2113 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Any backwards compatibility concerns here since new nodes will stop interpreting the bool? I assume not, but please explain why not in the commit msg.
Done.
pkg/roachpb/batch.go, line 78 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Had to do a double take to see that we're not sharing memory between
Timestamp{,FromServerClock}
. Maybe this is clearer:now := clock.NowAsClockTimestamp() ba.TimestampFromServerClock = &now ba.Timestamp = (*ba.TimestampFromServerClock).ToTimestamp()Not sure this is really better.
Added a comment to make this clear.
pkg/storage/pebble_mvcc_scanner.go, line 217 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I know you're not really changing this but this comment is confusing - we set
checkUncertainty = false
and say "we must still check it"? Having trouble even making a better comment. I don't see any checks for the uncertainty on synthetic values (might've missed it). So maybe just replace this with a TODO referencing a Github issue?
The comment is trying to contrast p.ts < local_limit
with p.ts < global_limit
, but I think it was just wrong. Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partial review due to EOD
Reviewed 1 of 62 files at r8, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, @nvanbenschoten, and @tbg)
-- commits, line 95 at r11:
Is another (contrived but possible) example that the client reads back k
but that due to a, uh, 10s GC pause, the read is eligible for a follower read? Or do we just not let non-txn'al requests do follower reads (which would be good; no reason for those to exist)?
-- commits, line 103 at r11:
local limit, right? I.e. instead of the pessimistic ts+500ms, we get to use 101 (which presumably is better in this example).
-- commits, line 111 at r11:
not not
-- commits, line 125 at r11:
nit: use timestamps from the example, then the prose can prime readers and make the example easier to follow.
-- commits, line 143 at r11:
Nit for all of these examples: why isn't client A doing both (i.e. remove client B)? The examples should drive home that you get stale reads, right? Now we always have to say "B causally relates to A" or something like that.
// then it can not trivially bump its timestamp without dropping and | ||
// re-acquiring those latches. Doing so could allow the request to read at an | ||
// unprotected timestamp. We only look at global latch spans because local | ||
// latch spans always use unbounded (NonMVCC) timestamps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this enforced somehow?
// See https://github.com/cockroachdb/cockroach/issues/58459. | ||
if txn == nil { | ||
return Interval{} | ||
func ComputeInterval( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth pointing out that this can return an empty Interval{}
meaning that the request is explicitly bypassing uncertainty restarts.
// | ||
// Unlike transactions, which establish an uncertainty interval on their | ||
// coordinator node during initialization, non-transactional requests receive | ||
// uncertainty intervals from their target leaseholder, using a clock reading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a choice, it makes sense to motivate it: we have all reason to believe that this means the leaseholder is chosing the timestamp, which means we get the a relevant observed timestamp "for free".
// single-range, those that hit uncertainty errors can always retry on the | ||
// server, so these errors never bubble up to the client that initiated the | ||
// request. | ||
var D7 = roachpb.Header{}.TimestampFromServerClock |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't read this comment in detail because I assume it mirrors exactly what is in the PR message and will be synced before merge.
require.NotNil(t, ba.TimestampFromServerClock) | ||
nonTxnOrigTsC <- ba.Timestamp | ||
// Wait for the test to give the go-ahead. | ||
select { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you put a DefaultSucceedsSoon timeout here and on all other selects. These kind of tests tend to deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I assume you've stressed this a bit to shake out flakes)
@@ -183,7 +183,7 @@ func maybeStripInFlightWrites(ba *roachpb.BatchRequest) (*roachpb.BatchRequest, | |||
// works for batches that exclusively contain writes; reads cannot be bumped | |||
// like this because they've already acquired timestamp-aware latches. | |||
func maybeBumpReadTimestampToWriteTimestamp( | |||
ctx context.Context, ba *roachpb.BatchRequest, latchSpans *spanset.SpanSet, | |||
ctx context.Context, ba *roachpb.BatchRequest, g *concurrency.Guard, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can g
be nil?
// can change due to server-side uncertainty retries. By remembering a stable | ||
// reference to the initial timestamp, we ensure that a non-transactional | ||
// request's uncertainty interval remains fixed across retries. | ||
util.hlc.Timestamp timestamp_from_server_clock = 27 [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remind me if there was a migration to worry about here? Is anything going to break cross-version due to the use of different fields?
Extracted from cockroachdb#73732, with relevant comments addressed. This commit adds support for server-side refreshes of `ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization for transactional requests, which now benefit from this new capability to refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction, before they've accumulated any refresh spans. There's some complexity around supporting this form of server-side retry, because it must be done above latching, instead of below. However, the recent refactoring in cockroachdb#73557 has made this possible to support cleanly. This is also a prerequisite to giving non-transactional requests uncertainty intervals (cockroachdb#73732), because we don't want ReadWithinUncertaintyIntervalErrors to reach the client for non-transactional requests. Conveniently, because non-transactional requests are always scoped to a single-range, those that hit uncertainty errors will always be able to retry on the server, so these errors will never bubble up to the client that initiated the request. Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.
181d586
to
7078527
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, and @tbg)
Previously, tbg (Tobias Grieger) wrote…
Is another (contrived but possible) example that the client reads back
k
but that due to a, uh, 10s GC pause, the read is eligible for a follower read? Or do we just not let non-txn'al requests do follower reads (which would be good; no reason for those to exist)?
Right, we don't allow non-transactional reads that have server-assigned timestamps to be served by followers for this reason. See:
cockroach/pkg/kv/kvserver/replica_follower_read.go
Lines 35 to 57 in 4a741cd
// BatchCanBeEvaluatedOnFollower determines if a batch consists exclusively of | |
// requests that can be evaluated on a follower replica, given a sufficiently | |
// advanced closed timestamp. | |
func BatchCanBeEvaluatedOnFollower(ba roachpb.BatchRequest) bool { | |
// Explanation of conditions: | |
// 1. the batch cannot have or intend to receive a timestamp set from a | |
// server-side clock. If a follower with a lagging clock sets its timestamp | |
// and this then allows the follower to evaluate the batch as a follower | |
// read, then the batch might miss past writes served at higher timestamps | |
// on the leaseholder. | |
// 2. each request in the batch needs to be "transactional", because those are | |
// the only ones that have clearly defined semantics when served under the | |
// closed timestamp. | |
// 3. the batch needs to be read-only, because a follower replica cannot | |
// propose writes to Raft. | |
// 4. the batch needs to be non-locking, because unreplicated locks are only | |
// held on the leaseholder. | |
tsFromServerClock := ba.Txn == nil && (ba.Timestamp.IsEmpty() || ba.TimestampFromServerClock) | |
if tsFromServerClock { | |
return false | |
} | |
return ba.IsAllTransactional() && ba.IsReadOnly() && !ba.IsLocking() | |
} |
Previously, tbg (Tobias Grieger) wrote…
local limit, right? I.e. instead of the pessimistic ts+500ms, we get to use 101 (which presumably is better in this example).
Yes, updated.
Previously, tbg (Tobias Grieger) wrote…
not not
Done.
Previously, tbg (Tobias Grieger) wrote…
nit: use timestamps from the example, then the prose can prime readers and make the example easier to follow.
Done.
Previously, tbg (Tobias Grieger) wrote…
Nit for all of these examples: why isn't client A doing both (i.e. remove client B)? The examples should drive home that you get stale reads, right? Now we always have to say "B causally relates to A" or something like that.
Done.
Previously, tbg (Tobias Grieger) wrote…
.
Done.
pkg/kv/kvserver/replica_batch_updates.go, line 186 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
can
g
be nil?
It can, but it never is given a nil g
pkg/kv/kvserver/concurrency/concurrency_manager.go, line 671 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is this enforced somehow?
Yes, see https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/spanlatch/manager.go#L423-L432 and
cockroach/pkg/kv/kvserver/spanset/spanset.go
Lines 192 to 195 in 4a741cd
if keys.IsLocal(span.Key) { | |
scope = SpanLocal | |
timestamp = hlc.Timestamp{} | |
} |
pkg/kv/kvserver/uncertainty/compute.go, line 61 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Might be worth pointing out that this can return an empty
Interval{}
meaning that the request is explicitly bypassing uncertainty restarts.
Done.
pkg/kv/kvserver/uncertainty/doc.go, line 208 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Since this is a choice, it makes sense to motivate it: we have all reason to believe that this means the leaseholder is chosing the timestamp, which means we get the a relevant observed timestamp "for free".
Good point, done.
pkg/roachpb/api.proto, line 2149 at r11 (raw file):
No concerns, discussed in commit message.
There are no backwards compatibility concerns with this change because the previous field was only consulted on the same node that set it. It was not consulted across RPC boundaries.
pkg/kv/kvserver/client_replica_test.go, line 734 at r11 (raw file):
Done.
(I assume you've stressed this a bit to shake out flakes)
Yes, stressed for some time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 7 of 62 files at r8, 53 of 53 files at r12, 8 of 8 files at r13, 34 of 34 files at r14, 15 of 15 files at r15, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @irfansharif, and @tbg)
Extracted from cockroachdb#73732, with relevant comments addressed. This commit adds support for server-side refreshes of `ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization for transactional requests, which now benefit from this new capability to refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction, before they've accumulated any refresh spans. There's some complexity around supporting this form of server-side retry, because it must be done above latching, instead of below. However, the recent refactoring in cockroachdb#73557 has made this possible to support cleanly. This is also a prerequisite to giving non-transactional requests uncertainty intervals (cockroachdb#73732), because we don't want ReadWithinUncertaintyIntervalErrors to reach the client for non-transactional requests. Conveniently, because non-transactional requests are always scoped to a single-range, those that hit uncertainty errors will always be able to retry on the server, so these errors will never bubble up to the client that initiated the request. Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.
This commit eliminates the primary mechanism that we use to pass clock information from a leaseholder, through Raft log entries, to a Range's followers. As we found in cockroachdb#72278, this was only needed for correctness in a few specific cases — namely lease transfers and range merges. These two operations continue to pass clock signals through more explicit channels, but we remove the unnecessary general case. The allows us to remote one of the two remaining places where we convert a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp` method. As outlined in cockroachdb#72121 (comment), I would like to remove ability to downcast a "data-plane" `Timestamp` to a "control-plane" `CloudTimestamp` entirely. This will clarify the role of `ClockTimestamps` in the system and clean up the channels through which clock information is passed between nodes. The other place where we cast from `Timestamp` to `ClockTimesatmp` is in `Store.Send`, at the boundary of KV RPCs. I would also like to get rid of this, but doing so needs to wait on cockroachdb#73732.
This commit eliminates the primary mechanism that we use to pass clock information from a leaseholder, through Raft log entries, to a Range's followers. As we found in cockroachdb#72278, this was only needed for correctness in a few specific cases — namely lease transfers and range merges. These two operations continue to pass clock signals through more explicit channels, but we remove the unnecessary general case. The allows us to remote one of the two remaining places where we convert a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp` method. As outlined in cockroachdb#72121 (comment), I would like to remove ability to downcast a "data-plane" `Timestamp` to a "control-plane" `CloudTimestamp` entirely. This will clarify the role of `ClockTimestamps` in the system and clean up the channels through which clock information is passed between nodes. The other place where we cast from `Timestamp` to `ClockTimesatmp` is in `Store.Send`, at the boundary of KV RPCs. I would also like to get rid of this, but doing so needs to wait on cockroachdb#73732.
Extracted from cockroachdb#73732, with relevant comments addressed. This commit adds support for server-side refreshes of `ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization for transactional requests, which now benefit from this new capability to refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction, before they've accumulated any refresh spans. There's some complexity around supporting this form of server-side retry, because it must be done above latching, instead of below. However, the recent refactoring in cockroachdb#73557 has made this possible to support cleanly. Specifically, we now handle `ReadWithinUncertaintyIntervalError` as a concurrency error in the `executeWithConcurrencyRetries` retry loop. This is different from other server-side retries, which are hit during writes and can be handled without releasing latches. This difference stems from the difference in how read and write latches behave. Write latches protect their MVCC timestamp and any later time. Meanwhile, read latches protect their MVCC timestamp and any earlier time. This means that a request holding read latches that hits an uncertainty error can't refresh without dropping those latches and acquiring new ones. This is also a prerequisite to giving non-transactional requests uncertainty intervals (cockroachdb#73732), because we don't want ReadWithinUncertaintyIntervalErrors to reach the client for non-transactional requests. Conveniently, because non-transactional requests are always scoped to a single-range, those that hit uncertainty errors will always be able to retry on the server, so these errors will never bubble up to the client that initiated the request. Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.
7078527
to
676a867
Compare
76095: kv: don't pass clock information through Raft log r=nvanbenschoten a=nvanbenschoten This commit eliminates the primary mechanism that we use to pass clock information from a leaseholder, through Raft log entries, to a Range's followers. As we found in #72278, this was only needed for correctness in a few specific cases — namely lease transfers and range merges. These two operations continue to pass clock signals through more explicit channels, but we remove the unnecessary general case. The allows us to remote one of the two remaining places where we convert a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp` method. As outlined in #72121 (comment), I would like to remove ability to downcast a "data-plane" `Timestamp` to a "control-plane" `CloudTimestamp` entirely. This will clarify the role of `ClockTimestamps` in the system and clean up the channels through which clock information is passed between nodes. The other place where we cast from `Timestamp` to `ClockTimesatmp` is in `Store.Send`, at the boundary of KV RPCs. I would also like to get rid of this, but doing so needs to wait on #73732. 76163: bazel: remove old protos when generating new ones r=ajwerner a=ajwerner This is what the Makefile did. It was painful to have the old onces because they'd lead to spurious diffs. Release note: None 76166: sql: support version numbers on descriptor validation r=fqazi a=fqazi Previously, the descriptor validation code did not take a version number, so it was not possible to version gate new validation logic. This was inadequate because when new fields are introduced we don't want their validation to kick in for certain cases like the debug doctor, such as any new fields with non-zero defaults. To address this, this patch add supports for version numbers inside the validation, and updates unit tests to pass this in as well. It also adds a new option on debug doctor to run a version of validation. Release note (cli change): Add new optional version argument to the doctor examine command. This can be used to enable / disable validation when examining older zip directories. 76188: ci: make sure metamorphic nightly uses proper formatter for issues r=nicktrav a=rickystewart The `--formatter=pebble-metamorphic` option got lost in #75585. Release note: None 76191: gazelle: exclude `.pb.go` files r=dt a=rickystewart Should prevent Gazelle from getting confused and adding these to `srcs`. Release note: None 76192: ci: make sure extra env vars are set for `roachtest` jobs r=rail a=rickystewart We [need these](https://github.com/cockroachdb/cockroach/blob/5e7690d6da09821ff431ef32fe8d1430d05aed9f/pkg/cmd/internal/issues/issues.go#L159-L171). Release note: None 76194: tree: remove TODOs about bytea/float cast volatility r=mgartner a=rafiss These comments aren't needed, since the current cast volatility is correct, as far as I can tell. We might want to report this as a bug in Postgres if it describes these casts as immutable incorrectly. Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Andrew Werner <awerner32@gmail.com> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
75905: kv: support server-side refreshes of uncertainty errors r=nvanbenschoten a=nvanbenschoten Extracted from #73732, with relevant comments addressed. This commit adds support for server-side refreshes of `ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization for transactional requests, which now benefit from this new capability to refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction, before they've accumulated any refresh spans. There's some complexity around supporting this form of server-side retry, because it must be done above latching, instead of below. However, the recent refactoring in #73557 has made this possible to support cleanly. Specifically, we now handle `ReadWithinUncertaintyIntervalError` as a concurrency error in the `executeWithConcurrencyRetries` retry loop. This is different from other server-side retries, which are hit during writes and can be handled without releasing latches. This difference stems from the difference in how read and write latches behave. Write latches protect their MVCC timestamp and any later time. Meanwhile, read latches protect their MVCC timestamp and any earlier time. This means that a request holding read latches that hits an uncertainty error can't refresh without dropping those latches and acquiring new ones. This is also a prerequisite to giving non-transactional requests uncertainty intervals (#73732), because we don't want ReadWithinUncertaintyIntervalErrors to reach the client for non-transactional requests. Conveniently, because non-transactional requests are always scoped to a single-range, those that hit uncertainty errors will always be able to retry on the server, so these errors will never bubble up to the client that initiated the request. Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
…tamp This commit replaces the boolean `timestamp_from_server_clock` flag on the `BatchRequest` header with a nullable `ClockTimestamp` field. Doing so allows for the use of the field to be expanded in a following commit. The field will soon be needed to record the time at which the request was received by the server node. The operation timestamp cannot serve this role because it can change due to server-side uncertainty retries. By remembering a stable reference to the initial timestamp, we ensure that a non-transactional request's uncertainty interval remains fixed across retries. There are no backwards compatibility concerns with this change because the previous field was only consulted on the same node that set it. It was not consulted across RPC boundaries.
This isn't needed until the next commit, but it's enough noise and code movement that it's better to live on its own.
Fixes cockroachdb#58459. Informs cockroachdb#36431. This commit fixes a long-standing correctness issue where non-transactional requests did not ensure single-key linearizability even if they deferred their timestamp allocation to the leaseholder of their (single) range. They still don't entirely, because of cockroachdb#36431, but this change brings us one step closer to the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests. The change addresses this by giving non-transactional requests uncertainty intervals. This ensures that they also guarantee single-key linearizability even with only loose (but bounded) clock synchronization. Non-transactional requests that use a client-provided timestamp do not have uncertainty intervals and do not make real-time ordering guarantees. Unlike transactions, which establish an uncertainty interval on their coordinator node during initialization, non-transactional requests receive uncertainty intervals from their target leaseholder, using a clock reading from the leaseholder's local HLC as the local limit and this clock reading + the cluster's maximum clock skew as the global limit. It is somewhat non-intuitive that non-transactional requests need uncertainty intervals — after all, they receive their timestamp to the leaseholder of the only range that they talk to, so isn't every value with a commit timestamp above their read timestamp certainly concurrent? The answer is surprisingly "no" for the following reasons, so they cannot forgo the use of uncertainty interval: 1. the request timestamp is allocated before consulting the replica's lease. This means that there are times when the replica is not the leaseholder at the point of timestamp allocation, and only becomes the leaseholder later. In such cases, the timestamp assigned to the request is not guaranteed to be greater than the written_timestamp of all writes served by the range at the time of allocation. This is true despite invariants 1 & 2 presented in `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the timestamp is not yet the leaseholder. In cases where the replica that assigned the non-transactional request's timestamp takes over as the leaseholder after the timestamp allocation, we expect minimumLocalLimitForLeaseholder to forward the local uncertainty limit above TimestampFromServerClock, to the lease start time. For example, consider the following series of events: - client A writes k = v1 - leaseholder writes v1 at ts = 100 - client A receives ack for write - client B wants to read k using a non-txn request - follower replica with slower clock receives non-txn request - follower replica assigns request ts = 95 - lease transferred to follower replica with lease start time = 101 - non-txn request must use 101 as limit of uncertainty interval to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. 2. even if the replica's lease is stable and the timestamp is assigned to the non-transactional request by the leaseholder, the assigned clock reading only reflects the written_timestamp of all of the writes served by the leaseholder (and previous leaseholders) thus far. This clock reading is not guaranteed to lead the commit timestamp of all of these writes, especially if they are committed remotely and resolved after the request has received its clock reading but before the request begins evaluating. As a result, the non-transactional request needs an uncertainty interval with a global uncertainty limit far enough in advance of the leaseholder's local HLC clock to ensure that it considers any value that was part of a transaction which could have committed before the request was received by the leaseholder to be uncertain. Concretely, the non-transactional request needs to consider values of the following form to be uncertain: written_timestamp < local_limit && commit_timestamp < global_limit The value that the non-transactional request is observing may have been written on the local leaseholder at time 10, its transaction may have been committed remotely at time 20, acknowledged, then the non-transactional request may have begun and received a timestamp of 15 from the local leaseholder, then finally the value may have been resolved asynchronously and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20). The failure of the non-transactional request to observe this value would be a stale read. For example, consider the following series of events: - client A begins a txn and is assigned provisional commit timestamp = 95 - client A's txn performs a Put(k, v1) - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95 - client A's txn performs a write elsewhere and hits a WriteTooOldError that bumps its provisional commit timestamp to 100 - client A's txn refreshes to ts = 100. This notably happens without involvement of the leaseholder that served the Put (this is at the heart of cockroachdb#36431), so that leaseholder's clock is not updated - client A's txn commits remotely and client A receives the acknowledgment - client B initiates non-txn read of k - leaseholder assigns read timestamp ts = 97 - asynchronous intent resolution resolves the txn's intent at k, moving v1 to ts = 100 in the process - non-txn request must use an uncertainty interval that extends past 100 to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. Failure to do so would be a stale read ---- This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional requests to benefit from our incoming fix for that issue. Second, it unblocks some of the clock refactors proposed in cockroachdb#72121 (comment), and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make this change. We know from cockroachdb#36431 that invariant 1 from [`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131) doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send` masks the bugs in this area in many cases. Removing that clock update (I don't actually plan to remove it, but I plan to disconnect it entirely from operation timestamps) without first giving non-transactional requests uncertainty intervals seems like it may reveal these bugs in ways we haven't seen in the past. So I'd like to land this fix before making that change. ---- Release note: None
676a867
to
d676d7e
Compare
bors r+ |
Build succeeded: |
This commit eliminates the primary mechanism that we use to pass clock information from a leaseholder, through Raft log entries, to a Range's followers. As we found in cockroachdb#72278, this was only needed for correctness in a few specific cases — namely lease transfers and range merges. These two operations continue to pass clock signals through more explicit channels, but we remove the unnecessary general case. The allows us to remote one of the two remaining places where we convert a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp` method. As outlined in cockroachdb#72121 (comment), I would like to remove ability to downcast a "data-plane" `Timestamp` to a "control-plane" `CloudTimestamp` entirely. This will clarify the role of `ClockTimestamps` in the system and clean up the channels through which clock information is passed between nodes. The other place where we cast from `Timestamp` to `ClockTimesatmp` is in `Store.Send`, at the boundary of KV RPCs. I would also like to get rid of this, but doing so needs to wait on cockroachdb#73732.
Extracted from cockroachdb#73732, with relevant comments addressed. This commit adds support for server-side refreshes of `ReadWithinUncertaintyIntervalError`s. This serves as a performance optimization for transactional requests, which now benefit from this new capability to refresh away `ReadWithinUncertaintyIntervalErrors` early in their transaction, before they've accumulated any refresh spans. There's some complexity around supporting this form of server-side retry, because it must be done above latching, instead of below. However, the recent refactoring in cockroachdb#73557 has made this possible to support cleanly. Specifically, we now handle `ReadWithinUncertaintyIntervalError` as a concurrency error in the `executeWithConcurrencyRetries` retry loop. This is different from other server-side retries, which are hit during writes and can be handled without releasing latches. This difference stems from the difference in how read and write latches behave. Write latches protect their MVCC timestamp and any later time. Meanwhile, read latches protect their MVCC timestamp and any earlier time. This means that a request holding read latches that hits an uncertainty error can't refresh without dropping those latches and acquiring new ones. This is also a prerequisite to giving non-transactional requests uncertainty intervals (cockroachdb#73732), because we don't want ReadWithinUncertaintyIntervalErrors to reach the client for non-transactional requests. Conveniently, because non-transactional requests are always scoped to a single-range, those that hit uncertainty errors will always be able to retry on the server, so these errors will never bubble up to the client that initiated the request. Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.
Fixes #58459.
Informs #36431.
This commit fixes a long-standing correctness issue where non-transactional requests did not ensure single-key linearizability even if they deferred their timestamp allocation to the leaseholder of their (single) range. They still don't entirely, because of #36431, but this change brings us one step closer to the fix we plan to land for #36431 also applying to non-transactional requests.
The change addresses this by giving non-transactional requests uncertainty intervals. This ensures that they also guarantee single-key linearizability even with only loose (but bounded) clock synchronization. Non-transactional requests that use a client-provided timestamp do not have uncertainty intervals and do not make real-time ordering guarantees.
Unlike transactions, which establish an uncertainty interval on their coordinator node during initialization, non-transactional requests receive uncertainty intervals from their target leaseholder, using a clock reading from the leaseholder's local HLC as the local limit and this clock reading + the cluster's maximum clock skew as the global limit.
It is somewhat non-intuitive that non-transactional requests need uncertainty intervals — after all, they receive their timestamp to the leaseholder of the only range that they talk to, so isn't every value with a commit timestamp above their read timestamp certainly concurrent? The answer is surprisingly "no" for the following reasons, so they cannot forgo the use of uncertainty interval:
the request timestamp is allocated before consulting the replica's lease. This means that there are times when the replica is not the leaseholder at the point of timestamp allocation, and only becomes the leaseholder later. In such cases, the timestamp assigned to the request is not guaranteed to be greater than the written_timestamp of all writes served by the range at the time of allocation. This is true despite invariants 1 & 2 presented in
pkg/kv/kvserver/uncertainty/doc.go
because the replica allocating the timestamp is not yet the leaseholder.In cases where the replica that assigned the non-transactional request's timestamp takes over as the leaseholder after the timestamp allocation, we expect minimumLocalLimitForLeaseholder to forward the local uncertainty limit above TimestampFromServerClock, to the lease start time.
For example, consider the following series of events:
even if the replica's lease is stable and the timestamp is assigned to the non-transactional request by the leaseholder, the assigned clock reading only reflects the written_timestamp of all of the writes served by the leaseholder (and previous leaseholders) thus far. This clock reading is not not guaranteed to lead the commit timestamp of all of these writes, especially if they are committed remotely and resolved after the request has received its clock reading but before the request begins evaluating.
As a result, the non-transactional request needs an uncertainty interval with a global uncertainty limit far enough in advance of the leaseholder's local HLC clock to ensure that it considers any value that was part of a transaction which could have committed before the request was received by the leaseholder to be uncertain. Concretely, the non-transactional request needs to consider values of the following form to be uncertain:
written_timestamp < local_limit && commit_timestamp < global_limit
The value that the non-transactional request is observing may have been written on the local leaseholder at time 10, its transaction may have been committed remotely at time 20, acknowledged, then the non-transactional request may have begun and received a timestamp of 15 from the local leaseholder, then finally the value may have been resolved asynchronously and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20). The failure of the non-transactional request to observe this value would be a stale read.
For example, consider the following series of events:
This change is related to #36431 in two ways. First, it allows non-transactional requests to benefit from our incoming fix for that issue. Second, it unblocks some of the clock refactors proposed in #72121 (comment), and by extension #72663. Even though there are clear bugs today, I still don't feel comfortable removing the
hlc.Clock.Update
inStore.Send
until we make this change. We know from #36431 that invariant 1 fromuncertainty.D6
doesn't hold, and yet I still do think thehlc.Clock.Update
inStore.Send
masks the bugs in this area in many cases. Removing that clock update (I don't actually plan to remove it, but I plan to disconnect it entirely from operation timestamps) without first giving non-transactional requests uncertainty intervals seems like it may reveal these bugs in ways we haven't seen in the past. So I'd like to land this fix before making that change.Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.