kvserver: have commands carry their wts, not read ts #60573

andreimatei · 2021-02-14T21:52:16Z

Raft commands carry their timestamp, which is used for two purposes:

update every replica's clock
check below Raft that the command operates below the range's GC threshold

Before this patch, commands were carrying ba.Timestamp. This patch makes
them carry ba.WriteTimestamp(). I believe that this is the right thing,
and what we had before was wrong wrt clock updates: the wts can be >
rts, and the timestamps that the command is "injecting" into each
replica are the timestamps of the keys it's writing - which is the
request's write timestamp.
W.r.t the GC threshold check, I believe carrying the read ts was not
"incorrect", but it was still wrong: below Raft, the question that
matters is whether a command is writing below the GC line (i.e. write
timestamp). Whatever reads the respective request might have done (i.e.
read timestamp), they're already done above Raft (and, in case of
AsyncConsensus, the respective results have already been delivered). So,
below Raft we should just concern ourselves with the write timestamp.

Release note: None

Raft commands carry their timestamp, which is used for two purposes: 1) update every replica's clock 2) check below Raft that the command operates below the range's GC threshold Before this patch, commands were carrying ba.Timestamp. This patch makes them carry ba.WriteTimestamp(). I believe that this is the right thing, and what we had before was wrong wrt clock updates: the wts can be > rts, and the timestamps that the command is "injecting" into each replica are the timestamps of the keys it's writing - which is the request's write timestamp. W.r.t the GC threshold check, I believe carrying the read ts was not "incorrect", but it was still wrong: below Raft, the question that matters is whether a command is _writing_ below the GC line (i.e. write timestamp). Whatever reads the respective request might have done (i.e. read timestamp), they're already done above Raft (and, in case of AsyncConsensus, the respective results have already been delivered). So, below Raft we should just concern ourselves with the write timestamp. Release note: None

Emphasis a field's relationship with synthetic timestamps. Release note: None

cockroach-teamcity · 2021-02-14T21:52:24Z

This change is

andreimatei · 2021-02-14T21:54:14Z

Do y'all have an opinion about whether the below-raft gc threshold check is needed at all nowadays?
The code says:

// This is
// necessary because not all commands declare read access on the GC
// threshold key, even though they implicitly depend on it. This means
// that access to this state will not be serialized by latching,
// so we must perform this check upstream and downstream of raft.
// See #14833.

Is that still true?

nvanbenschoten

I believe that this is the right thing,
and what we had before was wrong wrt clock updates: the wts can be >
rts, and the timestamps that the command is "injecting" into each
replica are the timestamps of the keys it's writing - which is the
request's write timestamp.

Yeah, I agree with this assessment. But note that this mechanism only affects non-transactional requests, which we don't really ever use these days and which we'd like to remove: #58459.

W.r.t the GC threshold check, I believe carrying the read ts was not
"incorrect", but it was still wrong: below Raft, the question that
matters is whether a command is writing below the GC line (i.e. write
timestamp). Whatever reads the respective request might have done (i.e.
read timestamp), they're already done above Raft (and, in case of
AsyncConsensus, the respective results have already been delivered). So,
below Raft we should just concern ourselves with the write timestamp.

This one I'm also less sure about. The hazard here is that a transaction has a read timestamp of 10 and a write timestamp of 20 and it gets re-ordered with a GC request at time 15. I'm having trouble understanding what issues this could cause and I'm left with the same question as you: why did we need #14833 at all if we were also latching each individual key while GC-ing it. Maybe Tobi knows more about this?

This is a worthy thread to pull on, and it would be great to clean this up, but I'm also a little hesitant to make this change so late in the cycle when there's very little obvious benefit to it but a moderate risk if we get it wrong. It's not a clear prereq of #59566, as we can make just as strong of an assertion on the leaseholder without changing what we ship through Raft. Can I convince you to pull this to this side and out of the critical path, at least for a week or two? That way, we can think all the way through it, make sure we're doing the right thing, and maybe also use that new understanding to address #55293.

Reviewed 8 of 8 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

andreimatei

I believe that this is the right thing,
and what we had before was wrong wrt clock updates: the wts can be >
rts, and the timestamps that the command is "injecting" into each
replica are the timestamps of the keys it's writing - which is the
request's write timestamp.

Yeah, I agree with this assessment. But note that this mechanism only affects non-transactional requests, which we don't really ever use these days and which we'd like to remove: #58459.

Why do you say it only affects non-txn requests? Quite the opposite, I think it affects only transactional requests because only those can have wts != rts. No?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

nvanbenschoten · 2021-02-17T03:48:40Z

What I meant was that the clock bump below Raft is meant to ensure that non-transactional reads provide linearizability when their timestamp is assigned by a kv server. But this thing is mostly broken and unused. See #58459.

andreimatei

So you don't think there's any value in follower replicas having their clock bumped by writes? That's what I thought the below-Raft bump was about.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

andreimatei · 2021-04-16T01:02:46Z

This PR has been superseded by #63045.

andreimatei added 2 commits February 14, 2021 15:05

kvserver: improve a comment

4453f2d

Emphasis a field's relationship with synthetic timestamps. Release note: None

andreimatei requested review from tbg and nvanbenschoten February 14, 2021 21:52

nvanbenschoten requested changes Feb 16, 2021

View reviewed changes

andreimatei commented Feb 16, 2021

View reviewed changes

andreimatei commented Feb 17, 2021

View reviewed changes

tbg removed their request for review April 15, 2021 08:09

andreimatei closed this Apr 16, 2021

andreimatei deleted the small.cmd-write-timestamp branch April 16, 2021 01:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: have commands carry their wts, not read ts #60573

kvserver: have commands carry their wts, not read ts #60573

andreimatei commented Feb 14, 2021

cockroach-teamcity commented Feb 14, 2021

andreimatei commented Feb 14, 2021

nvanbenschoten left a comment

andreimatei left a comment

nvanbenschoten commented Feb 17, 2021

andreimatei left a comment

andreimatei commented Apr 16, 2021

kvserver: have commands carry their wts, not read ts #60573

kvserver: have commands carry their wts, not read ts #60573

Conversation

andreimatei commented Feb 14, 2021

cockroach-teamcity commented Feb 14, 2021

andreimatei commented Feb 14, 2021

nvanbenschoten left a comment

Choose a reason for hiding this comment

andreimatei left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Feb 17, 2021

andreimatei left a comment

Choose a reason for hiding this comment

andreimatei commented Apr 16, 2021