-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: have commands carry their wts, not read ts #60573
kvserver: have commands carry their wts, not read ts #60573
Conversation
Raft commands carry their timestamp, which is used for two purposes: 1) update every replica's clock 2) check below Raft that the command operates below the range's GC threshold Before this patch, commands were carrying ba.Timestamp. This patch makes them carry ba.WriteTimestamp(). I believe that this is the right thing, and what we had before was wrong wrt clock updates: the wts can be > rts, and the timestamps that the command is "injecting" into each replica are the timestamps of the keys it's writing - which is the request's write timestamp. W.r.t the GC threshold check, I believe carrying the read ts was not "incorrect", but it was still wrong: below Raft, the question that matters is whether a command is _writing_ below the GC line (i.e. write timestamp). Whatever reads the respective request might have done (i.e. read timestamp), they're already done above Raft (and, in case of AsyncConsensus, the respective results have already been delivered). So, below Raft we should just concern ourselves with the write timestamp. Release note: None
Emphasis a field's relationship with synthetic timestamps. Release note: None
Do y'all have an opinion about whether the below-raft gc threshold check is needed at all nowadays?
Is that still true? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this is the right thing,
and what we had before was wrong wrt clock updates: the wts can be >
rts, and the timestamps that the command is "injecting" into each
replica are the timestamps of the keys it's writing - which is the
request's write timestamp.
Yeah, I agree with this assessment. But note that this mechanism only affects non-transactional requests, which we don't really ever use these days and which we'd like to remove: #58459.
W.r.t the GC threshold check, I believe carrying the read ts was not
"incorrect", but it was still wrong: below Raft, the question that
matters is whether a command is writing below the GC line (i.e. write
timestamp). Whatever reads the respective request might have done (i.e.
read timestamp), they're already done above Raft (and, in case of
AsyncConsensus, the respective results have already been delivered). So,
below Raft we should just concern ourselves with the write timestamp.
This one I'm also less sure about. The hazard here is that a transaction has a read timestamp of 10 and a write timestamp of 20 and it gets re-ordered with a GC request at time 15. I'm having trouble understanding what issues this could cause and I'm left with the same question as you: why did we need #14833 at all if we were also latching each individual key while GC-ing it. Maybe Tobi knows more about this?
This is a worthy thread to pull on, and it would be great to clean this up, but I'm also a little hesitant to make this change so late in the cycle when there's very little obvious benefit to it but a moderate risk if we get it wrong. It's not a clear prereq of #59566, as we can make just as strong of an assertion on the leaseholder without changing what we ship through Raft. Can I convince you to pull this to this side and out of the critical path, at least for a week or two? That way, we can think all the way through it, make sure we're doing the right thing, and maybe also use that new understanding to address #55293.
Reviewed 8 of 8 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this is the right thing,
and what we had before was wrong wrt clock updates: the wts can be >
rts, and the timestamps that the command is "injecting" into each
replica are the timestamps of the keys it's writing - which is the
request's write timestamp.Yeah, I agree with this assessment. But note that this mechanism only affects non-transactional requests, which we don't really ever use these days and which we'd like to remove: #58459.
Why do you say it only affects non-txn requests? Quite the opposite, I think it affects only transactional requests because only those can have wts != rts. No?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
What I meant was that the clock bump below Raft is meant to ensure that non-transactional reads provide linearizability when their timestamp is assigned by a kv server. But this thing is mostly broken and unused. See #58459. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you don't think there's any value in follower replicas having their clock bumped by writes? That's what I thought the below-Raft bump was about.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
This PR has been superseded by #63045. |
Raft commands carry their timestamp, which is used for two purposes:
Before this patch, commands were carrying ba.Timestamp. This patch makes
them carry ba.WriteTimestamp(). I believe that this is the right thing,
and what we had before was wrong wrt clock updates: the wts can be >
rts, and the timestamps that the command is "injecting" into each
replica are the timestamps of the keys it's writing - which is the
request's write timestamp.
W.r.t the GC threshold check, I believe carrying the read ts was not
"incorrect", but it was still wrong: below Raft, the question that
matters is whether a command is writing below the GC line (i.e. write
timestamp). Whatever reads the respective request might have done (i.e.
read timestamp), they're already done above Raft (and, in case of
AsyncConsensus, the respective results have already been delivered). So,
below Raft we should just concern ourselves with the write timestamp.
Release note: None