Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: have commands carry their wts, not read ts #60573

Closed

Conversation

andreimatei
Copy link
Contributor

Raft commands carry their timestamp, which is used for two purposes:

  1. update every replica's clock
  2. check below Raft that the command operates below the range's GC threshold

Before this patch, commands were carrying ba.Timestamp. This patch makes
them carry ba.WriteTimestamp(). I believe that this is the right thing,
and what we had before was wrong wrt clock updates: the wts can be >
rts, and the timestamps that the command is "injecting" into each
replica are the timestamps of the keys it's writing - which is the
request's write timestamp.
W.r.t the GC threshold check, I believe carrying the read ts was not
"incorrect", but it was still wrong: below Raft, the question that
matters is whether a command is writing below the GC line (i.e. write
timestamp). Whatever reads the respective request might have done (i.e.
read timestamp), they're already done above Raft (and, in case of
AsyncConsensus, the respective results have already been delivered). So,
below Raft we should just concern ourselves with the write timestamp.

Release note: None

Raft commands carry their timestamp, which is used for two purposes:
1) update every replica's clock
2) check below Raft that the command operates below the range's GC threshold

Before this patch, commands were carrying ba.Timestamp. This patch makes
them carry ba.WriteTimestamp(). I believe that this is the right thing,
and what we had before was wrong wrt clock updates: the wts can be >
rts, and the timestamps that the command is "injecting" into each
replica are the timestamps of the keys it's writing - which is the
request's write timestamp.
W.r.t the GC threshold check, I believe carrying the read ts was not
"incorrect", but it was still wrong: below Raft, the question that
matters is whether a command is _writing_ below the GC line (i.e. write
timestamp). Whatever reads the respective request might have done (i.e.
read timestamp), they're already done above Raft (and, in case of
AsyncConsensus, the respective results have already been delivered). So,
below Raft we should just concern ourselves with the write timestamp.

Release note: None
Emphasis a field's relationship with synthetic timestamps.

Release note: None
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei
Copy link
Contributor Author

Do y'all have an opinion about whether the below-raft gc threshold check is needed at all nowadays?
The code says:

// This is
// necessary because not all commands declare read access on the GC
// threshold key, even though they implicitly depend on it. This means
// that access to this state will not be serialized by latching,
// so we must perform this check upstream and downstream of raft.
// See #14833.

Is that still true?

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this is the right thing,
and what we had before was wrong wrt clock updates: the wts can be >
rts, and the timestamps that the command is "injecting" into each
replica are the timestamps of the keys it's writing - which is the
request's write timestamp.

Yeah, I agree with this assessment. But note that this mechanism only affects non-transactional requests, which we don't really ever use these days and which we'd like to remove: #58459.

W.r.t the GC threshold check, I believe carrying the read ts was not
"incorrect", but it was still wrong: below Raft, the question that
matters is whether a command is writing below the GC line (i.e. write
timestamp). Whatever reads the respective request might have done (i.e.
read timestamp), they're already done above Raft (and, in case of
AsyncConsensus, the respective results have already been delivered). So,
below Raft we should just concern ourselves with the write timestamp.

This one I'm also less sure about. The hazard here is that a transaction has a read timestamp of 10 and a write timestamp of 20 and it gets re-ordered with a GC request at time 15. I'm having trouble understanding what issues this could cause and I'm left with the same question as you: why did we need #14833 at all if we were also latching each individual key while GC-ing it. Maybe Tobi knows more about this?

This is a worthy thread to pull on, and it would be great to clean this up, but I'm also a little hesitant to make this change so late in the cycle when there's very little obvious benefit to it but a moderate risk if we get it wrong. It's not a clear prereq of #59566, as we can make just as strong of an assertion on the leaseholder without changing what we ship through Raft. Can I convince you to pull this to this side and out of the critical path, at least for a week or two? That way, we can think all the way through it, make sure we're doing the right thing, and maybe also use that new understanding to address #55293.

Reviewed 8 of 8 files at r1, 1 of 1 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this is the right thing,
and what we had before was wrong wrt clock updates: the wts can be >
rts, and the timestamps that the command is "injecting" into each
replica are the timestamps of the keys it's writing - which is the
request's write timestamp.

Yeah, I agree with this assessment. But note that this mechanism only affects non-transactional requests, which we don't really ever use these days and which we'd like to remove: #58459.

Why do you say it only affects non-txn requests? Quite the opposite, I think it affects only transactional requests because only those can have wts != rts. No?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

@nvanbenschoten
Copy link
Member

What I meant was that the clock bump below Raft is meant to ensure that non-transactional reads provide linearizability when their timestamp is assigned by a kv server. But this thing is mostly broken and unused. See #58459.

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you don't think there's any value in follower replicas having their clock bumped by writes? That's what I thought the below-Raft bump was about.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

@tbg tbg removed their request for review April 15, 2021 08:09
@andreimatei
Copy link
Contributor Author

This PR has been superseded by #63045.

@andreimatei andreimatei deleted the small.cmd-write-timestamp branch April 16, 2021 01:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants