Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: introduce a Raft-based transport for closedts #59566

Merged
merged 2 commits into from
Feb 20, 2021

Conversation

andreimatei
Copy link
Contributor

@andreimatei andreimatei commented Jan 29, 2021

This patch introduces a replacement for the existing closed timestamp
mechanism / transport. The new mechanism is gated by a cluster version.

Raft commands now carry increasing closed timestamps generated by the
propBuf by using the recent request Tracker for synchronizing with
in-flight requests (i.e. not closing timestamps below them).
Raft commands get a closed ts field, and the range state gets the field
as well.

The propBuf pays attention to the range's closed timestamp policy for
deciding whether to close lagging or leading timestamps.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andreimatei
Copy link
Contributor Author

No tests yet.
The first commit is #59225.

@andreimatei andreimatei changed the title kvserver: hookup the new closedts tracker to outgoing proposals kvserver: introduce a Raft-based transport for closedts Feb 12, 2021
@andreimatei
Copy link
Contributor Author

Only the last commit is new here.
The reset are #59854, #59571 and #59693

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! There's a lot here, but the new logic really isn't too complex. It's much easier to reason about than the old system.

Reviewed 34 of 34 files at r7.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)


pkg/kv/kvserver/replica_application_result.go, line 235 at r7 (raw file):

	log.VEventf(ctx, 2, "retry: proposalIllegalLeaseIndex")

	maxLeaseIndex, pErr := r.propose(ctx, p, tok.Move(ctx))

Do you think this has similar buffer aliasing issues to what we saw a few days ago?


pkg/kv/kvserver/replica_application_state_machine.go, line 431 at r7 (raw file):

	}
	if log.V(4) {
		log.Infof(ctx, "processing command %x: maxLeaseIndex=%d closedts: %s idx: %d",

If you want to keep this. Move the cmd.ent.Index before maxLeaseIndex and name it raftIndex=%d. Also, s/: /=/.


pkg/kv/kvserver/replica_application_state_machine.go, line 440 at r7 (raw file):

	// reject a command, all will.
	if !b.r.shouldApplyCommand(ctx, cmd, &b.state) {
		log.VEventf(ctx, 1, "applying command %x with forced error: %s", cmd.raftCmd.ClosedTimestamp, cmd.forcedErr)

Did you mean to keep this?


pkg/kv/kvserver/replica_application_state_machine.go, line 448 at r7 (raw file):

		cmd.raftCmd.ClosedTimestamp.Reset()
	} else {
		// !!! Find a way to assert that this command is not writing below the replica's closed ts.

Is this addressed?


pkg/kv/kvserver/replica_application_state_machine.go, line 826 at r7 (raw file):

	if cts := cmd.raftCmd.ClosedTimestamp; !cts.IsEmpty() {
		if cts.Less(b.state.ClosedTimestamp) {
			return wrapWithNonDeterministicFailure(errors.Errorf(

Any reason not to just fatal here? Is the new error handling needed?


pkg/kv/kvserver/replica_application_state_machine.go, line 891 at r7 (raw file):

	r.mu.state.RaftAppliedIndex = b.state.RaftAppliedIndex
	r.mu.state.LeaseAppliedIndex = b.state.LeaseAppliedIndex
	closedTimestampUpdated := !r.mu.state.ClosedTimestamp.Equal(b.state.ClosedTimestamp)

Any reason not to just blindly write the update? If you need closedTimestampUpdated for later, consider:

closedTimestampUpdated := r.mu.state.ClosedTimestamp.Forward(b.state.ClosedTimestamp)

pkg/kv/kvserver/replica_proposal.go, line 117 at r7 (raw file):

	// booleans.
	Request *roachpb.BatchRequest
	// leaseStatus represents the lease under which the Request was evaluated and

nit: add an empty line above this comment.


pkg/kv/kvserver/replica_proposal.go, line 117 at r7 (raw file):

	// booleans.
	Request *roachpb.BatchRequest
	// leaseStatus represents the lease under which the Request was evaluated and

What does this contain for lease requests. Consider spelling that out.


pkg/kv/kvserver/replica_proposal.go, line 171 at r7 (raw file):

}

func (proposal *ProposalData) alreadyP() {

!!!


pkg/kv/kvserver/replica_proposal.go, line 414 at r7 (raw file):

	r.concMgr.OnRangeLeaseUpdated(newLease.Sequence, iAmTheLeaseHolder)

	r.mu.proposalBuf.OnLeaseChangeLocked(iAmTheLeaseHolder, r.mu.state.ClosedTimestamp)

Give this a comment.


pkg/kv/kvserver/replica_proposal_buf.go, line 114 at r7 (raw file):

}

// propBuf is a multi-producer, single-consumer buffer for Raft proposals. The

This comment could use some updates to indicate the new role that the proposal buffer plays.


pkg/kv/kvserver/replica_proposal_buf.go, line 143 at r7 (raw file):

	arr    propBufArray

	clock *hlc.Clock

Should we put accessors to the clock and settings on the proposer interface so we don't need to plumb the dependencies into here?


pkg/kv/kvserver/replica_proposal_buf.go, line 148 at r7 (raw file):

	// currently-evaluating requests.
	evalTracker tracker.Tracker
	// closedTSNanos is the largest "closed timestamp" - i.e. the largest

"closedTSNanos"

Also, should this be renamed to "assignedClosedTS" or something to indicate that this is not a real closed timestamp yet?


pkg/kv/kvserver/replica_proposal_buf.go, line 153 at r7 (raw file):

	// any more.
	//
	// We only close physical timestamps, ignoring the logical part. If we say

Stale paragraph.


pkg/kv/kvserver/replica_proposal_buf.go, line 217 at r7 (raw file):

	enqueueUpdateCheck()
	closeTimestampPolicy() roachpb.RangeClosedTimestampPolicy
	// raftAppliedClosedTimestamp returns the closed timestamp from the ReplicaState.

If we just need this to check if the timestamp is empty, I'd slim down the interface. Consider raftTransportClosedTimestampsInUse() bool or something like that. No need to let future changes do something stupid.


pkg/kv/kvserver/replica_proposal_buf.go, line 271 at r7 (raw file):

will be assigned which is assigned here

Something's off here.


pkg/kv/kvserver/replica_proposal_buf.go, line 273 at r7 (raw file):

// closed_timestamp. max_lease_index will be assigned which is assigned here
// when the command is sequenced in the buffer. closed_timestamp will be
// assigned later. It is also expected that the byte slice has sufficient

Clarify when "later" is.


pkg/kv/kvserver/replica_proposal_buf.go, line 517 at r7 (raw file):

	}

	now := b.clock.PhysicalNow()

Why PhysicalNow? That seems wrong, doesn't it? Also, let's pull this logic out into some other method so that we can wrap it with a test.


pkg/kv/kvserver/replica_proposal_buf.go, line 620 at r7 (raw file):

		// could, in principle, but we'd have to make a copy of the encoded command
		// as to not modify the copy that's already stored in the local replica's
		// command cache.

s/command cache/raft entry cache/


pkg/kv/kvserver/replica_proposal_buf.go, line 687 at r7 (raw file):

}

// OnLeaseChange is called when a new lease is applied to this range. The closed

Move this above EvaluatingRequestsCount.


pkg/kv/kvserver/replica_proposal_buf.go, line 688 at r7 (raw file):

// OnLeaseChange is called when a new lease is applied to this range. The closed
// timestamp tracked by the propBuf is updated accordingly, such that the leases

Isn't this lease stuff wrong? That's not what we're passing anymore.


pkg/kv/kvserver/replica_proposal_buf.go, line 729 at r7 (raw file):

	// aware of all the reads performed by the transferer.
	isBrandNewLeaseRequest := false
	isLeaseExtension := false

Do we need this variable?


pkg/kv/kvserver/replica_proposal_buf.go, line 736 at r7 (raw file):

		// former is more up to date, having been modified by the evaluation.
		newLease := p.command.ReplicatedEvalResult.State.Lease
		isLeaseExtension = newLease.Sequence != leaseReq.PrevLease.Sequence

Isn't this backwards? Lease extensions have the same sequence number.

Also, should we use p.leaseStatus in place of leaseReq.PrevLease?


pkg/kv/kvserver/replica_proposal_buf.go, line 741 at r7 (raw file):

		}
		isBrandNewLeaseRequest = true
		closedTSTarget = newLease.Start.ToTimestamp()

Make a note of why we completely overwrite the old closedTSTarget. Something about how it's meaningless on a non-leaseholder.


pkg/kv/kvserver/replica_proposal_buf.go, line 746 at r7 (raw file):

		lb := b.evalTracker.LowerBound(ctx)
		if !lb.IsEmpty() {
			closedTSTarget.Backward(lb.FloorPrev())

Make a not about why Prev and also why FloorPrev (to get rid of logical ticks).


pkg/kv/kvserver/replica_proposal_buf.go, line 760 at r7 (raw file):

		closedTSTarget.Backward(p.leaseStatus.Expiration())
	}
	b.closedTS.Forward(closedTSTarget)

Give this a comment.


pkg/kv/kvserver/replica_proposal_buf.go, line 817 at r7 (raw file):

// the propBuf.
func (b *propBuf) EvaluatingRequestsCount() int {
	b.p.locker().Lock()

Can we use the rlocker?


pkg/kv/kvserver/replica_proposal_buf.go, line 856 at r7 (raw file):

func (t *TrackedRequestToken) doneLocked(ctx context.Context) {
	if t.done {
		// t.b == nil means that Reset() was called on this token.

Does Reset still exist?


pkg/kv/kvserver/replica_proposal_buf.go, line 863 at r7 (raw file):

	}
	t.done = true
	if t.b != nil {

Is this condition possible?


pkg/kv/kvserver/replica_proposal_buf.go, line 887 at r7 (raw file):

// returns the minimum timestamp at which this request must write. The tracked
// request is identified by its tentative write timestamp. After calling this,
// the caller must bump the write timestamp to at least the returned minTS.

Doesn't it need to bump to a time above the minTS? Or does this behave differently from the old tracker?


pkg/kv/kvserver/replica_proposal_buf.go, line 910 at r7 (raw file):

// TrackEvaluatingRequest. See comments over there for when this needs to be
// called.
func (b *propBuf) AbandonEvaluatingRequest(ctx context.Context, tok tracker.RemovalToken) {

Is this still in use?


pkg/kv/kvserver/replica_raft.go, line 120 at r7 (raw file):

		}
		proposal.finishApplication(ctx, pr)

stray line.


pkg/kv/kvserver/replica_write.go, line 91 at r7 (raw file):

	defer untrack(ctx, 0, 0, 0) // covers all error returns below

	writeTS := ba.WriteTimestamp()

nit: any reason to pull this all the way out here? Should we just pass ba.WriteTimestamp() directly to TrackEvaluatingRequest?


pkg/kv/kvserver/store_split.go, line 125 at r7 (raw file):

	}

	rsl.SetClosedTimestamp(ctx, readWriter, closedTS)

Do we need error handling here? And a comment?


pkg/kv/kvserver/store_split.go, line 163 at r7 (raw file):

	rightRepl.mu.Lock()
	rightRepl.mu.initialMaxClosed = initialMaxClosed
	// !!! initialize propBuf.closedTS here? Don't think so, splitPostApply() will do it through leasePostApply().

Yeah, I don't think so either.


pkg/kv/kvserver/testing_knobs.go, line 242 at r7 (raw file):

	// worrying about Raft).
	AllowLeaseRequestProposalsWhenNotLeader bool
	// DontCloseTimestamps inhibits the propBuf's closing of timestamps.

So all entries will have an empty closed timestamp field? Maybe spell that out explicitly.


pkg/kv/kvserver/batcheval/cmd_subsume.go, line 159 at r7 (raw file):

	}

	// !!! FreezeStart should be bumped to the closed timestamp

I think it should just go away. This was mostly a mistake. We can fix things to serve reads under the frozen range's closed timestamp, but this whole "serving reads under the range's freeze timestamp" business is not great. See this new TODO: https://github.com/cockroachdb/cockroach/pull/60521/files#diff-b080aecf7d979c0b5a4c43973db709d0d866f137cb4e961ebfc0d979ad22c003R296.


pkg/kv/kvserver/kvserverpb/proposer_kv.go, line 26 at r7 (raw file):

	ClosedTimestamp: hlc.Timestamp{
		WallTime:  math.MaxInt64,
		Logical:   math.MaxInt32,

If we promise never to include logical ticks in the closed timestamp, maybe this can go away and we can save a few bytes of unnecessary space in the pre-allocation buffer. What do you think?


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 244 at r7 (raw file):

  // well as that uproots whatever ordering was originally envisioned.
  //
  // This field is set through RaftCommandFooter hackery.

You call it "hackery", I call it "style".


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 270 at r7 (raw file):

  map<string, string> trace_data = 16;

  // The closed timestamp carried by this command. Once a follower is told to

nit: consider moving this right below max_lease_index, since there's a duality to these two fields.


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 272 at r7 (raw file):

  // The closed timestamp carried by this command. Once a follower is told to
  // apply this command, it knows that there will be no further writes at
  // timestamps <= closed_timestamp_nanos. Note that the command itself might

s/closed_timestamp_nanos/closed_timestamp/


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 276 at r7 (raw file):

  // used after this command is applied.
  //
  // Note that this represents only the physical part of a timestamp; the

Stale paragraph.


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 302 at r7 (raw file):

// ClosedTimestampFooter is similar to RaftCommandFooter, allowing the proposal
// buffer to fill in the max_lease_index field after most of the proto has been

"closed_timestamp file"


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 307 at r7 (raw file):

  util.hlc.Timestamp closed_timestamp = 17 [(gogoproto.nullable) = false];
}

nit: strange extra lines at the bottom of this file.


pkg/kv/kvserver/stateloader/stateloader.go, line 171 at r7 (raw file):

	}
	if state.UsingAppliedStateKey {
		rai, lai := state.RaftAppliedIndex, state.LeaseAppliedIndex

nit: follow this pattern, so:

rai, lai, ct := state.RaftAppliedIndex, state.LeaseAppliedIndex, state.ClosedTimestamp
if err := rsl.SetRangeAppliedState(ctx, readWriter, rai, lai, ms, ct); err != nil {

pkg/kv/kvserver/stateloader/stateloader.go, line 302 at r7 (raw file):

	readWriter storage.ReadWriter,
	appliedIndex, leaseAppliedIndex uint64,
	closedTimestamp hlc.Timestamp,

Move this arg down to keep the four fields in the same order everywhere.


pkg/storage/enginepb/mvcc3.proto, line 208 at r7 (raw file):

  // closed_timestamp is the largest timestamp that is known to have been
  // closed. This means that the current leaseholder (if any) and any future

"closed as of this applied index"

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the last two commits are new here. The rest are #59571, #59693 and #60573.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)


pkg/kv/kvserver/replica_application_result.go, line 235 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you think this has similar buffer aliasing issues to what we saw a few days ago?

I don't think so, because r.propose allocates a new buffer and re-marshals everything. Right?


pkg/kv/kvserver/replica_application_state_machine.go, line 431 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If you want to keep this. Move the cmd.ent.Index before maxLeaseIndex and name it raftIndex=%d. Also, s/: /=/.

done


pkg/kv/kvserver/replica_application_state_machine.go, line 440 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did you mean to keep this?

removed


pkg/kv/kvserver/replica_application_state_machine.go, line 448 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this addressed?

it is now


pkg/kv/kvserver/replica_application_state_machine.go, line 826 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Any reason not to just fatal here? Is the new error handling needed?

I'm glad you think it's not. fataling


pkg/kv/kvserver/replica_application_state_machine.go, line 891 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Any reason not to just blindly write the update? If you need closedTimestampUpdated for later, consider:

closedTimestampUpdated := r.mu.state.ClosedTimestamp.Forward(b.state.ClosedTimestamp)

done


pkg/kv/kvserver/replica_proposal.go, line 117 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: add an empty line above this comment.

done


pkg/kv/kvserver/replica_proposal.go, line 117 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What does this contain for lease requests. Consider spelling that out.

see now


pkg/kv/kvserver/replica_proposal.go, line 171 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

!!!

gone


pkg/kv/kvserver/replica_proposal.go, line 414 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Give this a comment.

see now


pkg/kv/kvserver/replica_proposal_buf.go, line 114 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This comment could use some updates to indicate the new role that the proposal buffer plays.

done


pkg/kv/kvserver/replica_proposal_buf.go, line 143 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we put accessors to the clock and settings on the proposer interface so we don't need to plumb the dependencies into here?

I think binding them early is much better. These things are clearly node-wide, not per-range.


pkg/kv/kvserver/replica_proposal_buf.go, line 148 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"closedTSNanos"

Also, should this be renamed to "assignedClosedTS" or something to indicate that this is not a real closed timestamp yet?

renamed

Although it is a very real closed timestamp - arguably more real than the other one :)


pkg/kv/kvserver/replica_proposal_buf.go, line 153 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Stale paragraph.

removed


pkg/kv/kvserver/replica_proposal_buf.go, line 217 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If we just need this to check if the timestamp is empty, I'd slim down the interface. Consider raftTransportClosedTimestampsInUse() bool or something like that. No need to let future changes do something stupid.

done


pkg/kv/kvserver/replica_proposal_buf.go, line 271 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
will be assigned which is assigned here

Something's off here.

done


pkg/kv/kvserver/replica_proposal_buf.go, line 273 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Clarify when "later" is.

done


pkg/kv/kvserver/replica_proposal_buf.go, line 517 at r7 (raw file):

Why PhysicalNow? That seems wrong, doesn't it?

Why? You mean as opposed to clock.Now()? We only care about the physical part cause that's what the policies operate on, right?

Also, let's pull this logic out into some other method so that we can wrap it with a test.

Done, and added a test.


pkg/kv/kvserver/replica_proposal_buf.go, line 620 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/command cache/raft entry cache/

done


pkg/kv/kvserver/replica_proposal_buf.go, line 687 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Move this above EvaluatingRequestsCount.

done


pkg/kv/kvserver/replica_proposal_buf.go, line 688 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Isn't this lease stuff wrong? That's not what we're passing anymore.

see now


pkg/kv/kvserver/replica_proposal_buf.go, line 729 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need this variable?

done


pkg/kv/kvserver/replica_proposal_buf.go, line 736 at r7 (raw file):

Isn't this backwards? Lease extensions have the same sequence number.

done

Also, should we use p.leaseStatus in place of leaseReq.PrevLease?

done, good idea


pkg/kv/kvserver/replica_proposal_buf.go, line 741 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Make a note of why we completely overwrite the old closedTSTarget. Something about how it's meaningless on a non-leaseholder.

see now


pkg/kv/kvserver/replica_proposal_buf.go, line 746 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Make a not about why Prev and also why FloorPrev (to get rid of logical ticks).

done


pkg/kv/kvserver/replica_proposal_buf.go, line 760 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Give this a comment.

done


pkg/kv/kvserver/replica_proposal_buf.go, line 817 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can we use the rlocker?

yes, done


pkg/kv/kvserver/replica_proposal_buf.go, line 856 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Does Reset still exist?

no, deleted


pkg/kv/kvserver/replica_proposal_buf.go, line 863 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this condition possible?

no, deleted


pkg/kv/kvserver/replica_proposal_buf.go, line 887 at r7 (raw file):

Doesn't it need to bump to a time above the minTS?

No, it needs to bump to minTS, which already incorporates a Next() call on the closed timestamp.


pkg/kv/kvserver/replica_proposal_buf.go, line 910 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Is this still in use?

no, deleted


pkg/kv/kvserver/replica_raft.go, line 120 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

stray line.

done


pkg/kv/kvserver/replica_write.go, line 91 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: any reason to pull this all the way out here? Should we just pass ba.WriteTimestamp() directly to TrackEvaluatingRequest?

done


pkg/kv/kvserver/store_split.go, line 125 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need error handling here? And a comment?

see now. Were you looking for more?


pkg/kv/kvserver/store_split.go, line 163 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yeah, I don't think so either.

done


pkg/kv/kvserver/testing_knobs.go, line 242 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

So all entries will have an empty closed timestamp field? Maybe spell that out explicitly.

done


pkg/kv/kvserver/batcheval/cmd_subsume.go, line 159 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think it should just go away. This was mostly a mistake. We can fix things to serve reads under the frozen range's closed timestamp, but this whole "serving reads under the range's freeze timestamp" business is not great. See this new TODO: https://github.com/cockroachdb/cockroach/pull/60521/files#diff-b080aecf7d979c0b5a4c43973db709d0d866f137cb4e961ebfc0d979ad22c003R296.

yeah


pkg/kv/kvserver/kvserverpb/proposer_kv.go, line 26 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If we promise never to include logical ticks in the closed timestamp, maybe this can go away and we can save a few bytes of unnecessary space in the pre-allocation buffer. What do you think?

I'm not impressed by saving space in this buffer; I don't think that's a factor. The logical ticks do maybe cost some bandwidth and proto unmarshalling, so I'm not in love with them - that's why all the closed timestamp policy computations are done without logical ticks, and also why I do PrevFloor() on what comes out of the tracker (and also cause I don't want to introduce ugly-looking logical=maxint timestamps). The only thing that can introduce closed ts with logical ticks is the lease start time. And I guess there it makes sense to keep them, to properly align the closed ts with the lease start. What do you think?


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 270 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: consider moving this right below max_lease_index, since there's a duality to these two fields.

done


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 272 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/closed_timestamp_nanos/closed_timestamp/

done


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 276 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Stale paragraph.

done


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 302 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"closed_timestamp file"

done


pkg/kv/kvserver/kvserverpb/proposer_kv.proto, line 307 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: strange extra lines at the bottom of this file.

done


pkg/kv/kvserver/stateloader/stateloader.go, line 171 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: follow this pattern, so:

rai, lai, ct := state.RaftAppliedIndex, state.LeaseAppliedIndex, state.ClosedTimestamp
if err := rsl.SetRangeAppliedState(ctx, readWriter, rai, lai, ms, ct); err != nil {

done


pkg/kv/kvserver/stateloader/stateloader.go, line 302 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Move this arg down to keep the four fields in the same order everywhere.

done


pkg/storage/enginepb/mvcc3.proto, line 208 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"closed as of this applied index"

done

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting much closer!

Reviewed 7 of 10 files at r9, 5 of 8 files at r10, 15 of 17 files at r11, 2 of 8 files at r12, 35 of 35 files at r14, 11 of 11 files at r15.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)


pkg/kv/kvserver/replica_application_result.go, line 235 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I don't think so, because r.propose allocates a new buffer and re-marshals everything. Right?

Oh, you're right. I was thinking that this was calling into proposalBuf.Insert.


pkg/kv/kvserver/replica_follower_read.go, line 153 at r15 (raw file):

	defer r.mu.RUnlock()
	// TODO(andrei): Make sure that this synchronizes with the closed timestamps
	// side-transport once the side-transport is written.

👍


pkg/kv/kvserver/replica_proposal_buf.go, line 143 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I think binding them early is much better. These things are clearly node-wide, not per-range.

If so, then let's pull these two fields right up before p proposer.


pkg/kv/kvserver/replica_proposal_buf.go, line 217 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

done

Looks like you missed the first word in the comment.


pkg/kv/kvserver/replica_proposal_buf.go, line 517 at r7 (raw file):

Why? You mean as opposed to clock.Now()? We only care about the physical part cause that's what the policies operate on, right?

But PhysicalNow doesn't take into consideration clock updates from other nodes. PhysicalNow bypasses the entire HLC clock. If we know that a follower is leading our clock by 50ms, it would be better to account for that in this target.

EDIT: to clarify in case this is where the confusion is: hlc.PhysicalNow() != hlc.Now().WallTime.


pkg/kv/kvserver/replica_proposal_buf.go, line 620 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

done

Try that one more time 😃


pkg/kv/kvserver/replica_proposal_buf.go, line 736 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Isn't this backwards? Lease extensions have the same sequence number.

done

Also, should we use p.leaseStatus in place of leaseReq.PrevLease?

done, good idea

I think you wanted oldLease := p.leaseStatus.Lease


pkg/kv/kvserver/replica_proposal_buf.go, line 887 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Doesn't it need to bump to a time above the minTS?

No, it needs to bump to minTS, which already incorporates a Next() call on the closed timestamp.

Do we like that? On the surface, I do. But then it means that we have to bump to the timestamp returned from this method but then we have to bump above any timestamp returned from the timestamp cache. I think the reason it was the other way before was to maintain that symmetry - so that the closed timestamp was similar to an entry in the timestamp cache.

I guess the difficulty here is that we're putting .Next() into the evalTracker, right?


pkg/kv/kvserver/replica_proposal_buf.go, line 689 at r14 (raw file):

		targetDuration := closedts.TargetDuration.Get(&b.settings.SV)
		closedTSTarget = hlc.Timestamp{WallTime: now - targetDuration.Nanoseconds()}
	case roachpb.LEAD_FOR_GLOBAL_READS:

Let's take this out for now so that we can solve the fallout of it independently from this PR. Maybe just comment it out and make the first case case roachpb.LAG_BY_CLUSTER_SETTING, roachpb.LEAD_FOR_GLOBAL_READS.


pkg/kv/kvserver/replica_proposal_buf_test.go, line 71 at r14 (raw file):

func (t testProposerRaft) Step(raftpb.Message) error {
	// TODO(andrei, nvanbenschoten): Capture the message and test against it.

What do you think about adding some tests that address this TODO by capturing MsgProps, decoding them, and asserting that their MinLeaseIndex and ClosedTimestamps are as expected? I think this is the kind of testing we're currently missing with closed timestamps in general.


pkg/kv/kvserver/replica_raftstorage.go, line 1010 at r14 (raw file):

	// wholesale replace r.mu.state.
	r.mu.state = s
	// If we're the leaseholder, we might have learned that we got the lease

Isn't this handled by the call to leasePostApplyLocked above?


pkg/kv/kvserver/store_split.go, line 125 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

see now. Were you looking for more?

No, looks good.


pkg/kv/kvserver/batcheval/cmd_subsume.go, line 164 at r15 (raw file):

	closedTS := cArgs.EvalCtx.FrozenClosedTimestamp(ctx)
	if !closedTS.IsEmpty() {
		reply.ClosedTimestamp = closedTS

reply.ClosedTimestamp = cArgs.EvalCtx.FrozenClosedTimestamp(ctx)


pkg/kv/kvserver/kvserverpb/proposer_kv.go, line 26 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

I'm not impressed by saving space in this buffer; I don't think that's a factor. The logical ticks do maybe cost some bandwidth and proto unmarshalling, so I'm not in love with them - that's why all the closed timestamp policy computations are done without logical ticks, and also why I do PrevFloor() on what comes out of the tracker (and also cause I don't want to introduce ugly-looking logical=maxint timestamps). The only thing that can introduce closed ts with logical ticks is the lease start time. And I guess there it makes sense to keep them, to properly align the closed ts with the lease start. What do you think?

SGTM. The lease start time is a good point.


pkg/kv/kvserver/stateloader/stateloader.go, line 489 at r14 (raw file):

}

func (rsl StateLoader) SetClosedTimestamp(

I'm guessing you'll need a comment on this.

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now based on #59571 and #60634. Removed the dependency on #60573.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)


pkg/kv/kvserver/replica_proposal_buf.go, line 143 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If so, then let's pull these two fields right up before p proposer.

done


pkg/kv/kvserver/replica_proposal_buf.go, line 217 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Looks like you missed the first word in the comment.

done


pkg/kv/kvserver/replica_proposal_buf.go, line 517 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why? You mean as opposed to clock.Now()? We only care about the physical part cause that's what the policies operate on, right?

But PhysicalNow doesn't take into consideration clock updates from other nodes. PhysicalNow bypasses the entire HLC clock. If we know that a follower is leading our clock by 50ms, it would be better to account for that in this target.

EDIT: to clarify in case this is where the confusion is: hlc.PhysicalNow() != hlc.Now().WallTime.

ah, indeed. done.


pkg/kv/kvserver/replica_proposal_buf.go, line 620 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Try that one more time 😃

was close tho


pkg/kv/kvserver/replica_proposal_buf.go, line 736 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think you wanted oldLease := p.leaseStatus.Lease

done


pkg/kv/kvserver/replica_proposal_buf.go, line 887 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we like that? On the surface, I do. But then it means that we have to bump to the timestamp returned from this method but then we have to bump above any timestamp returned from the timestamp cache. I think the reason it was the other way before was to maintain that symmetry - so that the closed timestamp was similar to an entry in the timestamp cache.

I guess the difficulty here is that we're putting .Next() into the evalTracker, right?

I think the old Tracker was also returning the minimum ts at which the new request can evaluate, not the highest ts at which it cannot. See the Next() call here, and also the comment on Track():

minProp := t.mu.next.Next()

So it seems to me that the Next() in applyTimestampCache() is unnecessary when it applies to minReadTS:
nextRTS := rTS.Next()

I don't care too much about the symmetry you're talking about (particularly since it doesn't look like it existed). I think it's natural for the timestamp cache to contain the read timestamp, just like the tracked contain the closed timestamps, and for callers to deal with the required Next() at different levels. What does bother me, however, is the way in which we pass these closed timestamps into applyTimestampCache, which kinda funges the reason for a push. I think we should keep the application of the tscache separate from the application of the closed ts.


pkg/kv/kvserver/replica_proposal_buf.go, line 689 at r14 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's take this out for now so that we can solve the fallout of it independently from this PR. Maybe just comment it out and make the first case case roachpb.LAG_BY_CLUSTER_SETTING, roachpb.LEAD_FOR_GLOBAL_READS.

done


pkg/kv/kvserver/replica_proposal_buf_test.go, line 71 at r14 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

What do you think about adding some tests that address this TODO by capturing MsgProps, decoding them, and asserting that their MinLeaseIndex and ClosedTimestamps are as expected? I think this is the kind of testing we're currently missing with closed timestamps in general.

working on it


pkg/kv/kvserver/replica_raftstorage.go, line 1010 at r14 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Isn't this handled by the call to leasePostApplyLocked above?

rebased on #60634 to get rid of the need to do this


pkg/kv/kvserver/batcheval/cmd_subsume.go, line 164 at r15 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

reply.ClosedTimestamp = cArgs.EvalCtx.FrozenClosedTimestamp(ctx)

done


pkg/kv/kvserver/stateloader/stateloader.go, line 489 at r14 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I'm guessing you'll need a comment on this.

done

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: mod the additional testing in replica_proposal_buf_test.go. I'll push through #59571 tomorrow to get it out of your way. Thanks for all the work you're putting into this.

Reviewed 1 of 4 files at r17, 15 of 17 files at r18, 31 of 33 files at r19, 11 of 11 files at r20.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)


pkg/kv/kvserver/replica_proposal_buf.go, line 143 at r7 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

done

Sorry, I was unclear. The evalTracker was in a fine location. I meant the clock and the settings, which are both injected dependencies.


pkg/kv/kvserver/replica_proposal_buf.go, line 689 at r14 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

done

👍 thanks.

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some unit tests to the propBuf. PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)


pkg/kv/kvserver/replica_proposal_buf.go, line 143 at r7 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Sorry, I was unclear. The evalTracker was in a fine location. I meant the clock and the settings, which are both injected dependencies.

moved the settings over, but the evalTracker also belongs here with the more static things. Particularly now since I've made Init() take it in as a param (for the benefit of tests).

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: and all dependencies are in.

Reviewed 1 of 4 files at r22, 15 of 17 files at r23, 32 of 34 files at r24, 11 of 11 files at r25.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @andreimatei)


pkg/kv/kvserver/replica_proposal_buf_test.go, line 203 at r25 (raw file):

	r, ok := ba.GetArg(roachpb.RequestLease)
	if ok {
		lease = &r.(*roachpb.RequestLeaseRequest).Lease

nit: I think the leaseStatus for a RequestLeaseRequest will be different. It should be LeaseStatus{Lease: lease}.


pkg/kv/kvserver/replica_proposal_buf_test.go, line 693 at r25 (raw file):

}

func TestProposalBufferClosedTimestamp(t *testing.T) {

Very nice test! Mind giving it a short comment?


pkg/kv/kvserver/replica_proposal_buf_test.go, line 701 at r25 (raw file):

	clock := hlc.NewClock(mc.UnixNano, time.Nanosecond)
	st := cluster.MakeTestingClusterSettings()
	var tracker mockTracker

nit: any reason not to pull this into the loop next to testProposer{ ... } and then construct it with the lowerBound directly?

Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @andreimatei, and @nvanbenschoten)


pkg/kv/kvserver/replica_proposal_buf_test.go, line 203 at r25 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: I think the leaseStatus for a RequestLeaseRequest will be different. It should be LeaseStatus{Lease: lease}.

as discussed, pc.lease is right here


pkg/kv/kvserver/replica_proposal_buf_test.go, line 693 at r25 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Very nice test! Mind giving it a short comment?

done; moved up a comment from inside the test


pkg/kv/kvserver/replica_proposal_buf_test.go, line 701 at r25 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: any reason not to pull this into the loop next to testProposer{ ... } and then construct it with the lowerBound directly?

done

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 19, 2021
Discovered while investigating a test failure in cockroachdb#59566.

In 278a21b, we shifted from talking about read and write requests to
locking and non-locking requests when deciding whether a request could
be served on a follower. This prevented locking scans and gets from
being served on followers. However, it began letting lone HeartbeatTxn
and EndTxn requests past the old `!IsReadOnly()` check. Luckily, these
were still prevented from being served on followers because they are
only sent in read-write transactions, which were also prevented from
performing follower reads.

Yesterday, in 0ac8ab9, we lifted this second limitation, allowing
read-write transactions to perform follower reads for non-locking
batches. However, this no longer prevented HeartbeatTxn and EndTxn
requests from being routed and served on follower replicas. This
resulted in a pretty disastrous situation where in very rare cases, a
follower was proposing a write under a lease that it did not own.
Luckily, new assertions added in cockroachdb#59566 caught this.

This commit fixes this oversight be re-introducing "read-only" as a
condition for serving follower reads.

Release note: None
craig bot pushed a commit that referenced this pull request Feb 19, 2021
60758: sql: Added new tables and columns in pg_catalog r=rafiss a=RichardJCai

Taking over #59498 since @mnovelodou is out and we want to get this in before stability.

Previously, added tables and columns existed in postgres but not cockroach db
This was inadequate because this may cause errors at tools that expects these
columns to exists
To address this, this patch contains the result of diff tool to add what is
missing

Release note (sql change): Added missing tables and columns at pg_catalog
New columns in table pg_attribute:
        atthasmissing
New columns in table pg_class:
        relrowsecurity
        relforcerowsecurity
        relispartition
        relispopulated
        relreplident
        relrewrite
New columns in table pg_collation:
        collprovider
        collversion
        collisdeterministic
New columns in table pg_index:
        indnkeyatts
New columns in table pg_proc:
        prokind
        prosupport
New columns in table pg_aggregate:
        aggmfinalmodify
        aggfinalmodify
New Tables:
        pg_replication_origin_roname_index
        pg_config
        pg_largeobject_loid_pn_index
        pg_ts_config
        pg_operator_oid_index
        pg_conversion_oid_index
        pg_depend_depender_index
        pg_stat_user_functions
        pg_index_indrelid_index
        pg_group
        pg_db_role_setting_databaseid_rol_index
        pg_statio_user_tables
        pg_partitioned_table_partrelid_index
        pg_publication_oid_index
        pg_publication
        pg_publication_tables
        pg_stat_xact_user_tables
        pg_ts_config_map_index
        pg_replication_origin_roiident_index
        pg_stat_progress_vacuum
        pg_file_settings
        pg_statio_all_indexes
        pg_shseclabel_object_index
        pg_enum_oid_index
        pg_statistic_relid_att_inh_index
        pg_largeobject_metadata_oid_index
        pg_stat_archiver
        pg_statio_sys_sequences
        pg_collation_oid_index
        pg_subscription
        pg_namespace_oid_index
        pg_amop
        pg_stat_database_conflicts
        pg_tablespace_oid_index
        pg_class_oid_index
        pg_range_rngtypid_index
        pg_description_o_c_o_index
        pg_opclass_oid_index
        pg_sequence_seqrelid_index
        pg_trigger_oid_index
        pg_timezone_abbrevs
        pg_ts_parser_oid_index
        pg_transform
        pg_extension_oid_index
        pg_statio_user_sequences
        pg_shdepend_reference_index
        pg_foreign_data_wrapper_oid_index
        pg_ts_config_oid_index
        pg_ts_dict_oid_index
        pg_init_privs_o_c_o_index
        pg_user_mappings
        pg_default_acl_role_nsp_obj_index
        pg_stat_slru
        pg_constraint_contypid_index
        pg_stat_progress_create_index
        pg_transform_type_lang_index
        pg_authid_oid_index
        pg_shmem_allocations
        pg_statio_sys_tables
        pg_statio_all_sequences
        pg_policy_oid_index
        pg_shdepend_depender_index
        pg_attribute_relid_attnum_index
        pg_event_trigger_oid_index
        pg_amproc
        pg_cast_oid_index
        pg_constraint_conparentid_index
        pg_statio_sys_indexes
        pg_conversion_default_index
        pg_statistic_ext
        pg_shadow
        pg_trigger_tgconstraint_index
        pg_ts_template
        pg_cast_source_target_index
        pg_type_oid_index
        pg_amproc_fam_proc_index
        pg_aggregate_fnoid_index
        pg_stat_progress_basebackup
        pg_default_acl_oid_index
        pg_foreign_server_oid_index
        pg_statistic_ext_data_stxoid_index
        pg_transform_oid_index
        pg_language_oid_index
        pg_db_role_setting
        pg_amop_opr_fam_index
        pg_user_mapping_oid_index
        pg_hba_file_rules
        pg_am_oid_index
        pg_statio_all_tables
        pg_statio_user_indexes
        pg_stat_progress_analyze
        pg_replication_origin
        pg_depend_reference_index
        pg_stat_bgwriter
        pg_attrdef_adrelid_adnum_index
        pg_stat_sys_indexes
        pg_statistic_ext_oid_index
        pg_foreign_table_relid_index
        pg_user_mapping_user_server_index
        pg_seclabel_object_index
        pg_subscription_rel_srrelid_srsubid_index
        pg_publication_rel_prrelid_prpubid_index
        pg_inherits_parent_index
        pg_ts_dict
        pg_stat_user_tables
        pg_stat_progress_cluster
        pg_stat_xact_all_tables
        pg_stat_database
        pg_shdescription_o_c_index
        pg_publication_rel_oid_index
        pg_largeobject
        pg_publication_rel
        pg_rewrite_oid_index
        pg_stat_all_indexes
        pg_amop_fam_strat_index
        pg_proc_oid_index
        pg_database_oid_index
        pg_sequences
        pg_subscription_oid_index
        pg_enum_typid_sortorder_index
        pg_ts_parser
        pg_ts_config_map
        pg_attrdef_oid_index
        pg_timezone_names
        pg_ts_template_oid_index
        pg_statistic_ext_relid_index
        pg_index_indexrelid_index
        pg_amop_oid_index
        pg_amproc_oid_index
        pg_rules
        pg_opfamily
        pg_stat_xact_sys_tables
        pg_policies
        pg_constraint_oid_index
        pg_stat_sys_tables
        pg_stat_xact_user_functions
        pg_available_extension_versions
        pg_stat_all_tables
        pg_auth_members_role_member_index
        pg_auth_members_member_role_index
        pg_inherits_relid_seqno_index
        pg_opfamily_oid_index
        pg_class_tblspc_relfilenode_index
        pg_cursors
        pg_stat_gssapi
        pg_stat_ssl
        pg_stat_user_indexes

Fixes #58001

60765: kv: don't serve non-locking, read-write requests on followers r=nvanbenschoten a=nvanbenschoten

Discovered while investigating a test failure in #59566.

In 278a21b, we shifted from talking about read and write requests to
locking and non-locking requests when deciding whether a request could
be served on a follower. This prevented locking scans and gets from
being served on followers. However, it began letting lone HeartbeatTxn
and EndTxn requests past the old `!IsReadOnly()` check. Luckily, these
were still prevented from being served on followers because they are
only sent in read-write transactions, which were also prevented from
performing follower reads.

Yesterday, in 0ac8ab9, we lifted this second limitation, allowing
read-write transactions to perform follower reads for non-locking
batches. However, this no longer prevented HeartbeatTxn and EndTxn
requests from being routed and served on follower replicas. This
resulted in a pretty disastrous situation where in very rare cases, a
follower was proposing a write under a lease that it did not own.
Luckily, new assertions added in #59566 caught this.

This commit fixes this oversight be re-introducing "read-only" as a
condition for serving follower reads.

Release note: None

60781: cluster: use WaitConditionNextExit r=rickystewart a=tbg

There is a chance that this will address an issue that causes spurious
failures on CI, where it looks like we're janking the file system out
from under a running process. For context, see:

#58955

Release note: None


Co-authored-by: richardjcai <caioftherichard@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@nvanbenschoten
Copy link
Member

Looks like you need to make generate after the recent rebase. pkg/clusterversion/key_string.go is out of date.

@andreimatei
Copy link
Contributor Author

andreimatei commented Feb 20, 2021 via email

This patch introduces a replacement for the existing closed timestamp
mechanism / transport. The new mechanism is gated by a cluster version.

Raft commands now carry increasing closed timestamps generated by the
propBuf by using the recent request Tracker for synchronizing with
in-flight requests (i.e. not closing timestamps below them).
Raft commands get a closed ts field, and the range state gets the field
as well.

The propBuf pays attention to the range's closed timestamp policy for
deciding whether to close lagging or leading timestamps.

Fixes cockroachdb#57395, cockroachdb#57396
Touches cockroachdb#57405

Release note: None
This patch deals with what happens to the RHS's timestamp cache on a
merge. Before this patch, we were either not touching the cache at all,
when the leases of the LHS and RHS were collocated at merge time, or we
were bumping the RHS's ts cache up to the freeze point otherwise
(because, in this case, the RHS's ts cache info has been lost).  This
patch goes further: now we'll bump the RHS ts cache up to the RHS closed
timestamp on the argument the the RHS's closed timestamp is lost.

This patch is needed by the effort to move closed timestamp to be
per-range instead of per-store, and also to have future-time closed
timestamps. Today, the new ts cache bump is not necessary for a fairly
subtle reason: if the pre-merge leases are collocated,, then the closed
ts of the RHS is not "lost" because it's the same as the one of the LHS.
If the leases are not collocated, the freeze time of the RHS is
certainly above its closed ts. So, in either case, the current code
doesn't lead to the possibility of accepting write post-merge that
invalidate previous follower reads.

The RHS' closed timestamp is plumbed from the freeze to the merge
through subsume response.

Release note: None
Copy link
Contributor Author

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15 and @nvanbenschoten)

@craig
Copy link
Contributor

craig bot commented Feb 20, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants