hlc: strongly type ClockTimestamp as specialization of Timestamp #58349

nvanbenschoten · 2020-12-29T19:49:51Z

Closes #57684 - no need to panic, type system now protects us.

This PR splits off a new hlc.ClockTimestamp type from the existing hlc.Timestamp type through a type alias. While the two types share the same memory and proto representation, they have different purposes and properties. Timestamp serves the role of representing an arbitrary timestamp, one that makes no claim about real-time. ClockTimestamp serves the role of representing a real timestamp pulled from one of the HLC clocks in the system. Because of this, it has the added capability to update a peer's HLC clock. As such, a clock timestamp is a promise that some node in the system has a clock with a reading equal to or above its value.

The PR also moves to a world where the ClockTimestamp specialization is maintained through a combination of static and dynamic typing. While the primary mechanisms that use ClockTimestamps will use static typing, Timestamp will also carry a bit indicating whether it can be downcast to a ClockTimestamp. This bit will replace the current flag structure. So instead of an interaction like the one introduced in #57077 checking whether the value has an arbitrary "synthetic" flag set, it will instead check whether the value has a "clock timestamp" (same mechanism, but slightly more consistent meaning).

This serves as an alternative to an approach like the one in #58213, where we split the types in the other direction, keeping Timestamp to represent a clock timestamp and introduce a new enginepb.TxnTimestamp to represent an arbitrary MVCC timestamp. That change resulted in a significantly larger diff, as it misjudged the extremely small percent of all Timestamp usages which care about the capability of updating remote HLC clocks. It also forced all users (tests, etc.) of timestamps to address this question front-and-center, which had benefits but was also a burden on all uses of timestamps.

The original intention of this change was to follow it up by inverting the synthetic flag on Timestamp, replacing an "is_synthetic" bit with a "from_clock" bit. While inverting the flag optimized the encoded size of non-clock (currently synthetic) timestamps at the expense of the encoded size of clock timestamps by 2 bytes, it came with major benefits. By making clock timestamps opt-in instead of opt-out, we more closely match the capability model we're trying to establish here with static typing, where a clock timestamp can do everything a normal timestamp can, but can also be used to update an HLC clock. The opt-in nature mitigated the risk of bugs that forget to set this flag correctly. Instead of risking a capability escalation where a non-clock timestamp is incorrectly interpreted as a clock timestamp and used to update an HLC clock, we risked a much less harmful capability de-escalation where a clock timestamp loses its ability to update an HLC clock. We could then much more carefully audit the cases where the flag needs to be unset, such as in the Timestamp.Add and Timestamp.Forward methods.

Unfortunately, this "from_clock" approach (attempted in 4e34e20) came with serious complications as well, which became clear only after making the change. Chiefly, the approach made the mixed-version migration of this field significantly more difficult. With v20.2 nodes unaware of the flag, v21.1 nodes would need to find a way to either auto-assign it for all timestamps coming from v21.1 nodes, or to work around its absence. But this latter idea touches on a second complication – the absence of the field resulted (intentionally) in an MVCC encoding that v20.2 nodes would be unable to decode. So v21.1 nodes would need to be very careful to always set the "from_clock" flag on all timestamps when in a mixed version cluster. But this all made the migration towards not setting the flag once in a v21.1-only cluster even more difficult. In the end, opting into "is_synthetic" is a lot easier to work with than opting into "from_clock", so that's how the rest of this PR operates.

cockroach-teamcity · 2020-12-29T19:50:04Z

This change is

knz · 2020-12-30T13:31:25Z

This change feels me with joy 💯

tbg

I like this approach. It's more practical than catching a giant diff and still highlights all of the places whose handling of timestamps raises questions.

Reviewed 50 of 50 files at r1, 6 of 6 files at r2, 39 of 39 files at r3, 26 of 26 files at r4, 27 of 27 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 700 at r5 (raw file):

	ctx := context.Background()
	origTS := makeTS(123, 0)
	plus10 := origTS.Add(10, 10).SetSynthetic(false)

SetSynthetic could be mistaken for something that mutates the receiver (but I guess so could Add). WithSynthetic() sound any better to you? To me it makes it clear that it's returning a new value rather than modifying the receiver. Feel free to ignore.

pkg/kv/kvserver/client_replica_test.go, line 83 at r5 (raw file):

		cfg.TestingKnobs.DisableReplicateQueue = true
		cfg.Clock = nil
		mtc := &multiTestContext{

Sad to see another mtc test introduced here but on the plus side, this test really plays to the strenghts of the mtc and can tell us what we need to add to TestCluster to make things work. I wonder if we really "just" need to allow injecting the clocks into TestCluster. If we did that - would you expect this test to carry over?

pkg/kv/kvserver/replica_application_state_machine.go, line 466 at r1 (raw file):

	// Update the batch's max timestamp.
	if clockTS, ok := cmd.replicatedResult().Timestamp.TryToClockTimestamp(); ok {

We will ultimately see both synthetic and real timestamps here, right?

pkg/kv/kvserver/replica_range_lease.go, line 211 at r1 (raw file):

	// future.
	// Currently, this is the interaction that prevents us from making
	// Lease.Start a ClockTimestamp.

Yes, as written this will generalize poorly to non-blocking txns even in the absence of transfers. I think naively if we are trying to do a non-blocking write at some far future synthetic timestamp ts=1000, we will get here with a lease status that has Timestamp: 1000. But we don't want the lease to start at 1000; we still want it to start at now. I think we can really pick Start here as the min between now and status.Timestamp, which can then always be a clock timestamp. Of course this means that non-blocking writes will need to get routed to the "last leaseholder" beyond the expiration of the lease (on expiration-based ranges), i.e. replicas need to understand that they should not even try to get the lease for ts=1000 even though the most recent lease of the leaseholder expires at ts=10; if ts=10 is still in the local future, the leaseholder is trusted to still be around and will have the request forwarded to it. At which point - does it need to grab an expiration-based lease all the way up to ts=1000?

On lease transfers, there should not be an issue because status.Timestamp is set here:

cockroach/pkg/kv/kvserver/replica_range_lease.go

Line 684 in abeda9c

 status := r.leaseStatus(ctx, *r.mu.state.Lease, now.ToTimestamp(), r.mu.minLeaseProposedTS) 

Either way, this seems all better suited for a follow-up.

pkg/kv/kvserver/replica_range_lease.go, line 875 at r1 (raw file):

	// TODO(nvanbenschoten): this should take the request timestamp and forward
	// now by that timestamp. Should we limit how far in the future this timestamp
	// can lead clock.Now()? Something to do with < now + RangeLeaseRenewalDuration()?

This seems related to my comment above about how expiration-based leases and non-blocking txns interact. I don't think there is a case for request timestamps ahead of the local clock in general (sans the clock uncertainty which we may or may not "remove" by a clock update prior to calling this method).

pkg/kv/kvserver/replica_range_lease.go, line 914 at r1 (raw file):

//  referring to? It appears to have rotted.
//
// TODO(nvanbenschoten): reword comment to account for request timestamp.

As you address this mind also whacking the above TODO.

pkg/kv/kvserver/store_send.go, line 93 at r1 (raw file):

		} else {
			// If the command appears to come from a node with a bad clock,
			// reject it now before we reach that point.

now that you're touching it, reword the opaque "that point" reference.

pkg/kv/kvserver/store_send.go, line 95 at r1 (raw file):

			// reject it now before we reach that point.
			var err error
			if err = s.cfg.Clock.UpdateAndCheckMaxOffset(ctx, baClockTS); err != nil {

nit: if err := ... and lose the local

pkg/sql/catalog/lease/lease.go, line 203 at r2 (raw file):

			return err
		}
		expiration := txn.ReadTimestamp().Add(int64(s.jitteredLeaseDuration()), 0)

random idea: AddNanos might be useful since we very often do just that.

pkg/storage/mvcc.go, line 1680 at r5 (raw file):

			// If we bumped the WriteTimestamp, we update both the TxnMeta and the
			// MVCCMetadata.Timestamp.
			if txnMeta.WriteTimestamp != writeTimestamp {

I guess this is fine even if for some reason writeTimestamp actually regressed, but I'm curious why you made this change.

pkg/util/hlc/hlc.go, line 294 at r1 (raw file):

// callers that intend to use the returned timestamp to update a peer's
// HLC clock should use this method.
func (c *Clock) NowAsClockTimestamp() ClockTimestamp {

I assume you considered type ClockTimestamp struct { ts Timestamp } (and similar shenanigans) and found that it didn't add enough to be worth the hassle, right?

pkg/util/hlc/legacy_timestamp.proto, line 34 at r5 (raw file):

  optional int32 logical = 2 [(gogoproto.nullable) = false];
  // Indicates that the Timestamp did not come from an HLC clock somewhere
  // in the system and, therefore, does not have has the ability to update

does not have

pkg/util/hlc/legacy_timestamp.proto, line 43 at r5 (raw file):

  // This ensures that the timestamp encoding does not change across nodes
  // that are and are not aware of this field.
  optional bool synthetic = 3;

This is fine because decoding a bool(1) into an uint32 (as a 20.2 node will do) will give exactly the flag back?

pkg/util/hlc/timestamp.go, line 368 at r1 (raw file):

}

// String implements the fmt.Formatter interface.

fmt.Stringer

pkg/util/hlc/timestamp.proto, line 40 at r5 (raw file):

  int32 logical = 2;
  // Indicates that the Timestamp did not come from an HLC clock somewhere
  // in the system and, therefore, does not have has the ability to update

have has

pkg/util/hlc/timestamp_test.go, line 295 at r3 (raw file):

	}{
		// Logical portion can be omitted.
		{"0", makeTS(0, 0), "0,0"},

what about -0.00?

nvanbenschoten

TFTRs! I tacked on two very small commits that fell out of this review.

The remaining discussion I'm still hopeful I can trick Tobi into continuing is about lease extensions and lease transfers. It took me a while to think through how these interactions should combine with future-time operations and I'm sure I've missed things along the way.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 700 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

SetSynthetic could be mistaken for something that mutates the receiver (but I guess so could Add). WithSynthetic() sound any better to you? To me it makes it clear that it's returning a new value rather than modifying the receiver. Feel free to ignore.

Done.

pkg/kv/kvserver/client_replica_test.go, line 83 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Sad to see another mtc test introduced here but on the plus side, this test really plays to the strenghts of the mtc and can tell us what we need to add to TestCluster to make things work. I wonder if we really "just" need to allow injecting the clocks into TestCluster. If we did that - would you expect this test to carry over?

I believe that's the only blocker here, but I don't know enough about the limits of TestCluster to say for sure.

pkg/kv/kvserver/replica_application_state_machine.go, line 466 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We will ultimately see both synthetic and real timestamps here, right?

Right. The corresponding clock update in replicaAppBatch.ApplyToStateMachine has always intrigued me. It says:

// Update the node clock with the maximum timestamp of all commands in the
// batch. This maintains a high water mark for all ops serviced, so that
// received ops without a timestamp specified are guaranteed one higher than
// any op already executed for overlapping keys.

How do you feel about these ops without a timestamp, and non-transactional operations in general? Are they a relic of the past? Do you see a world in which we push to get rid of them?

I also don't fully understand why this below-Raft clock update is even needed at all. Non-leaseholder nodes won't be able to serve consistent reads anyway until they become the leaseholder, at which point they could update their clock in leasePostApply (like we do in MergeRange). And if they are the leaseholder, they will have already updated their clock above Raft. So is this all just for clock stability?

pkg/kv/kvserver/replica_range_lease.go, line 211 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Yes, as written this will generalize poorly to non-blocking txns even in the absence of transfers. I think naively if we are trying to do a non-blocking write at some far future synthetic timestamp ts=1000, we will get here with a lease status that has Timestamp: 1000. But we don't want the lease to start at 1000; we still want it to start at now. I think we can really pick Start here as the min between now and status.Timestamp, which can then always be a clock timestamp. Of course this means that non-blocking writes will need to get routed to the "last leaseholder" beyond the expiration of the lease (on expiration-based ranges), i.e. replicas need to understand that they should not even try to get the lease for ts=1000 even though the most recent lease of the leaseholder expires at ts=10; if ts=10 is still in the local future, the leaseholder is trusted to still be around and will have the request forwarded to it. At which point - does it need to grab an expiration-based lease all the way up to ts=1000?

On lease transfers, there should not be an issue because status.Timestamp is set here:

cockroach/pkg/kv/kvserver/replica_range_lease.go

Line 684 in abeda9c

status := r.leaseStatus(ctx, *r.mu.state.Lease, now.ToTimestamp(), r.mu.minLeaseProposedTS)

Either way, this seems all better suited for a follow-up.

The thing that scared me about backdating status.Timestamp to now was the "It's up to us to ensure that Lease.Start is greater than the end time of the previous lease" comment. It's possible that status.Timestamp is past the expiration of the current lease (which is why we're here) but that now is not. If we propose a new lease with a start time below the expiration of the current lease, we may cause serious issues. In practice, we'll just get an error from cmd_lease_request.go. However, if we propose a new lease with a start time too far in the future, cmd_lease_request.go will take care of backdating the lease start time as far back as is legal.

Meanwhile, things get even more subtle for reads in the future, in which case the leaseholder absolutely needs to have a lease that extends all the way up past the read timestamp — unless we put future-time reads through Raft, which seems bad. But we don't want future time operations resulting in extremely long leases that reduce the availability of the system after crashes. This is why I was thinking (down below) that we limit how far an operation can be in the future. We basically say that an operation can't be more than RangeLeaseRenewalDuration (the duration through a lease in which we happily serve operations without trying to extend) in advance of now. This will still give us 4.5 seconds of runway without needing to support requests that are so far in the future that they need a lease extension any further than would be provided by a reacquisition at time.Now(). As you point out, such lease extensions would be tough to support for expiration-based leases and it's unclear how they would work at all for epoch-based leases.

Finally, the behavior of lease transfers is also kind of tricky and is the subject of a follow on PR (which I wasn't going to force you to review unless you had any interest 😉). The problem is that today, lease transfers properly synchronize with ongoing requests through a delicate dance with the HLC clock. Every request bumps the clock, then checks for an ongoing lease transfer, then evaluates. Lease transfers block ongoing requests, then grab a clock reading, then start the new lease at this clock reading. But this is broken if we don't bump the clock on all reads. My current plan is as follows:

we stop using custom synchronization for lease transfers. Instead, we use latching to provide the synchronization. This ensures that we don't only block new requests, but also flush out in-flight requests
for this to work, we need to push the lease check for read/write requests below latching. This seems fine, and even avoids complications like this
now that latching provides sufficient mutual exclusion, we can forward the new lease during a lease transfer by the maximum value in the timestamp cache - thereby picking up any future-time operations

This not only allows lease transfers to work in this... forward-looking world, but also comes with a few other benefits:

it naturally addresses this longstanding TODO of allowing read requests below lease transfers if they are more than the maximum clock offset in the past
it avoids custom synchronization
it more closely aligns TransferLeaseRequest and SubsumeRequest, which now both grab clock readings during evaluation and will both need to forward their clock reading by the upper-bound of range's portion of the timestamp cache
it more closely aligns the lease acquisition handling with the handling of MergeInProgressError by classifying a new InvalidLeaseError as a "concurrencyRetryError" (see isConcurrencyRetryError)
it pulls observed timestamp handling a layer down, which I think I'll use to address your plumbing comments on roachpb/storage: introduce localUncertaintyLimit, teach MVCC about synthetic timestamps #57077
it moves one step closer to a world where we can "ship a portion of the timestamp cache" during lease transfers to avoid retry errors on the new leaseholder

While you're here, do you mind giving me a sanity check on that plan?

pkg/kv/kvserver/replica_range_lease.go, line 875 at r1 (raw file):

I don't think there is a case for request timestamps ahead of the local clock in general

I'm confused by this. If a transaction has its write timestamp bumped and then refreshes its read timestamp to a future time, won't this result in a request timestamp ahead of the local clock?

pkg/kv/kvserver/replica_range_lease.go, line 914 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

As you address this mind also whacking the above TODO.

👍

pkg/kv/kvserver/store_send.go, line 93 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

now that you're touching it, reword the opaque "that point" reference.

Done.

pkg/kv/kvserver/store_send.go, line 95 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: if err := ... and lose the local

Done.

pkg/sql/catalog/lease/lease.go, line 203 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

random idea: AddNanos might be useful since we very often do just that.

Added a TODO.

pkg/storage/mvcc.go, line 1680 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I guess this is fine even if for some reason writeTimestamp actually regressed, but I'm curious why you made this change.

I actually made the change for an earlier version of this PR. It was needed in 4e34e20 because in that case, it was possible for writeTimestamp to be "promoted" to a clock timestamp. So without this, the FromClock flag was getting out of sync between the txnMeta and the txn and hitting an assertion in putIntentMeta. It shouldn't be strictly needed now, but this still seems like the right thing to do now that Less is only encoding a partial ordering.

pkg/util/hlc/hlc.go, line 294 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I assume you considered type ClockTimestamp struct { ts Timestamp } (and similar shenanigans) and found that it didn't add enough to be worth the hassle, right?

Yeah, I tried a number of different approaches here and found this to be the best.

Something like type ClockTimestamp struct { ts Timestamp } wasn't possible because generated code refers to the various fields in the struct. It also makes the pointer cast ((*Timestamp)(clock_ts)) impossible.

Embedding worked but had two downsides. First, it allowed ClockTimestamp to inherit all of Timestamp's methods that weren't explicitly overridden. That seemed dangerous and I found that having to think through each override was preferable. Second, it allowed users to reach inside the ClockTimestamp and pull out a .Timestamp, which was confusing. This was more of an issue in #58213, where the "embedded" type had more privileges than the "embedding" type, but it's still not great here.

This type def seems to be the best approach.

pkg/util/hlc/legacy_timestamp.proto, line 34 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

does not have

Done.

pkg/util/hlc/legacy_timestamp.proto, line 43 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This is fine because decoding a bool(1) into an uint32 (as a 20.2 node will do) will give exactly the flag back?

Right, although the flags field didn't even exist in 20.2, so we're doubly ok.

pkg/util/hlc/timestamp.go, line 368 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

fmt.Stringer

Done.

pkg/util/hlc/timestamp.proto, line 40 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

have has

Done.

pkg/util/hlc/timestamp_test.go, line 295 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

what about -0.00?

Done.

tbg

Reviewed 100 of 100 files at r6, 7 of 7 files at r7, 39 of 39 files at r8, 26 of 26 files at r9, 28 of 28 files at r10, 3 of 3 files at r11, 1 of 1 files at r12.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/kvserver/replica_application_state_machine.go, line 466 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Right. The corresponding clock update in replicaAppBatch.ApplyToStateMachine has always intrigued me. It says:
// Update the node clock with the maximum timestamp of all commands in the
// batch. This maintains a high water mark for all ops serviced, so that
// received ops without a timestamp specified are guaranteed one higher than
// any op already executed for overlapping keys.
How do you feel about these ops without a timestamp, and non-transactional operations in general? Are they a relic of the past? Do you see a world in which we push to get rid of them?

I also don't fully understand why this below-Raft clock update is even needed at all. Non-leaseholder nodes won't be able to serve consistent reads anyway until they become the leaseholder, at which point they could update their clock in leasePostApply (like we do in MergeRange). And if they are the leaseholder, they will have already updated their clock above Raft. So is this all just for clock stability?

In general, I would like to get rid of non-transactional operations to simplify our model. "Luckily", we have always prevented non-transactional ops from spanning ranges, so we're not so far off. For anything that's point-based, we have the 1PC optimization. I think the main challenge is around our use of Increment which we want to never lay down an intent. I think our SQL sequences rely on the performance characteristics of the current approach, and we need to make sure we don't regress. However, a one-off txn with an increment in it should pretty reliably hit the 1PC optimization; when it does not, it should be possible to get these on the server-side refresh path:

cockroach/pkg/kv/kvserver/replica_write.go

Lines 597 to 612 in 89781b1

 // evaluateWriteBatchWithServersideRefreshes invokes evaluateBatch and retries 

 // at a higher timestamp in the event of some retriable errors if allowed by the 

 // batch/txn. 

 // 

 // deadline, if not nil, specifies the highest timestamp (exclusive) at which 

 // the request can be evaluated. If ba is a transactional request, then dealine 

 // cannot be specified; a transaction's deadline comes from it's EndTxn request. 

 func (r *Replica) evaluateWriteBatchWithServersideRefreshes( 

 ctx context.Context, 

 idKey kvserverbase.CmdIDKey, 

 rec batcheval.EvalContext, 

 ms *enginepb.MVCCStats, 

 ba *roachpb.BatchRequest, 

 latchSpans *spanset.SpanSet, 

 deadline *hlc.Timestamp, 

 ) (batch storage.Batch, br *roachpb.BatchResponse, res result.Result, pErr *roachpb.Error) {

if they don't hit that already. Are you aware of anything else that could get in our way? We can also consider the Require1PC flag here, though I don't like that flag (1PC is sometimes impossible forever, like when your batch spans ranges); I think we really wanted a "don't lay down intents" flag?

By the way, we could in principle freely move (within the constraints of existing clock timestamps) any 1pc txns we receive, right? That is, in this code

cockroach/pkg/kv/kvserver/replica_write.go

Lines 423 to 428 in 89781b1

 ok, pErr := r.canAttempt1PCEvaluation(ctx, ba, latchSpans) 

 if pErr != nil { 

 return nil, enginepb.MVCCStats{}, nil, result.Result{}, pErr 

 } 

 if ok { 

 res := r.evaluate1PC(ctx, idKey, ba, latchSpans)

we could simply bump the txn to the local clock (basically a trivial refresh)? I wonder if that's a good optimization in general, as it would lead to contending 1PC txns originating on different gateways to serialize perfectly. Either way, this could allow us to keep the "assign timestamps late where it's useful for performance" behavior while at the same time phasing out non-transactional requests.

pkg/kv/kvserver/replica_range_lease.go, line 211 at r1 (raw file):

The thing that scared me about backdating status.Timestamp to now was the "It's up to us to ensure that Lease.Start is greater than the end time of the previous lease" comment.

I agree that it's pretty convoluted. I think at the heart of it is that status.Timestamp is kind of silly. It's the timestamp for which we generated the status, great. That is useful to know, and it's not always a clock timestamp (always assuming non-blocking txns now). But now we're here and we are actually trying to get a lease, so status.Timestamp should no longer play any role except that we may want to check that the lease we're getting does indeed cover status.Timestamp (true now, not necessarily true with exp-based leases and future writes, see below for my thoughts on those). More generally, does it make any sense to specify the start timestamp in the lease request? It does not, because as you mention the lease evaluation picks the start timestamp for you:

cockroach/pkg/kv/kvserver/batcheval/cmd_lease_request.go

Lines 84 to 132 in 293c931

 
 isExtension := prevLease.Replica.StoreID == newLease.Replica.StoreID 

 effectiveStart := newLease.Start 

 // Wind the start timestamp back as far towards the previous lease as we 

 // can. That'll make sure that when multiple leases are requested out of 

 // order at the same replica (after all, they use the request timestamp, 

 // which isn't straight out of our local clock), they all succeed unless 

 // they have a "real" issue with a previous lease. Example: Assuming no 

 // previous lease, one request for [5, 15) followed by one for [0, 15) 

 // would fail without this optimization. With it, the first request 

 // effectively gets the lease for [0, 15), which the second one can commit 

 // again (even extending your own lease is possible; see below). 

 // 

 // If this is our lease (or no prior lease exists), we effectively absorb 

 // the old lease. This allows multiple requests from the same replica to 

 // merge without ticking away from the minimal common start timestamp. It 

 // also has the positive side-effect of fixing #3561, which was caused by 

 // the absence of replay protection. 

 if prevLease.Replica.StoreID == 0 || isExtension { 

 effectiveStart.Backward(prevLease.Start) 

 // If the lease holder promised to not propose any commands below 

 // MinProposedTS, it must also not be allowed to extend a lease before that 

 // timestamp. We make sure that when a node restarts, its earlier in-flight 

 // commands (which are not tracked by the spanlatch manager post restart) 

 // receive an error under the new lease by making sure the sequence number 

 // of that lease is higher. This in turn is achieved by forwarding its start 

 // time here, which makes it not Equivalent() to the preceding lease for the 

 // same store. 

 // 

 // Note also that leasePostApply makes sure to update the timestamp cache in 

 // this case: even though the lease holder does not change, the sequence 

 // number does and this triggers a low water mark bump. 

 // 

 // The bug prevented with this is unlikely to occur in practice 

 // since earlier commands usually apply before this lease will. 

 if ts := args.MinProposedTS; isExtension && ts != nil { 

 effectiveStart.Forward(*ts) 

 } 

 } else if prevLease.Type() == roachpb.LeaseExpiration { 

 effectiveStart.Backward(prevLease.Expiration.Next()) 

 } 

 if isExtension { 

 if effectiveStart.Less(prevLease.Start) { 

 rErr.Message = "extension moved start timestamp backwards" 

 return newFailedLeaseTrigger(false /* isTransfer */), rErr 

 }

(so all you can "win" by specifying something here is an error if you overlap the previous lease). What muddies the waters further is that the code that picks the start timestamp above picks it from the expiration of the previous lease, and the expiration cannot be a clock timestamp - though we "know" (more like - hope, as it is based on trusting the code here to not mess up) that the clock has since advanced past the previous expiration and so we can force it to be a clock timestamp.

I think the way this should work is: we don't specify a start timestamp for the lease in the LeaseRequest, and we don't specify the expiration either; but we do specify the duration. During evaluation, we compare a clock reading (perhaps we send a clock reading with the request - that might be more idiomatic than a random call in the eval code) with the old lease's expiration (which of course will show that the old lease expiration can be interpreted as a clock timestamp now, so it's really just asserting that) and this determines the new start timestamp. The new expiration is then just that plus the specified lease duration (if expiration based); we can compute something from the Expiration timestamp if provided (mixed-version case).

Meanwhile, things get even more subtle for reads in the future, in which case the leaseholder absolutely needs to have a lease that extends all the way up past the read timestamp

Have you considered adopting a strategy where non-blocking txns start with ReadTimestamp=now, WriteTimestamp=future? This will force them to refresh (if they do any reads) but it might eliminate the constraint that the non-blocking txn has to be kept within a lease duration, which I find problematic (the lease duration should not influence what future txns are possible, this looks like a major leak). We should be able to get away with allowing an intent for ts=1000 to be laid down under a lease that expires at ts=500, right? And any problems with non-blocking txns getting starved could be mitigated by appropriate in-memory locking. By defaulting to let future txn's reads starved, we are defaulting to not disrupting foreground traffic (assuming the future txn keeps restarting itself into the future too, not sure what the plan there is), which I think is also desirable (as a default).

While you're here, do you mind giving me a sanity check on that plan?

This makes 100% sense to me and I look forward to it happening. Lease transfers are something I need to sleuth the code for whenever I touch them and it never sticks. What you're outlining looks like the natural (and memorable) way to do it. Feel free to ping me on the PR when it's out.

pkg/kv/kvserver/replica_range_lease.go, line 875 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't think there is a case for request timestamps ahead of the local clock in general

I'm confused by this. If a transaction has its write timestamp bumped and then refreshes its read timestamp to a future time, won't this result in a request timestamp ahead of the local clock?

My comment was poorly written - I meant without non-blocking txns, there isn't a case for future request timestamps in general.

pkg/kv/kvserver/replica_range_lease.go, line 914 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

👍

Were you planning to do this here?

This commit splits off a new hlc.ClockTimestamp type from the existing hlc.Timestamp type through a type alias. While the two types share the same memory and proto representation, they have different purposes and properties. Timestamp serves the role of representing an arbitrary timestamp, one that makes no claim about real time. ClockTimestamp serves the role of representing a real timestamp pulled from one of the HLC clocks in the system. Because of this, it has the added capability to update a peer's HLC clock. As such, a clock timestamp is an promise that some node in the system has a clock with a reading equal to or above its value. The commit also pushes towards a world where the ClockTimestamp specialization is maintained through a combination of static and dynamic typing. While the primary mechanisms that use ClockTimestamps will use static typing, Timestamp will also carry a bit indicating whether it can be downcast to a ClockTimestamp. This bit will replace the current flag structure. So instead of an interaction like the one introduced in cockroachdb#57077 checking whether the value has a "synthetic" flag set, it will instead check whether the value has a "clock timestamp". This serves as an alternative to an approach like the one in cockroachdb#58213, where we split the types in the other direction, keeping Timestamp to represent a clock timestamp and introduce a new enginepb.TxnTimestamp to represent an arbitrary MVCC timestamp. That change resulted in a significantly larger diff, as it misjudged the extremely small percent of all Timestamp usages which care about the capability of updating remote HLC clocks. It also forced all users (tests, etc.) of timestamps to address this question front-and-center, which had benefits but was also a burden on all uses of timestamps. The original intention of this change was to follow it up by inverting the synthetic flag on Timestamps, replacing an "is_synthetic" bit with a "from_clock" bit. While inverting the flag optimized the encoded size of non-clock (currently synthetic) timestamps at the expense of the encoded size of clock timestamps by 2 bytes, it came with major benefits. By making clock timestamps opt-in instead of opt-out, we more closely match the capability model we're trying to establish here with static typing, where a clock timestamp can do everything a normal timestamp can, but can also be used to update an HLC clock. The opt-in nature mitigated the risk of bugs that forget to set this flag correctly. Instead of risking a capability escalation where a non-clock timestamp is incorrectly interpreted as a clock timestamp and used to update an HLC clock, we risked a much less harmful capability de-escalation where a clock timestamp loses its ability to update an HLC clock. We could then much more carefully audit the cases where the flag needs to be unset, such as in the Timestamp.Add and Timestamp.Forward methods. Unfortunately, this "from_clock" approach (attempted in 4e34e20) came with serious complications as well, which became clear only after making the change. Chiefly, the approach made the mixed-version migration of this field significantly more difficult. With v20.2 nodes unaware of the flag, v21.1 nodes would need to find a way to either auto-assign it for all timestamps coming from v21.1 nodes, or to work around its absence. But this latter idea touches on a second complication – the absence of the field resulted (intentionally) in an MVCC encoding that v20.2 nodes would be unable to decode. So v21.1 nodes would need to be very careful to always set the "from_clock" flag on all timestamps when in a mixed version cluster. But this all made the migration towards not setting the flag once in a v21.1-only cluster even more difficult. In the end, opting into "is_synthetic" is a lot easier to work with than opting into "from_clock", so that's how the rest of this PR will operate.

Instead, use methods that will ensure proper interactions with the upcoming Synthetic field.

Adapted from cockroachdb#57077. Also switch over TestIntentInterleavingIter and TestIntentDemuxWriter. No need to re-write the parsing logic again. We'll also want this to use flags in the next commit. This causes a large diff because all nanos are now considered seconds. This doesn't actually change any behavior in the tests themselves.

No need to re-write the parsing logic again. We'll also want this to use flags in the next commit. This causes a large diff because all nanos are now considered seconds. This doesn't actually change any behavior in the tests themselves.

This commit replaces Timestamp's (and LegacyTimestamp's) `flags` field with a single boolean field called `synthetic`. This shaves a byte off the encoded message when the field is set and makes synthetic timestamps easier to work with. If we ever decide that we need to go back on this because we need more flags (unlikely), we can without breaking anything. The commit also starts the process of updating tests to handle the fact that `Timestamp.Add` should set the synthetic flag. However, it stops before completing the change because it turned out to be fairly large and not a pre-requisite for non-blocking transactions.

This was only used in one very old test (added in 1123717), which could be easily adapted to no longer need the very specific hook.

These methods were implementing the `fmt.Stringer` interface, not the `fmt.Formatter` interface.

nvanbenschoten

I'm going to merge this since the remaining discussion is about leases and the intention isn't to resolve that discussion in this PR. Meanwhile, this is getting hammered with diffs vs. master. Thanks again for the review!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/kv/kvserver/replica_application_state_machine.go, line 466 at r1 (raw file):

Are you aware of anything else that could get in our way?

Not to my knowledge. You raise a good point that any operation that relies on being a non-txn request must be a single key operation, otherwise, they risk being promoted to a txn request. And so 1PC should ensure that these operations behave the same as non-txn requests on the server. So in that sense, using the Require1PC doesn't seem too far off from the "don't lay down intents" flag you want.

By the way, we could in principle freely move (within the constraints of existing clock timestamps) any 1pc txns we receive, right

Yep, we actually already do this with "server-side refreshes" - see canDoServersideRetry and tryBumpBatchTimestamp. The only thing I think we're missing is that to do this, a 1PC operation will actually fail its 1PC attempt and then will evaluate completely as a non-1PC batch that writes intents and then immediately resolves them.

I opened #58459 to track this.

pkg/kv/kvserver/replica_range_lease.go, line 211 at r1 (raw file):
Thanks, this was all very helpful. I had been thinking that we would need to support cases where a lease had not yet expired at time.Now() but was expired at a future request time by grabbing a new lease that starts at the future request time. I no longer think that's correct. Leases should never have a start time in the future. Instead, we should redirect to the current leaseholder in such cases (if on a different node) and force the current leaseholder to extend its lease if it does not cover the future-time request (if on the correct node). This will all work as long as we don't allow requests further than a lease duration into the future (which is all we need to support). It will also allow us to make lease.Start a ClockTimestamp. I'll do this in a follow-on PR.

Have you considered adopting a strategy where non-blocking txns start with ReadTimestamp=now, WriteTimestamp=future?

This is mostly how things will work. In fact, they'll start with ReadTimestamp=now, WriteTimestamp=now. The WriteTimestamp will then be bumped by any non-blocking range's closed timestamps. Finally, the ReadTimestamp will be pulled up by a refresh.

We should be able to get away with allowing an intent for ts=1000 to be laid down under a lease that expires at ts=500, right?

Right. But it's less clear whether we should allow reads or reading writes (those that bump the tscache) to operate at ts=1000 under a lease that expires at ts=500.

And any problems with non-blocking txns getting starved could be mitigated by appropriate in-memory locking. By defaulting to let future txn's reads starved, we are defaulting to not disrupting foreground traffic (assuming the future txn keeps restarting itself into the future too, not sure what the plan there is), which I think is also desirable (as a default).

I don't think I follow here. Are you saying that the reads would just wait until the lease's expiration was high enough to serve the read?

This makes 100% sense to me and I look forward to it happening. Lease transfers are something I need to sleuth the code for whenever I touch them and it never sticks. What you're outlining looks like the natural (and memorable) way to do it. Feel free to ping me on the PR when it's out.

👍 thanks for checking.

pkg/kv/kvserver/replica_range_lease.go, line 875 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

My comment was poorly written - I meant without non-blocking txns, there isn't a case for future request timestamps in general.

Right, without non-blocking txns, there's no reason for future request timestamps. But with non-blocking txns, we need to support them.

pkg/kv/kvserver/replica_range_lease.go, line 914 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Were you planning to do this here?

No, I was going to do both in the PR where I add support for lease transfers after future time reads.

craig · 2021-01-05T19:38:20Z

Build succeeded:

GitHub CI (Cockroach)

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)

pkg/kv/kvserver/replica_range_lease.go, line 211 at r1 (raw file):

I don't think I follow here. Are you saying that the reads would just wait until the lease's expiration was high enough to serve the read?

I was in some fantasy land in which the problem below doesn't exist.

Right. But it's less clear whether we should allow reads or reading writes (those that bump the tscache) to operate at ts=1000 under a lease that expires at ts=500.

I agree that timestamp cache updates that fall outside of the current lease must not happen. I was hoping that there's some way to work around this due to the fact that non-blocking txns start reading at a clock timestamp and that only changes through refreshes, but the refresh would be to whatever the write timestamp happens to be, and so we are definitely going to be constrained by the lease durations. Instead of trying to work around that via lease extensions (which are not easy as you pointed out), can we make refreshes block (at the replica) until the lease does cover the desired timestamp (or the leaseholder changes)? That way, non-blocking txns can basically not worry too much about the lease durations; the refresh will return as soon as it can.

This commit adds a `doc.go` file to the `pkg/util/hlc package` that details that properties and the uses of Hybrid Logical Clocks throughout the system. It is meant to capture an overview of the ways in which HLCs benfit CockroachDB and to exhaustively enumerate the (few) places in which the HLC is a key component to the correctness of different subsystems. This was inspired by cockroachdb#72121, which is once again making me feel uneasy about how implicit most interactions with the HLC clock are. Specifically, the uses of the HLC for causality tracking are subtle and underdocumented. The typing changes made in cockroachdb#58349 help to isolate timestamps used for causality tracking from any other timestamps in the system, but until we remove the escape hatch of dynamically casting a `Timestamp` back to a `ClockTimestamp` with `TryToClockTimestamp()`, it is still too difficult to understand when and why clock signals are being passed between HLCs on different nodes, and where doing so is necessary for correctness. I'm looking to make that change and I'm hoping that documenting this first (with help from the reviewers!) will set that up to be successful. Some of this was adapted from section 4 of the SIGMOD 2020 paper.

72278: hlc: document properties and uses of Hybrid Logical Clocks r=nvanbenschoten a=nvanbenschoten The godoc is currently hosted on [here](http://34.75.249.248:8000/pkg/github.com/cockroachdb/cockroach/pkg/util/hlc/). I'll leave that up for a few days. ---- This commit adds a `doc.go` file to the `pkg/util/hlc` package that details that properties and the uses of Hybrid Logical Clocks throughout the system. It is meant to capture an overview of the ways in which HLCs benfit CockroachDB and to exhaustively enumerate the (few) places in which the HLC is a key component to the correctness of different subsystems. This was inspired by #72121, which is once again making me feel uneasy about how implicit most interactions with the HLC clock are. Specifically, the uses of the HLC for causality tracking are subtle and underdocumented. The typing changes made in #58349 help to isolate timestamps used for causality tracking from any other timestamps in the system, but until we remove the escape hatch of dynamically casting a `Timestamp` back to a `ClockTimestamp` with `TryToClockTimestamp()`, it is still too difficult to understand when and why clock signals are being passed between HLCs on different nodes, and where doing so is necessary for correctness. I'm looking to make that change and I'm hoping that documenting this first (with help from the reviewers!) will set that up to be successful. Some of this was adapted from [section 4 of the SIGMOD 2020 paper](https://dl.acm.org/doi/pdf/10.1145/3318464.3386134). Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

nvanbenschoten requested review from tbg and a team December 29, 2020 19:49

nvanbenschoten requested a review from a team as a code owner December 29, 2020 19:49

nvanbenschoten requested review from dt and removed request for a team December 29, 2020 19:49

dt removed their request for review December 29, 2020 20:04

nvanbenschoten force-pushed the nvanbenschoten/clockTimestamp branch 3 times, most recently from 6a37c0c to 13a7a20 Compare December 30, 2020 02:35

nvanbenschoten force-pushed the nvanbenschoten/clockTimestamp branch 2 times, most recently from 4450671 to 086903e Compare December 31, 2020 05:54

tbg approved these changes Jan 4, 2021

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/clockTimestamp branch from 086903e to 6a1f133 Compare January 5, 2021 00:35

nvanbenschoten commented Jan 5, 2021

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/clockTimestamp branch from 6a1f133 to cde777d Compare January 5, 2021 02:55

tbg approved these changes Jan 5, 2021

View reviewed changes

nvanbenschoten mentioned this pull request Jan 5, 2021

kv: fix non-transactional requests' interaction with uncertainty intervals #58459

Closed

nvanbenschoten added 7 commits January 5, 2021 13:57

hlc: don't mutate WallTime or Logical fields directly

ae2edd2

Instead, use methods that will ensure proper interactions with the upcoming Synthetic field.

kv: delete StoreTestingKnobs.ClockBeforeSend

924423f

This was only used in one very old test (added in 1123717), which could be easily adapted to no longer need the very specific hook.

sql/pgwire: fix references to fmt.Stringer

b96f85c

These methods were implementing the `fmt.Stringer` interface, not the `fmt.Formatter` interface.

nvanbenschoten force-pushed the nvanbenschoten/clockTimestamp branch from cde777d to b96f85c Compare January 5, 2021 18:57

nvanbenschoten commented Jan 5, 2021

View reviewed changes

craig bot merged commit a148c4f into cockroachdb:master Jan 5, 2021

tbg reviewed Jan 6, 2021

View reviewed changes

tbg self-requested a review January 6, 2021 13:30

nvanbenschoten deleted the nvanbenschoten/clockTimestamp branch January 7, 2021 03:13

lunevalex added the A-multiregion Related to multi-region label Jan 12, 2021

nvanbenschoten mentioned this pull request Jan 19, 2021

storage: prevent future timestamps from leaking #33632

Closed

nvanbenschoten mentioned this pull request Nov 1, 2021

hlc: document properties and uses of Hybrid Logical Clocks #72278

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hlc: strongly type ClockTimestamp as specialization of Timestamp #58349

hlc: strongly type ClockTimestamp as specialization of Timestamp #58349

nvanbenschoten commented Dec 29, 2020 •

edited

Loading

cockroach-teamcity commented Dec 29, 2020

knz commented Dec 30, 2020

tbg left a comment

nvanbenschoten left a comment

tbg left a comment

nvanbenschoten left a comment

craig bot commented Jan 5, 2021

tbg left a comment

	// evaluateWriteBatchWithServersideRefreshes invokes evaluateBatch and retries
	// at a higher timestamp in the event of some retriable errors if allowed by the
	// batch/txn.
	//
	// deadline, if not nil, specifies the highest timestamp (exclusive) at which
	// the request can be evaluated. If ba is a transactional request, then dealine
	// cannot be specified; a transaction's deadline comes from it's EndTxn request.
	func (r *Replica) evaluateWriteBatchWithServersideRefreshes(
	ctx context.Context,
	idKey kvserverbase.CmdIDKey,
	rec batcheval.EvalContext,
	ms *enginepb.MVCCStats,
	ba *roachpb.BatchRequest,
	latchSpans *spanset.SpanSet,
	deadline *hlc.Timestamp,
	) (batch storage.Batch, br roachpb.BatchResponse, res result.Result, pErr roachpb.Error) {

	ok, pErr := r.canAttempt1PCEvaluation(ctx, ba, latchSpans)
	if pErr != nil {
	return nil, enginepb.MVCCStats{}, nil, result.Result{}, pErr
	}
	if ok {
	res := r.evaluate1PC(ctx, idKey, ba, latchSpans)


	isExtension := prevLease.Replica.StoreID == newLease.Replica.StoreID
	effectiveStart := newLease.Start

	// Wind the start timestamp back as far towards the previous lease as we
	// can. That'll make sure that when multiple leases are requested out of
	// order at the same replica (after all, they use the request timestamp,
	// which isn't straight out of our local clock), they all succeed unless
	// they have a "real" issue with a previous lease. Example: Assuming no
	// previous lease, one request for [5, 15) followed by one for [0, 15)
	// would fail without this optimization. With it, the first request
	// effectively gets the lease for [0, 15), which the second one can commit
	// again (even extending your own lease is possible; see below).
	//
	// If this is our lease (or no prior lease exists), we effectively absorb
	// the old lease. This allows multiple requests from the same replica to
	// merge without ticking away from the minimal common start timestamp. It
	// also has the positive side-effect of fixing #3561, which was caused by
	// the absence of replay protection.
	if prevLease.Replica.StoreID == 0 \|\| isExtension {
	effectiveStart.Backward(prevLease.Start)
	// If the lease holder promised to not propose any commands below
	// MinProposedTS, it must also not be allowed to extend a lease before that
	// timestamp. We make sure that when a node restarts, its earlier in-flight
	// commands (which are not tracked by the spanlatch manager post restart)
	// receive an error under the new lease by making sure the sequence number
	// of that lease is higher. This in turn is achieved by forwarding its start
	// time here, which makes it not Equivalent() to the preceding lease for the
	// same store.
	//
	// Note also that leasePostApply makes sure to update the timestamp cache in
	// this case: even though the lease holder does not change, the sequence
	// number does and this triggers a low water mark bump.
	//
	// The bug prevented with this is unlikely to occur in practice
	// since earlier commands usually apply before this lease will.
	if ts := args.MinProposedTS; isExtension && ts != nil {
	effectiveStart.Forward(*ts)
	}

	} else if prevLease.Type() == roachpb.LeaseExpiration {
	effectiveStart.Backward(prevLease.Expiration.Next())
	}

	if isExtension {
	if effectiveStart.Less(prevLease.Start) {
	rErr.Message = "extension moved start timestamp backwards"
	return newFailedLeaseTrigger(false /* isTransfer */), rErr
	}

hlc: strongly type ClockTimestamp as specialization of Timestamp #58349

hlc: strongly type ClockTimestamp as specialization of Timestamp #58349

Conversation

nvanbenschoten commented Dec 29, 2020 • edited Loading

cockroach-teamcity commented Dec 29, 2020

knz commented Dec 30, 2020

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

craig bot commented Jan 5, 2021

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Dec 29, 2020 •

edited

Loading