hlc: document properties and uses of Hybrid Logical Clocks #72278

nvanbenschoten · 2021-11-01T06:04:22Z

The godoc is currently hosted on here. I'll leave that up for a few days.

This commit adds a doc.go file to the pkg/util/hlc package that details that
properties and the uses of Hybrid Logical Clocks throughout the system. It is
meant to capture an overview of the ways in which HLCs benfit CockroachDB and to
exhaustively enumerate the (few) places in which the HLC is a key component to
the correctness of different subsystems.

This was inspired by #72121, which is once again making me feel uneasy about how
implicit most interactions with the HLC clock are. Specifically, the uses of the
HLC for causality tracking are subtle and underdocumented. The typing changes
made in #58349 help to isolate timestamps used for causality tracking from any
other timestamps in the system, but until we remove the escape hatch of
dynamically casting a Timestamp back to a ClockTimestamp with
TryToClockTimestamp(), it is still too difficult to understand when and why
clock signals are being passed between HLCs on different nodes, and where doing
so is necessary for correctness. I'm looking to make that change and I'm hoping
that documenting this first (with help from the reviewers!) will set that up to
be successful.

Some of this was adapted from section 4 of the SIGMOD 2020 paper.

cockroach-teamcity · 2021-11-01T06:04:30Z

This change is

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @nvanbenschoten, and @tbg)

pkg/util/hlc/doc.go, line 107 at r1 (raw file):

additional protection here by persisting the wall time of the clock
periodically, although this protection is disabled by default.

Have we considered these protections in the context of serverless sql pods? Is this going to end up being an onerous amount of waiting when scaling back up?

Furthermore, what sort of latency tracking to do we do between and pods of tenant and the host? Do we have any sort of protection from a tenant causing havoc by using far future timestamps? Feels like something we need to enter into the threat model.

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @nvanbenschoten, and @tbg)

pkg/util/hlc/doc.go, line 242 at r1 (raw file):

 - SQL descriptor leases. TODO(nvanbenschoten): learn how this works and whether
   this relies on bounded clock skew of liveness or for safety. May need to talk
   to Andrew about it. See LeaseManager.timeRemaining.

I don't believe that the the descriptor leases rely on bounded clock skew for correctness in any way. The chosen lease expirations are in the MVCC domain and act as transaction deadlines.

The LeaseManager you're referring to here is startupmigrations.LeaseManager. I do believe that thing will lose its mutual exclusion under severe skew. Fortunately, I don't think the startup migrations which touch descriptors (as they existed in the past) were susceptible to any correctness issues in the absence of mutual exclusion; they always used read-write to interact with descriptors. Those startup migrations are decreasingly used. I don't believe they have been used since 20.2 for anything new. They primarily set up new state in the cluster. I'd be happy if that code were deleted.

bdarnell

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @nvanbenschoten, and @tbg)

pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 590 at r1 (raw file):

// transactions commit with higher timestamps, even if their read and writes
// sets do not conflict with the original transaction's. This obviates the
// need for uncertainty intervals and prevents the "causal reverse" anamoly

After reading the new file, it seems to me that the removed text here is still accurate (or at least it could be if we fixed up this linearizable option to work as advertised). I'm drawing a distinction between the uncertainty interval and other things that require clock synchronization within max-offset (e.g. non-cooperative lease transfers).

pkg/util/hlc/doc.go, line 30 at r1 (raw file):

clock.

There are three channels through which HLC timestamps are passed between nodes

Early on we considered sprinkling more HLC channels around the code (e.g. on periodic heartbeats). These channels would not do anything meaningful for causality tracking, but they might help avoid pushes/restarts occasionally. I'm not sure if we'd ever introduce those at this point, but we might want to tweak the language here to say that these are the three ways that we convey HLC timestamps for causal purposes.

Could we go even further and say that e.g. the raft channel supports causality for cooperative lease transfers, and BatchRequest for observed timestamps, etc? It's interesting to think about the possibility of using separate clocks for these purposes. For example, I don't think the clock used for lease transfers necessarily needs to be the singleton node clock. This might help avoid a high-lock-contention global object. OTOH, observed timestamps basically assume a single clock so maybe there's not much to be gained here.

pkg/util/hlc/doc.go, line 72 at r1 (raw file):

   more, see pkg/kv/kvserver/observedts/doc.go.

 - Non-transactional requests. Most operations in CockroachDB are transactional

"Most KV operations"

pkg/util/hlc/doc.go, line 96 at r1 (raw file):

   just concerned about a transaction restarting at a timestamp above the local
   clock back then because we had yet to separate the "clock timestamp" domain
   from the "transaction timestamp" domain.

I don't remember exactly what was going on here but I think it's plausible that this is not necessary for correctness, but is needed to ensure that the clock is advanced far enough that it doesn't immediately hit the same retryable error again.

pkg/util/hlc/doc.go, line 107 at r1 (raw file):

Previously, ajwerner wrote…

Have we considered these protections in the context of serverless sql pods? Is this going to end up being an onerous amount of waiting when scaling back up?

Furthermore, what sort of latency tracking to do we do between and pods of tenant and the host? Do we have any sort of protection from a tenant causing havoc by using far future timestamps? Feels like something we need to enter into the threat model.

Do SQL pods need this? I don't think they do. Strict monotonicity is important for KV pods but I think SQL pods would not cause correctness problems if their timestamps moved backwards across a restart (within max-offset).

Also, if monotonicity of timestamps for SQL pods were required, remember that what matters is the interval from the last timestamp generated by a previous incarnation of the node to the first timestamp generated by the new incarnation. All the time spent on the scale-down and scale-up process itself still counts, so you shouldn't actually need much of a sleep.

pkg/util/hlc/doc.go, line 206 at r1 (raw file):

   replicas to both consider themselves leaseholders at the same time. This can
   not lead to stale reads for transactional requests, because a transaction
   with an uncertainty interval that extends past a lease's expiration will not

Really? I was not aware that we checked the end-of-uncertainty timestamp against the lease expiration. But that sounds like a much more elegant solution to the main issue here than the stasis period.

pkg/util/hlc/doc.go, line 213 at r1 (raw file):

   use uncertainty intervals, but the mechanics differ slightly.

   TODO(nvanbenschoten): is that really it? It's the only thing I've ever been

That may be it; I can't remember anything else right now. In the early days we cared a lot more about non-transactional operations, so they motivated a lot of stuff like this.

This is interesting because the usage of max-offset in non-cooperative lease transfers is a big part of the reason you can't just set max-offset to an extremely high value. If we added uncertainty intervals to non-transactional operations (as proposed above) and could avoid the max-offset gap between consecutive leases (and I guess the "strict monotonic" sleep at startup), then you could run with a high max-offset. You'd get read uncertainty errors all the time, but you could retry and rely on observed timestamps to make progress. This would be kind of like a distributed timestamp oracle for folks who insist that they can't get reasonable clock synchronization.

andreimatei

nice, thanks for writing

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @nvanbenschoten, and @tbg)

pkg/util/hlc/doc.go, line 58 at r1 (raw file):

   lease's start time. Upon application of this Raft entry, the incoming
   leaseholder forwards its HLC to this clock reading, transitively ensuring
   that its clock is >= the new lease's start time.

consider spelling out why the lease recipient should have a clock higher than the lease start.

pkg/util/hlc/doc.go, line 70 at r1 (raw file):

   node, and by extension, at the time that the transaction began. This allows
   the transaction to avoid uncertainty restarts in some circumstances. For
   more, see pkg/kv/kvserver/observedts/doc.go.

do you want to hint at the subteties around pushed intents here?

pkg/util/hlc/doc.go, line 96 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't remember exactly what was going on here but I think it's plausible that this is not necessary for correctness, but is needed to ensure that the clock is advanced far enough that it doesn't immediately hit the same retryable error again.

AFAICS, 72fa944 doesn't deal with updating the gateway's clock; it only deals with ensuring that the txn restarts above the uncertain value it encountered. Which I guess is also what Ben is saying.

pkg/util/hlc/doc.go, line 139 at r1 (raw file):

 - Transaction uncertainty intervals. The single-key linearizability property is
   satisfied in CockroachDB by tracking an uncertainty interval for each
   transaction, within which the causal ordering between two transactions is

s/causal ordering/real-time ordering ?

nvanbenschoten

Thanks for the reviews!

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, and @tbg)

pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 590 at r1 (raw file):
I think we still need an uncertainty interval for the same reason that readers on GLOBAL tables need to occasionally wait. Without them, it would be possible for a read on a node with a fast clock (ts@15) to observe a committed value (ts@10) and then a later read on a node with a slow clock (ts@5) to miss the committed value.

FWIW the reason Spanner doesn't need this is because it holds its locks across the commit-wait duration. So the node with a fast clock would get stuck waiting on the locks and would effectively commit-wait as well. See section 4.2.1 from the Spanner paper:

After commit wait, the coordinator sends the commit timestamp to the client and all other participant leaders. Each participant leader logs the transaction’s outcome through Paxos. All participants apply at the same timestamp and then release locks.

Does this check out to you, or am I still missing something, maybe related to your "fixed up to work as advertised" comment?

pkg/util/hlc/doc.go, line 30 at r1 (raw file):

I'm not sure if we'd ever introduce those at this point, but we might want to tweak the language here to say that these are the three ways that we convey HLC timestamps for causal purposes.

These three came from an audit of the code, so it doesn't look like we did. I added some words to that effect.

Could we go even further and say that e.g. the raft channel supports causality for cooperative lease transfers, and BatchRequest for observed timestamps, etc?

That's a good idea. I'll point each use of causality at the corresponding channel.

pkg/util/hlc/doc.go, line 58 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

consider spelling out why the lease recipient should have a clock higher than the lease start.

Done.

pkg/util/hlc/doc.go, line 70 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

do you want to hint at the subteties around pushed intents here?

Done.

pkg/util/hlc/doc.go, line 72 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

"Most KV operations"

Done.

pkg/util/hlc/doc.go, line 96 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

AFAICS, 72fa944 doesn't deal with updating the gateway's clock; it only deals with ensuring that the txn restarts above the uncertain value it encountered. Which I guess is also what Ben is saying.

But why was this needed? We don't pull the next epoch's timestamp from the clock, we pull it straight from the error:

cockroach/pkg/roachpb/data.go

Lines 1004 to 1015 in 72fa944

 case *ReadWithinUncertaintyIntervalError: 

 // If the reader encountered a newer write within the uncertainty 

 // interval, we advance the txn's timestamp just past the last observed 

 // timestamp from the node. 

 ts, ok := txn.GetObservedTimestamp(pErr.OriginNode) 

 if !ok { 

 log.Fatalf(ctx, 

 "missing observed timestamp for node %d found on uncertainty restart. "+ 

 "err: %s. txn: %s. Observed timestamps: %s", 

 pErr.OriginNode, pErr, txn, txn.ObservedTimestamps) 

 } 

 txn.Timestamp.Forward(ts)

Maybe we were concerned at that time about a transaction being given a timestamp that was greater than its local clock's HLC?

pkg/util/hlc/doc.go, line 107 at r1 (raw file):
I think you're right that SQL pods don't need this. Luckily, it doesn't look like they get it today, because ensureClockMonotonicity is called in server.Server.PreStart. SQL Pods don't use a server.Server, they use a server.SQLServer.

Furthermore, what sort of latency tracking to do we do between and pods of tenant and the host? Do we have any sort of protection from a tenant causing havoc by using far future timestamps? Feels like something we need to enter into the threat model.

This is a good point. This is an attack vector that we have not considered to date. We have some protection in that we will detect and reject unreasonable clock jumps in Clock.UpdateAndCheckMaxOffset, but this isn't comprehensive. A malicious tenant could keep pushing the clock forward by max_offset-1 until it crashed a KV node.

Part of the purpose of this documentation effort was to better understand when and why clock synchronization is needed. What is shows is that the BatchRequest channel mostly isn't needed, especially in the client->server direction and not for correctness. So I propose that we gate the Clock.Update call in Store.Send on whether the client is trusted or not. If the request is from a secondary tenant, we can skip it.

pkg/util/hlc/doc.go, line 139 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

s/causal ordering/real-time ordering ?

Done.

pkg/util/hlc/doc.go, line 206 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Really? I was not aware that we checked the end-of-uncertainty timestamp against the lease expiration. But that sounds like a much more elegant solution to the main issue here than the stasis period.

Ah, you're right. I tried to make that change in #58904, but didn't get this far. Fixed.

pkg/util/hlc/doc.go, line 213 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

That may be it; I can't remember anything else right now. In the early days we cared a lot more about non-transactional operations, so they motivated a lot of stuff like this.

This is interesting because the usage of max-offset in non-cooperative lease transfers is a big part of the reason you can't just set max-offset to an extremely high value. If we added uncertainty intervals to non-transactional operations (as proposed above) and could avoid the max-offset gap between consecutive leases (and I guess the "strict monotonic" sleep at startup), then you could run with a high max-offset. You'd get read uncertainty errors all the time, but you could retry and rely on observed timestamps to make progress. This would be kind of like a distributed timestamp oracle for folks who insist that they can't get reasonable clock synchronization.

Is this accounting for the need to check the end-of-uncertainty timestamp against the lease expiration if we removed the stasis gap? If we had a higher max-offset then wouldn't the utility window of leases would become smaller, eventually to the point where leases could not be used at all?

pkg/util/hlc/doc.go, line 242 at r1 (raw file):

Previously, ajwerner wrote…

 - SQL descriptor leases. TODO(nvanbenschoten): learn how this works and whether
   this relies on bounded clock skew of liveness or for safety. May need to talk
   to Andrew about it. See LeaseManager.timeRemaining.
I don't believe that the the descriptor leases rely on bounded clock skew for correctness in any way. The chosen lease expirations are in the MVCC domain and act as transaction deadlines.

The LeaseManager you're referring to here is startupmigrations.LeaseManager. I do believe that thing will lose its mutual exclusion under severe skew. Fortunately, I don't think the startup migrations which touch descriptors (as they existed in the past) were susceptible to any correctness issues in the absence of mutual exclusion; they always used read-write to interact with descriptors. Those startup migrations are decreasingly used. I don't believe they have been used since 20.2 for anything new. They primarily set up new state in the cluster. I'd be happy if that code were deleted.

Got it, thanks for the confirmation. I'll remove this section then.

tbg

Very nice to have all of this brain juice in one digestible text.

Reviewed 3 of 4 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 590 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think we still need an uncertainty interval for the same reason that readers on GLOBAL tables need to occasionally wait. Without them, it would be possible for a read on a node with a fast clock (ts@15) to observe a committed value (ts@10) and then a later read on a node with a slow clock (ts@5) to miss the committed value.

FWIW the reason Spanner doesn't need this is because it holds its locks across the commit-wait duration. So the node with a fast clock would get stuck waiting on the locks and would effectively commit-wait as well. See section 4.2.1 from the Spanner paper:

After commit wait, the coordinator sends the commit timestamp to the client and all other participant leaders. Each participant leader logs the transaction’s outcome through Paxos. All participants apply at the same timestamp and then release locks.

Does this check out to you, or am I still missing something, maybe related to your "fixed up to work as advertised" comment?

That makes sense at least to me. I would even find it useful to include this reasoning in the comment.

pkg/util/hlc/doc.go, line 107 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think you're right that SQL pods don't need this. Luckily, it doesn't look like they get it today, because ensureClockMonotonicity is called in server.Server.PreStart. SQL Pods don't use a server.Server, they use a server.SQLServer.

Furthermore, what sort of latency tracking to do we do between and pods of tenant and the host? Do we have any sort of protection from a tenant causing havoc by using far future timestamps? Feels like something we need to enter into the threat model.

This is a good point. This is an attack vector that we have not considered to date. We have some protection in that we will detect and reject unreasonable clock jumps in Clock.UpdateAndCheckMaxOffset, but this isn't comprehensive. A malicious tenant could keep pushing the clock forward by max_offset-1 until it crashed a KV node.

Part of the purpose of this documentation effort was to better understand when and why clock synchronization is needed. What is shows is that the BatchRequest channel mostly isn't needed, especially in the client->server direction and not for correctness. So I propose that we gate the Clock.Update call in Store.Send on whether the client is trusted or not. If the request is from a secondary tenant, we can skip it.

Sounds like something we should file an issue for.

pkg/util/hlc/doc.go, line 25 at r2 (raw file):

Causality tracking

HLCs provide causality tracking through their logical component upon each

This isn't quite, exactly, true, right? The logical component doesn't track anything if observed timestamps always strictly increase in the physical component. It is more that they help establish causality between events that share a wall clock reading.

pkg/util/hlc/doc.go, line 35 at r2 (raw file):

 - Raft (unidirectional): proposers of Raft commands (i.e. leaseholders) attach
   clock readings to these command, which are later consumed by followers when
   commands are applied to their Raft state machine.

Can you add a permalink to the doc? I'm not sure what you are referring to, and I did look through

cockroach/pkg/kv/kvserver/kvserverpb/proposer_kv.pb.go

Line 483 in 6d35693

type RaftCommand struct {

. I don't think that you mean the WriteTimestamp nor the ClosedTimestamp.

pkg/util/hlc/doc.go, line 37 at r2 (raw file):

   commands are applied to their Raft state machine.

 - BatchRequest API (bidirectional): clients and servers of the KV BatchRequest

Ditto, are you talking about this one here?

cockroach/pkg/roachpb/api.proto

Lines 1983 to 1992 in 5f43585

 message Header { 

 // timestamp specifies time at which reads or writes should be performed. If 

 // the timestamp is set to zero value, its value is initialized to the wall 

 // time of the server node. 

 // 

 // Transactional requests are not allowed to set this field; they must rely on 

 // the server to set it from txn.ReadTimestamp. Also, for transactional 

 // requests, writes are performed at the provisional commit timestamp 

 // (txn.WriteTimestamp). 

 util.hlc.Timestamp timestamp = 1 [(gogoproto.nullable) = false];

But this isn't really populated with hlc clock readings.

pkg/util/hlc/doc.go, line 44 at r2 (raw file):

   readings back to the root of the flow. Currently, this only takes place on
   errors, and relates to the "Transaction retry errors" interaction detailed
   below.

Ditto

pkg/util/hlc/doc.go, line 96 at r2 (raw file):

 - Non-transactional requests (Raft + BatchRequest channels). Most KV operations
   in CockroachDB are transactional and receive their read timestamps from their
   client. They use an uncertainty interval (see below) to avoid stale reads in

"from their client" might be ambiguous. The timestamp is picked by the gateway's hlc clock when instantiating the transaction.

pkg/util/hlc/doc.go, line 98 at r2 (raw file):

   client. They use an uncertainty interval (see below) to avoid stale reads in
   the presence of clock skew. However, the KV API also exposes the option to
   send a single-range, strongly consistent, "non-transaction" request. These

maybe a clearer why to phrase this is that

the KV API also exposes the option to elide the transaction for requests targeting a single range (which trivially applies to all point requests). These requests do not carry a read predetermined timestamp; instead, it is chosen from the HLC upon arrival at the leaseholder for the range. Since the hlc clock always leads the timestamp if any write served on the range, this will not result in stale reads, despite not using an uncertainty interval for such requests.

pkg/util/hlc/doc.go, line 123 at r2 (raw file):

Strict monotonicity

HLCs provide strict monotonicity within and across restarts on a single node.

maybe add "as implemented by CockroachDB"

pkg/util/hlc/doc.go, line 132 at r2 (raw file):

Strictly monotonic timestamp allocation ensures that two causally dependent
transactions originating from the same node are given timestamps that reflect

Maybe worth mentioning that this isn't one of CockroachDB's crucial guarantees since as a multi-node database, you can't assume that causally dependent ops hit the same node. It's a nice property anyway.

pkg/util/hlc/doc.go, line 213 at r2 (raw file):

   more, see LeaseState_UNUSABLE.

   Note however that it is easy to overstate the salient point here if one is

understate?

pkg/util/hlc/doc.go, line 231 at r2 (raw file):

   with an uncertainty interval that extends past a lease's expiration will not
   be able to use that lease to perform a read (which is enforced by a stasis
   period immediately before its expiration). However, because some

It's interesting that during a cooperative lease transfer, we don't have to pick a start timestamp for the next read that takes into account any MaxOffset-checks that was applied for reads under the current lease. I suppose that this makes sense, given that success of the lease transfer implies that there wasn't a competing leaseholder, and if there was, the transfer fails and so no harm done.

pkg/util/hlc/doc.go, line 256 at r2 (raw file):
maybe add

HAZARD: this mode of operation is completely untested.

so whatever theoretical properties we ascribe here, it's not clear that the code even works according to spec.

pkg/util/hlc/doc.go, line 266 at r2 (raw file):

clock’s offset from other nodes. This best-effort validation is done in the
RemoteClockMonitor. If any node exceeds the configured maximum offset by more
than 80% compared to a majority of other nodes, it self-terminates.

Interesting, the 80% is news to me (I know we use some Marzullo-type thing, but don't remember specifics). Maybe a permalink here too for sleuthers.

tbg

Very nice to have all of this brain juice in one digestible text.

Reviewed 3 of 4 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 590 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think we still need an uncertainty interval for the same reason that readers on GLOBAL tables need to occasionally wait. Without them, it would be possible for a read on a node with a fast clock (ts@15) to observe a committed value (ts@10) and then a later read on a node with a slow clock (ts@5) to miss the committed value.

FWIW the reason Spanner doesn't need this is because it holds its locks across the commit-wait duration. So the node with a fast clock would get stuck waiting on the locks and would effectively commit-wait as well. See section 4.2.1 from the Spanner paper:

After commit wait, the coordinator sends the commit timestamp to the client and all other participant leaders. Each participant leader logs the transaction’s outcome through Paxos. All participants apply at the same timestamp and then release locks.

Does this check out to you, or am I still missing something, maybe related to your "fixed up to work as advertised" comment?

That makes sense at least to me. I would even find it useful to include this reasoning in the comment.

pkg/util/hlc/doc.go, line 107 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think you're right that SQL pods don't need this. Luckily, it doesn't look like they get it today, because ensureClockMonotonicity is called in server.Server.PreStart. SQL Pods don't use a server.Server, they use a server.SQLServer.

Furthermore, what sort of latency tracking to do we do between and pods of tenant and the host? Do we have any sort of protection from a tenant causing havoc by using far future timestamps? Feels like something we need to enter into the threat model.

This is a good point. This is an attack vector that we have not considered to date. We have some protection in that we will detect and reject unreasonable clock jumps in Clock.UpdateAndCheckMaxOffset, but this isn't comprehensive. A malicious tenant could keep pushing the clock forward by max_offset-1 until it crashed a KV node.

Part of the purpose of this documentation effort was to better understand when and why clock synchronization is needed. What is shows is that the BatchRequest channel mostly isn't needed, especially in the client->server direction and not for correctness. So I propose that we gate the Clock.Update call in Store.Send on whether the client is trusted or not. If the request is from a secondary tenant, we can skip it.

Sounds like something we should file an issue for.

pkg/util/hlc/doc.go, line 25 at r2 (raw file):

Causality tracking

HLCs provide causality tracking through their logical component upon each

This isn't quite, exactly, true, right? The logical component doesn't track anything if observed timestamps always strictly increase in the physical component. It is more that they help establish causality between events that share a wall clock reading.

pkg/util/hlc/doc.go, line 35 at r2 (raw file):

 - Raft (unidirectional): proposers of Raft commands (i.e. leaseholders) attach
   clock readings to these command, which are later consumed by followers when
   commands are applied to their Raft state machine.

Can you add a permalink to the doc? I'm not sure what you are referring to, and I did look through

cockroach/pkg/kv/kvserver/kvserverpb/proposer_kv.pb.go

Line 483 in 6d35693

type RaftCommand struct {

. I don't think that you mean the WriteTimestamp nor the ClosedTimestamp.

pkg/util/hlc/doc.go, line 37 at r2 (raw file):

   commands are applied to their Raft state machine.

 - BatchRequest API (bidirectional): clients and servers of the KV BatchRequest

Ditto, are you talking about this one here?

cockroach/pkg/roachpb/api.proto

Lines 1983 to 1992 in 5f43585

 message Header { 

 // timestamp specifies time at which reads or writes should be performed. If 

 // the timestamp is set to zero value, its value is initialized to the wall 

 // time of the server node. 

 // 

 // Transactional requests are not allowed to set this field; they must rely on 

 // the server to set it from txn.ReadTimestamp. Also, for transactional 

 // requests, writes are performed at the provisional commit timestamp 

 // (txn.WriteTimestamp). 

 util.hlc.Timestamp timestamp = 1 [(gogoproto.nullable) = false];

But this isn't really populated with hlc clock readings.

pkg/util/hlc/doc.go, line 44 at r2 (raw file):

   readings back to the root of the flow. Currently, this only takes place on
   errors, and relates to the "Transaction retry errors" interaction detailed
   below.

Ditto

pkg/util/hlc/doc.go, line 96 at r2 (raw file):

 - Non-transactional requests (Raft + BatchRequest channels). Most KV operations
   in CockroachDB are transactional and receive their read timestamps from their
   client. They use an uncertainty interval (see below) to avoid stale reads in

"from their client" might be ambiguous. The timestamp is picked by the gateway's hlc clock when instantiating the transaction.

pkg/util/hlc/doc.go, line 98 at r2 (raw file):

   client. They use an uncertainty interval (see below) to avoid stale reads in
   the presence of clock skew. However, the KV API also exposes the option to
   send a single-range, strongly consistent, "non-transaction" request. These

maybe a clearer why to phrase this is that

the KV API also exposes the option to elide the transaction for requests targeting a single range (which trivially applies to all point requests). These requests do not carry a read predetermined timestamp; instead, it is chosen from the HLC upon arrival at the leaseholder for the range. Since the hlc clock always leads the timestamp if any write served on the range, this will not result in stale reads, despite not using an uncertainty interval for such requests.

pkg/util/hlc/doc.go, line 123 at r2 (raw file):

Strict monotonicity

HLCs provide strict monotonicity within and across restarts on a single node.

maybe add "as implemented by CockroachDB"

pkg/util/hlc/doc.go, line 132 at r2 (raw file):

Strictly monotonic timestamp allocation ensures that two causally dependent
transactions originating from the same node are given timestamps that reflect

Maybe worth mentioning that this isn't one of CockroachDB's crucial guarantees since as a multi-node database, you can't assume that causally dependent ops hit the same node. It's a nice property anyway.

pkg/util/hlc/doc.go, line 213 at r2 (raw file):

   more, see LeaseState_UNUSABLE.

   Note however that it is easy to overstate the salient point here if one is

understate?

pkg/util/hlc/doc.go, line 231 at r2 (raw file):

   with an uncertainty interval that extends past a lease's expiration will not
   be able to use that lease to perform a read (which is enforced by a stasis
   period immediately before its expiration). However, because some

It's interesting that during a cooperative lease transfer, we don't have to pick a start timestamp for the next read that takes into account any MaxOffset-checks that was applied for reads under the current lease. I suppose that this makes sense, given that success of the lease transfer implies that there wasn't a competing leaseholder, and if there was, the transfer fails and so no harm done.

pkg/util/hlc/doc.go, line 256 at r2 (raw file):
maybe add

HAZARD: this mode of operation is completely untested.

so whatever theoretical properties we ascribe here, it's not clear that the code even works according to spec.

pkg/util/hlc/doc.go, line 266 at r2 (raw file):

clock’s offset from other nodes. This best-effort validation is done in the
RemoteClockMonitor. If any node exceeds the configured maximum offset by more
than 80% compared to a majority of other nodes, it self-terminates.

Interesting, the 80% is news to me (I know we use some Marzullo-type thing, but don't remember specifics). Maybe a permalink here too for sleuthers.

nvanbenschoten

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, and @tbg)

pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 590 at r1 (raw file):