Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hlc: document properties and uses of Hybrid Logical Clocks #72278

Merged
merged 1 commit into from
Nov 12, 2021

Conversation

nvanbenschoten
Copy link
Member

The godoc is currently hosted on here. I'll leave that up for a few days.


This commit adds a doc.go file to the pkg/util/hlc package that details that
properties and the uses of Hybrid Logical Clocks throughout the system. It is
meant to capture an overview of the ways in which HLCs benfit CockroachDB and to
exhaustively enumerate the (few) places in which the HLC is a key component to
the correctness of different subsystems.

This was inspired by #72121, which is once again making me feel uneasy about how
implicit most interactions with the HLC clock are. Specifically, the uses of the
HLC for causality tracking are subtle and underdocumented. The typing changes
made in #58349 help to isolate timestamps used for causality tracking from any
other timestamps in the system, but until we remove the escape hatch of
dynamically casting a Timestamp back to a ClockTimestamp with
TryToClockTimestamp(), it is still too difficult to understand when and why
clock signals are being passed between HLCs on different nodes, and where doing
so is necessary for correctness. I'm looking to make that change and I'm hoping
that documenting this first (with help from the reviewers!) will set that up to
be successful.

Some of this was adapted from section 4 of the SIGMOD 2020 paper.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @nvanbenschoten, and @tbg)


pkg/util/hlc/doc.go, line 107 at r1 (raw file):

additional protection here by persisting the wall time of the clock
periodically, although this protection is disabled by default.

Have we considered these protections in the context of serverless sql pods? Is this going to end up being an onerous amount of waiting when scaling back up?

Furthermore, what sort of latency tracking to do we do between and pods of tenant and the host? Do we have any sort of protection from a tenant causing havoc by using far future timestamps? Feels like something we need to enter into the threat model.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei, @bdarnell, @nvanbenschoten, and @tbg)


pkg/util/hlc/doc.go, line 242 at r1 (raw file):

 - SQL descriptor leases. TODO(nvanbenschoten): learn how this works and whether
   this relies on bounded clock skew of liveness or for safety. May need to talk
   to Andrew about it. See LeaseManager.timeRemaining.

I don't believe that the the descriptor leases rely on bounded clock skew for correctness in any way. The chosen lease expirations are in the MVCC domain and act as transaction deadlines.

The LeaseManager you're referring to here is startupmigrations.LeaseManager. I do believe that thing will lose its mutual exclusion under severe skew. Fortunately, I don't think the startup migrations which touch descriptors (as they existed in the past) were susceptible to any correctness issues in the absence of mutual exclusion; they always used read-write to interact with descriptors. Those startup migrations are decreasingly used. I don't believe they have been used since 20.2 for anything new. They primarily set up new state in the cluster. I'd be happy if that code were deleted.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @nvanbenschoten, and @tbg)


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 590 at r1 (raw file):

// transactions commit with higher timestamps, even if their read and writes
// sets do not conflict with the original transaction's. This obviates the
// need for uncertainty intervals and prevents the "causal reverse" anamoly

After reading the new file, it seems to me that the removed text here is still accurate (or at least it could be if we fixed up this linearizable option to work as advertised). I'm drawing a distinction between the uncertainty interval and other things that require clock synchronization within max-offset (e.g. non-cooperative lease transfers).


pkg/util/hlc/doc.go, line 30 at r1 (raw file):

clock.

There are three channels through which HLC timestamps are passed between nodes

Early on we considered sprinkling more HLC channels around the code (e.g. on periodic heartbeats). These channels would not do anything meaningful for causality tracking, but they might help avoid pushes/restarts occasionally. I'm not sure if we'd ever introduce those at this point, but we might want to tweak the language here to say that these are the three ways that we convey HLC timestamps for causal purposes.

Could we go even further and say that e.g. the raft channel supports causality for cooperative lease transfers, and BatchRequest for observed timestamps, etc? It's interesting to think about the possibility of using separate clocks for these purposes. For example, I don't think the clock used for lease transfers necessarily needs to be the singleton node clock. This might help avoid a high-lock-contention global object. OTOH, observed timestamps basically assume a single clock so maybe there's not much to be gained here.


pkg/util/hlc/doc.go, line 72 at r1 (raw file):

   more, see pkg/kv/kvserver/observedts/doc.go.

 - Non-transactional requests. Most operations in CockroachDB are transactional

"Most KV operations"


pkg/util/hlc/doc.go, line 96 at r1 (raw file):

   just concerned about a transaction restarting at a timestamp above the local
   clock back then because we had yet to separate the "clock timestamp" domain
   from the "transaction timestamp" domain.

I don't remember exactly what was going on here but I think it's plausible that this is not necessary for correctness, but is needed to ensure that the clock is advanced far enough that it doesn't immediately hit the same retryable error again.


pkg/util/hlc/doc.go, line 107 at r1 (raw file):

Previously, ajwerner wrote…

Have we considered these protections in the context of serverless sql pods? Is this going to end up being an onerous amount of waiting when scaling back up?

Furthermore, what sort of latency tracking to do we do between and pods of tenant and the host? Do we have any sort of protection from a tenant causing havoc by using far future timestamps? Feels like something we need to enter into the threat model.

Do SQL pods need this? I don't think they do. Strict monotonicity is important for KV pods but I think SQL pods would not cause correctness problems if their timestamps moved backwards across a restart (within max-offset).

Also, if monotonicity of timestamps for SQL pods were required, remember that what matters is the interval from the last timestamp generated by a previous incarnation of the node to the first timestamp generated by the new incarnation. All the time spent on the scale-down and scale-up process itself still counts, so you shouldn't actually need much of a sleep.


pkg/util/hlc/doc.go, line 206 at r1 (raw file):

   replicas to both consider themselves leaseholders at the same time. This can
   not lead to stale reads for transactional requests, because a transaction
   with an uncertainty interval that extends past a lease's expiration will not

Really? I was not aware that we checked the end-of-uncertainty timestamp against the lease expiration. But that sounds like a much more elegant solution to the main issue here than the stasis period.


pkg/util/hlc/doc.go, line 213 at r1 (raw file):

   use uncertainty intervals, but the mechanics differ slightly.

   TODO(nvanbenschoten): is that really it? It's the only thing I've ever been

That may be it; I can't remember anything else right now. In the early days we cared a lot more about non-transactional operations, so they motivated a lot of stuff like this.

This is interesting because the usage of max-offset in non-cooperative lease transfers is a big part of the reason you can't just set max-offset to an extremely high value. If we added uncertainty intervals to non-transactional operations (as proposed above) and could avoid the max-offset gap between consecutive leases (and I guess the "strict monotonic" sleep at startup), then you could run with a high max-offset. You'd get read uncertainty errors all the time, but you could retry and rely on observed timestamps to make progress. This would be kind of like a distributed timestamp oracle for folks who insist that they can't get reasonable clock synchronization.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong:

nice, thanks for writing

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @bdarnell, @nvanbenschoten, and @tbg)


pkg/util/hlc/doc.go, line 58 at r1 (raw file):

   lease's start time. Upon application of this Raft entry, the incoming
   leaseholder forwards its HLC to this clock reading, transitively ensuring
   that its clock is >= the new lease's start time.

consider spelling out why the lease recipient should have a clock higher than the lease start.


pkg/util/hlc/doc.go, line 70 at r1 (raw file):

   node, and by extension, at the time that the transaction began. This allows
   the transaction to avoid uncertainty restarts in some circumstances. For
   more, see pkg/kv/kvserver/observedts/doc.go.

do you want to hint at the subteties around pushed intents here?


pkg/util/hlc/doc.go, line 96 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't remember exactly what was going on here but I think it's plausible that this is not necessary for correctness, but is needed to ensure that the clock is advanced far enough that it doesn't immediately hit the same retryable error again.

AFAICS, 72fa944 doesn't deal with updating the gateway's clock; it only deals with ensuring that the txn restarts above the uncertain value it encountered. Which I guess is also what Ben is saying.


pkg/util/hlc/doc.go, line 139 at r1 (raw file):

 - Transaction uncertainty intervals. The single-key linearizability property is
   satisfied in CockroachDB by tracking an uncertainty interval for each
   transaction, within which the causal ordering between two transactions is

s/causal ordering/real-time ordering ?

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reviews!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, and @tbg)


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 590 at r1 (raw file):
I think we still need an uncertainty interval for the same reason that readers on GLOBAL tables need to occasionally wait. Without them, it would be possible for a read on a node with a fast clock (ts@15) to observe a committed value (ts@10) and then a later read on a node with a slow clock (ts@5) to miss the committed value.

FWIW the reason Spanner doesn't need this is because it holds its locks across the commit-wait duration. So the node with a fast clock would get stuck waiting on the locks and would effectively commit-wait as well. See section 4.2.1 from the Spanner paper:

After commit wait, the coordinator sends the commit timestamp to the client and all other participant leaders. Each participant leader logs the transaction’s outcome through Paxos. All participants apply at the same timestamp and then release locks.

Does this check out to you, or am I still missing something, maybe related to your "fixed up to work as advertised" comment?


pkg/util/hlc/doc.go, line 30 at r1 (raw file):

I'm not sure if we'd ever introduce those at this point, but we might want to tweak the language here to say that these are the three ways that we convey HLC timestamps for causal purposes.

These three came from an audit of the code, so it doesn't look like we did. I added some words to that effect.

Could we go even further and say that e.g. the raft channel supports causality for cooperative lease transfers, and BatchRequest for observed timestamps, etc?

That's a good idea. I'll point each use of causality at the corresponding channel.


pkg/util/hlc/doc.go, line 58 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

consider spelling out why the lease recipient should have a clock higher than the lease start.

Done.


pkg/util/hlc/doc.go, line 70 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

do you want to hint at the subteties around pushed intents here?

Done.


pkg/util/hlc/doc.go, line 72 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

"Most KV operations"

Done.


pkg/util/hlc/doc.go, line 96 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

AFAICS, 72fa944 doesn't deal with updating the gateway's clock; it only deals with ensuring that the txn restarts above the uncertain value it encountered. Which I guess is also what Ben is saying.

But why was this needed? We don't pull the next epoch's timestamp from the clock, we pull it straight from the error:

cockroach/pkg/roachpb/data.go

Lines 1004 to 1015 in 72fa944

case *ReadWithinUncertaintyIntervalError:
// If the reader encountered a newer write within the uncertainty
// interval, we advance the txn's timestamp just past the last observed
// timestamp from the node.
ts, ok := txn.GetObservedTimestamp(pErr.OriginNode)
if !ok {
log.Fatalf(ctx,
"missing observed timestamp for node %d found on uncertainty restart. "+
"err: %s. txn: %s. Observed timestamps: %s",
pErr.OriginNode, pErr, txn, txn.ObservedTimestamps)
}
txn.Timestamp.Forward(ts)

Maybe we were concerned at that time about a transaction being given a timestamp that was greater than its local clock's HLC?


pkg/util/hlc/doc.go, line 107 at r1 (raw file):
I think you're right that SQL pods don't need this. Luckily, it doesn't look like they get it today, because ensureClockMonotonicity is called in server.Server.PreStart. SQL Pods don't use a server.Server, they use a server.SQLServer.

Furthermore, what sort of latency tracking to do we do between and pods of tenant and the host? Do we have any sort of protection from a tenant causing havoc by using far future timestamps? Feels like something we need to enter into the threat model.

This is a good point. This is an attack vector that we have not considered to date. We have some protection in that we will detect and reject unreasonable clock jumps in Clock.UpdateAndCheckMaxOffset, but this isn't comprehensive. A malicious tenant could keep pushing the clock forward by max_offset-1 until it crashed a KV node.

Part of the purpose of this documentation effort was to better understand when and why clock synchronization is needed. What is shows is that the BatchRequest channel mostly isn't needed, especially in the client->server direction and not for correctness. So I propose that we gate the Clock.Update call in Store.Send on whether the client is trusted or not. If the request is from a secondary tenant, we can skip it.


pkg/util/hlc/doc.go, line 139 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

s/causal ordering/real-time ordering ?

Done.


pkg/util/hlc/doc.go, line 206 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Really? I was not aware that we checked the end-of-uncertainty timestamp against the lease expiration. But that sounds like a much more elegant solution to the main issue here than the stasis period.

Ah, you're right. I tried to make that change in #58904, but didn't get this far. Fixed.


pkg/util/hlc/doc.go, line 213 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

That may be it; I can't remember anything else right now. In the early days we cared a lot more about non-transactional operations, so they motivated a lot of stuff like this.

This is interesting because the usage of max-offset in non-cooperative lease transfers is a big part of the reason you can't just set max-offset to an extremely high value. If we added uncertainty intervals to non-transactional operations (as proposed above) and could avoid the max-offset gap between consecutive leases (and I guess the "strict monotonic" sleep at startup), then you could run with a high max-offset. You'd get read uncertainty errors all the time, but you could retry and rely on observed timestamps to make progress. This would be kind of like a distributed timestamp oracle for folks who insist that they can't get reasonable clock synchronization.

Is this accounting for the need to check the end-of-uncertainty timestamp against the lease expiration if we removed the stasis gap? If we had a higher max-offset then wouldn't the utility window of leases would become smaller, eventually to the point where leases could not be used at all?


pkg/util/hlc/doc.go, line 242 at r1 (raw file):

Previously, ajwerner wrote…
 - SQL descriptor leases. TODO(nvanbenschoten): learn how this works and whether
   this relies on bounded clock skew of liveness or for safety. May need to talk
   to Andrew about it. See LeaseManager.timeRemaining.

I don't believe that the the descriptor leases rely on bounded clock skew for correctness in any way. The chosen lease expirations are in the MVCC domain and act as transaction deadlines.

The LeaseManager you're referring to here is startupmigrations.LeaseManager. I do believe that thing will lose its mutual exclusion under severe skew. Fortunately, I don't think the startup migrations which touch descriptors (as they existed in the past) were susceptible to any correctness issues in the absence of mutual exclusion; they always used read-write to interact with descriptors. Those startup migrations are decreasingly used. I don't believe they have been used since 20.2 for anything new. They primarily set up new state in the cluster. I'd be happy if that code were deleted.

Got it, thanks for the confirmation. I'll remove this section then.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice to have all of this brain juice in one digestible text.

Reviewed 3 of 4 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 590 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think we still need an uncertainty interval for the same reason that readers on GLOBAL tables need to occasionally wait. Without them, it would be possible for a read on a node with a fast clock (ts@15) to observe a committed value (ts@10) and then a later read on a node with a slow clock (ts@5) to miss the committed value.

FWIW the reason Spanner doesn't need this is because it holds its locks across the commit-wait duration. So the node with a fast clock would get stuck waiting on the locks and would effectively commit-wait as well. See section 4.2.1 from the Spanner paper:

After commit wait, the coordinator sends the commit timestamp to the client and all other participant leaders. Each participant leader logs the transaction’s outcome through Paxos. All participants apply at the same timestamp and then release locks.

Does this check out to you, or am I still missing something, maybe related to your "fixed up to work as advertised" comment?

That makes sense at least to me. I would even find it useful to include this reasoning in the comment.


pkg/util/hlc/doc.go, line 107 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think you're right that SQL pods don't need this. Luckily, it doesn't look like they get it today, because ensureClockMonotonicity is called in server.Server.PreStart. SQL Pods don't use a server.Server, they use a server.SQLServer.

Furthermore, what sort of latency tracking to do we do between and pods of tenant and the host? Do we have any sort of protection from a tenant causing havoc by using far future timestamps? Feels like something we need to enter into the threat model.

This is a good point. This is an attack vector that we have not considered to date. We have some protection in that we will detect and reject unreasonable clock jumps in Clock.UpdateAndCheckMaxOffset, but this isn't comprehensive. A malicious tenant could keep pushing the clock forward by max_offset-1 until it crashed a KV node.

Part of the purpose of this documentation effort was to better understand when and why clock synchronization is needed. What is shows is that the BatchRequest channel mostly isn't needed, especially in the client->server direction and not for correctness. So I propose that we gate the Clock.Update call in Store.Send on whether the client is trusted or not. If the request is from a secondary tenant, we can skip it.

Sounds like something we should file an issue for.


pkg/util/hlc/doc.go, line 25 at r2 (raw file):

Causality tracking

HLCs provide causality tracking through their logical component upon each

This isn't quite, exactly, true, right? The logical component doesn't track anything if observed timestamps always strictly increase in the physical component. It is more that they help establish causality between events that share a wall clock reading.


pkg/util/hlc/doc.go, line 35 at r2 (raw file):

 - Raft (unidirectional): proposers of Raft commands (i.e. leaseholders) attach
   clock readings to these command, which are later consumed by followers when
   commands are applied to their Raft state machine.

Can you add a permalink to the doc? I'm not sure what you are referring to, and I did look through

type RaftCommand struct {
. I don't think that you mean the WriteTimestamp nor the ClosedTimestamp.


pkg/util/hlc/doc.go, line 37 at r2 (raw file):

   commands are applied to their Raft state machine.

 - BatchRequest API (bidirectional): clients and servers of the KV BatchRequest

Ditto, are you talking about this one here?

message Header {
// timestamp specifies time at which reads or writes should be performed. If
// the timestamp is set to zero value, its value is initialized to the wall
// time of the server node.
//
// Transactional requests are not allowed to set this field; they must rely on
// the server to set it from txn.ReadTimestamp. Also, for transactional
// requests, writes are performed at the provisional commit timestamp
// (txn.WriteTimestamp).
util.hlc.Timestamp timestamp = 1 [(gogoproto.nullable) = false];

But this isn't really populated with hlc clock readings.


pkg/util/hlc/doc.go, line 44 at r2 (raw file):

   readings back to the root of the flow. Currently, this only takes place on
   errors, and relates to the "Transaction retry errors" interaction detailed
   below.

Ditto


pkg/util/hlc/doc.go, line 96 at r2 (raw file):

 - Non-transactional requests (Raft + BatchRequest channels). Most KV operations
   in CockroachDB are transactional and receive their read timestamps from their
   client. They use an uncertainty interval (see below) to avoid stale reads in

"from their client" might be ambiguous. The timestamp is picked by the gateway's hlc clock when instantiating the transaction.


pkg/util/hlc/doc.go, line 98 at r2 (raw file):

   client. They use an uncertainty interval (see below) to avoid stale reads in
   the presence of clock skew. However, the KV API also exposes the option to
   send a single-range, strongly consistent, "non-transaction" request. These

maybe a clearer why to phrase this is that

the KV API also exposes the option to elide the transaction for requests targeting a single range (which trivially applies to all point requests). These requests do not carry a read predetermined timestamp; instead, it is chosen from the HLC upon arrival at the leaseholder for the range. Since the hlc clock always leads the timestamp if any write served on the range, this will not result in stale reads, despite not using an uncertainty interval for such requests.


pkg/util/hlc/doc.go, line 123 at r2 (raw file):

Strict monotonicity

HLCs provide strict monotonicity within and across restarts on a single node.

maybe add "as implemented by CockroachDB"


pkg/util/hlc/doc.go, line 132 at r2 (raw file):

Strictly monotonic timestamp allocation ensures that two causally dependent
transactions originating from the same node are given timestamps that reflect

Maybe worth mentioning that this isn't one of CockroachDB's crucial guarantees since as a multi-node database, you can't assume that causally dependent ops hit the same node. It's a nice property anyway.


pkg/util/hlc/doc.go, line 213 at r2 (raw file):

   more, see LeaseState_UNUSABLE.

   Note however that it is easy to overstate the salient point here if one is

understate?


pkg/util/hlc/doc.go, line 231 at r2 (raw file):

   with an uncertainty interval that extends past a lease's expiration will not
   be able to use that lease to perform a read (which is enforced by a stasis
   period immediately before its expiration). However, because some

It's interesting that during a cooperative lease transfer, we don't have to pick a start timestamp for the next read that takes into account any MaxOffset-checks that was applied for reads under the current lease. I suppose that this makes sense, given that success of the lease transfer implies that there wasn't a competing leaseholder, and if there was, the transfer fails and so no harm done.


pkg/util/hlc/doc.go, line 256 at r2 (raw file):
maybe add

HAZARD: this mode of operation is completely untested.

so whatever theoretical properties we ascribe here, it's not clear that the code even works according to spec.


pkg/util/hlc/doc.go, line 266 at r2 (raw file):

clocks offset from other nodes. This best-effort validation is done in the
RemoteClockMonitor. If any node exceeds the configured maximum offset by more
than 80% compared to a majority of other nodes, it self-terminates.

Interesting, the 80% is news to me (I know we use some Marzullo-type thing, but don't remember specifics). Maybe a permalink here too for sleuthers.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice to have all of this brain juice in one digestible text.

Reviewed 3 of 4 files at r1, 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 590 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think we still need an uncertainty interval for the same reason that readers on GLOBAL tables need to occasionally wait. Without them, it would be possible for a read on a node with a fast clock (ts@15) to observe a committed value (ts@10) and then a later read on a node with a slow clock (ts@5) to miss the committed value.

FWIW the reason Spanner doesn't need this is because it holds its locks across the commit-wait duration. So the node with a fast clock would get stuck waiting on the locks and would effectively commit-wait as well. See section 4.2.1 from the Spanner paper:

After commit wait, the coordinator sends the commit timestamp to the client and all other participant leaders. Each participant leader logs the transaction’s outcome through Paxos. All participants apply at the same timestamp and then release locks.

Does this check out to you, or am I still missing something, maybe related to your "fixed up to work as advertised" comment?

That makes sense at least to me. I would even find it useful to include this reasoning in the comment.


pkg/util/hlc/doc.go, line 107 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think you're right that SQL pods don't need this. Luckily, it doesn't look like they get it today, because ensureClockMonotonicity is called in server.Server.PreStart. SQL Pods don't use a server.Server, they use a server.SQLServer.

Furthermore, what sort of latency tracking to do we do between and pods of tenant and the host? Do we have any sort of protection from a tenant causing havoc by using far future timestamps? Feels like something we need to enter into the threat model.

This is a good point. This is an attack vector that we have not considered to date. We have some protection in that we will detect and reject unreasonable clock jumps in Clock.UpdateAndCheckMaxOffset, but this isn't comprehensive. A malicious tenant could keep pushing the clock forward by max_offset-1 until it crashed a KV node.

Part of the purpose of this documentation effort was to better understand when and why clock synchronization is needed. What is shows is that the BatchRequest channel mostly isn't needed, especially in the client->server direction and not for correctness. So I propose that we gate the Clock.Update call in Store.Send on whether the client is trusted or not. If the request is from a secondary tenant, we can skip it.

Sounds like something we should file an issue for.


pkg/util/hlc/doc.go, line 25 at r2 (raw file):

Causality tracking

HLCs provide causality tracking through their logical component upon each

This isn't quite, exactly, true, right? The logical component doesn't track anything if observed timestamps always strictly increase in the physical component. It is more that they help establish causality between events that share a wall clock reading.


pkg/util/hlc/doc.go, line 35 at r2 (raw file):

 - Raft (unidirectional): proposers of Raft commands (i.e. leaseholders) attach
   clock readings to these command, which are later consumed by followers when
   commands are applied to their Raft state machine.

Can you add a permalink to the doc? I'm not sure what you are referring to, and I did look through

type RaftCommand struct {
. I don't think that you mean the WriteTimestamp nor the ClosedTimestamp.


pkg/util/hlc/doc.go, line 37 at r2 (raw file):

   commands are applied to their Raft state machine.

 - BatchRequest API (bidirectional): clients and servers of the KV BatchRequest

Ditto, are you talking about this one here?

message Header {
// timestamp specifies time at which reads or writes should be performed. If
// the timestamp is set to zero value, its value is initialized to the wall
// time of the server node.
//
// Transactional requests are not allowed to set this field; they must rely on
// the server to set it from txn.ReadTimestamp. Also, for transactional
// requests, writes are performed at the provisional commit timestamp
// (txn.WriteTimestamp).
util.hlc.Timestamp timestamp = 1 [(gogoproto.nullable) = false];

But this isn't really populated with hlc clock readings.


pkg/util/hlc/doc.go, line 44 at r2 (raw file):

   readings back to the root of the flow. Currently, this only takes place on
   errors, and relates to the "Transaction retry errors" interaction detailed
   below.

Ditto


pkg/util/hlc/doc.go, line 96 at r2 (raw file):

 - Non-transactional requests (Raft + BatchRequest channels). Most KV operations
   in CockroachDB are transactional and receive their read timestamps from their
   client. They use an uncertainty interval (see below) to avoid stale reads in

"from their client" might be ambiguous. The timestamp is picked by the gateway's hlc clock when instantiating the transaction.


pkg/util/hlc/doc.go, line 98 at r2 (raw file):

   client. They use an uncertainty interval (see below) to avoid stale reads in
   the presence of clock skew. However, the KV API also exposes the option to
   send a single-range, strongly consistent, "non-transaction" request. These

maybe a clearer why to phrase this is that

the KV API also exposes the option to elide the transaction for requests targeting a single range (which trivially applies to all point requests). These requests do not carry a read predetermined timestamp; instead, it is chosen from the HLC upon arrival at the leaseholder for the range. Since the hlc clock always leads the timestamp if any write served on the range, this will not result in stale reads, despite not using an uncertainty interval for such requests.


pkg/util/hlc/doc.go, line 123 at r2 (raw file):

Strict monotonicity

HLCs provide strict monotonicity within and across restarts on a single node.

maybe add "as implemented by CockroachDB"


pkg/util/hlc/doc.go, line 132 at r2 (raw file):

Strictly monotonic timestamp allocation ensures that two causally dependent
transactions originating from the same node are given timestamps that reflect

Maybe worth mentioning that this isn't one of CockroachDB's crucial guarantees since as a multi-node database, you can't assume that causally dependent ops hit the same node. It's a nice property anyway.


pkg/util/hlc/doc.go, line 213 at r2 (raw file):

   more, see LeaseState_UNUSABLE.

   Note however that it is easy to overstate the salient point here if one is

understate?


pkg/util/hlc/doc.go, line 231 at r2 (raw file):

   with an uncertainty interval that extends past a lease's expiration will not
   be able to use that lease to perform a read (which is enforced by a stasis
   period immediately before its expiration). However, because some

It's interesting that during a cooperative lease transfer, we don't have to pick a start timestamp for the next read that takes into account any MaxOffset-checks that was applied for reads under the current lease. I suppose that this makes sense, given that success of the lease transfer implies that there wasn't a competing leaseholder, and if there was, the transfer fails and so no harm done.


pkg/util/hlc/doc.go, line 256 at r2 (raw file):
maybe add

HAZARD: this mode of operation is completely untested.

so whatever theoretical properties we ascribe here, it's not clear that the code even works according to spec.


pkg/util/hlc/doc.go, line 266 at r2 (raw file):

clocks offset from other nodes. This best-effort validation is done in the
RemoteClockMonitor. If any node exceeds the configured maximum offset by more
than 80% compared to a majority of other nodes, it self-terminates.

Interesting, the 80% is news to me (I know we use some Marzullo-type thing, but don't remember specifics). Maybe a permalink here too for sleuthers.

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, and @tbg)


pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 590 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

That makes sense at least to me. I would even find it useful to include this reasoning in the comment.

Done.


pkg/util/hlc/doc.go, line 107 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Sounds like something we should file an issue for.

Done in #72663.


pkg/util/hlc/doc.go, line 25 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This isn't quite, exactly, true, right? The logical component doesn't track anything if observed timestamps always strictly increase in the physical component. It is more that they help establish causality between events that share a wall clock reading.

Done.


pkg/util/hlc/doc.go, line 35 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Can you add a permalink to the doc? I'm not sure what you are referring to, and I did look through

type RaftCommand struct {
. I don't think that you mean the WriteTimestamp nor the ClosedTimestamp.

I do mean the WriteTimestamp, which is used to update the clock on followers here. I added a link. But also, I'm coming at that code with an axe right after this, and I'll be changing this working slightly at that point.


pkg/util/hlc/doc.go, line 37 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto, are you talking about this one here?

message Header {
// timestamp specifies time at which reads or writes should be performed. If
// the timestamp is set to zero value, its value is initialized to the wall
// time of the server node.
//
// Transactional requests are not allowed to set this field; they must rely on
// the server to set it from txn.ReadTimestamp. Also, for transactional
// requests, writes are performed at the provisional commit timestamp
// (txn.WriteTimestamp).
util.hlc.Timestamp timestamp = 1 [(gogoproto.nullable) = false];

But this isn't really populated with hlc clock readings.

Added a few links. I'm also coming to clean that code up soon 😃 Some of this is a little aspirational for the next week or so.


pkg/util/hlc/doc.go, line 44 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ditto

Done.


pkg/util/hlc/doc.go, line 96 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

"from their client" might be ambiguous. The timestamp is picked by the gateway's hlc clock when instantiating the transaction.

Done.


pkg/util/hlc/doc.go, line 98 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

maybe a clearer why to phrase this is that

the KV API also exposes the option to elide the transaction for requests targeting a single range (which trivially applies to all point requests). These requests do not carry a read predetermined timestamp; instead, it is chosen from the HLC upon arrival at the leaseholder for the range. Since the hlc clock always leads the timestamp if any write served on the range, this will not result in stale reads, despite not using an uncertainty interval for such requests.

Done. That is clearer, thanks.


pkg/util/hlc/doc.go, line 123 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

maybe add "as implemented by CockroachDB"

Done.


pkg/util/hlc/doc.go, line 132 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Maybe worth mentioning that this isn't one of CockroachDB's crucial guarantees since as a multi-node database, you can't assume that causally dependent ops hit the same node. It's a nice property anyway.

Done.


pkg/util/hlc/doc.go, line 213 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

understate?

I was meaning that the bounded skew is not relied upon to keep the actual MVCC intervals owned by leases disjoint.


pkg/util/hlc/doc.go, line 256 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

maybe add

HAZARD: this mode of operation is completely untested.

so whatever theoretical properties we ascribe here, it's not clear that the code even works according to spec.

I like it. Done.


pkg/util/hlc/doc.go, line 266 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Interesting, the 80% is news to me (I know we use some Marzullo-type thing, but don't remember specifics). Maybe a permalink here too for sleuthers.

Done.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, @nvanbenschoten, and @tbg)


pkg/util/hlc/doc.go, line 70 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Done.

Done? I don't see the word "intent" anywhere.


pkg/util/hlc/doc.go, line 95 at r3 (raw file):

   Separately, when a leaseholder
   on a given node serves a write, it ensures that the node's HLC clock is >=
   the written_timestamp of the write.

Hmm, but we want to get rid of this for the purposes of multi-tenancy, as we were talking about last night (i.e. to prevent a SQL pod from injecting timestamps into the cluster). So... what happens then with observed timestamps?

This commit adds a `doc.go` file to the `pkg/util/hlc package` that details that
properties and the uses of Hybrid Logical Clocks throughout the system. It is
meant to capture an overview of the ways in which HLCs benfit CockroachDB and to
exhaustively enumerate the (few) places in which the HLC is a key component to
the correctness of different subsystems.

This was inspired by cockroachdb#72121, which is once again making me feel uneasy about how
implicit most interactions with the HLC clock are. Specifically, the uses of the
HLC for causality tracking are subtle and underdocumented. The typing changes
made in cockroachdb#58349 help to isolate timestamps used for causality tracking from any
other timestamps in the system, but until we remove the escape hatch of
dynamically casting a `Timestamp` back to a `ClockTimestamp` with
`TryToClockTimestamp()`, it is still too difficult to understand when and why
clock signals are being passed between HLCs on different nodes, and where doing
so is necessary for correctness. I'm looking to make that change and I'm hoping
that documenting this first (with help from the reviewers!) will set that up to
be successful.

Some of this was adapted from section 4 of the SIGMOD 2020 paper.
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, and @tbg)


pkg/util/hlc/doc.go, line 70 at r1 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Done? I don't see the word "intent" anywhere.

I made a reference to the written_timestamp above, but I'll be more explicit and say that this does not change even if an intent is pushed during resolution. Done.


pkg/util/hlc/doc.go, line 95 at r3 (raw file):

Previously, andreimatei (Andrei Matei) wrote…
   Separately, when a leaseholder
   on a given node serves a write, it ensures that the node's HLC clock is >=
   the written_timestamp of the write.

Hmm, but we want to get rid of this for the purposes of multi-tenancy, as we were talking about last night (i.e. to prevent a SQL pod from injecting timestamps into the cluster). So... what happens then with observed timestamps?

We want to get rid of hlc.Clock.Update(Timestamp). That doesn't mean we have to get rid of ensuring that a node's HLC clock is >= the WrittenTimestamp (note: this is a new timestamp, see #72121 (comment)) of a write, it just means that the write is originally written with a WrittenTimestamp in some cases. FWIW, this is already the case for future time writes — just picture s/synthetic_bit=true/written_timestamp=now()/ in cases where now() < write_timestamp.

@nvanbenschoten
Copy link
Member Author

bors r+

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, @nvanbenschoten, and @tbg)


pkg/util/hlc/doc.go, line 95 at r3 (raw file):

it just means that the write is originally written with a WrittenTimestamp in some cases.

Right... But what are these cases going to be? Any timestamp above the current lease expiration? So the idea will be that KV will make a determination that the gateway for a particular write is not to be trusted and so we won't bother ensuring linearizability for its writes?

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner, @andreimatei, @bdarnell, and @tbg)


pkg/util/hlc/doc.go, line 95 at r3 (raw file):

But what are these cases going to be?

Cases where the Txn.WriteTimestamp is greater than the leaseholder's local HLC clock. The idea is that we'll no longer "push" operation timestamps into the local HLC clock. Instead, we'll (optionally) push into it with separate clock readings and then pull from it to determine a write's WrittenTimestamp. As an optimization, if this WrittenTimestamp is equal to or greater than its MVCC Timestamp (the common case), then we don't need to explicitly include the WrittenTimestamp in the key/value. Otherwise, we do. And this WrittenTimestamp is what ObservedTimestamps are compared against.

Ignoring the clock reading from secondary tenants just increases the chance that we'll fall into this second bucket and need to write both timestamps into a key/value.

Any timestamp above the current lease expiration?

No, writes still won't be allowed with MVCC timestamps above the current lease expiration. Although maybe they could be as long as their WrittenTimestamp (hlc.ClockTimestamp) is below it and they did not read? I'd have to think through that. But that's not the current proposal.

So the idea will be that KV will make a determination that the gateway for a particular write is not to be trusted and so we won't bother ensuring linearizability for its writes?

The first part about trust is correct. We would ignore the clock reading from untrusted gateways.

The second part about linearizability is not correct. We'd still ensure linearizability as long as the untrusted tenant's HLC is within the max_offset of every other node in the system. But if it isn't, I guess that tenant would be susceptible to stale reads.

Want to talk through this at some point to make sure that I'm not missing something before I go off and pick this fight with the code?

@craig
Copy link
Contributor

craig bot commented Nov 12, 2021

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Nov 12, 2021

Build succeeded:

@craig craig bot merged commit 5cb931b into cockroachdb:master Nov 12, 2021
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/hlcDocs branch November 17, 2021 19:50
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 5, 2022
This commit eliminates the primary mechanism that we use to pass clock
information from a leaseholder, through Raft log entries, to a Range's
followers. As we found in cockroachdb#72278, this was only needed for correctness
in a few specific cases — namely lease transfers and range merges. These
two operations continue to pass clock signals through more explicit
channels, but we remove the unnecessary general case.

The allows us to remote one of the two remaining places where we convert
a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp`
method. As outlined in cockroachdb#72121 (comment),
I would like to remove ability to downcast a "data-plane" `Timestamp` to
a "control-plane" `CloudTimestamp` entirely. This will clarify the role
of `ClockTimestamps` in the system and clean up the channels through
which clock information is passed between nodes.

The other place where we cast from `Timestamp` to `ClockTimesatmp` is
in `Store.Send`, at the boundary of KV RPCs. I would also like to get
rid of this, but doing so needs to wait on cockroachdb#73732.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 5, 2022
This commit eliminates the primary mechanism that we use to pass clock
information from a leaseholder, through Raft log entries, to a Range's
followers. As we found in cockroachdb#72278, this was only needed for correctness
in a few specific cases — namely lease transfers and range merges. These
two operations continue to pass clock signals through more explicit
channels, but we remove the unnecessary general case.

The allows us to remote one of the two remaining places where we convert
a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp`
method. As outlined in cockroachdb#72121 (comment),
I would like to remove ability to downcast a "data-plane" `Timestamp` to
a "control-plane" `CloudTimestamp` entirely. This will clarify the role
of `ClockTimestamps` in the system and clean up the channels through
which clock information is passed between nodes.

The other place where we cast from `Timestamp` to `ClockTimesatmp` is
in `Store.Send`, at the boundary of KV RPCs. I would also like to get
rid of this, but doing so needs to wait on cockroachdb#73732.
craig bot pushed a commit that referenced this pull request Feb 8, 2022
76095: kv: don't pass clock information through Raft log r=nvanbenschoten a=nvanbenschoten

This commit eliminates the primary mechanism that we use to pass clock information from a leaseholder, through Raft log entries, to a Range's followers. As we found in #72278, this was only needed for correctness in a few specific cases — namely lease transfers and range merges. These two operations continue to pass clock signals through more explicit channels, but we remove the unnecessary general case.

The allows us to remote one of the two remaining places where we convert a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp` method. As outlined in #72121 (comment), I would like to remove ability to downcast a "data-plane" `Timestamp` to a "control-plane" `CloudTimestamp` entirely. This will clarify the role of `ClockTimestamps` in the system and clean up the channels through which clock information is passed between nodes.

The other place where we cast from `Timestamp` to `ClockTimesatmp` is in `Store.Send`, at the boundary of KV RPCs. I would also like to get rid of this, but doing so needs to wait on #73732.

76163: bazel: remove old protos when generating new ones r=ajwerner a=ajwerner

This is what the Makefile did. It was painful to have the old onces because
they'd lead to spurious diffs.

Release note: None

76166: sql: support version numbers on descriptor validation r=fqazi a=fqazi

Previously, the descriptor validation code did not
take a version number, so it was not possible to version
gate new validation logic. This was inadequate  because
when new fields are introduced we don't want their validation
to kick in for certain cases like the debug doctor, such as any
new fields with non-zero defaults. To address this, this patch add supports
for version numbers inside the validation, and updates unit tests
to pass this in as well. It also adds a new option on debug doctor
to run a version of validation.

Release note (cli change): Add new optional version argument
to the doctor examine command. This can be used to enable /
disable validation when examining older zip directories.

76188: ci: make sure metamorphic nightly uses proper formatter for issues r=nicktrav a=rickystewart

The `--formatter=pebble-metamorphic` option got lost in #75585.

Release note: None

76191: gazelle: exclude `.pb.go` files r=dt a=rickystewart

Should prevent Gazelle from getting confused and adding these to `srcs`.

Release note: None

76192: ci: make sure extra env vars are set for `roachtest` jobs r=rail a=rickystewart

We [need these](https://github.com/cockroachdb/cockroach/blob/5e7690d6da09821ff431ef32fe8d1430d05aed9f/pkg/cmd/internal/issues/issues.go#L159-L171).

Release note: None

76194: tree: remove TODOs about bytea/float cast volatility r=mgartner a=rafiss

These comments aren't needed, since the current cast volatility is
correct, as far as I can tell.

We might want to report this as a bug in Postgres if it describes these
casts as immutable incorrectly.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Andrew Werner <awerner32@gmail.com>
Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 8, 2022
Before this change, we were updating the local clock with each BatchResponse's
WriteTimestamp. This was meant to handle cases where the batch request timestamp
was forwarded during evaluation.

This was unnecessary for two reasons. The first is that BatchRequest can lead
the local HLC clock (explored in cockroachdb#72121 and cockroachdb#72278) as long as any clock reading
information in the values reflects the state of the HLC clock (synthetic bit
today, "local timestamp" tomorrow). The second is that even if this first reason
was not the case, the BatchRequest will only ever be bumped one logical tick
above any existing value in the range, so as long as the existing values in the
range followed the rule of trailing the leaseholders HLC (NB: they don't), all
new writes would too.
RajivTS pushed a commit to RajivTS/cockroach that referenced this pull request Mar 6, 2022
This commit eliminates the primary mechanism that we use to pass clock
information from a leaseholder, through Raft log entries, to a Range's
followers. As we found in cockroachdb#72278, this was only needed for correctness
in a few specific cases — namely lease transfers and range merges. These
two operations continue to pass clock signals through more explicit
channels, but we remove the unnecessary general case.

The allows us to remote one of the two remaining places where we convert
a `Timestamp` to a `ClockTimestamp` through the `TryToClockTimestamp`
method. As outlined in cockroachdb#72121 (comment),
I would like to remove ability to downcast a "data-plane" `Timestamp` to
a "control-plane" `CloudTimestamp` entirely. This will clarify the role
of `ClockTimestamps` in the system and clean up the channels through
which clock information is passed between nodes.

The other place where we cast from `Timestamp` to `ClockTimesatmp` is
in `Store.Send`, at the boundary of KV RPCs. I would also like to get
rid of this, but doing so needs to wait on cockroachdb#73732.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants