kv: fix non-transactional requests' interaction with uncertainty intervals #58459

nvanbenschoten · 2021-01-05T18:31:54Z

Non-transactional requests and their decision of which timestamp to use complicate our transaction model. These requests do not carry uncertainty intervals but still desire to provide single-key linearizability. As a result, they lead to complex interactions with the HLC clock in an attempt to provide the guarantee that non-transactional requests observe any change that was performed causally before them. Instead of assigning a timestamp at the client and using an uncertainty interval to ensure reads-your-writes, they delay the assignment of a timestamp to the server.

As discussed in #58349, this complication does not seem worthwhile. From a performance-perspective, non-transactional requests should not be any cheaper than a 1PC transaction. And since non-transactional requests cannot span ranges, we know that any non-transactional request can also be expressed as a 1PC transaction. We should explore eliminating non-transactional requests and replace them with 1PC transactions.

Epic CRDB-1514

First commit from cockroachdb#59505. This commit updates the kv client routing logic to account for the new `LEAD_FOR_GLOBAL_READS` `RangeClosedTimestampPolicy` added in cockroachdb#59505. In enterprise builds, non-stale read-only requests to ranges with this closed timestamp policy can now be routed to follower replicas, just like stale read-only requests to normal ranges currently are. In addition to this main change, there are a few smaller changes in this PR that were hard to split out, so are included in this commit. First, non-transactional requests are no longer served by followers even if the follower replica detects that the request can be. Previously, non-transactional requests would never be routed intentionally to a follower replica, but `Replica.canServeFollowerRead` would allow them through if their timestamp was low enough. This change was made in order to keep the client and server logic in-sync and because this does not seem safe for non-transactional requests that get their timestamp assigned on the server. We're planning to remove non-transactional requests soon anyway (cockroachdb#58459), so this isn't a big deal. It mostly just caused some testing fallout. Second, transactional requests that had previously written intents are now allowed to have their read-only requests served by followers, as long as those followers have a closed timestamp above the transaction's read *and* write timestamp. Previously, we had avoided this because it seemed to cause issues with read-your-writes. However, it appears safe as long as the write timestamp is below the closed timestamp, because we know all of the transaction's intents are at or below its write timestamp. This is very important for multi-region work, where we want a transaction to be able to write to a REGIONAL table and then later perform local reads (maybe foreign key checks) on GLOBAL tables. Third, a max clock offset shift in `canUseFollowerRead` was removed. It wasn't exactly clear what this was doing. It was introduced in the original 70be833 and seemed to allow a follower read in cases where they otherwise shouldn't be expected to succeed. I thought at first that it was accounting for the fact that the kv client's clock might be leading the kv server's clock and so it was being pessimistic about the expected closed timestamp, but it actually seems to be shifting the other way, so I don't know. I might be missing something. Finally, the concept of a `replicaoracle.OracleFactoryFunc` was removed and the existing `replicaoracle.OracleFactory` takes its place. We no longer need the double-factory approach because the transaction object is now passed directly to `Oracle.ChoosePreferredReplica`. This was a necessary change, because the process of determining whether a follower read can be served now requires transaction information and range information (i.e. the closed timestamp policy), so we need to make it in the Oracle itself instead of in the OracleFactory. This all seems like a simplification anyway. This is still waiting on changes to the closed timestamp system to be able to write end-to-end tests using this new functionality.

@dt

58362: *: craft rangefeed abstraction, support cluster settings in tenants r=nvanbenschoten a=ajwerner This PR is a reworking of @dt's excellent effort in #57926 to pick up rangefeeds for tenant cluster settings after sprucing up the libraries around them. @tbg and @nvanbenschoten I'm tagging you accordingly given that you reviewed that PR. ### First two commits (`kvclient/rangefeed` package and adoption in `catalog/lease`) The first commit in this PR abstract out some rangefeed client logic around retries to make them more generally usable in other contexts. The second commit adopts this package in `sql/catalog/lease`. These two commits can (and probably should) be merged separately and exist in #58361. ### Next two commits (`server/settingswatcher` package and adoption in `server`) These commits introduce code to update the settings using the above rangefeed package and then hooks it up to the sql server. ### Next three commits Remove the restrictions from tenants updating settings, mark which settings can be updated, generate docs. These are lifted straight from @dt in #57926. Maybe closes #56623. 59571: kv: route present-time reads to global_read follower replicas r=nvanbenschoten a=nvanbenschoten This commit updates the kv client routing logic to account for the new `LEAD_FOR_GLOBAL_READS` `RangeClosedTimestampPolicy` added in #59505. In enterprise builds, non-stale read-only requests to ranges with this closed timestamp policy can now be routed to follower replicas, just like stale read-only requests to normal ranges currently are. In addition to this main change, there are a few smaller changes in this PR that were hard to split out, so are included in this commit. First, non-transactional requests are no longer served by followers even if the follower replica detects that the request can be. Previously, non-transactional requests would never be routed intentionally to a follower replica, but `Replica.canServeFollowerRead` would allow them through if their timestamp was low enough. This change was made in order to keep the client and server logic in-sync and because this does not seem safe for non-transactional requests that get their timestamp assigned on the server. We're planning to remove non-transactional requests soon anyway (#58459), so this isn't a big deal. It mostly just caused some testing fallout. Second, transactional requests that had previously written intents are now allowed to have their read-only requests served by followers, as long as those followers have a closed timestamp above the transaction's read *and* write timestamp. Previously, we had avoided this because it seemed to cause issues with read-your-writes. However, it appears safe as long as the write timestamp is below the closed timestamp, because we know all of the transaction's intents are at or below its write timestamp. This is very important for multi-region work, where we want a transaction to be able to write to a REGIONAL table and then later perform local reads (maybe foreign key checks) on GLOBAL tables. Third, a max clock offset shift in `canUseFollowerRead` was removed. It wasn't exactly clear what this was doing. It was introduced in the original 70be833 and seemed to allow a follower read in cases where they otherwise shouldn't be expected to succeed. I thought at first that it was accounting for the fact that the kv client's clock might be leading the kv server's clock and so it was being pessimistic about the expected closed timestamp, but it actually seems to be shifting the other way, so I don't know. I might be missing something. Finally, the concept of a `replicaoracle.OracleFactoryFunc` was removed and the existing `replicaoracle.OracleFactory` takes its place. We no longer need the double-factory approach because the transaction object is now passed directly to `Oracle.ChoosePreferredReplica`. This was a necessary change, because the process of determining whether a follower read can be served now requires transaction information and range information (i.e. the closed timestamp policy), so we need to make it in the Oracle itself instead of in the OracleFactory. This all seems like a simplification anyway. This is still waiting on changes to the closed timestamp system to be able to write end-to-end tests using this new functionality. 60021: roachpb: use LocalUncertaintyLimit from err in readWithinUncertaintyIntervalRetryTimestamp r=nvanbenschoten a=nvanbenschoten Fixes #57685. This commit updates `readWithinUncertaintyIntervalRetryTimestamp` to use the the `LocalUncertaintyLimit` field from `ReadWithinUncertaintyIntervalError` in place of `ObservedTimestamps`. This addresses a bug where `ReadWithinUncertaintyIntervalErrors` thrown by follower replicas during follower reads would use meaningless observed timestamps. Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com> Co-authored-by: David Taylor <tinystatemachine@gmail.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

This commit combines the server-side "local" and "global" uncertainty limits into an `uncertainty.Interval` struct. It also centralizes the computation of this interval to an `uncertainty.ComputeInterval` function. As a result, MVCC no longer reaches inside of a `Transaction` object to construct its implied uncertainty interval. Instead, an uncertainty interval is supplied to MVCC directly. This refactor is made in preparation for a follow-on commit that will address cockroachdb#58459 (giving non-transactional requests uncertainty intervals), which in turn will prepare to simplify HLC handling throughout the system. The refactor also cleanly addresses a goal I've had to make the local uncertainty limit a `ClockTimestamp`, which helps clarify its role in the system.

72876: server,ccl: replace a few usages of keys.TODOSQLCodec r=e-mbrown a=e-mbrown references #48123 Replaced a few instances of keys.TODOSQLCodec with proper codecs. Release note: None 72898: tracing: untangle local/remove parent vs recording collection r=andreimatei a=andreimatei Before this patch, two concepts were bundled together: whether or not the parent of a span is local or remote to the child's node, and whether or not the parent should include the child in its recording. You could create a child span with either: - the WithParentAndAutoCollection option, which would give you a local parent which includes the child in its recording, or - the WithParentAndManualCollection option, which would give you a remote parent (which naturally would not / couldn't include the child in its recording). The WithParentAndManualCollection is sometimes used even when the parent is local, which is a) quite confusing, because at a lower level it produced what was called a "remote parent" and b) sub-optimal, because the child looked like a "local root", which meant it had to be put into the active spans registry. Were it not for the bundling of the two concerns together, such a child of a local parent, but for whom the parent is not in charge of collecting the recording, would not need to be put directly into the span registry; such a span (like other child spans) should be part of the registry indirectly, through its parent - which is cheaper because it doesn't add contention on the global mutex lock. The funky case of a local parent whose recording does not include the child's is used by DistSQL processors - each processor gets a span whose recording is collected by the DistSQL infrastructure, independent of the parent. This patch untangles the parent-child relationship from the recording collection concern, and cleans up the API in the process. Now you have the following options for creating a span: WithParent(parent) - creates a parent/child relationship. The parent has a reference to the child. The child is part of the active spans registry through the parent. Unless WithDetachedRecording() is also specified, the parent's recording includes the child's. WithRemoteParent(parentMeta) - creates a parent/child relationship across process boundaries. WithDetachedRecording() - asks the parent to not include the new child in its recording. This is a no-op in case WithRemoteParent() is used. This option is used by DistSQL. So, we get a cleaner API, and a more efficient active spans registry implementation because the DistSQL spans are no longer directly part of it. Also, we open the door for a new possibility in the future: you can imagine that, in certain scenarios, it'd be interesting for a span to collect the recording of all its children, including the ones whose recording, under normal circumstances, is not the parent's responsibility to collect. For example, when asking for the recording of a span in the registry in an out-of-band way (i.e. when debugging), I want all the info I can get. So, I should be able to point to a DistSQL flow span and ask it for the recording of all processors on its node. This is not implemented yet. Release note: None 73244: kv: combine local and global uncertainty limit into Interval struct r=nvanbenschoten a=nvanbenschoten This commit combines the server-side "local" and "global" uncertainty limits into an `uncertainty.Interval` struct. It also centralizes the computation of this interval to an `uncertainty.ComputeInterval` function. As a result, MVCC no longer reaches inside of a `Transaction` object to construct its implied uncertainty interval. Instead, an uncertainty interval is supplied to MVCC directly. This refactor is made in preparation for a follow-on commit that will address #58459 (giving non-transactional requests uncertainty intervals), which in turn will prepare to simplify HLC handling throughout the system. The refactor also cleanly addresses a goal I've had to make the local uncertainty limit a `ClockTimestamp`, which helps clarify its role in the system. 73279: kv: detect context cancellation in limitBulkIOWrite, avoid log spam r=nvanbenschoten a=nvanbenschoten This commit adds logic to propagate context cancellation in `limitBulkIOWrite`. This function is used in two places, 1) when ingesting ssts, and 2) when receiving a snapshot. The first case uses the Raft scheduler goroutine's context, so it never gets cancelled. The second case uses the context of the sender of a Raft snapshot, so it can get cancelled. In customer clusters, we were seeing Raft snapshots hit their deadline and begin spamming `error rate limiting bulk io write: context deadline exceeded` errors messages. This was bad for two reasons. First, it was very noisy. Second, it meant that a Raft snapshot that was no longer going to succeed was still writing out full SSTs while holding on to the `snapshotApplySem`. This contributed to the snapshot starvation we saw in the issue. With this commit, `limitBulkIOWrite` will eagerly detect context cancellation and will propagate the cancellation up to the caller, allowing the caller to quickly release resources. Release notes (bug fix): Raft snapshots now detect timeouts earlier and avoid spamming the logs with `context deadline exceeded` errors. Co-authored-by: e-mbrown <ebsonari@gmail.com> Co-authored-by: Andrei Matei <andrei@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

Fixes cockroachdb#58459. Informs cockroachdb#36431. This commit fixes a long-standing correctness issue where non-transactional requests did not ensure single-key linearizability even if they deferred their timestamp allocation to the leaseholder of their (single) range. They still don't entirely, because of cockroachdb#36431, but this change brings us one step closer to the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests. The change addresses this by giving non-transactional requests uncertainty intervals. This ensures that they also guarantee single-key linearizability even with only loose (but bounded) clock synchronization. Non-transactional requests that use a client-provided timestamp do not have uncertainty intervals and do not make real-time ordering guarantees. Unlike transactions, which establish an uncertainty interval on their coordinator node during initialization, non-transactional requests receive uncertainty intervals from their target leaseholder, using a clock reading from the leaseholder's local HLC as the local limit and this clock reading + the cluster's maximum clock skew as the global limit. It is somewhat non-intuitive that non-transactional requests need uncertainty intervals — after all, they receive their timestamp to the leaseholder of the only range that they talk to, so isn't every value with a commit timestamp above their read timestamp certainly concurrent? The answer is surprisingly "no" for the following reasons, so they cannot forgo the use of uncertainty interval: 1. the request timestamp is allocated before consulting the replica's lease. This means that there are times when the replica is not the leaseholder at the point of timestamp allocation, and only becomes the leaseholder later. In such cases, the timestamp assigned to the request is not guaranteed to be greater than the written_timestamp of all writes served by the range at the time of allocation. This is true despite invariants 1 & 2 from above, because the replica allocating the timestamp is not yet the leaseholder. In cases where the replica that assigned the non-transactional request's timestamp takes over as the leaseholder after the timestamp allocation, we expect minimumLocalLimitForLeaseholder to forward the local uncertainty limit above TimestampFromServerClock, to the lease start time. 2. ignoring reason #1, the request timestamp assigned by the leaseholder is equal to its local uncertainty limit and is guaranteed to lead the written_timestamp of all writes served by the range at the time the request arrived at the leaseholder node. However, this timestamp is still not guaranteed to lead the commit timestamp of all of these writes. As a result, the non-transactional request needs an uncertainty interval with a global uncertainty limit far enough in advance of the local HLC clock to ensure that it considers any value that was part of a transaction which could have committed before the request was received by the leaseholder to be uncertain. Concretely, the non-transactional request needs to consider values of the following form to be uncertain: written_timestamp < local_limit && commit_timestamp < global_limit The value that the non-transactional request is observing may have been written on the local leaseholder at time 10, its transaction may have been committed remotely at time 20, acknowledged, then the non-transactional request may have begun and received a timestamp of 15 from the local leaseholder, then finally the value may have been resolved asynchronously and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20). The failure of the non-transactional request to observe this value would be a stale read. ---- Conveniently, because non-transactional requests are always scoped to a single-range, those that hit uncertainty errors can always retry on the server, so these errors never bubble up to the client that initiated the request. However, before this commit, this wasn't actually possible because we did not support server-side retries of `ReadWithinUncertaintyIntervalError`. The second part of this commit adds support for this. This new support allows us to make the guarantee we want about non-transactional requests never returning these errors to the client, even though they use uncertainty intervals. It also serves as a performance optimization for transactional requests, which now benefit from this new capability. There's some complexity around supporting this form of server-side retry, because it must be done above latching instead of below, but recent refactors in cockroachdb#73557 and cockroachdb#73717 have made this possible to support cleanly. ---- This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional requests to benefit from our incoming fix for that issue. Second, it unblocks some of the clock refactors proposed in cockroachdb#72121 (comment), and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make this change. We know from cockroachdb#36431 that invariant 1 from [`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131) doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send` masks the bugs in this area in many cases. Removing that clock update (I don't actually plan to remove it, but I plan to disconnect it entirely from operation timestamps) without first giving non-transactional requests uncertainty intervals seems like it may reveal these bugs in ways we haven't seen in the past. So I'd like to land this fix before making that change. ---- Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.

Fixes cockroachdb#58459. Informs cockroachdb#36431. This commit fixes a long-standing correctness issue where non-transactional requests did not ensure single-key linearizability even if they deferred their timestamp allocation to the leaseholder of their (single) range. They still don't entirely, because of cockroachdb#36431, but this change brings us one step closer to the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests. The change addresses this by giving non-transactional requests uncertainty intervals. This ensures that they also guarantee single-key linearizability even with only loose (but bounded) clock synchronization. Non-transactional requests that use a client-provided timestamp do not have uncertainty intervals and do not make real-time ordering guarantees. Unlike transactions, which establish an uncertainty interval on their coordinator node during initialization, non-transactional requests receive uncertainty intervals from their target leaseholder, using a clock reading from the leaseholder's local HLC as the local limit and this clock reading + the cluster's maximum clock skew as the global limit. It is somewhat non-intuitive that non-transactional requests need uncertainty intervals — after all, they receive their timestamp to the leaseholder of the only range that they talk to, so isn't every value with a commit timestamp above their read timestamp certainly concurrent? The answer is surprisingly "no" for the following reasons, so they cannot forgo the use of uncertainty interval: 1. the request timestamp is allocated before consulting the replica's lease. This means that there are times when the replica is not the leaseholder at the point of timestamp allocation, and only becomes the leaseholder later. In such cases, the timestamp assigned to the request is not guaranteed to be greater than the written_timestamp of all writes served by the range at the time of allocation. This is true despite invariants 1 & 2 presented in `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the timestamp is not yet the leaseholder. In cases where the replica that assigned the non-transactional request's timestamp takes over as the leaseholder after the timestamp allocation, we expect minimumLocalLimitForLeaseholder to forward the local uncertainty limit above TimestampFromServerClock, to the lease start time. For example, consider the following series of events: - client A writes k = v1 - leaseholder writes v1 at ts = 100 - client A receives ack for write - client B wants to read k using a non-txn request - follower replica with slower clock receives non-txn request - follower replica assigns request ts = 95 - lease transferred to follower replica with lease start time = 101 - non-txn request must use 101 as limit of uncertainty interval to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. 2. even if the replica's lease is stable and the timestamp is assigned to the non-transactional request by the leaseholder, the assigned clock reading only reflects the written_timestamp of all of the writes served by the leaseholder (and previous leaseholders) thus far. This clock reading is not not guaranteed to lead the commit timestamp of all of these writes, especially if they are committed remotely and resolved after the request has received its clock reading but before the request begins evaluating. As a result, the non-transactional request needs an uncertainty interval with a global uncertainty limit far enough in advance of the leaseholder's local HLC clock to ensure that it considers any value that was part of a transaction which could have committed before the request was received by the leaseholder to be uncertain. Concretely, the non-transactional request needs to consider values of the following form to be uncertain: written_timestamp < local_limit && commit_timestamp < global_limit The value that the non-transactional request is observing may have been written on the local leaseholder at time 10, its transaction may have been committed remotely at time 20, acknowledged, then the non-transactional request may have begun and received a timestamp of 15 from the local leaseholder, then finally the value may have been resolved asynchronously and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20). The failure of the non-transactional request to observe this value would be a stale read. For example, consider the following series of events: - client A begins a txn and is assigned provisional commit timestamp = 95 - client A's txn performs a Put(k, v1) - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95 - client A's txn performs a write elsewhere and hits a WriteTooOldError that bumps its provisional commit timestamp to 100 - client A's txn refreshes to ts = 100. This notably happens without involvement of the leaseholder that served the Put (this is at the heart of cockroachdb#36431), so that leaseholder's clock is not updated - client A's txn commits remotely and client A receives the acknowledgment - client B initiates non-txn read of k - leaseholder assigns read timestamp ts = 97 - asynchronous intent resolution resolves the txn's intent at k, moving v1 to ts = 100 in the process - non-txn request must use an uncertainty interval that extends past 100 to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. Failure to do so would be a stale read ---- This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional requests to benefit from our incoming fix for that issue. Second, it unblocks some of the clock refactors proposed in cockroachdb#72121 (comment), and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make this change. We know from cockroachdb#36431 that invariant 1 from [`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131) doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send` masks the bugs in this area in many cases. Removing that clock update (I don't actually plan to remove it, but I plan to disconnect it entirely from operation timestamps) without first giving non-transactional requests uncertainty intervals seems like it may reveal these bugs in ways we haven't seen in the past. So I'd like to land this fix before making that change. ---- Release note: None

Fixes cockroachdb#58459. Informs cockroachdb#36431. This commit fixes a long-standing correctness issue where non-transactional requests did not ensure single-key linearizability even if they deferred their timestamp allocation to the leaseholder of their (single) range. They still don't entirely, because of cockroachdb#36431, but this change brings us one step closer to the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests. The change addresses this by giving non-transactional requests uncertainty intervals. This ensures that they also guarantee single-key linearizability even with only loose (but bounded) clock synchronization. Non-transactional requests that use a client-provided timestamp do not have uncertainty intervals and do not make real-time ordering guarantees. Unlike transactions, which establish an uncertainty interval on their coordinator node during initialization, non-transactional requests receive uncertainty intervals from their target leaseholder, using a clock reading from the leaseholder's local HLC as the local limit and this clock reading + the cluster's maximum clock skew as the global limit. It is somewhat non-intuitive that non-transactional requests need uncertainty intervals — after all, they receive their timestamp to the leaseholder of the only range that they talk to, so isn't every value with a commit timestamp above their read timestamp certainly concurrent? The answer is surprisingly "no" for the following reasons, so they cannot forgo the use of uncertainty interval: 1. the request timestamp is allocated before consulting the replica's lease. This means that there are times when the replica is not the leaseholder at the point of timestamp allocation, and only becomes the leaseholder later. In such cases, the timestamp assigned to the request is not guaranteed to be greater than the written_timestamp of all writes served by the range at the time of allocation. This is true despite invariants 1 & 2 presented in `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the timestamp is not yet the leaseholder. In cases where the replica that assigned the non-transactional request's timestamp takes over as the leaseholder after the timestamp allocation, we expect minimumLocalLimitForLeaseholder to forward the local uncertainty limit above TimestampFromServerClock, to the lease start time. For example, consider the following series of events: - client A writes k = v1 - leaseholder writes v1 at ts = 100 - client A receives ack for write - client B wants to read k using a non-txn request - follower replica with slower clock receives non-txn request - follower replica assigns request ts = 95 - lease transferred to follower replica with lease start time = 101 - non-txn request must use 101 as limit of uncertainty interval to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. 2. even if the replica's lease is stable and the timestamp is assigned to the non-transactional request by the leaseholder, the assigned clock reading only reflects the written_timestamp of all of the writes served by the leaseholder (and previous leaseholders) thus far. This clock reading is not guaranteed to lead the commit timestamp of all of these writes, especially if they are committed remotely and resolved after the request has received its clock reading but before the request begins evaluating. As a result, the non-transactional request needs an uncertainty interval with a global uncertainty limit far enough in advance of the leaseholder's local HLC clock to ensure that it considers any value that was part of a transaction which could have committed before the request was received by the leaseholder to be uncertain. Concretely, the non-transactional request needs to consider values of the following form to be uncertain: written_timestamp < local_limit && commit_timestamp < global_limit The value that the non-transactional request is observing may have been written on the local leaseholder at time 10, its transaction may have been committed remotely at time 20, acknowledged, then the non-transactional request may have begun and received a timestamp of 15 from the local leaseholder, then finally the value may have been resolved asynchronously and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20). The failure of the non-transactional request to observe this value would be a stale read. For example, consider the following series of events: - client A begins a txn and is assigned provisional commit timestamp = 95 - client A's txn performs a Put(k, v1) - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95 - client A's txn performs a write elsewhere and hits a WriteTooOldError that bumps its provisional commit timestamp to 100 - client A's txn refreshes to ts = 100. This notably happens without involvement of the leaseholder that served the Put (this is at the heart of cockroachdb#36431), so that leaseholder's clock is not updated - client A's txn commits remotely and client A receives the acknowledgment - client B initiates non-txn read of k - leaseholder assigns read timestamp ts = 97 - asynchronous intent resolution resolves the txn's intent at k, moving v1 to ts = 100 in the process - non-txn request must use an uncertainty interval that extends past 100 to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. Failure to do so would be a stale read ---- This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional requests to benefit from our incoming fix for that issue. Second, it unblocks some of the clock refactors proposed in cockroachdb#72121 (comment), and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make this change. We know from cockroachdb#36431 that invariant 1 from [`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131) doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send` masks the bugs in this area in many cases. Removing that clock update (I don't actually plan to remove it, but I plan to disconnect it entirely from operation timestamps) without first giving non-transactional requests uncertainty intervals seems like it may reveal these bugs in ways we haven't seen in the past. So I'd like to land this fix before making that change. ---- Release note: None

73732: kv: give non-transactional requests uncertainty intervals r=nvanbenschoten a=nvanbenschoten Fixes #58459. Informs #36431. This commit fixes a long-standing correctness issue where non-transactional requests did not ensure single-key linearizability even if they deferred their timestamp allocation to the leaseholder of their (single) range. They still don't entirely, because of #36431, but this change brings us one step closer to the fix we plan to land for #36431 also applying to non-transactional requests. The change addresses this by giving non-transactional requests uncertainty intervals. This ensures that they also guarantee single-key linearizability even with only loose (but bounded) clock synchronization. Non-transactional requests that use a client-provided timestamp do not have uncertainty intervals and do not make real-time ordering guarantees. Unlike transactions, which establish an uncertainty interval on their coordinator node during initialization, non-transactional requests receive uncertainty intervals from their target leaseholder, using a clock reading from the leaseholder's local HLC as the local limit and this clock reading + the cluster's maximum clock skew as the global limit. It is somewhat non-intuitive that non-transactional requests need uncertainty intervals — after all, they receive their timestamp to the leaseholder of the only range that they talk to, so isn't every value with a commit timestamp above their read timestamp certainly concurrent? The answer is surprisingly "no" for the following reasons, so they cannot forgo the use of uncertainty interval: 1. the request timestamp is allocated before consulting the replica's lease. This means that there are times when the replica is not the leaseholder at the point of timestamp allocation, and only becomes the leaseholder later. In such cases, the timestamp assigned to the request is not guaranteed to be greater than the written_timestamp of all writes served by the range at the time of allocation. This is true despite invariants 1 & 2 presented in `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the timestamp is not yet the leaseholder. In cases where the replica that assigned the non-transactional request's timestamp takes over as the leaseholder after the timestamp allocation, we expect minimumLocalLimitForLeaseholder to forward the local uncertainty limit above TimestampFromServerClock, to the lease start time. For example, consider the following series of events: - client A writes k = v1 - leaseholder writes v1 at ts = 100 - client A receives ack for write - client B wants to read k using a non-txn request - follower replica with slower clock receives non-txn request - follower replica assigns request ts = 95 - lease transferred to follower replica with lease start time = 101 - non-txn request must use 101 as limit of uncertainty interval to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. Failure to do so would be a stale read 2. even if the replica's lease is stable and the timestamp is assigned to the non-transactional request by the leaseholder, the assigned clock reading only reflects the written_timestamp of all of the writes served by the leaseholder (and previous leaseholders) thus far. This clock reading is not not guaranteed to lead the commit timestamp of all of these writes, especially if they are committed remotely and resolved after the request has received its clock reading but before the request begins evaluating. As a result, the non-transactional request needs an uncertainty interval with a global uncertainty limit far enough in advance of the leaseholder's local HLC clock to ensure that it considers any value that was part of a transaction which could have committed before the request was received by the leaseholder to be uncertain. Concretely, the non-transactional request needs to consider values of the following form to be uncertain: written_timestamp < local_limit && commit_timestamp < global_limit The value that the non-transactional request is observing may have been written on the local leaseholder at time 10, its transaction may have been committed remotely at time 20, acknowledged, then the non-transactional request may have begun and received a timestamp of 15 from the local leaseholder, then finally the value may have been resolved asynchronously and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20). The failure of the non-transactional request to observe this value would be a stale read. For example, consider the following series of events: - client A begins a txn and is assigned provisional commit timestamp = 95 - client A's txn performs a Put(k, v1) - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95 - client A's txn performs a write elsewhere and hits a WriteTooOldError that bumps its provisional commit timestamp to 100 - client A's txn refreshes to ts = 100. This notably happens without involvement of the leaseholder that served the Put (this is at the heart of #36431), so that leaseholder's clock is not updated - client A's txn commits remotely and client A receives the acknowledgment - client B initiates non-txn read of k - leaseholder assigns read timestamp ts = 97 - asynchronous intent resolution resolves the txn's intent at k, moving v1 to ts = 100 in the process - non-txn request must use an uncertainty interval that extends past 100 to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. Failure to do so would be a stale read ---- This change is related to #36431 in two ways. First, it allows non-transactional requests to benefit from our incoming fix for that issue. Second, it unblocks some of the clock refactors proposed in #72121 (comment), and by extension #72663. Even though there are clear bugs today, I still don't feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make this change. We know from #36431 that **invariant 1** from [`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131) doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send` masks the bugs in this area in many cases. Removing that clock update (I don't actually plan to remove it, but I plan to disconnect it entirely from operation timestamps) without first giving non-transactional requests uncertainty intervals seems like it may reveal these bugs in ways we haven't seen in the past. So I'd like to land this fix before making that change. ---- Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

Fixes cockroachdb#58459. Informs cockroachdb#36431. This commit fixes a long-standing correctness issue where non-transactional requests did not ensure single-key linearizability even if they deferred their timestamp allocation to the leaseholder of their (single) range. They still don't entirely, because of cockroachdb#36431, but this change brings us one step closer to the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests. The change addresses this by giving non-transactional requests uncertainty intervals. This ensures that they also guarantee single-key linearizability even with only loose (but bounded) clock synchronization. Non-transactional requests that use a client-provided timestamp do not have uncertainty intervals and do not make real-time ordering guarantees. Unlike transactions, which establish an uncertainty interval on their coordinator node during initialization, non-transactional requests receive uncertainty intervals from their target leaseholder, using a clock reading from the leaseholder's local HLC as the local limit and this clock reading + the cluster's maximum clock skew as the global limit. It is somewhat non-intuitive that non-transactional requests need uncertainty intervals — after all, they receive their timestamp to the leaseholder of the only range that they talk to, so isn't every value with a commit timestamp above their read timestamp certainly concurrent? The answer is surprisingly "no" for the following reasons, so they cannot forgo the use of uncertainty interval: 1. the request timestamp is allocated before consulting the replica's lease. This means that there are times when the replica is not the leaseholder at the point of timestamp allocation, and only becomes the leaseholder later. In such cases, the timestamp assigned to the request is not guaranteed to be greater than the written_timestamp of all writes served by the range at the time of allocation. This is true despite invariants 1 & 2 presented in `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the timestamp is not yet the leaseholder. In cases where the replica that assigned the non-transactional request's timestamp takes over as the leaseholder after the timestamp allocation, we expect minimumLocalLimitForLeaseholder to forward the local uncertainty limit above TimestampFromServerClock, to the lease start time. For example, consider the following series of events: - client A writes k = v1 - leaseholder writes v1 at ts = 100 - client A receives ack for write - client B wants to read k using a non-txn request - follower replica with slower clock receives non-txn request - follower replica assigns request ts = 95 - lease transferred to follower replica with lease start time = 101 - non-txn request must use 101 as limit of uncertainty interval to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. 2. even if the replica's lease is stable and the timestamp is assigned to the non-transactional request by the leaseholder, the assigned clock reading only reflects the written_timestamp of all of the writes served by the leaseholder (and previous leaseholders) thus far. This clock reading is not guaranteed to lead the commit timestamp of all of these writes, especially if they are committed remotely and resolved after the request has received its clock reading but before the request begins evaluating. As a result, the non-transactional request needs an uncertainty interval with a global uncertainty limit far enough in advance of the leaseholder's local HLC clock to ensure that it considers any value that was part of a transaction which could have committed before the request was received by the leaseholder to be uncertain. Concretely, the non-transactional request needs to consider values of the following form to be uncertain: written_timestamp < local_limit && commit_timestamp < global_limit The value that the non-transactional request is observing may have been written on the local leaseholder at time 10, its transaction may have been committed remotely at time 20, acknowledged, then the non-transactional request may have begun and received a timestamp of 15 from the local leaseholder, then finally the value may have been resolved asynchronously and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20). The failure of the non-transactional request to observe this value would be a stale read. For example, consider the following series of events: - client A begins a txn and is assigned provisional commit timestamp = 95 - client A's txn performs a Put(k, v1) - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95 - client A's txn performs a write elsewhere and hits a WriteTooOldError that bumps its provisional commit timestamp to 100 - client A's txn refreshes to ts = 100. This notably happens without involvement of the leaseholder that served the Put (this is at the heart of cockroachdb#36431), so that leaseholder's clock is not updated - client A's txn commits remotely and client A receives the acknowledgment - client B initiates non-txn read of k - leaseholder assigns read timestamp ts = 97 - asynchronous intent resolution resolves the txn's intent at k, moving v1 to ts = 100 in the process - non-txn request must use an uncertainty interval that extends past 100 to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. Failure to do so would be a stale read ---- This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional requests to benefit from our incoming fix for that issue. Second, it unblocks some of the clock refactors proposed in cockroachdb#72121 (comment), and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make this change. We know from cockroachdb#36431 that invariant 1 from [`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131) doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send` masks the bugs in this area in many cases. Removing that clock update (I don't actually plan to remove it, but I plan to disconnect it entirely from operation timestamps) without first giving non-transactional requests uncertainty intervals seems like it may reveal these bugs in ways we haven't seen in the past. So I'd like to land this fix before making that change. ---- Release note: None

nvanbenschoten added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. A-kv-transactions Relating to MVCC and the transactional model. labels Jan 5, 2021

nvanbenschoten mentioned this issue Jan 29, 2021

kv: route present-time reads to global_read follower replicas #59571

Merged

nvanbenschoten mentioned this issue Feb 17, 2021

kvserver: have commands carry their wts, not read ts #60573

Closed

jlinder added the T-kv KV Team label Jun 16, 2021

nvanbenschoten mentioned this issue Aug 2, 2021

kv: permit some non-transactional batches to perform follower reads #68192

Merged

nvanbenschoten mentioned this issue Nov 28, 2021

kv: combine local and global uncertainty limit into Interval struct #73244

Merged

nvanbenschoten mentioned this issue Nov 30, 2021

kv: disable observed timestamp captured before range merge from applying to merged range #73292

Closed

nvanbenschoten changed the title ~~kv: remove non-transactional requests~~ kv: fix non-transactional requests' interaction with uncertainty intervals Nov 30, 2021

nvanbenschoten mentioned this issue Dec 13, 2021

kv: give non-transactional requests uncertainty intervals #73732

Merged

craig bot closed this as completed in d676d7e Feb 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: fix non-transactional requests' interaction with uncertainty intervals #58459

kv: fix non-transactional requests' interaction with uncertainty intervals #58459

nvanbenschoten commented Jan 5, 2021 •

edited by exalate-issue-sync bot

Loading

kv: fix non-transactional requests' interaction with uncertainty intervals #58459

kv: fix non-transactional requests' interaction with uncertainty intervals #58459

Comments

nvanbenschoten commented Jan 5, 2021 • edited by exalate-issue-sync bot Loading

nvanbenschoten commented Jan 5, 2021 •

edited by exalate-issue-sync bot

Loading