Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: don't Update HLC clock in response to untrusted tenants #72663

Open
nvanbenschoten opened this issue Nov 11, 2021 · 0 comments
Open

kv: don't Update HLC clock in response to untrusted tenants #72663

nvanbenschoten opened this issue Nov 11, 2021 · 0 comments
Labels
A-kv-transactions Relating to MVCC and the transactional model. A-multitenancy Related to multi-tenancy A-security C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Nov 11, 2021

Discovered in #72278.

We currently update the HLC clock in response to BatchRequests from SQL tenant pods here:

// Update our clock with the incoming request timestamp. This advances the
// local node's clock to a high water mark from all nodes with which it has
// interacted.
if baClockTS, ok := ba.Timestamp.TryToClockTimestamp(); ok {
if s.cfg.TestingKnobs.DisableMaxOffsetCheck {
s.cfg.Clock.Update(baClockTS)
} else {
// If the command appears to come from a node with a bad clock,
// reject it instead of updating the local clock and proceeding.
if err := s.cfg.Clock.UpdateAndCheckMaxOffset(ctx, baClockTS); err != nil {
return nil, roachpb.NewError(err)
}
}
}

This seems like a bad idea, given our multi-tenant threat model where any SQL pod can be entirely owned and we still want to protect the KV host cluster. Currently, a malicious tenant could keep pushing the clock forward by max_offset-1 until it crashed a KV node (through (rpc.RemoteClockMonitor).VerifyClockOffset).

Now that we've convinced ourselves that this HLC channel is not necessary for correctness, we can consider removing the call to (hlc.Clock).Update here for requests originating from secondary tenants.

Jira issue: CRDB-11248

@nvanbenschoten nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-transactions Relating to MVCC and the transactional model. A-multitenancy Related to multi-tenancy labels Nov 11, 2021
@blathers-crl blathers-crl bot added the T-kv KV Team label Nov 11, 2021
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 13, 2021
Fixes cockroachdb#58459.
Informs cockroachdb#36431.

This commit fixes a long-standing correctness issue where non-transactional
requests did not ensure single-key linearizability even if they deferred their
timestamp allocation to the leaseholder of their (single) range. They still
don't entirely, because of cockroachdb#36431, but this change brings us one step closer to
the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests.

The change addresses this by giving non-transactional requests uncertainty
intervals. This ensures that they also guarantee single-key linearizability even
with only loose (but bounded) clock synchronization. Non-transactional requests
that use a client-provided timestamp do not have uncertainty intervals and do
not make real-time ordering guarantees.

Unlike transactions, which establish an uncertainty interval on their
coordinator node during initialization, non-transactional requests receive
uncertainty intervals from their target leaseholder, using a clock reading
from the leaseholder's local HLC as the local limit and this clock reading +
the cluster's maximum clock skew as the global limit.

It is somewhat non-intuitive that non-transactional requests need uncertainty
intervals — after all, they receive their timestamp to the leaseholder of the
only range that they talk to, so isn't every value with a commit timestamp
above their read timestamp certainly concurrent? The answer is surprisingly
"no" for the following reasons, so they cannot forgo the use of uncertainty
interval:

1. the request timestamp is allocated before consulting the replica's lease.
   This means that there are times when the replica is not the leaseholder at
   the point of timestamp allocation, and only becomes the leaseholder later.
   In such cases, the timestamp assigned to the request is not guaranteed to
   be greater than the written_timestamp of all writes served by the range at
   the time of allocation. This is true despite invariants 1 & 2 from above,
   because the replica allocating the timestamp is not yet the leaseholder.

   In cases where the replica that assigned the non-transactional request's
   timestamp takes over as the leaseholder after the timestamp allocation, we
   expect minimumLocalLimitForLeaseholder to forward the local uncertainty
   limit above TimestampFromServerClock, to the lease start time.

2. ignoring reason #1, the request timestamp assigned by the leaseholder is
   equal to its local uncertainty limit and is guaranteed to lead the
   written_timestamp of all writes served by the range at the time the
   request arrived at the leaseholder node. However, this timestamp is still
   not guaranteed to lead the commit timestamp of all of these writes. As a
   result, the non-transactional request needs an uncertainty interval with a
   global uncertainty limit far enough in advance of the local HLC clock to
   ensure that it considers any value that was part of a transaction which
   could have committed before the request was received by the leaseholder to
   be uncertain. Concretely, the non-transactional request needs to consider
   values of the following form to be uncertain:

     written_timestamp < local_limit && commit_timestamp < global_limit

   The value that the non-transactional request is observing may have been
   written on the local leaseholder at time 10, its transaction may have been
   committed remotely at time 20, acknowledged, then the non-transactional
   request may have begun and received a timestamp of 15 from the local
   leaseholder, then finally the value may have been resolved asynchronously
   and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20).
   The failure of the non-transactional request to observe this value would
   be a stale read.

----

Conveniently, because non-transactional requests are always scoped to a
single-range, those that hit uncertainty errors can always retry on the server,
so these errors never bubble up to the client that initiated the request.

However, before this commit, this wasn't actually possible because we did not
support server-side retries of `ReadWithinUncertaintyIntervalError`. The second
part of this commit adds support for this. This new support allows us to make
the guarantee we want about non-transactional requests never returning these
errors to the client, even though they use uncertainty intervals. It also serves
as a performance optimization for transactional requests, which now benefit from
this new capability. There's some complexity around supporting this form of
server-side retry, because it must be done above latching instead of below, but
recent refactors in cockroachdb#73557 and cockroachdb#73717 have made this possible to support
cleanly.

----

This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional
requests to benefit from our incoming fix for that issue. Second, it unblocks
some of the clock refactors proposed in cockroachdb#72121 (comment),
and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't
feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make
this change. We know from cockroachdb#36431 that invariant 1 from
[`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131)
doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send`
masks the bugs in this area in many cases. Removing that clock update (I don't
actually plan to remove it, but I plan to disconnect it entirely from operation
timestamps) without first giving non-transactional requests uncertainty intervals
seems like it may reveal these bugs in ways we haven't seen in the past. So I'd
like to land this fix before making that change.

----

Release note (performance improvement): Certain forms of automatically retried
"read uncertainty" errors are now retried more efficiently, avoiding a network
round trip.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 3, 2022
Fixes cockroachdb#58459.
Informs cockroachdb#36431.

This commit fixes a long-standing correctness issue where non-transactional
requests did not ensure single-key linearizability even if they deferred their
timestamp allocation to the leaseholder of their (single) range. They still
don't entirely, because of cockroachdb#36431, but this change brings us one step closer to
the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests.

The change addresses this by giving non-transactional requests uncertainty
intervals. This ensures that they also guarantee single-key linearizability even
with only loose (but bounded) clock synchronization. Non-transactional requests
that use a client-provided timestamp do not have uncertainty intervals and do
not make real-time ordering guarantees.

Unlike transactions, which establish an uncertainty interval on their
coordinator node during initialization, non-transactional requests receive
uncertainty intervals from their target leaseholder, using a clock reading
from the leaseholder's local HLC as the local limit and this clock reading +
the cluster's maximum clock skew as the global limit.

It is somewhat non-intuitive that non-transactional requests need uncertainty
intervals — after all, they receive their timestamp to the leaseholder of the
only range that they talk to, so isn't every value with a commit timestamp
above their read timestamp certainly concurrent? The answer is surprisingly
"no" for the following reasons, so they cannot forgo the use of uncertainty
interval:

1. the request timestamp is allocated before consulting the replica's lease.
   This means that there are times when the replica is not the leaseholder at
   the point of timestamp allocation, and only becomes the leaseholder later.
   In such cases, the timestamp assigned to the request is not guaranteed to
   be greater than the written_timestamp of all writes served by the range at
   the time of allocation. This is true despite invariants 1 & 2 presented in
   `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the
   timestamp is not yet the leaseholder.

   In cases where the replica that assigned the non-transactional request's
   timestamp takes over as the leaseholder after the timestamp allocation, we
   expect minimumLocalLimitForLeaseholder to forward the local uncertainty
   limit above TimestampFromServerClock, to the lease start time.

   For example, consider the following series of events:
   - client A writes k = v1
   - leaseholder writes v1 at ts = 100
   - client A receives ack for write
   - client B wants to read k using a non-txn request
   - follower replica with slower clock receives non-txn request
   - follower replica assigns request ts = 95
   - lease transferred to follower replica with lease start time = 101
   - non-txn request must use 101 as limit of uncertainty interval to ensure
     that it observes k = v1 in uncertainty interval, performs a server-side
     retry, bumps its read timestamp, and returns k = v1.

2. even if the replica's lease is stable and the timestamp is assigned to the
   non-transactional request by the leaseholder, the assigned clock reading
   only reflects the written_timestamp of all of the writes served by the
   leaseholder (and previous leaseholders) thus far. This clock reading is
   not not guaranteed to lead the commit timestamp of all of these writes,
   especially if they are committed remotely and resolved after the request
   has received its clock reading but before the request begins evaluating.

   As a result, the non-transactional request needs an uncertainty interval
   with a global uncertainty limit far enough in advance of the leaseholder's
   local HLC clock to ensure that it considers any value that was part of a
   transaction which could have committed before the request was received by
   the leaseholder to be uncertain. Concretely, the non-transactional request
   needs to consider values of the following form to be uncertain:

     written_timestamp < local_limit && commit_timestamp < global_limit

   The value that the non-transactional request is observing may have been
   written on the local leaseholder at time 10, its transaction may have been
   committed remotely at time 20, acknowledged, then the non-transactional
   request may have begun and received a timestamp of 15 from the local
   leaseholder, then finally the value may have been resolved asynchronously
   and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20).
   The failure of the non-transactional request to observe this value would
   be a stale read.

   For example, consider the following series of events:
   - client A begins a txn and is assigned provisional commit timestamp = 95
   - client A's txn performs a Put(k, v1)
   - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95
   - client A's txn performs a write elsewhere and hits a WriteTooOldError
     that bumps its provisional commit timestamp to 100
   - client A's txn refreshes to ts = 100. This notably happens without
     involvement of the leaseholder that served the Put (this is at the heart
     of cockroachdb#36431), so that leaseholder's clock is not updated
   - client A's txn commits remotely and client A receives the acknowledgment
   - client B initiates non-txn read of k
   - leaseholder assigns read timestamp ts = 97
   - asynchronous intent resolution resolves the txn's intent at k, moving v1
     to ts = 100 in the process
   - non-txn request must use an uncertainty interval that extends past 100
     to ensure that it observes k = v1 in uncertainty interval, performs a
     server-side retry, bumps its read timestamp, and returns k = v1. Failure
     to do so would be a stale read

----

This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional
requests to benefit from our incoming fix for that issue. Second, it unblocks
some of the clock refactors proposed in cockroachdb#72121 (comment),
and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't
feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make
this change. We know from cockroachdb#36431 that invariant 1 from
[`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131)
doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send`
masks the bugs in this area in many cases. Removing that clock update (I don't
actually plan to remove it, but I plan to disconnect it entirely from operation
timestamps) without first giving non-transactional requests uncertainty intervals
seems like it may reveal these bugs in ways we haven't seen in the past. So I'd
like to land this fix before making that change.

----

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 5, 2022
Fixes cockroachdb#58459.
Informs cockroachdb#36431.

This commit fixes a long-standing correctness issue where non-transactional
requests did not ensure single-key linearizability even if they deferred their
timestamp allocation to the leaseholder of their (single) range. They still
don't entirely, because of cockroachdb#36431, but this change brings us one step closer to
the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests.

The change addresses this by giving non-transactional requests uncertainty
intervals. This ensures that they also guarantee single-key linearizability even
with only loose (but bounded) clock synchronization. Non-transactional requests
that use a client-provided timestamp do not have uncertainty intervals and do
not make real-time ordering guarantees.

Unlike transactions, which establish an uncertainty interval on their
coordinator node during initialization, non-transactional requests receive
uncertainty intervals from their target leaseholder, using a clock reading
from the leaseholder's local HLC as the local limit and this clock reading +
the cluster's maximum clock skew as the global limit.

It is somewhat non-intuitive that non-transactional requests need uncertainty
intervals — after all, they receive their timestamp to the leaseholder of the
only range that they talk to, so isn't every value with a commit timestamp
above their read timestamp certainly concurrent? The answer is surprisingly
"no" for the following reasons, so they cannot forgo the use of uncertainty
interval:

1. the request timestamp is allocated before consulting the replica's lease.
   This means that there are times when the replica is not the leaseholder at
   the point of timestamp allocation, and only becomes the leaseholder later.
   In such cases, the timestamp assigned to the request is not guaranteed to
   be greater than the written_timestamp of all writes served by the range at
   the time of allocation. This is true despite invariants 1 & 2 presented in
   `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the
   timestamp is not yet the leaseholder.

   In cases where the replica that assigned the non-transactional request's
   timestamp takes over as the leaseholder after the timestamp allocation, we
   expect minimumLocalLimitForLeaseholder to forward the local uncertainty
   limit above TimestampFromServerClock, to the lease start time.

   For example, consider the following series of events:
   - client A writes k = v1
   - leaseholder writes v1 at ts = 100
   - client A receives ack for write
   - client B wants to read k using a non-txn request
   - follower replica with slower clock receives non-txn request
   - follower replica assigns request ts = 95
   - lease transferred to follower replica with lease start time = 101
   - non-txn request must use 101 as limit of uncertainty interval to ensure
     that it observes k = v1 in uncertainty interval, performs a server-side
     retry, bumps its read timestamp, and returns k = v1.

2. even if the replica's lease is stable and the timestamp is assigned to the
   non-transactional request by the leaseholder, the assigned clock reading
   only reflects the written_timestamp of all of the writes served by the
   leaseholder (and previous leaseholders) thus far. This clock reading is
   not guaranteed to lead the commit timestamp of all of these writes,
   especially if they are committed remotely and resolved after the request
   has received its clock reading but before the request begins evaluating.

   As a result, the non-transactional request needs an uncertainty interval
   with a global uncertainty limit far enough in advance of the leaseholder's
   local HLC clock to ensure that it considers any value that was part of a
   transaction which could have committed before the request was received by
   the leaseholder to be uncertain. Concretely, the non-transactional request
   needs to consider values of the following form to be uncertain:

     written_timestamp < local_limit && commit_timestamp < global_limit

   The value that the non-transactional request is observing may have been
   written on the local leaseholder at time 10, its transaction may have been
   committed remotely at time 20, acknowledged, then the non-transactional
   request may have begun and received a timestamp of 15 from the local
   leaseholder, then finally the value may have been resolved asynchronously
   and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20).
   The failure of the non-transactional request to observe this value would
   be a stale read.

   For example, consider the following series of events:
   - client A begins a txn and is assigned provisional commit timestamp = 95
   - client A's txn performs a Put(k, v1)
   - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95
   - client A's txn performs a write elsewhere and hits a WriteTooOldError
     that bumps its provisional commit timestamp to 100
   - client A's txn refreshes to ts = 100. This notably happens without
     involvement of the leaseholder that served the Put (this is at the heart
     of cockroachdb#36431), so that leaseholder's clock is not updated
   - client A's txn commits remotely and client A receives the acknowledgment
   - client B initiates non-txn read of k
   - leaseholder assigns read timestamp ts = 97
   - asynchronous intent resolution resolves the txn's intent at k, moving v1
     to ts = 100 in the process
   - non-txn request must use an uncertainty interval that extends past 100
     to ensure that it observes k = v1 in uncertainty interval, performs a
     server-side retry, bumps its read timestamp, and returns k = v1. Failure
     to do so would be a stale read

----

This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional
requests to benefit from our incoming fix for that issue. Second, it unblocks
some of the clock refactors proposed in cockroachdb#72121 (comment),
and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't
feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make
this change. We know from cockroachdb#36431 that invariant 1 from
[`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131)
doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send`
masks the bugs in this area in many cases. Removing that clock update (I don't
actually plan to remove it, but I plan to disconnect it entirely from operation
timestamps) without first giving non-transactional requests uncertainty intervals
seems like it may reveal these bugs in ways we haven't seen in the past. So I'd
like to land this fix before making that change.

----

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 8, 2022
Fixes cockroachdb#58459.
Informs cockroachdb#36431.

This commit fixes a long-standing correctness issue where non-transactional
requests did not ensure single-key linearizability even if they deferred their
timestamp allocation to the leaseholder of their (single) range. They still
don't entirely, because of cockroachdb#36431, but this change brings us one step closer to
the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests.

The change addresses this by giving non-transactional requests uncertainty
intervals. This ensures that they also guarantee single-key linearizability even
with only loose (but bounded) clock synchronization. Non-transactional requests
that use a client-provided timestamp do not have uncertainty intervals and do
not make real-time ordering guarantees.

Unlike transactions, which establish an uncertainty interval on their
coordinator node during initialization, non-transactional requests receive
uncertainty intervals from their target leaseholder, using a clock reading
from the leaseholder's local HLC as the local limit and this clock reading +
the cluster's maximum clock skew as the global limit.

It is somewhat non-intuitive that non-transactional requests need uncertainty
intervals — after all, they receive their timestamp to the leaseholder of the
only range that they talk to, so isn't every value with a commit timestamp
above their read timestamp certainly concurrent? The answer is surprisingly
"no" for the following reasons, so they cannot forgo the use of uncertainty
interval:

1. the request timestamp is allocated before consulting the replica's lease.
   This means that there are times when the replica is not the leaseholder at
   the point of timestamp allocation, and only becomes the leaseholder later.
   In such cases, the timestamp assigned to the request is not guaranteed to
   be greater than the written_timestamp of all writes served by the range at
   the time of allocation. This is true despite invariants 1 & 2 presented in
   `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the
   timestamp is not yet the leaseholder.

   In cases where the replica that assigned the non-transactional request's
   timestamp takes over as the leaseholder after the timestamp allocation, we
   expect minimumLocalLimitForLeaseholder to forward the local uncertainty
   limit above TimestampFromServerClock, to the lease start time.

   For example, consider the following series of events:
   - client A writes k = v1
   - leaseholder writes v1 at ts = 100
   - client A receives ack for write
   - client B wants to read k using a non-txn request
   - follower replica with slower clock receives non-txn request
   - follower replica assigns request ts = 95
   - lease transferred to follower replica with lease start time = 101
   - non-txn request must use 101 as limit of uncertainty interval to ensure
     that it observes k = v1 in uncertainty interval, performs a server-side
     retry, bumps its read timestamp, and returns k = v1.

2. even if the replica's lease is stable and the timestamp is assigned to the
   non-transactional request by the leaseholder, the assigned clock reading
   only reflects the written_timestamp of all of the writes served by the
   leaseholder (and previous leaseholders) thus far. This clock reading is
   not guaranteed to lead the commit timestamp of all of these writes,
   especially if they are committed remotely and resolved after the request
   has received its clock reading but before the request begins evaluating.

   As a result, the non-transactional request needs an uncertainty interval
   with a global uncertainty limit far enough in advance of the leaseholder's
   local HLC clock to ensure that it considers any value that was part of a
   transaction which could have committed before the request was received by
   the leaseholder to be uncertain. Concretely, the non-transactional request
   needs to consider values of the following form to be uncertain:

     written_timestamp < local_limit && commit_timestamp < global_limit

   The value that the non-transactional request is observing may have been
   written on the local leaseholder at time 10, its transaction may have been
   committed remotely at time 20, acknowledged, then the non-transactional
   request may have begun and received a timestamp of 15 from the local
   leaseholder, then finally the value may have been resolved asynchronously
   and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20).
   The failure of the non-transactional request to observe this value would
   be a stale read.

   For example, consider the following series of events:
   - client A begins a txn and is assigned provisional commit timestamp = 95
   - client A's txn performs a Put(k, v1)
   - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95
   - client A's txn performs a write elsewhere and hits a WriteTooOldError
     that bumps its provisional commit timestamp to 100
   - client A's txn refreshes to ts = 100. This notably happens without
     involvement of the leaseholder that served the Put (this is at the heart
     of cockroachdb#36431), so that leaseholder's clock is not updated
   - client A's txn commits remotely and client A receives the acknowledgment
   - client B initiates non-txn read of k
   - leaseholder assigns read timestamp ts = 97
   - asynchronous intent resolution resolves the txn's intent at k, moving v1
     to ts = 100 in the process
   - non-txn request must use an uncertainty interval that extends past 100
     to ensure that it observes k = v1 in uncertainty interval, performs a
     server-side retry, bumps its read timestamp, and returns k = v1. Failure
     to do so would be a stale read

----

This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional
requests to benefit from our incoming fix for that issue. Second, it unblocks
some of the clock refactors proposed in cockroachdb#72121 (comment),
and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't
feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make
this change. We know from cockroachdb#36431 that invariant 1 from
[`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131)
doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send`
masks the bugs in this area in many cases. Removing that clock update (I don't
actually plan to remove it, but I plan to disconnect it entirely from operation
timestamps) without first giving non-transactional requests uncertainty intervals
seems like it may reveal these bugs in ways we haven't seen in the past. So I'd
like to land this fix before making that change.

----

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 8, 2022
Fixes cockroachdb#58459.
Informs cockroachdb#36431.

This commit fixes a long-standing correctness issue where non-transactional
requests did not ensure single-key linearizability even if they deferred their
timestamp allocation to the leaseholder of their (single) range. They still
don't entirely, because of cockroachdb#36431, but this change brings us one step closer to
the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests.

The change addresses this by giving non-transactional requests uncertainty
intervals. This ensures that they also guarantee single-key linearizability even
with only loose (but bounded) clock synchronization. Non-transactional requests
that use a client-provided timestamp do not have uncertainty intervals and do
not make real-time ordering guarantees.

Unlike transactions, which establish an uncertainty interval on their
coordinator node during initialization, non-transactional requests receive
uncertainty intervals from their target leaseholder, using a clock reading
from the leaseholder's local HLC as the local limit and this clock reading +
the cluster's maximum clock skew as the global limit.

It is somewhat non-intuitive that non-transactional requests need uncertainty
intervals — after all, they receive their timestamp to the leaseholder of the
only range that they talk to, so isn't every value with a commit timestamp
above their read timestamp certainly concurrent? The answer is surprisingly
"no" for the following reasons, so they cannot forgo the use of uncertainty
interval:

1. the request timestamp is allocated before consulting the replica's lease.
   This means that there are times when the replica is not the leaseholder at
   the point of timestamp allocation, and only becomes the leaseholder later.
   In such cases, the timestamp assigned to the request is not guaranteed to
   be greater than the written_timestamp of all writes served by the range at
   the time of allocation. This is true despite invariants 1 & 2 presented in
   `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the
   timestamp is not yet the leaseholder.

   In cases where the replica that assigned the non-transactional request's
   timestamp takes over as the leaseholder after the timestamp allocation, we
   expect minimumLocalLimitForLeaseholder to forward the local uncertainty
   limit above TimestampFromServerClock, to the lease start time.

   For example, consider the following series of events:
   - client A writes k = v1
   - leaseholder writes v1 at ts = 100
   - client A receives ack for write
   - client B wants to read k using a non-txn request
   - follower replica with slower clock receives non-txn request
   - follower replica assigns request ts = 95
   - lease transferred to follower replica with lease start time = 101
   - non-txn request must use 101 as limit of uncertainty interval to ensure
     that it observes k = v1 in uncertainty interval, performs a server-side
     retry, bumps its read timestamp, and returns k = v1.

2. even if the replica's lease is stable and the timestamp is assigned to the
   non-transactional request by the leaseholder, the assigned clock reading
   only reflects the written_timestamp of all of the writes served by the
   leaseholder (and previous leaseholders) thus far. This clock reading is
   not guaranteed to lead the commit timestamp of all of these writes,
   especially if they are committed remotely and resolved after the request
   has received its clock reading but before the request begins evaluating.

   As a result, the non-transactional request needs an uncertainty interval
   with a global uncertainty limit far enough in advance of the leaseholder's
   local HLC clock to ensure that it considers any value that was part of a
   transaction which could have committed before the request was received by
   the leaseholder to be uncertain. Concretely, the non-transactional request
   needs to consider values of the following form to be uncertain:

     written_timestamp < local_limit && commit_timestamp < global_limit

   The value that the non-transactional request is observing may have been
   written on the local leaseholder at time 10, its transaction may have been
   committed remotely at time 20, acknowledged, then the non-transactional
   request may have begun and received a timestamp of 15 from the local
   leaseholder, then finally the value may have been resolved asynchronously
   and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20).
   The failure of the non-transactional request to observe this value would
   be a stale read.

   For example, consider the following series of events:
   - client A begins a txn and is assigned provisional commit timestamp = 95
   - client A's txn performs a Put(k, v1)
   - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95
   - client A's txn performs a write elsewhere and hits a WriteTooOldError
     that bumps its provisional commit timestamp to 100
   - client A's txn refreshes to ts = 100. This notably happens without
     involvement of the leaseholder that served the Put (this is at the heart
     of cockroachdb#36431), so that leaseholder's clock is not updated
   - client A's txn commits remotely and client A receives the acknowledgment
   - client B initiates non-txn read of k
   - leaseholder assigns read timestamp ts = 97
   - asynchronous intent resolution resolves the txn's intent at k, moving v1
     to ts = 100 in the process
   - non-txn request must use an uncertainty interval that extends past 100
     to ensure that it observes k = v1 in uncertainty interval, performs a
     server-side retry, bumps its read timestamp, and returns k = v1. Failure
     to do so would be a stale read

----

This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional
requests to benefit from our incoming fix for that issue. Second, it unblocks
some of the clock refactors proposed in cockroachdb#72121 (comment),
and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't
feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make
this change. We know from cockroachdb#36431 that invariant 1 from
[`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131)
doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send`
masks the bugs in this area in many cases. Removing that clock update (I don't
actually plan to remove it, but I plan to disconnect it entirely from operation
timestamps) without first giving non-transactional requests uncertainty intervals
seems like it may reveal these bugs in ways we haven't seen in the past. So I'd
like to land this fix before making that change.

----

Release note: None
craig bot pushed a commit that referenced this issue Feb 8, 2022
73732: kv: give non-transactional requests uncertainty intervals r=nvanbenschoten a=nvanbenschoten

Fixes #58459.
Informs #36431.

This commit fixes a long-standing correctness issue where non-transactional requests did not ensure single-key linearizability even if they deferred their timestamp allocation to the leaseholder of their (single) range. They still don't entirely, because of #36431, but this change brings us one step closer to the fix we plan to land for #36431 also applying to non-transactional requests.

The change addresses this by giving non-transactional requests uncertainty intervals. This ensures that they also guarantee single-key linearizability even with only loose (but bounded) clock synchronization. Non-transactional requests that use a client-provided timestamp do not have uncertainty intervals and do not make real-time ordering guarantees.

Unlike transactions, which establish an uncertainty interval on their coordinator node during initialization, non-transactional requests receive uncertainty intervals from their target leaseholder, using a clock reading from the leaseholder's local HLC as the local limit and this clock reading + the cluster's maximum clock skew as the global limit.

It is somewhat non-intuitive that non-transactional requests need uncertainty intervals — after all, they receive their timestamp to the leaseholder of the only range that they talk to, so isn't every value with a commit timestamp above their read timestamp certainly concurrent? The answer is surprisingly "no" for the following reasons, so they cannot forgo the use of uncertainty interval:

1. the request timestamp is allocated before consulting the replica's lease. This means that there are times when the replica is not the leaseholder at the point of timestamp allocation, and only becomes the leaseholder later. In such cases, the timestamp assigned to the request is not guaranteed to be greater than the written_timestamp of all writes served by the range at the time of allocation. This is true despite invariants 1 & 2 presented in `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the timestamp is not yet the leaseholder.

   In cases where the replica that assigned the non-transactional request's timestamp takes over as the leaseholder after the timestamp allocation, we expect minimumLocalLimitForLeaseholder to forward the local uncertainty limit above TimestampFromServerClock, to the lease start time.


   For example, consider the following series of events:
   - client A writes k = v1
   - leaseholder writes v1 at ts = 100
   - client A receives ack for write
   - client B wants to read k using a non-txn request
   - follower replica with slower clock receives non-txn request
   - follower replica assigns request ts = 95
   - lease transferred to follower replica with lease start time = 101
   - non-txn request must use 101 as limit of uncertainty interval to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. Failure to do so would be a stale read

2. even if the replica's lease is stable and the timestamp is assigned to the non-transactional request by the leaseholder, the assigned clock reading only reflects the written_timestamp of all of the writes served by the leaseholder (and previous leaseholders) thus far. This clock reading is not not guaranteed to lead the commit timestamp of all of these writes, especially if they are committed remotely and resolved after the request has received its clock reading but before the request begins evaluating.

   As a result, the non-transactional request needs an uncertainty interval with a global uncertainty limit far enough in advance of the leaseholder's local HLC clock to ensure that it considers any value that was part of a transaction which could have committed before the request was received by the leaseholder to be uncertain. Concretely, the non-transactional request needs to consider values of the following form to be uncertain:

     written_timestamp < local_limit && commit_timestamp < global_limit

   The value that the non-transactional request is observing may have been written on the local leaseholder at time 10, its transaction may have been committed remotely at time 20, acknowledged, then the non-transactional request may have begun and received a timestamp of 15 from the local leaseholder, then finally the value may have been resolved asynchronously and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20). The failure of the non-transactional request to observe this value would be a stale read.

   For example, consider the following series of events:
   - client A begins a txn and is assigned provisional commit timestamp = 95
   - client A's txn performs a Put(k, v1)
   - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95
   - client A's txn performs a write elsewhere and hits a WriteTooOldError that bumps its provisional commit timestamp to 100
   - client A's txn refreshes to ts = 100. This notably happens without involvement of the leaseholder that served the Put (this is at the heart of #36431), so that leaseholder's clock is not updated
   - client A's txn commits remotely and client A receives the acknowledgment
   - client B initiates non-txn read of k
   - leaseholder assigns read timestamp ts = 97
   - asynchronous intent resolution resolves the txn's intent at k, moving v1 to ts = 100 in the process
   - non-txn request must use an uncertainty interval that extends past 100 to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. Failure to do so would be a stale read

----

This change is related to #36431 in two ways. First, it allows non-transactional requests to benefit from our incoming fix for that issue. Second, it unblocks some of the clock refactors proposed in #72121 (comment), and by extension #72663. Even though there are clear bugs today, I still don't feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make this change. We know from #36431 that **invariant 1** from [`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131) doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send` masks the bugs in this area in many cases. Removing that clock update (I don't actually plan to remove it, but I plan to disconnect it entirely from operation timestamps) without first giving non-transactional requests uncertainty intervals seems like it may reveal these bugs in ways we haven't seen in the past. So I'd like to land this fix before making that change.

----

Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
RajivTS pushed a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022
Fixes cockroachdb#58459.
Informs cockroachdb#36431.

This commit fixes a long-standing correctness issue where non-transactional
requests did not ensure single-key linearizability even if they deferred their
timestamp allocation to the leaseholder of their (single) range. They still
don't entirely, because of cockroachdb#36431, but this change brings us one step closer to
the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests.

The change addresses this by giving non-transactional requests uncertainty
intervals. This ensures that they also guarantee single-key linearizability even
with only loose (but bounded) clock synchronization. Non-transactional requests
that use a client-provided timestamp do not have uncertainty intervals and do
not make real-time ordering guarantees.

Unlike transactions, which establish an uncertainty interval on their
coordinator node during initialization, non-transactional requests receive
uncertainty intervals from their target leaseholder, using a clock reading
from the leaseholder's local HLC as the local limit and this clock reading +
the cluster's maximum clock skew as the global limit.

It is somewhat non-intuitive that non-transactional requests need uncertainty
intervals — after all, they receive their timestamp to the leaseholder of the
only range that they talk to, so isn't every value with a commit timestamp
above their read timestamp certainly concurrent? The answer is surprisingly
"no" for the following reasons, so they cannot forgo the use of uncertainty
interval:

1. the request timestamp is allocated before consulting the replica's lease.
   This means that there are times when the replica is not the leaseholder at
   the point of timestamp allocation, and only becomes the leaseholder later.
   In such cases, the timestamp assigned to the request is not guaranteed to
   be greater than the written_timestamp of all writes served by the range at
   the time of allocation. This is true despite invariants 1 & 2 presented in
   `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the
   timestamp is not yet the leaseholder.

   In cases where the replica that assigned the non-transactional request's
   timestamp takes over as the leaseholder after the timestamp allocation, we
   expect minimumLocalLimitForLeaseholder to forward the local uncertainty
   limit above TimestampFromServerClock, to the lease start time.

   For example, consider the following series of events:
   - client A writes k = v1
   - leaseholder writes v1 at ts = 100
   - client A receives ack for write
   - client B wants to read k using a non-txn request
   - follower replica with slower clock receives non-txn request
   - follower replica assigns request ts = 95
   - lease transferred to follower replica with lease start time = 101
   - non-txn request must use 101 as limit of uncertainty interval to ensure
     that it observes k = v1 in uncertainty interval, performs a server-side
     retry, bumps its read timestamp, and returns k = v1.

2. even if the replica's lease is stable and the timestamp is assigned to the
   non-transactional request by the leaseholder, the assigned clock reading
   only reflects the written_timestamp of all of the writes served by the
   leaseholder (and previous leaseholders) thus far. This clock reading is
   not guaranteed to lead the commit timestamp of all of these writes,
   especially if they are committed remotely and resolved after the request
   has received its clock reading but before the request begins evaluating.

   As a result, the non-transactional request needs an uncertainty interval
   with a global uncertainty limit far enough in advance of the leaseholder's
   local HLC clock to ensure that it considers any value that was part of a
   transaction which could have committed before the request was received by
   the leaseholder to be uncertain. Concretely, the non-transactional request
   needs to consider values of the following form to be uncertain:

     written_timestamp < local_limit && commit_timestamp < global_limit

   The value that the non-transactional request is observing may have been
   written on the local leaseholder at time 10, its transaction may have been
   committed remotely at time 20, acknowledged, then the non-transactional
   request may have begun and received a timestamp of 15 from the local
   leaseholder, then finally the value may have been resolved asynchronously
   and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20).
   The failure of the non-transactional request to observe this value would
   be a stale read.

   For example, consider the following series of events:
   - client A begins a txn and is assigned provisional commit timestamp = 95
   - client A's txn performs a Put(k, v1)
   - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95
   - client A's txn performs a write elsewhere and hits a WriteTooOldError
     that bumps its provisional commit timestamp to 100
   - client A's txn refreshes to ts = 100. This notably happens without
     involvement of the leaseholder that served the Put (this is at the heart
     of cockroachdb#36431), so that leaseholder's clock is not updated
   - client A's txn commits remotely and client A receives the acknowledgment
   - client B initiates non-txn read of k
   - leaseholder assigns read timestamp ts = 97
   - asynchronous intent resolution resolves the txn's intent at k, moving v1
     to ts = 100 in the process
   - non-txn request must use an uncertainty interval that extends past 100
     to ensure that it observes k = v1 in uncertainty interval, performs a
     server-side retry, bumps its read timestamp, and returns k = v1. Failure
     to do so would be a stale read

----

This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional
requests to benefit from our incoming fix for that issue. Second, it unblocks
some of the clock refactors proposed in cockroachdb#72121 (comment),
and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't
feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make
this change. We know from cockroachdb#36431 that invariant 1 from
[`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131)
doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send`
masks the bugs in this area in many cases. Removing that clock update (I don't
actually plan to remove it, but I plan to disconnect it entirely from operation
timestamps) without first giving non-transactional requests uncertainty intervals
seems like it may reveal these bugs in ways we haven't seen in the past. So I'd
like to land this fix before making that change.

----

Release note: None
@knz knz added the A-security label Jun 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. A-multitenancy Related to multi-tenancy A-security C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

No branches or pull requests

2 participants