Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: fix non-transactional requests' interaction with uncertainty intervals #58459

Closed
nvanbenschoten opened this issue Jan 5, 2021 · 0 comments · Fixed by #73732
Closed

kv: fix non-transactional requests' interaction with uncertainty intervals #58459

nvanbenschoten opened this issue Jan 5, 2021 · 0 comments · Fixed by #73732
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Jan 5, 2021

Non-transactional requests and their decision of which timestamp to use complicate our transaction model. These requests do not carry uncertainty intervals but still desire to provide single-key linearizability. As a result, they lead to complex interactions with the HLC clock in an attempt to provide the guarantee that non-transactional requests observe any change that was performed causally before them. Instead of assigning a timestamp at the client and using an uncertainty interval to ensure reads-your-writes, they delay the assignment of a timestamp to the server.

As discussed in #58349, this complication does not seem worthwhile. From a performance-perspective, non-transactional requests should not be any cheaper than a 1PC transaction. And since non-transactional requests cannot span ranges, we know that any non-transactional request can also be expressed as a 1PC transaction. We should explore eliminating non-transactional requests and replace them with 1PC transactions.

Epic CRDB-1514

@nvanbenschoten nvanbenschoten added C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. A-kv-transactions Relating to MVCC and the transactional model. labels Jan 5, 2021
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 29, 2021
First commit from cockroachdb#59505.

This commit updates the kv client routing logic to account for the new
`LEAD_FOR_GLOBAL_READS` `RangeClosedTimestampPolicy` added in cockroachdb#59505. In
enterprise builds, non-stale read-only requests to ranges with this
closed timestamp policy can now be routed to follower replicas, just
like stale read-only requests to normal ranges currently are.

In addition to this main change, there are a few smaller changes in this
PR that were hard to split out, so are included in this commit.

First, non-transactional requests are no longer served by followers even
if the follower replica detects that the request can be. Previously,
non-transactional requests would never be routed intentionally to a
follower replica, but `Replica.canServeFollowerRead` would allow them
through if their timestamp was low enough. This change was made in order
to keep the client and server logic in-sync and because this does not
seem safe for non-transactional requests that get their timestamp
assigned on the server. We're planning to remove non-transactional
requests soon anyway (cockroachdb#58459), so this isn't a big deal. It mostly
just caused some testing fallout.

Second, transactional requests that had previously written intents are
now allowed to have their read-only requests served by followers, as
long as those followers have a closed timestamp above the transaction's
read *and* write timestamp. Previously, we had avoided this because it
seemed to cause issues with read-your-writes. However, it appears safe
as long as the write timestamp is below the closed timestamp, because we
know all of the transaction's intents are at or below its write
timestamp. This is very important for multi-region work, where we want a
transaction to be able to write to a REGIONAL table and then later
perform local reads (maybe foreign key checks) on GLOBAL tables.

Third, a max clock offset shift in `canUseFollowerRead` was removed. It
wasn't exactly clear what this was doing. It was introduced in the
original 70be833 and seemed to allow a follower read in cases where they
otherwise shouldn't be expected to succeed. I thought at first that it
was accounting for the fact that the kv client's clock might be leading
the kv server's clock and so it was being pessimistic about the expected
closed timestamp, but it actually seems to be shifting the other way, so
I don't know. I might be missing something.

Finally, the concept of a `replicaoracle.OracleFactoryFunc` was removed
and the existing `replicaoracle.OracleFactory` takes its place. We no
longer need the double-factory approach because the transaction object
is now passed directly to `Oracle.ChoosePreferredReplica`. This was a
necessary change, because the process of determining whether a follower
read can be served now requires transaction information and range
information (i.e. the closed timestamp policy), so we need to make it in
the Oracle itself instead of in the OracleFactory. This all seems like a
simplification anyway.

This is still waiting on changes to the closed timestamp system to be
able to write end-to-end tests using this new functionality.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 6, 2021
First commit from cockroachdb#59505.

This commit updates the kv client routing logic to account for the new
`LEAD_FOR_GLOBAL_READS` `RangeClosedTimestampPolicy` added in cockroachdb#59505. In
enterprise builds, non-stale read-only requests to ranges with this
closed timestamp policy can now be routed to follower replicas, just
like stale read-only requests to normal ranges currently are.

In addition to this main change, there are a few smaller changes in this
PR that were hard to split out, so are included in this commit.

First, non-transactional requests are no longer served by followers even
if the follower replica detects that the request can be. Previously,
non-transactional requests would never be routed intentionally to a
follower replica, but `Replica.canServeFollowerRead` would allow them
through if their timestamp was low enough. This change was made in order
to keep the client and server logic in-sync and because this does not
seem safe for non-transactional requests that get their timestamp
assigned on the server. We're planning to remove non-transactional
requests soon anyway (cockroachdb#58459), so this isn't a big deal. It mostly
just caused some testing fallout.

Second, transactional requests that had previously written intents are
now allowed to have their read-only requests served by followers, as
long as those followers have a closed timestamp above the transaction's
read *and* write timestamp. Previously, we had avoided this because it
seemed to cause issues with read-your-writes. However, it appears safe
as long as the write timestamp is below the closed timestamp, because we
know all of the transaction's intents are at or below its write
timestamp. This is very important for multi-region work, where we want a
transaction to be able to write to a REGIONAL table and then later
perform local reads (maybe foreign key checks) on GLOBAL tables.

Third, a max clock offset shift in `canUseFollowerRead` was removed. It
wasn't exactly clear what this was doing. It was introduced in the
original 70be833 and seemed to allow a follower read in cases where they
otherwise shouldn't be expected to succeed. I thought at first that it
was accounting for the fact that the kv client's clock might be leading
the kv server's clock and so it was being pessimistic about the expected
closed timestamp, but it actually seems to be shifting the other way, so
I don't know. I might be missing something.

Finally, the concept of a `replicaoracle.OracleFactoryFunc` was removed
and the existing `replicaoracle.OracleFactory` takes its place. We no
longer need the double-factory approach because the transaction object
is now passed directly to `Oracle.ChoosePreferredReplica`. This was a
necessary change, because the process of determining whether a follower
read can be served now requires transaction information and range
information (i.e. the closed timestamp policy), so we need to make it in
the Oracle itself instead of in the OracleFactory. This all seems like a
simplification anyway.

This is still waiting on changes to the closed timestamp system to be
able to write end-to-end tests using this new functionality.
andreimatei pushed a commit to andreimatei/cockroach that referenced this issue Feb 9, 2021
First commit from cockroachdb#59505.

This commit updates the kv client routing logic to account for the new
`LEAD_FOR_GLOBAL_READS` `RangeClosedTimestampPolicy` added in cockroachdb#59505. In
enterprise builds, non-stale read-only requests to ranges with this
closed timestamp policy can now be routed to follower replicas, just
like stale read-only requests to normal ranges currently are.

In addition to this main change, there are a few smaller changes in this
PR that were hard to split out, so are included in this commit.

First, non-transactional requests are no longer served by followers even
if the follower replica detects that the request can be. Previously,
non-transactional requests would never be routed intentionally to a
follower replica, but `Replica.canServeFollowerRead` would allow them
through if their timestamp was low enough. This change was made in order
to keep the client and server logic in-sync and because this does not
seem safe for non-transactional requests that get their timestamp
assigned on the server. We're planning to remove non-transactional
requests soon anyway (cockroachdb#58459), so this isn't a big deal. It mostly
just caused some testing fallout.

Second, transactional requests that had previously written intents are
now allowed to have their read-only requests served by followers, as
long as those followers have a closed timestamp above the transaction's
read *and* write timestamp. Previously, we had avoided this because it
seemed to cause issues with read-your-writes. However, it appears safe
as long as the write timestamp is below the closed timestamp, because we
know all of the transaction's intents are at or below its write
timestamp. This is very important for multi-region work, where we want a
transaction to be able to write to a REGIONAL table and then later
perform local reads (maybe foreign key checks) on GLOBAL tables.

Third, a max clock offset shift in `canUseFollowerRead` was removed. It
wasn't exactly clear what this was doing. It was introduced in the
original 70be833 and seemed to allow a follower read in cases where they
otherwise shouldn't be expected to succeed. I thought at first that it
was accounting for the fact that the kv client's clock might be leading
the kv server's clock and so it was being pessimistic about the expected
closed timestamp, but it actually seems to be shifting the other way, so
I don't know. I might be missing something.

Finally, the concept of a `replicaoracle.OracleFactoryFunc` was removed
and the existing `replicaoracle.OracleFactory` takes its place. We no
longer need the double-factory approach because the transaction object
is now passed directly to `Oracle.ChoosePreferredReplica`. This was a
necessary change, because the process of determining whether a follower
read can be served now requires transaction information and range
information (i.e. the closed timestamp policy), so we need to make it in
the Oracle itself instead of in the OracleFactory. This all seems like a
simplification anyway.

This is still waiting on changes to the closed timestamp system to be
able to write end-to-end tests using this new functionality.
andreimatei pushed a commit to andreimatei/cockroach that referenced this issue Feb 9, 2021
First commit from cockroachdb#59505.

This commit updates the kv client routing logic to account for the new
`LEAD_FOR_GLOBAL_READS` `RangeClosedTimestampPolicy` added in cockroachdb#59505. In
enterprise builds, non-stale read-only requests to ranges with this
closed timestamp policy can now be routed to follower replicas, just
like stale read-only requests to normal ranges currently are.

In addition to this main change, there are a few smaller changes in this
PR that were hard to split out, so are included in this commit.

First, non-transactional requests are no longer served by followers even
if the follower replica detects that the request can be. Previously,
non-transactional requests would never be routed intentionally to a
follower replica, but `Replica.canServeFollowerRead` would allow them
through if their timestamp was low enough. This change was made in order
to keep the client and server logic in-sync and because this does not
seem safe for non-transactional requests that get their timestamp
assigned on the server. We're planning to remove non-transactional
requests soon anyway (cockroachdb#58459), so this isn't a big deal. It mostly
just caused some testing fallout.

Second, transactional requests that had previously written intents are
now allowed to have their read-only requests served by followers, as
long as those followers have a closed timestamp above the transaction's
read *and* write timestamp. Previously, we had avoided this because it
seemed to cause issues with read-your-writes. However, it appears safe
as long as the write timestamp is below the closed timestamp, because we
know all of the transaction's intents are at or below its write
timestamp. This is very important for multi-region work, where we want a
transaction to be able to write to a REGIONAL table and then later
perform local reads (maybe foreign key checks) on GLOBAL tables.

Third, a max clock offset shift in `canUseFollowerRead` was removed. It
wasn't exactly clear what this was doing. It was introduced in the
original 70be833 and seemed to allow a follower read in cases where they
otherwise shouldn't be expected to succeed. I thought at first that it
was accounting for the fact that the kv client's clock might be leading
the kv server's clock and so it was being pessimistic about the expected
closed timestamp, but it actually seems to be shifting the other way, so
I don't know. I might be missing something.

Finally, the concept of a `replicaoracle.OracleFactoryFunc` was removed
and the existing `replicaoracle.OracleFactory` takes its place. We no
longer need the double-factory approach because the transaction object
is now passed directly to `Oracle.ChoosePreferredReplica`. This was a
necessary change, because the process of determining whether a follower
read can be served now requires transaction information and range
information (i.e. the closed timestamp policy), so we need to make it in
the Oracle itself instead of in the OracleFactory. This all seems like a
simplification anyway.

This is still waiting on changes to the closed timestamp system to be
able to write end-to-end tests using this new functionality.
andreimatei pushed a commit to andreimatei/cockroach that referenced this issue Feb 12, 2021
First commit from cockroachdb#59505.

This commit updates the kv client routing logic to account for the new
`LEAD_FOR_GLOBAL_READS` `RangeClosedTimestampPolicy` added in cockroachdb#59505. In
enterprise builds, non-stale read-only requests to ranges with this
closed timestamp policy can now be routed to follower replicas, just
like stale read-only requests to normal ranges currently are.

In addition to this main change, there are a few smaller changes in this
PR that were hard to split out, so are included in this commit.

First, non-transactional requests are no longer served by followers even
if the follower replica detects that the request can be. Previously,
non-transactional requests would never be routed intentionally to a
follower replica, but `Replica.canServeFollowerRead` would allow them
through if their timestamp was low enough. This change was made in order
to keep the client and server logic in-sync and because this does not
seem safe for non-transactional requests that get their timestamp
assigned on the server. We're planning to remove non-transactional
requests soon anyway (cockroachdb#58459), so this isn't a big deal. It mostly
just caused some testing fallout.

Second, transactional requests that had previously written intents are
now allowed to have their read-only requests served by followers, as
long as those followers have a closed timestamp above the transaction's
read *and* write timestamp. Previously, we had avoided this because it
seemed to cause issues with read-your-writes. However, it appears safe
as long as the write timestamp is below the closed timestamp, because we
know all of the transaction's intents are at or below its write
timestamp. This is very important for multi-region work, where we want a
transaction to be able to write to a REGIONAL table and then later
perform local reads (maybe foreign key checks) on GLOBAL tables.

Third, a max clock offset shift in `canUseFollowerRead` was removed. It
wasn't exactly clear what this was doing. It was introduced in the
original 70be833 and seemed to allow a follower read in cases where they
otherwise shouldn't be expected to succeed. I thought at first that it
was accounting for the fact that the kv client's clock might be leading
the kv server's clock and so it was being pessimistic about the expected
closed timestamp, but it actually seems to be shifting the other way, so
I don't know. I might be missing something.

Finally, the concept of a `replicaoracle.OracleFactoryFunc` was removed
and the existing `replicaoracle.OracleFactory` takes its place. We no
longer need the double-factory approach because the transaction object
is now passed directly to `Oracle.ChoosePreferredReplica`. This was a
necessary change, because the process of determining whether a follower
read can be served now requires transaction information and range
information (i.e. the closed timestamp policy), so we need to make it in
the Oracle itself instead of in the OracleFactory. This all seems like a
simplification anyway.

This is still waiting on changes to the closed timestamp system to be
able to write end-to-end tests using this new functionality.
andreimatei pushed a commit to andreimatei/cockroach that referenced this issue Feb 15, 2021
First commit from cockroachdb#59505.

This commit updates the kv client routing logic to account for the new
`LEAD_FOR_GLOBAL_READS` `RangeClosedTimestampPolicy` added in cockroachdb#59505. In
enterprise builds, non-stale read-only requests to ranges with this
closed timestamp policy can now be routed to follower replicas, just
like stale read-only requests to normal ranges currently are.

In addition to this main change, there are a few smaller changes in this
PR that were hard to split out, so are included in this commit.

First, non-transactional requests are no longer served by followers even
if the follower replica detects that the request can be. Previously,
non-transactional requests would never be routed intentionally to a
follower replica, but `Replica.canServeFollowerRead` would allow them
through if their timestamp was low enough. This change was made in order
to keep the client and server logic in-sync and because this does not
seem safe for non-transactional requests that get their timestamp
assigned on the server. We're planning to remove non-transactional
requests soon anyway (cockroachdb#58459), so this isn't a big deal. It mostly
just caused some testing fallout.

Second, transactional requests that had previously written intents are
now allowed to have their read-only requests served by followers, as
long as those followers have a closed timestamp above the transaction's
read *and* write timestamp. Previously, we had avoided this because it
seemed to cause issues with read-your-writes. However, it appears safe
as long as the write timestamp is below the closed timestamp, because we
know all of the transaction's intents are at or below its write
timestamp. This is very important for multi-region work, where we want a
transaction to be able to write to a REGIONAL table and then later
perform local reads (maybe foreign key checks) on GLOBAL tables.

Third, a max clock offset shift in `canUseFollowerRead` was removed. It
wasn't exactly clear what this was doing. It was introduced in the
original 70be833 and seemed to allow a follower read in cases where they
otherwise shouldn't be expected to succeed. I thought at first that it
was accounting for the fact that the kv client's clock might be leading
the kv server's clock and so it was being pessimistic about the expected
closed timestamp, but it actually seems to be shifting the other way, so
I don't know. I might be missing something.

Finally, the concept of a `replicaoracle.OracleFactoryFunc` was removed
and the existing `replicaoracle.OracleFactory` takes its place. We no
longer need the double-factory approach because the transaction object
is now passed directly to `Oracle.ChoosePreferredReplica`. This was a
necessary change, because the process of determining whether a follower
read can be served now requires transaction information and range
information (i.e. the closed timestamp policy), so we need to make it in
the Oracle itself instead of in the OracleFactory. This all seems like a
simplification anyway.

This is still waiting on changes to the closed timestamp system to be
able to write end-to-end tests using this new functionality.
andreimatei pushed a commit to andreimatei/cockroach that referenced this issue Feb 16, 2021
First commit from cockroachdb#59505.

This commit updates the kv client routing logic to account for the new
`LEAD_FOR_GLOBAL_READS` `RangeClosedTimestampPolicy` added in cockroachdb#59505. In
enterprise builds, non-stale read-only requests to ranges with this
closed timestamp policy can now be routed to follower replicas, just
like stale read-only requests to normal ranges currently are.

In addition to this main change, there are a few smaller changes in this
PR that were hard to split out, so are included in this commit.

First, non-transactional requests are no longer served by followers even
if the follower replica detects that the request can be. Previously,
non-transactional requests would never be routed intentionally to a
follower replica, but `Replica.canServeFollowerRead` would allow them
through if their timestamp was low enough. This change was made in order
to keep the client and server logic in-sync and because this does not
seem safe for non-transactional requests that get their timestamp
assigned on the server. We're planning to remove non-transactional
requests soon anyway (cockroachdb#58459), so this isn't a big deal. It mostly
just caused some testing fallout.

Second, transactional requests that had previously written intents are
now allowed to have their read-only requests served by followers, as
long as those followers have a closed timestamp above the transaction's
read *and* write timestamp. Previously, we had avoided this because it
seemed to cause issues with read-your-writes. However, it appears safe
as long as the write timestamp is below the closed timestamp, because we
know all of the transaction's intents are at or below its write
timestamp. This is very important for multi-region work, where we want a
transaction to be able to write to a REGIONAL table and then later
perform local reads (maybe foreign key checks) on GLOBAL tables.

Third, a max clock offset shift in `canUseFollowerRead` was removed. It
wasn't exactly clear what this was doing. It was introduced in the
original 70be833 and seemed to allow a follower read in cases where they
otherwise shouldn't be expected to succeed. I thought at first that it
was accounting for the fact that the kv client's clock might be leading
the kv server's clock and so it was being pessimistic about the expected
closed timestamp, but it actually seems to be shifting the other way, so
I don't know. I might be missing something.

Finally, the concept of a `replicaoracle.OracleFactoryFunc` was removed
and the existing `replicaoracle.OracleFactory` takes its place. We no
longer need the double-factory approach because the transaction object
is now passed directly to `Oracle.ChoosePreferredReplica`. This was a
necessary change, because the process of determining whether a follower
read can be served now requires transaction information and range
information (i.e. the closed timestamp policy), so we need to make it in
the Oracle itself instead of in the OracleFactory. This all seems like a
simplification anyway.

This is still waiting on changes to the closed timestamp system to be
able to write end-to-end tests using this new functionality.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 17, 2021
First commit from cockroachdb#59505.

This commit updates the kv client routing logic to account for the new
`LEAD_FOR_GLOBAL_READS` `RangeClosedTimestampPolicy` added in cockroachdb#59505. In
enterprise builds, non-stale read-only requests to ranges with this
closed timestamp policy can now be routed to follower replicas, just
like stale read-only requests to normal ranges currently are.

In addition to this main change, there are a few smaller changes in this
PR that were hard to split out, so are included in this commit.

First, non-transactional requests are no longer served by followers even
if the follower replica detects that the request can be. Previously,
non-transactional requests would never be routed intentionally to a
follower replica, but `Replica.canServeFollowerRead` would allow them
through if their timestamp was low enough. This change was made in order
to keep the client and server logic in-sync and because this does not
seem safe for non-transactional requests that get their timestamp
assigned on the server. We're planning to remove non-transactional
requests soon anyway (cockroachdb#58459), so this isn't a big deal. It mostly
just caused some testing fallout.

Second, transactional requests that had previously written intents are
now allowed to have their read-only requests served by followers, as
long as those followers have a closed timestamp above the transaction's
read *and* write timestamp. Previously, we had avoided this because it
seemed to cause issues with read-your-writes. However, it appears safe
as long as the write timestamp is below the closed timestamp, because we
know all of the transaction's intents are at or below its write
timestamp. This is very important for multi-region work, where we want a
transaction to be able to write to a REGIONAL table and then later
perform local reads (maybe foreign key checks) on GLOBAL tables.

Third, a max clock offset shift in `canUseFollowerRead` was removed. It
wasn't exactly clear what this was doing. It was introduced in the
original 70be833 and seemed to allow a follower read in cases where they
otherwise shouldn't be expected to succeed. I thought at first that it
was accounting for the fact that the kv client's clock might be leading
the kv server's clock and so it was being pessimistic about the expected
closed timestamp, but it actually seems to be shifting the other way, so
I don't know. I might be missing something.

Finally, the concept of a `replicaoracle.OracleFactoryFunc` was removed
and the existing `replicaoracle.OracleFactory` takes its place. We no
longer need the double-factory approach because the transaction object
is now passed directly to `Oracle.ChoosePreferredReplica`. This was a
necessary change, because the process of determining whether a follower
read can be served now requires transaction information and range
information (i.e. the closed timestamp policy), so we need to make it in
the Oracle itself instead of in the OracleFactory. This all seems like a
simplification anyway.

This is still waiting on changes to the closed timestamp system to be
able to write end-to-end tests using this new functionality.
andreimatei pushed a commit to andreimatei/cockroach that referenced this issue Feb 18, 2021
First commit from cockroachdb#59505.

This commit updates the kv client routing logic to account for the new
`LEAD_FOR_GLOBAL_READS` `RangeClosedTimestampPolicy` added in cockroachdb#59505. In
enterprise builds, non-stale read-only requests to ranges with this
closed timestamp policy can now be routed to follower replicas, just
like stale read-only requests to normal ranges currently are.

In addition to this main change, there are a few smaller changes in this
PR that were hard to split out, so are included in this commit.

First, non-transactional requests are no longer served by followers even
if the follower replica detects that the request can be. Previously,
non-transactional requests would never be routed intentionally to a
follower replica, but `Replica.canServeFollowerRead` would allow them
through if their timestamp was low enough. This change was made in order
to keep the client and server logic in-sync and because this does not
seem safe for non-transactional requests that get their timestamp
assigned on the server. We're planning to remove non-transactional
requests soon anyway (cockroachdb#58459), so this isn't a big deal. It mostly
just caused some testing fallout.

Second, transactional requests that had previously written intents are
now allowed to have their read-only requests served by followers, as
long as those followers have a closed timestamp above the transaction's
read *and* write timestamp. Previously, we had avoided this because it
seemed to cause issues with read-your-writes. However, it appears safe
as long as the write timestamp is below the closed timestamp, because we
know all of the transaction's intents are at or below its write
timestamp. This is very important for multi-region work, where we want a
transaction to be able to write to a REGIONAL table and then later
perform local reads (maybe foreign key checks) on GLOBAL tables.

Third, a max clock offset shift in `canUseFollowerRead` was removed. It
wasn't exactly clear what this was doing. It was introduced in the
original 70be833 and seemed to allow a follower read in cases where they
otherwise shouldn't be expected to succeed. I thought at first that it
was accounting for the fact that the kv client's clock might be leading
the kv server's clock and so it was being pessimistic about the expected
closed timestamp, but it actually seems to be shifting the other way, so
I don't know. I might be missing something.

Finally, the concept of a `replicaoracle.OracleFactoryFunc` was removed
and the existing `replicaoracle.OracleFactory` takes its place. We no
longer need the double-factory approach because the transaction object
is now passed directly to `Oracle.ChoosePreferredReplica`. This was a
necessary change, because the process of determining whether a follower
read can be served now requires transaction information and range
information (i.e. the closed timestamp policy), so we need to make it in
the Oracle itself instead of in the OracleFactory. This all seems like a
simplification anyway.

This is still waiting on changes to the closed timestamp system to be
able to write end-to-end tests using this new functionality.
craig bot pushed a commit that referenced this issue Feb 18, 2021
58362: *: craft rangefeed abstraction, support cluster settings in tenants r=nvanbenschoten a=ajwerner

This PR is a reworking of @dt's excellent effort in #57926
to pick up rangefeeds for tenant cluster settings after sprucing up the libraries around
them. @tbg and @nvanbenschoten I'm tagging you accordingly given that you reviewed
that PR.

### First two commits (`kvclient/rangefeed` package and adoption in `catalog/lease`)

The first commit in this PR abstract out some rangefeed client logic around
retries to make them more generally usable in other contexts. The second
commit adopts this package in `sql/catalog/lease`. These two commits can
(and probably should) be merged separately and exist in
#58361.

### Next two commits (`server/settingswatcher` package and adoption in `server`)

These commits introduce code to update the settings using the above rangefeed
package and then hooks it up to the sql server.

### Next three commits

Remove the restrictions from tenants updating settings, mark which settings can be
updated, generate docs. 

These are lifted straight from @dt in #57926.

Maybe closes #56623.


59571: kv: route present-time reads to global_read follower replicas r=nvanbenschoten a=nvanbenschoten

This commit updates the kv client routing logic to account for the new `LEAD_FOR_GLOBAL_READS` `RangeClosedTimestampPolicy` added in #59505. In enterprise builds, non-stale read-only requests to ranges with this closed timestamp policy can now be routed to follower replicas, just like stale read-only requests to normal ranges currently are.

In addition to this main change, there are a few smaller changes in this PR that were hard to split out, so are included in this commit.

First, non-transactional requests are no longer served by followers even if the follower replica detects that the request can be. Previously, non-transactional requests would never be routed intentionally to a follower replica, but `Replica.canServeFollowerRead` would allow them through if their timestamp was low enough. This change was made in order to keep the client and server logic in-sync and because this does not seem safe for non-transactional requests that get their timestamp assigned on the server. We're planning to remove non-transactional requests soon anyway (#58459), so this isn't a big deal. It mostly just caused some testing fallout.

Second, transactional requests that had previously written intents are now allowed to have their read-only requests served by followers, as long as those followers have a closed timestamp above the transaction's read *and* write timestamp. Previously, we had avoided this because it seemed to cause issues with read-your-writes. However, it appears safe as long as the write timestamp is below the closed timestamp, because we know all of the transaction's intents are at or below its write timestamp. This is very important for multi-region work, where we want a transaction to be able to write to a REGIONAL table and then later perform local reads (maybe foreign key checks) on GLOBAL tables.

Third, a max clock offset shift in `canUseFollowerRead` was removed. It wasn't exactly clear what this was doing. It was introduced in the original 70be833 and seemed to allow a follower read in cases where they otherwise shouldn't be expected to succeed. I thought at first that it was accounting for the fact that the kv client's clock might be leading the kv server's clock and so it was being pessimistic about the expected closed timestamp, but it actually seems to be shifting the other way, so I don't know. I might be missing something.

Finally, the concept of a `replicaoracle.OracleFactoryFunc` was removed and the existing `replicaoracle.OracleFactory` takes its place. We no longer need the double-factory approach because the transaction object is now passed directly to `Oracle.ChoosePreferredReplica`. This was a necessary change, because the process of determining whether a follower read can be served now requires transaction information and range information (i.e. the closed timestamp policy), so we need to make it in the Oracle itself instead of in the OracleFactory. This all seems like a simplification anyway.

This is still waiting on changes to the closed timestamp system to be able to write end-to-end tests using this new functionality.

60021: roachpb: use LocalUncertaintyLimit from err in readWithinUncertaintyIntervalRetryTimestamp r=nvanbenschoten a=nvanbenschoten

Fixes #57685.

This commit updates `readWithinUncertaintyIntervalRetryTimestamp` to use the the `LocalUncertaintyLimit` field from `ReadWithinUncertaintyIntervalError` in place of `ObservedTimestamps`. This addresses a bug where `ReadWithinUncertaintyIntervalErrors` thrown by follower replicas during follower reads would use meaningless observed timestamps.

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 28, 2021
This commit combines the server-side "local" and "global" uncertainty
limits into an `uncertainty.Interval` struct. It also centralizes the
computation of this interval to an `uncertainty.ComputeInterval` function.

As a result, MVCC no longer reaches inside of a `Transaction` object to
construct its implied uncertainty interval. Instead, an uncertainty
interval is supplied to MVCC directly.

This refactor is made in preparation for a follow-on commit that will
address cockroachdb#58459 (giving non-transactional requests uncertainty intervals),
which in turn will prepare to simplify HLC handling throughout the system.

The refactor also cleanly addresses a goal I've had to make the local
uncertainty limit a `ClockTimestamp`, which helps clarify its role in
the system.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 30, 2021
This commit combines the server-side "local" and "global" uncertainty
limits into an `uncertainty.Interval` struct. It also centralizes the
computation of this interval to an `uncertainty.ComputeInterval` function.

As a result, MVCC no longer reaches inside of a `Transaction` object to
construct its implied uncertainty interval. Instead, an uncertainty
interval is supplied to MVCC directly.

This refactor is made in preparation for a follow-on commit that will
address cockroachdb#58459 (giving non-transactional requests uncertainty intervals),
which in turn will prepare to simplify HLC handling throughout the system.

The refactor also cleanly addresses a goal I've had to make the local
uncertainty limit a `ClockTimestamp`, which helps clarify its role in
the system.
@nvanbenschoten nvanbenschoten changed the title kv: remove non-transactional requests kv: fix non-transactional requests' interaction with uncertainty intervals Nov 30, 2021
craig bot pushed a commit that referenced this issue Nov 30, 2021
72876: server,ccl: replace a few usages of keys.TODOSQLCodec r=e-mbrown a=e-mbrown

references #48123

Replaced a few instances of keys.TODOSQLCodec with proper codecs.

Release note: None

72898: tracing: untangle local/remove parent vs recording collection r=andreimatei a=andreimatei

Before this patch, two concepts were bundled together: whether or not
the parent of a span is local or remote to the child's node, and whether
or not the parent should include the child in its recording. You could
create a child span with either:
- the WithParentAndAutoCollection option, which would give you a local
  parent which includes the child in its recording, or
- the WithParentAndManualCollection option, which would give you a
  remote parent (which naturally would not / couldn't include the child
  in its recording).

The WithParentAndManualCollection is sometimes used even when the parent
is local, which is a) quite confusing, because at a lower level it
produced what was called a "remote parent" and b) sub-optimal, because
the child looked like a "local root", which meant it had to be put into
the active spans registry. Were it not for the bundling of the two
concerns together, such a child of a local parent, but for whom the
parent is not in charge of collecting the recording, would not need to
be put directly into the span registry; such a span (like other child
spans) should be part of the registry indirectly, through its parent -
which is cheaper because it doesn't add contention on the global mutex
lock.

The funky case of a local parent whose recording does not include the
child's is used by DistSQL processors - each processor gets a span whose
recording is collected by the DistSQL infrastructure, independent of the
parent.

This patch untangles the parent-child relationship from the
recording collection concern, and cleans up the API in the process. Now
you have the following options for creating a span:

WithParent(parent) - creates a parent/child relationship. The parent has
a reference to the child. The child is part of the active spans registry
through the parent. Unless WithDetachedRecording() is also specified,
the parent's recording includes the child's.

WithRemoteParent(parentMeta) - creates a parent/child relationship
across process boundaries.

WithDetachedRecording() - asks the parent to not include the new child
in its recording. This is a no-op in case WithRemoteParent() is used.
This option is used by DistSQL.

So, we get a cleaner API, and a more efficient active spans registry
implementation because the DistSQL spans are no longer directly part of
it. Also, we open the door for a new possibility in the future: you can
imagine that, in certain scenarios, it'd be interesting for a span to
collect the recording of all its children, including the ones whose
recording, under normal circumstances, is not the parent's
responsibility to collect. For example, when asking for the recording of
a span in the registry in an out-of-band way (i.e. when debugging), I
want all the info I can get. So, I should be able to point to a DistSQL
flow span and ask it for the recording of all processors on its node.
This is not implemented yet.

Release note: None

73244: kv: combine local and global uncertainty limit into Interval struct r=nvanbenschoten a=nvanbenschoten

This commit combines the server-side "local" and "global" uncertainty limits into an `uncertainty.Interval` struct. It also centralizes the computation of this interval to an `uncertainty.ComputeInterval` function.

As a result, MVCC no longer reaches inside of a `Transaction` object to construct its implied uncertainty interval. Instead, an uncertainty interval is supplied to MVCC directly.

This refactor is made in preparation for a follow-on commit that will address #58459 (giving non-transactional requests uncertainty intervals), which in turn will prepare to simplify HLC handling throughout the system.

The refactor also cleanly addresses a goal I've had to make the local uncertainty limit a `ClockTimestamp`, which helps clarify its role in the system.

73279: kv: detect context cancellation in limitBulkIOWrite, avoid log spam r=nvanbenschoten a=nvanbenschoten

This commit adds logic to propagate context cancellation in `limitBulkIOWrite`.
This function is used in two places, 1) when ingesting ssts, and 2) when
receiving a snapshot. The first case uses the Raft scheduler goroutine's
context, so it never gets cancelled. The second case uses the context of the
sender of a Raft snapshot, so it can get cancelled.

In customer clusters, we were seeing Raft snapshots hit their deadline and begin
spamming `error rate limiting bulk io write: context deadline exceeded` errors
messages. This was bad for two reasons. First, it was very noisy. Second, it
meant that a Raft snapshot that was no longer going to succeed was still writing
out full SSTs while holding on to the `snapshotApplySem`. This contributed to
the snapshot starvation we saw in the issue.

With this commit, `limitBulkIOWrite` will eagerly detect context cancellation
and will propagate the cancellation up to the caller, allowing the caller to
quickly release resources.

Release notes (bug fix): Raft snapshots now detect timeouts earlier and avoid
spamming the logs with `context deadline exceeded` errors.

Co-authored-by: e-mbrown <ebsonari@gmail.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Dec 13, 2021
Fixes cockroachdb#58459.
Informs cockroachdb#36431.

This commit fixes a long-standing correctness issue where non-transactional
requests did not ensure single-key linearizability even if they deferred their
timestamp allocation to the leaseholder of their (single) range. They still
don't entirely, because of cockroachdb#36431, but this change brings us one step closer to
the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests.

The change addresses this by giving non-transactional requests uncertainty
intervals. This ensures that they also guarantee single-key linearizability even
with only loose (but bounded) clock synchronization. Non-transactional requests
that use a client-provided timestamp do not have uncertainty intervals and do
not make real-time ordering guarantees.

Unlike transactions, which establish an uncertainty interval on their
coordinator node during initialization, non-transactional requests receive
uncertainty intervals from their target leaseholder, using a clock reading
from the leaseholder's local HLC as the local limit and this clock reading +
the cluster's maximum clock skew as the global limit.

It is somewhat non-intuitive that non-transactional requests need uncertainty
intervals — after all, they receive their timestamp to the leaseholder of the
only range that they talk to, so isn't every value with a commit timestamp
above their read timestamp certainly concurrent? The answer is surprisingly
"no" for the following reasons, so they cannot forgo the use of uncertainty
interval:

1. the request timestamp is allocated before consulting the replica's lease.
   This means that there are times when the replica is not the leaseholder at
   the point of timestamp allocation, and only becomes the leaseholder later.
   In such cases, the timestamp assigned to the request is not guaranteed to
   be greater than the written_timestamp of all writes served by the range at
   the time of allocation. This is true despite invariants 1 & 2 from above,
   because the replica allocating the timestamp is not yet the leaseholder.

   In cases where the replica that assigned the non-transactional request's
   timestamp takes over as the leaseholder after the timestamp allocation, we
   expect minimumLocalLimitForLeaseholder to forward the local uncertainty
   limit above TimestampFromServerClock, to the lease start time.

2. ignoring reason #1, the request timestamp assigned by the leaseholder is
   equal to its local uncertainty limit and is guaranteed to lead the
   written_timestamp of all writes served by the range at the time the
   request arrived at the leaseholder node. However, this timestamp is still
   not guaranteed to lead the commit timestamp of all of these writes. As a
   result, the non-transactional request needs an uncertainty interval with a
   global uncertainty limit far enough in advance of the local HLC clock to
   ensure that it considers any value that was part of a transaction which
   could have committed before the request was received by the leaseholder to
   be uncertain. Concretely, the non-transactional request needs to consider
   values of the following form to be uncertain:

     written_timestamp < local_limit && commit_timestamp < global_limit

   The value that the non-transactional request is observing may have been
   written on the local leaseholder at time 10, its transaction may have been
   committed remotely at time 20, acknowledged, then the non-transactional
   request may have begun and received a timestamp of 15 from the local
   leaseholder, then finally the value may have been resolved asynchronously
   and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20).
   The failure of the non-transactional request to observe this value would
   be a stale read.

----

Conveniently, because non-transactional requests are always scoped to a
single-range, those that hit uncertainty errors can always retry on the server,
so these errors never bubble up to the client that initiated the request.

However, before this commit, this wasn't actually possible because we did not
support server-side retries of `ReadWithinUncertaintyIntervalError`. The second
part of this commit adds support for this. This new support allows us to make
the guarantee we want about non-transactional requests never returning these
errors to the client, even though they use uncertainty intervals. It also serves
as a performance optimization for transactional requests, which now benefit from
this new capability. There's some complexity around supporting this form of
server-side retry, because it must be done above latching instead of below, but
recent refactors in cockroachdb#73557 and cockroachdb#73717 have made this possible to support
cleanly.

----

This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional
requests to benefit from our incoming fix for that issue. Second, it unblocks
some of the clock refactors proposed in cockroachdb#72121 (comment),
and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't
feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make
this change. We know from cockroachdb#36431 that invariant 1 from
[`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131)
doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send`
masks the bugs in this area in many cases. Removing that clock update (I don't
actually plan to remove it, but I plan to disconnect it entirely from operation
timestamps) without first giving non-transactional requests uncertainty intervals
seems like it may reveal these bugs in ways we haven't seen in the past. So I'd
like to land this fix before making that change.

----

Release note (performance improvement): Certain forms of automatically retried
"read uncertainty" errors are now retried more efficiently, avoiding a network
round trip.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 3, 2022
Fixes cockroachdb#58459.
Informs cockroachdb#36431.

This commit fixes a long-standing correctness issue where non-transactional
requests did not ensure single-key linearizability even if they deferred their
timestamp allocation to the leaseholder of their (single) range. They still
don't entirely, because of cockroachdb#36431, but this change brings us one step closer to
the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests.

The change addresses this by giving non-transactional requests uncertainty
intervals. This ensures that they also guarantee single-key linearizability even
with only loose (but bounded) clock synchronization. Non-transactional requests
that use a client-provided timestamp do not have uncertainty intervals and do
not make real-time ordering guarantees.

Unlike transactions, which establish an uncertainty interval on their
coordinator node during initialization, non-transactional requests receive
uncertainty intervals from their target leaseholder, using a clock reading
from the leaseholder's local HLC as the local limit and this clock reading +
the cluster's maximum clock skew as the global limit.

It is somewhat non-intuitive that non-transactional requests need uncertainty
intervals — after all, they receive their timestamp to the leaseholder of the
only range that they talk to, so isn't every value with a commit timestamp
above their read timestamp certainly concurrent? The answer is surprisingly
"no" for the following reasons, so they cannot forgo the use of uncertainty
interval:

1. the request timestamp is allocated before consulting the replica's lease.
   This means that there are times when the replica is not the leaseholder at
   the point of timestamp allocation, and only becomes the leaseholder later.
   In such cases, the timestamp assigned to the request is not guaranteed to
   be greater than the written_timestamp of all writes served by the range at
   the time of allocation. This is true despite invariants 1 & 2 presented in
   `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the
   timestamp is not yet the leaseholder.

   In cases where the replica that assigned the non-transactional request's
   timestamp takes over as the leaseholder after the timestamp allocation, we
   expect minimumLocalLimitForLeaseholder to forward the local uncertainty
   limit above TimestampFromServerClock, to the lease start time.

   For example, consider the following series of events:
   - client A writes k = v1
   - leaseholder writes v1 at ts = 100
   - client A receives ack for write
   - client B wants to read k using a non-txn request
   - follower replica with slower clock receives non-txn request
   - follower replica assigns request ts = 95
   - lease transferred to follower replica with lease start time = 101
   - non-txn request must use 101 as limit of uncertainty interval to ensure
     that it observes k = v1 in uncertainty interval, performs a server-side
     retry, bumps its read timestamp, and returns k = v1.

2. even if the replica's lease is stable and the timestamp is assigned to the
   non-transactional request by the leaseholder, the assigned clock reading
   only reflects the written_timestamp of all of the writes served by the
   leaseholder (and previous leaseholders) thus far. This clock reading is
   not not guaranteed to lead the commit timestamp of all of these writes,
   especially if they are committed remotely and resolved after the request
   has received its clock reading but before the request begins evaluating.

   As a result, the non-transactional request needs an uncertainty interval
   with a global uncertainty limit far enough in advance of the leaseholder's
   local HLC clock to ensure that it considers any value that was part of a
   transaction which could have committed before the request was received by
   the leaseholder to be uncertain. Concretely, the non-transactional request
   needs to consider values of the following form to be uncertain:

     written_timestamp < local_limit && commit_timestamp < global_limit

   The value that the non-transactional request is observing may have been
   written on the local leaseholder at time 10, its transaction may have been
   committed remotely at time 20, acknowledged, then the non-transactional
   request may have begun and received a timestamp of 15 from the local
   leaseholder, then finally the value may have been resolved asynchronously
   and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20).
   The failure of the non-transactional request to observe this value would
   be a stale read.

   For example, consider the following series of events:
   - client A begins a txn and is assigned provisional commit timestamp = 95
   - client A's txn performs a Put(k, v1)
   - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95
   - client A's txn performs a write elsewhere and hits a WriteTooOldError
     that bumps its provisional commit timestamp to 100
   - client A's txn refreshes to ts = 100. This notably happens without
     involvement of the leaseholder that served the Put (this is at the heart
     of cockroachdb#36431), so that leaseholder's clock is not updated
   - client A's txn commits remotely and client A receives the acknowledgment
   - client B initiates non-txn read of k
   - leaseholder assigns read timestamp ts = 97
   - asynchronous intent resolution resolves the txn's intent at k, moving v1
     to ts = 100 in the process
   - non-txn request must use an uncertainty interval that extends past 100
     to ensure that it observes k = v1 in uncertainty interval, performs a
     server-side retry, bumps its read timestamp, and returns k = v1. Failure
     to do so would be a stale read

----

This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional
requests to benefit from our incoming fix for that issue. Second, it unblocks
some of the clock refactors proposed in cockroachdb#72121 (comment),
and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't
feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make
this change. We know from cockroachdb#36431 that invariant 1 from
[`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131)
doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send`
masks the bugs in this area in many cases. Removing that clock update (I don't
actually plan to remove it, but I plan to disconnect it entirely from operation
timestamps) without first giving non-transactional requests uncertainty intervals
seems like it may reveal these bugs in ways we haven't seen in the past. So I'd
like to land this fix before making that change.

----

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 5, 2022
Fixes cockroachdb#58459.
Informs cockroachdb#36431.

This commit fixes a long-standing correctness issue where non-transactional
requests did not ensure single-key linearizability even if they deferred their
timestamp allocation to the leaseholder of their (single) range. They still
don't entirely, because of cockroachdb#36431, but this change brings us one step closer to
the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests.

The change addresses this by giving non-transactional requests uncertainty
intervals. This ensures that they also guarantee single-key linearizability even
with only loose (but bounded) clock synchronization. Non-transactional requests
that use a client-provided timestamp do not have uncertainty intervals and do
not make real-time ordering guarantees.

Unlike transactions, which establish an uncertainty interval on their
coordinator node during initialization, non-transactional requests receive
uncertainty intervals from their target leaseholder, using a clock reading
from the leaseholder's local HLC as the local limit and this clock reading +
the cluster's maximum clock skew as the global limit.

It is somewhat non-intuitive that non-transactional requests need uncertainty
intervals — after all, they receive their timestamp to the leaseholder of the
only range that they talk to, so isn't every value with a commit timestamp
above their read timestamp certainly concurrent? The answer is surprisingly
"no" for the following reasons, so they cannot forgo the use of uncertainty
interval:

1. the request timestamp is allocated before consulting the replica's lease.
   This means that there are times when the replica is not the leaseholder at
   the point of timestamp allocation, and only becomes the leaseholder later.
   In such cases, the timestamp assigned to the request is not guaranteed to
   be greater than the written_timestamp of all writes served by the range at
   the time of allocation. This is true despite invariants 1 & 2 presented in
   `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the
   timestamp is not yet the leaseholder.

   In cases where the replica that assigned the non-transactional request's
   timestamp takes over as the leaseholder after the timestamp allocation, we
   expect minimumLocalLimitForLeaseholder to forward the local uncertainty
   limit above TimestampFromServerClock, to the lease start time.

   For example, consider the following series of events:
   - client A writes k = v1
   - leaseholder writes v1 at ts = 100
   - client A receives ack for write
   - client B wants to read k using a non-txn request
   - follower replica with slower clock receives non-txn request
   - follower replica assigns request ts = 95
   - lease transferred to follower replica with lease start time = 101
   - non-txn request must use 101 as limit of uncertainty interval to ensure
     that it observes k = v1 in uncertainty interval, performs a server-side
     retry, bumps its read timestamp, and returns k = v1.

2. even if the replica's lease is stable and the timestamp is assigned to the
   non-transactional request by the leaseholder, the assigned clock reading
   only reflects the written_timestamp of all of the writes served by the
   leaseholder (and previous leaseholders) thus far. This clock reading is
   not guaranteed to lead the commit timestamp of all of these writes,
   especially if they are committed remotely and resolved after the request
   has received its clock reading but before the request begins evaluating.

   As a result, the non-transactional request needs an uncertainty interval
   with a global uncertainty limit far enough in advance of the leaseholder's
   local HLC clock to ensure that it considers any value that was part of a
   transaction which could have committed before the request was received by
   the leaseholder to be uncertain. Concretely, the non-transactional request
   needs to consider values of the following form to be uncertain:

     written_timestamp < local_limit && commit_timestamp < global_limit

   The value that the non-transactional request is observing may have been
   written on the local leaseholder at time 10, its transaction may have been
   committed remotely at time 20, acknowledged, then the non-transactional
   request may have begun and received a timestamp of 15 from the local
   leaseholder, then finally the value may have been resolved asynchronously
   and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20).
   The failure of the non-transactional request to observe this value would
   be a stale read.

   For example, consider the following series of events:
   - client A begins a txn and is assigned provisional commit timestamp = 95
   - client A's txn performs a Put(k, v1)
   - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95
   - client A's txn performs a write elsewhere and hits a WriteTooOldError
     that bumps its provisional commit timestamp to 100
   - client A's txn refreshes to ts = 100. This notably happens without
     involvement of the leaseholder that served the Put (this is at the heart
     of cockroachdb#36431), so that leaseholder's clock is not updated
   - client A's txn commits remotely and client A receives the acknowledgment
   - client B initiates non-txn read of k
   - leaseholder assigns read timestamp ts = 97
   - asynchronous intent resolution resolves the txn's intent at k, moving v1
     to ts = 100 in the process
   - non-txn request must use an uncertainty interval that extends past 100
     to ensure that it observes k = v1 in uncertainty interval, performs a
     server-side retry, bumps its read timestamp, and returns k = v1. Failure
     to do so would be a stale read

----

This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional
requests to benefit from our incoming fix for that issue. Second, it unblocks
some of the clock refactors proposed in cockroachdb#72121 (comment),
and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't
feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make
this change. We know from cockroachdb#36431 that invariant 1 from
[`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131)
doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send`
masks the bugs in this area in many cases. Removing that clock update (I don't
actually plan to remove it, but I plan to disconnect it entirely from operation
timestamps) without first giving non-transactional requests uncertainty intervals
seems like it may reveal these bugs in ways we haven't seen in the past. So I'd
like to land this fix before making that change.

----

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 8, 2022
Fixes cockroachdb#58459.
Informs cockroachdb#36431.

This commit fixes a long-standing correctness issue where non-transactional
requests did not ensure single-key linearizability even if they deferred their
timestamp allocation to the leaseholder of their (single) range. They still
don't entirely, because of cockroachdb#36431, but this change brings us one step closer to
the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests.

The change addresses this by giving non-transactional requests uncertainty
intervals. This ensures that they also guarantee single-key linearizability even
with only loose (but bounded) clock synchronization. Non-transactional requests
that use a client-provided timestamp do not have uncertainty intervals and do
not make real-time ordering guarantees.

Unlike transactions, which establish an uncertainty interval on their
coordinator node during initialization, non-transactional requests receive
uncertainty intervals from their target leaseholder, using a clock reading
from the leaseholder's local HLC as the local limit and this clock reading +
the cluster's maximum clock skew as the global limit.

It is somewhat non-intuitive that non-transactional requests need uncertainty
intervals — after all, they receive their timestamp to the leaseholder of the
only range that they talk to, so isn't every value with a commit timestamp
above their read timestamp certainly concurrent? The answer is surprisingly
"no" for the following reasons, so they cannot forgo the use of uncertainty
interval:

1. the request timestamp is allocated before consulting the replica's lease.
   This means that there are times when the replica is not the leaseholder at
   the point of timestamp allocation, and only becomes the leaseholder later.
   In such cases, the timestamp assigned to the request is not guaranteed to
   be greater than the written_timestamp of all writes served by the range at
   the time of allocation. This is true despite invariants 1 & 2 presented in
   `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the
   timestamp is not yet the leaseholder.

   In cases where the replica that assigned the non-transactional request's
   timestamp takes over as the leaseholder after the timestamp allocation, we
   expect minimumLocalLimitForLeaseholder to forward the local uncertainty
   limit above TimestampFromServerClock, to the lease start time.

   For example, consider the following series of events:
   - client A writes k = v1
   - leaseholder writes v1 at ts = 100
   - client A receives ack for write
   - client B wants to read k using a non-txn request
   - follower replica with slower clock receives non-txn request
   - follower replica assigns request ts = 95
   - lease transferred to follower replica with lease start time = 101
   - non-txn request must use 101 as limit of uncertainty interval to ensure
     that it observes k = v1 in uncertainty interval, performs a server-side
     retry, bumps its read timestamp, and returns k = v1.

2. even if the replica's lease is stable and the timestamp is assigned to the
   non-transactional request by the leaseholder, the assigned clock reading
   only reflects the written_timestamp of all of the writes served by the
   leaseholder (and previous leaseholders) thus far. This clock reading is
   not guaranteed to lead the commit timestamp of all of these writes,
   especially if they are committed remotely and resolved after the request
   has received its clock reading but before the request begins evaluating.

   As a result, the non-transactional request needs an uncertainty interval
   with a global uncertainty limit far enough in advance of the leaseholder's
   local HLC clock to ensure that it considers any value that was part of a
   transaction which could have committed before the request was received by
   the leaseholder to be uncertain. Concretely, the non-transactional request
   needs to consider values of the following form to be uncertain:

     written_timestamp < local_limit && commit_timestamp < global_limit

   The value that the non-transactional request is observing may have been
   written on the local leaseholder at time 10, its transaction may have been
   committed remotely at time 20, acknowledged, then the non-transactional
   request may have begun and received a timestamp of 15 from the local
   leaseholder, then finally the value may have been resolved asynchronously
   and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20).
   The failure of the non-transactional request to observe this value would
   be a stale read.

   For example, consider the following series of events:
   - client A begins a txn and is assigned provisional commit timestamp = 95
   - client A's txn performs a Put(k, v1)
   - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95
   - client A's txn performs a write elsewhere and hits a WriteTooOldError
     that bumps its provisional commit timestamp to 100
   - client A's txn refreshes to ts = 100. This notably happens without
     involvement of the leaseholder that served the Put (this is at the heart
     of cockroachdb#36431), so that leaseholder's clock is not updated
   - client A's txn commits remotely and client A receives the acknowledgment
   - client B initiates non-txn read of k
   - leaseholder assigns read timestamp ts = 97
   - asynchronous intent resolution resolves the txn's intent at k, moving v1
     to ts = 100 in the process
   - non-txn request must use an uncertainty interval that extends past 100
     to ensure that it observes k = v1 in uncertainty interval, performs a
     server-side retry, bumps its read timestamp, and returns k = v1. Failure
     to do so would be a stale read

----

This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional
requests to benefit from our incoming fix for that issue. Second, it unblocks
some of the clock refactors proposed in cockroachdb#72121 (comment),
and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't
feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make
this change. We know from cockroachdb#36431 that invariant 1 from
[`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131)
doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send`
masks the bugs in this area in many cases. Removing that clock update (I don't
actually plan to remove it, but I plan to disconnect it entirely from operation
timestamps) without first giving non-transactional requests uncertainty intervals
seems like it may reveal these bugs in ways we haven't seen in the past. So I'd
like to land this fix before making that change.

----

Release note: None
craig bot pushed a commit that referenced this issue Feb 8, 2022
73732: kv: give non-transactional requests uncertainty intervals r=nvanbenschoten a=nvanbenschoten

Fixes #58459.
Informs #36431.

This commit fixes a long-standing correctness issue where non-transactional requests did not ensure single-key linearizability even if they deferred their timestamp allocation to the leaseholder of their (single) range. They still don't entirely, because of #36431, but this change brings us one step closer to the fix we plan to land for #36431 also applying to non-transactional requests.

The change addresses this by giving non-transactional requests uncertainty intervals. This ensures that they also guarantee single-key linearizability even with only loose (but bounded) clock synchronization. Non-transactional requests that use a client-provided timestamp do not have uncertainty intervals and do not make real-time ordering guarantees.

Unlike transactions, which establish an uncertainty interval on their coordinator node during initialization, non-transactional requests receive uncertainty intervals from their target leaseholder, using a clock reading from the leaseholder's local HLC as the local limit and this clock reading + the cluster's maximum clock skew as the global limit.

It is somewhat non-intuitive that non-transactional requests need uncertainty intervals — after all, they receive their timestamp to the leaseholder of the only range that they talk to, so isn't every value with a commit timestamp above their read timestamp certainly concurrent? The answer is surprisingly "no" for the following reasons, so they cannot forgo the use of uncertainty interval:

1. the request timestamp is allocated before consulting the replica's lease. This means that there are times when the replica is not the leaseholder at the point of timestamp allocation, and only becomes the leaseholder later. In such cases, the timestamp assigned to the request is not guaranteed to be greater than the written_timestamp of all writes served by the range at the time of allocation. This is true despite invariants 1 & 2 presented in `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the timestamp is not yet the leaseholder.

   In cases where the replica that assigned the non-transactional request's timestamp takes over as the leaseholder after the timestamp allocation, we expect minimumLocalLimitForLeaseholder to forward the local uncertainty limit above TimestampFromServerClock, to the lease start time.


   For example, consider the following series of events:
   - client A writes k = v1
   - leaseholder writes v1 at ts = 100
   - client A receives ack for write
   - client B wants to read k using a non-txn request
   - follower replica with slower clock receives non-txn request
   - follower replica assigns request ts = 95
   - lease transferred to follower replica with lease start time = 101
   - non-txn request must use 101 as limit of uncertainty interval to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. Failure to do so would be a stale read

2. even if the replica's lease is stable and the timestamp is assigned to the non-transactional request by the leaseholder, the assigned clock reading only reflects the written_timestamp of all of the writes served by the leaseholder (and previous leaseholders) thus far. This clock reading is not not guaranteed to lead the commit timestamp of all of these writes, especially if they are committed remotely and resolved after the request has received its clock reading but before the request begins evaluating.

   As a result, the non-transactional request needs an uncertainty interval with a global uncertainty limit far enough in advance of the leaseholder's local HLC clock to ensure that it considers any value that was part of a transaction which could have committed before the request was received by the leaseholder to be uncertain. Concretely, the non-transactional request needs to consider values of the following form to be uncertain:

     written_timestamp < local_limit && commit_timestamp < global_limit

   The value that the non-transactional request is observing may have been written on the local leaseholder at time 10, its transaction may have been committed remotely at time 20, acknowledged, then the non-transactional request may have begun and received a timestamp of 15 from the local leaseholder, then finally the value may have been resolved asynchronously and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20). The failure of the non-transactional request to observe this value would be a stale read.

   For example, consider the following series of events:
   - client A begins a txn and is assigned provisional commit timestamp = 95
   - client A's txn performs a Put(k, v1)
   - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95
   - client A's txn performs a write elsewhere and hits a WriteTooOldError that bumps its provisional commit timestamp to 100
   - client A's txn refreshes to ts = 100. This notably happens without involvement of the leaseholder that served the Put (this is at the heart of #36431), so that leaseholder's clock is not updated
   - client A's txn commits remotely and client A receives the acknowledgment
   - client B initiates non-txn read of k
   - leaseholder assigns read timestamp ts = 97
   - asynchronous intent resolution resolves the txn's intent at k, moving v1 to ts = 100 in the process
   - non-txn request must use an uncertainty interval that extends past 100 to ensure that it observes k = v1 in uncertainty interval, performs a server-side retry, bumps its read timestamp, and returns k = v1. Failure to do so would be a stale read

----

This change is related to #36431 in two ways. First, it allows non-transactional requests to benefit from our incoming fix for that issue. Second, it unblocks some of the clock refactors proposed in #72121 (comment), and by extension #72663. Even though there are clear bugs today, I still don't feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make this change. We know from #36431 that **invariant 1** from [`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131) doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send` masks the bugs in this area in many cases. Removing that clock update (I don't actually plan to remove it, but I plan to disconnect it entirely from operation timestamps) without first giving non-transactional requests uncertainty intervals seems like it may reveal these bugs in ways we haven't seen in the past. So I'd like to land this fix before making that change.

----

Release note (performance improvement): Certain forms of automatically retried "read uncertainty" errors are now retried more efficiently, avoiding a network round trip.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in d676d7e Feb 8, 2022
RajivTS pushed a commit to RajivTS/cockroach that referenced this issue Mar 6, 2022
Fixes cockroachdb#58459.
Informs cockroachdb#36431.

This commit fixes a long-standing correctness issue where non-transactional
requests did not ensure single-key linearizability even if they deferred their
timestamp allocation to the leaseholder of their (single) range. They still
don't entirely, because of cockroachdb#36431, but this change brings us one step closer to
the fix we plan to land for cockroachdb#36431 also applying to non-transactional requests.

The change addresses this by giving non-transactional requests uncertainty
intervals. This ensures that they also guarantee single-key linearizability even
with only loose (but bounded) clock synchronization. Non-transactional requests
that use a client-provided timestamp do not have uncertainty intervals and do
not make real-time ordering guarantees.

Unlike transactions, which establish an uncertainty interval on their
coordinator node during initialization, non-transactional requests receive
uncertainty intervals from their target leaseholder, using a clock reading
from the leaseholder's local HLC as the local limit and this clock reading +
the cluster's maximum clock skew as the global limit.

It is somewhat non-intuitive that non-transactional requests need uncertainty
intervals — after all, they receive their timestamp to the leaseholder of the
only range that they talk to, so isn't every value with a commit timestamp
above their read timestamp certainly concurrent? The answer is surprisingly
"no" for the following reasons, so they cannot forgo the use of uncertainty
interval:

1. the request timestamp is allocated before consulting the replica's lease.
   This means that there are times when the replica is not the leaseholder at
   the point of timestamp allocation, and only becomes the leaseholder later.
   In such cases, the timestamp assigned to the request is not guaranteed to
   be greater than the written_timestamp of all writes served by the range at
   the time of allocation. This is true despite invariants 1 & 2 presented in
   `pkg/kv/kvserver/uncertainty/doc.go` because the replica allocating the
   timestamp is not yet the leaseholder.

   In cases where the replica that assigned the non-transactional request's
   timestamp takes over as the leaseholder after the timestamp allocation, we
   expect minimumLocalLimitForLeaseholder to forward the local uncertainty
   limit above TimestampFromServerClock, to the lease start time.

   For example, consider the following series of events:
   - client A writes k = v1
   - leaseholder writes v1 at ts = 100
   - client A receives ack for write
   - client B wants to read k using a non-txn request
   - follower replica with slower clock receives non-txn request
   - follower replica assigns request ts = 95
   - lease transferred to follower replica with lease start time = 101
   - non-txn request must use 101 as limit of uncertainty interval to ensure
     that it observes k = v1 in uncertainty interval, performs a server-side
     retry, bumps its read timestamp, and returns k = v1.

2. even if the replica's lease is stable and the timestamp is assigned to the
   non-transactional request by the leaseholder, the assigned clock reading
   only reflects the written_timestamp of all of the writes served by the
   leaseholder (and previous leaseholders) thus far. This clock reading is
   not guaranteed to lead the commit timestamp of all of these writes,
   especially if they are committed remotely and resolved after the request
   has received its clock reading but before the request begins evaluating.

   As a result, the non-transactional request needs an uncertainty interval
   with a global uncertainty limit far enough in advance of the leaseholder's
   local HLC clock to ensure that it considers any value that was part of a
   transaction which could have committed before the request was received by
   the leaseholder to be uncertain. Concretely, the non-transactional request
   needs to consider values of the following form to be uncertain:

     written_timestamp < local_limit && commit_timestamp < global_limit

   The value that the non-transactional request is observing may have been
   written on the local leaseholder at time 10, its transaction may have been
   committed remotely at time 20, acknowledged, then the non-transactional
   request may have begun and received a timestamp of 15 from the local
   leaseholder, then finally the value may have been resolved asynchronously
   and moved to timestamp 20 (written_timestamp: 10, commit_timestamp: 20).
   The failure of the non-transactional request to observe this value would
   be a stale read.

   For example, consider the following series of events:
   - client A begins a txn and is assigned provisional commit timestamp = 95
   - client A's txn performs a Put(k, v1)
   - leaseholder serves Put(k, v1), lays down intent at written_timestamp = 95
   - client A's txn performs a write elsewhere and hits a WriteTooOldError
     that bumps its provisional commit timestamp to 100
   - client A's txn refreshes to ts = 100. This notably happens without
     involvement of the leaseholder that served the Put (this is at the heart
     of cockroachdb#36431), so that leaseholder's clock is not updated
   - client A's txn commits remotely and client A receives the acknowledgment
   - client B initiates non-txn read of k
   - leaseholder assigns read timestamp ts = 97
   - asynchronous intent resolution resolves the txn's intent at k, moving v1
     to ts = 100 in the process
   - non-txn request must use an uncertainty interval that extends past 100
     to ensure that it observes k = v1 in uncertainty interval, performs a
     server-side retry, bumps its read timestamp, and returns k = v1. Failure
     to do so would be a stale read

----

This change is related to cockroachdb#36431 in two ways. First, it allows non-transactional
requests to benefit from our incoming fix for that issue. Second, it unblocks
some of the clock refactors proposed in cockroachdb#72121 (comment),
and by extension cockroachdb#72663. Even though there are clear bugs today, I still don't
feel comfortable removing the `hlc.Clock.Update` in `Store.Send` until we make
this change. We know from cockroachdb#36431 that invariant 1 from
[`uncertainty.D6`](https://github.com/cockroachdb/cockroach/blob/22df0a6658a927e7b65dafbdfc9d790500791993/pkg/kv/kvserver/uncertainty/doc.go#L131)
doesn't hold, and yet I still do think the `hlc.Clock.Update` in `Store.Send`
masks the bugs in this area in many cases. Removing that clock update (I don't
actually plan to remove it, but I plan to disconnect it entirely from operation
timestamps) without first giving non-transactional requests uncertainty intervals
seems like it may reveal these bugs in ways we haven't seen in the past. So I'd
like to land this fix before making that change.

----

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-cleanup Tech debt, refactors, loose ends, etc. Solution not expected to significantly change behavior. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants