Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: fix interaction between lease transfers and future time operations #57688

Closed
nvanbenschoten opened this issue Dec 8, 2020 · 0 comments · Fixed by #60521
Closed

kv: fix interaction between lease transfers and future time operations #57688

nvanbenschoten opened this issue Dec 8, 2020 · 0 comments · Fixed by #60521
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. A-multiregion Related to multi-region C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@nvanbenschoten
Copy link
Member

Needed for non-blocking transactions: #52745.

Once we begin allowing operations at future times, lease transfers needs to incorporate the maximum time of reads performed on the current leaseholder. The transfer should either prevent the next lease from having a start time before maximum read timestamp, or at least inform the next leaseholder of this maximum read timestamp.

@nvanbenschoten nvanbenschoten added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-transactions Relating to MVCC and the transactional model. A-multiregion Related to multi-region T-multiregion labels Dec 8, 2020
@nvanbenschoten nvanbenschoten self-assigned this Dec 8, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 14, 2021
This commit adds lease transfers to KV nemesis. This will be useful in
validating that I don't mess anything up when addressing cockroachdb#57688 or when
performing the precursor refactor to pull lease checks below latching.

I stressed this for 50,000 iterations on a 20 node roachprod cluster
without failure, so it looks like lease transfers aren't already broken!
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 14, 2021
This commit adds lease transfers to KV nemesis. This will be useful in
validating that I don't mess anything up when addressing cockroachdb#57688 or when
performing the precursor refactor to pull lease checks below latching.

I stressed this for 50,000 iterations on a 20 node roachprod cluster
without failure, so it looks like lease transfers aren't already broken!
craig bot pushed a commit that referenced this issue Jan 15, 2021
59008: kvnemesis: add lease transfers r=nvanbenschoten a=nvanbenschoten

This commit adds lease transfers to KV nemesis. This will be useful in validating that I don't mess anything up when addressing #57688 or when performing the precursor refactor to pull lease checks below latching.

I stressed this for 50,000 iterations on a 20 node roachprod cluster without failure, so it looks like lease transfers aren't already broken!

59056: bazel,colexec: generate crossjoiner.eg.go within bazel  r=rickystewart a=alan-mas

We broke the bazel build in 52c5f51 when we introduced a new .eg.go
file, without updating the bazel target. We do that here.

Fixes #59052

59070: cmd/roachtest: add workload F to pebble roachtest r=petermattis a=jbowens

Run workload F (100% inserts) in the nightly Pebble benchmarks. A
performance regression in the 100%-insert workload went unnoticed.
Adding it to the nightly benchmarks will help detect future regressions.

Release note: none

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Alanmas <acostas.alan@gmail.com>
Co-authored-by: Jackson Owens <jackson@cockroachlabs.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 17, 2021
Needed for cockroachdb#57688.

This commit reworks interactions between range leases and requests, pulling the
consultation of a replica's lease down below the level of latching while keeping
heavy-weight operations like lease acquisitions above the level of latching.
Doing so comes with several benefits, some related specifically to non-blocking
transactions and some more general.

Background

Before discussing the change here, let's discuss how lease checks, lease
acquisitions, lease redirection, and lease transfers currently work. Today,
requests consult a replica's range lease before acquiring latches. If the lease
is good to go, the request proceeds to acquire latches. If the lease is not
currently held by any replica, the lease is acquired (again, above latches)
through a coalesced `RequestLeaseRequest`. If the lease is currently held by a
different replica, the request is redirected to that replica using a
`NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in
progress, the request is optimistically redirected to the prospective new
leaseholder.

This all works, but only because it's been around for so long. Due to the lease
check above latching, we're forced to go to great lengths to get the
synchronization with in-flight requests right, which leads to very subtle logic.
This is most apparent with lease transfers, which properly synchronize with
ongoing requests through a delicate dance with the HLC clock and some serious
"spooky action at a distance". Every request bumps the local HLC clock in
`Store.Send`, then grabs the replica mutex, checks for an ongoing lease
transfer, drops the replica mutex, then evaluates. Lease transfers grab the
replica mutex, grab a clock reading from the local HLC clock, bump the
minLeaseProposedTS to stop using the current lease, drops the replica mutex,
then proposes a new lease using this clock reading as its start time. This works
only because each request bumps the HLC clock _before_ checking the lease, so
the HLC clock can serve as an upper bound on every request that has made it
through the lease check by the time the lease transfer begins.

This structure is inflexible, subtle, and falls over as soon as we try to extend
it.

Motivation

The primary motivation for pulling lease checks and transfers below latching is
that the interaction between requests and lease transfers is incompatible with
future-time operations, a key part of the non-blocking transaction project. This
is because the structure relies on the HLC clock providing an upper bound on the
time of any request served by an outgoing leaseholder, which is attached to
lease transfers to ensure that the new leaseholder does not violate any request
served on the old leaseholder. But this is quickly violated once we start
serving future-time operations, which don't bump the HLC clock.

So we quickly need to look elsewhere for this information. The obvious place to
look for this information is the timestamp cache, which records the upper bound
read time of each key span in a range, even if this upper bound time is
synthetic. If we could scan the timestamp cache and attach the maximum read time
to a lease transfer (through a new field, not as the lease start time), we'd be
good. But this runs into a problem, because if we just read the timestamp cache
under the lease transfer's lock, we can't be sure we didn't miss any in-progress
operations that had passed the lease check previously but had not yet bumped the
timestamp cache. Maybe they are still reading? So the custom locking quickly
runs into problems (I said it was inflexible!).

Solution

The solution here is to stop relying on custom locking for lease transfers by
pulling the lease check below latching and by pulling the determination of the
transfer's start time below latching. This ensures that during a lease transfer,
we don't only block new requests, but we also flush out in-flight requests. This
means that by the time we look at the timestamp cache during the evaluation of a
lease transfer, we know it has already been updated by any request that will be
served under the current lease.

This commit doesn't make the switch from consulting the HLC clock to consulting
the timestamp cache during TransferLease request evaluation, but a future commit
will.

Other benefits

Besides this primary change, a number of other benefits fall out of this
restructuring.

1. we avoid relying on custom synchronization around leases, instead relying
   on more the more general latching mechanism.
2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now
   both grab clock readings during evaluation and will both need to forward
   their clock reading by the upper-bound of a range's portion of the timestamp
   cache. It makes sense that these two requests would be very similar, as both
   are responsible for renouncing the current leaseholder's powers and passing
   them elsewhere.
3. we more closely aligns the lease acquisition handling with the handling of
   `MergeInProgressError` by classifying a new `InvalidLeaseError` as a
   "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing
   structure of: grab latches, check range state, drop latches and wait if
   necessary, retry.
4. in doing so, we fuse the critical section of lease checks and the rest of
   the checks in `checkExecutionCanProceed`. So we grab the replica read lock
   one fewer time in the request path.
5. we move one step closer to a world where we can "ship a portion of the
   timestamp cache" during lease transfers (and range merges) to avoid retry
   errors / transaction aborts on the new leaseholder. This commit will be
   followed up by one that ships a very basic summary of a leaseholder's
   timestamp cache during lease transfers. However, this would now be trivial to
   extend with higher resolution information, given some size limit. Perhaps we
   prioritize the local portion of the timestamp cache to avoid txn aborts?
6. now that leases are checked below latching, we no longer have the potential
   for an arbitrary delay due to latching and waiting on locks between when the
   lease is checked and when a request evaluates, so we no longer need checks
   like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119).
7. we pull observed timestamp handling a layer down, which will be useful to
   address plumbing comments on cockroachdb#57077.

Other behavioral changes

There are two auxiliary behavioral changes made by this commit that deserve
attention.

The first is that during a lease transfer, operations now block on the outgoing
leaseholder instead of immediately redirecting to the expected next leaseholder.
This has trade-offs. On one hand, this delays redirection, which may make lease
transfers more disruptive to ongoing traffic. On the other, we've seen in the
past that the optimistic redirection is not an absolute win. In many cases, it
can lead to thrashing and lots of wasted work, as the outgoing leaseholder and
the incoming leaseholder both point at each other and requests ping-pong between
them. We've seen this cause serious issues like cockroachdb#22837 and cockroachdb#32367, which we
addressed by adding exponential backoff in the client in 89d349a. So while this
change may make average-case latency during lease transfers slightly worse, it
will keep things much more orderly, avoid wasted work, and reduce worse case
latency during lease transfers.

The other behavioral changes made by this commit is that observed timestamps are
no longer applied to a request to reduce its MaxOffset until after latching and
locking, instead of before. This sounds concerning, but it's actually not for
two reasons. First, as of cockroachdb#57136, a transactions uncertainty interval is no
longer considered by the lock table because locks in a transaction's uncertainty
interval are no longer considered write-read conflicts. Instead, those locks'
provisional values are considered at evaluation time to be uncertain. Second,
the fact that the observed timestamp-limited MaxOffset was being used for
latching is no longer correct in a world with synthetic timestamps (see cockroachdb#57077),
so we would have had to make this change anyway. So put together, this
behavioral change isn't meaningful.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 21, 2021
Needed for cockroachdb#57688.

This commit reworks interactions between range leases and requests, pulling the
consultation of a replica's lease down below the level of latching while keeping
heavy-weight operations like lease acquisitions above the level of latching.
Doing so comes with several benefits, some related specifically to non-blocking
transactions and some more general.

Background

Before discussing the change here, let's discuss how lease checks, lease
acquisitions, lease redirection, and lease transfers currently work. Today,
requests consult a replica's range lease before acquiring latches. If the lease
is good to go, the request proceeds to acquire latches. If the lease is not
currently held by any replica, the lease is acquired (again, above latches)
through a coalesced `RequestLeaseRequest`. If the lease is currently held by a
different replica, the request is redirected to that replica using a
`NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in
progress, the request is optimistically redirected to the prospective new
leaseholder.

This all works, but only because it's been around for so long. Due to the lease
check above latching, we're forced to go to great lengths to get the
synchronization with in-flight requests right, which leads to very subtle logic.
This is most apparent with lease transfers, which properly synchronize with
ongoing requests through a delicate dance with the HLC clock and some serious
"spooky action at a distance". Every request bumps the local HLC clock in
`Store.Send`, then grabs the replica mutex, checks for an ongoing lease
transfer, drops the replica mutex, then evaluates. Lease transfers grab the
replica mutex, grab a clock reading from the local HLC clock, bump the
minLeaseProposedTS to stop using the current lease, drops the replica mutex,
then proposes a new lease using this clock reading as its start time. This works
only because each request bumps the HLC clock _before_ checking the lease, so
the HLC clock can serve as an upper bound on every request that has made it
through the lease check by the time the lease transfer begins.

This structure is inflexible, subtle, and falls over as soon as we try to extend
it.

Motivation

The primary motivation for pulling lease checks and transfers below latching is
that the interaction between requests and lease transfers is incompatible with
future-time operations, a key part of the non-blocking transaction project. This
is because the structure relies on the HLC clock providing an upper bound on the
time of any request served by an outgoing leaseholder, which is attached to
lease transfers to ensure that the new leaseholder does not violate any request
served on the old leaseholder. But this is quickly violated once we start
serving future-time operations, which don't bump the HLC clock.

So we quickly need to look elsewhere for this information. The obvious place to
look for this information is the timestamp cache, which records the upper bound
read time of each key span in a range, even if this upper bound time is
synthetic. If we could scan the timestamp cache and attach the maximum read time
to a lease transfer (through a new field, not as the lease start time), we'd be
good. But this runs into a problem, because if we just read the timestamp cache
under the lease transfer's lock, we can't be sure we didn't miss any in-progress
operations that had passed the lease check previously but had not yet bumped the
timestamp cache. Maybe they are still reading? So the custom locking quickly
runs into problems (I said it was inflexible!).

Solution

The solution here is to stop relying on custom locking for lease transfers by
pulling the lease check below latching and by pulling the determination of the
transfer's start time below latching. This ensures that during a lease transfer,
we don't only block new requests, but we also flush out in-flight requests. This
means that by the time we look at the timestamp cache during the evaluation of a
lease transfer, we know it has already been updated by any request that will be
served under the current lease.

This commit doesn't make the switch from consulting the HLC clock to consulting
the timestamp cache during TransferLease request evaluation, but a future commit
will.

Other benefits

Besides this primary change, a number of other benefits fall out of this
restructuring.

1. we avoid relying on custom synchronization around leases, instead relying
   on more the more general latching mechanism.
2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now
   both grab clock readings during evaluation and will both need to forward
   their clock reading by the upper-bound of a range's portion of the timestamp
   cache. It makes sense that these two requests would be very similar, as both
   are responsible for renouncing the current leaseholder's powers and passing
   them elsewhere.
3. we more closely aligns the lease acquisition handling with the handling of
   `MergeInProgressError` by classifying a new `InvalidLeaseError` as a
   "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing
   structure of: grab latches, check range state, drop latches and wait if
   necessary, retry.
4. in doing so, we fuse the critical section of lease checks and the rest of
   the checks in `checkExecutionCanProceed`. So we grab the replica read lock
   one fewer time in the request path.
5. we move one step closer to a world where we can "ship a portion of the
   timestamp cache" during lease transfers (and range merges) to avoid retry
   errors / transaction aborts on the new leaseholder. This commit will be
   followed up by one that ships a very basic summary of a leaseholder's
   timestamp cache during lease transfers. However, this would now be trivial to
   extend with higher resolution information, given some size limit. Perhaps we
   prioritize the local portion of the timestamp cache to avoid txn aborts?
6. now that leases are checked below latching, we no longer have the potential
   for an arbitrary delay due to latching and waiting on locks between when the
   lease is checked and when a request evaluates, so we no longer need checks
   like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119).
7. we pull observed timestamp handling a layer down, which will be useful to
   address plumbing comments on cockroachdb#57077.

Other behavioral changes

There are two auxiliary behavioral changes made by this commit that deserve
attention.

The first is that during a lease transfer, operations now block on the outgoing
leaseholder instead of immediately redirecting to the expected next leaseholder.
This has trade-offs. On one hand, this delays redirection, which may make lease
transfers more disruptive to ongoing traffic. On the other, we've seen in the
past that the optimistic redirection is not an absolute win. In many cases, it
can lead to thrashing and lots of wasted work, as the outgoing leaseholder and
the incoming leaseholder both point at each other and requests ping-pong between
them. We've seen this cause serious issues like cockroachdb#22837 and cockroachdb#32367, which we
addressed by adding exponential backoff in the client in 89d349a. So while this
change may make average-case latency during lease transfers slightly worse, it
will keep things much more orderly, avoid wasted work, and reduce worse case
latency during lease transfers.

The other behavioral changes made by this commit is that observed timestamps are
no longer applied to a request to reduce its MaxOffset until after latching and
locking, instead of before. This sounds concerning, but it's actually not for
two reasons. First, as of cockroachdb#57136, a transactions uncertainty interval is no
longer considered by the lock table because locks in a transaction's uncertainty
interval are no longer considered write-read conflicts. Instead, those locks'
provisional values are considered at evaluation time to be uncertain. Second,
the fact that the observed timestamp-limited MaxOffset was being used for
latching is no longer correct in a world with synthetic timestamps (see cockroachdb#57077),
so we would have had to make this change anyway. So put together, this
behavioral change isn't meaningful.
craig bot pushed a commit that referenced this issue Jan 29, 2021
58904: kv: generalize lease stasis, accept future-time requests under valid lease r=nvanbenschoten a=nvanbenschoten

Relates to #57688.

This PR performs a series of cleanup steps and eventually generalizes the
replica lease check to consult a request's specific timestamp instead of
assuming that all requests operate at or below `time.Now()`. It then adds to this
generalization some safeguards to ensure that future-time requests don't get
themselves into infinite lease extension/acquisition loops or cause other kinds
of problems due to the requests being too far into the future.

It doesn't go as far as properly handling lease transfers or merged in the case
that a leaseholder has served future-time operations, but that will follow
shortly in another PR.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 1, 2021
Needed for cockroachdb#57688.

This commit reworks interactions between range leases and requests, pulling the
consultation of a replica's lease down below the level of latching while keeping
heavy-weight operations like lease acquisitions above the level of latching.
Doing so comes with several benefits, some related specifically to non-blocking
transactions and some more general.

Background

Before discussing the change here, let's discuss how lease checks, lease
acquisitions, lease redirection, and lease transfers currently work. Today,
requests consult a replica's range lease before acquiring latches. If the lease
is good to go, the request proceeds to acquire latches. If the lease is not
currently held by any replica, the lease is acquired (again, above latches)
through a coalesced `RequestLeaseRequest`. If the lease is currently held by a
different replica, the request is redirected to that replica using a
`NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in
progress, the request is optimistically redirected to the prospective new
leaseholder.

This all works, but only because it's been around for so long. Due to the lease
check above latching, we're forced to go to great lengths to get the
synchronization with in-flight requests right, which leads to very subtle logic.
This is most apparent with lease transfers, which properly synchronize with
ongoing requests through a delicate dance with the HLC clock and some serious
"spooky action at a distance". Every request bumps the local HLC clock in
`Store.Send`, then grabs the replica mutex, checks for an ongoing lease
transfer, drops the replica mutex, then evaluates. Lease transfers grab the
replica mutex, grab a clock reading from the local HLC clock, bump the
minLeaseProposedTS to stop using the current lease, drops the replica mutex,
then proposes a new lease using this clock reading as its start time. This works
only because each request bumps the HLC clock _before_ checking the lease, so
the HLC clock can serve as an upper bound on every request that has made it
through the lease check by the time the lease transfer begins.

This structure is inflexible, subtle, and falls over as soon as we try to extend
it.

Motivation

The primary motivation for pulling lease checks and transfers below latching is
that the interaction between requests and lease transfers is incompatible with
future-time operations, a key part of the non-blocking transaction project. This
is because the structure relies on the HLC clock providing an upper bound on the
time of any request served by an outgoing leaseholder, which is attached to
lease transfers to ensure that the new leaseholder does not violate any request
served on the old leaseholder. But this is quickly violated once we start
serving future-time operations, which don't bump the HLC clock.

So we quickly need to look elsewhere for this information. The obvious place to
look for this information is the timestamp cache, which records the upper bound
read time of each key span in a range, even if this upper bound time is
synthetic. If we could scan the timestamp cache and attach the maximum read time
to a lease transfer (through a new field, not as the lease start time), we'd be
good. But this runs into a problem, because if we just read the timestamp cache
under the lease transfer's lock, we can't be sure we didn't miss any in-progress
operations that had passed the lease check previously but had not yet bumped the
timestamp cache. Maybe they are still reading? So the custom locking quickly
runs into problems (I said it was inflexible!).

Solution

The solution here is to stop relying on custom locking for lease transfers by
pulling the lease check below latching and by pulling the determination of the
transfer's start time below latching. This ensures that during a lease transfer,
we don't only block new requests, but we also flush out in-flight requests. This
means that by the time we look at the timestamp cache during the evaluation of a
lease transfer, we know it has already been updated by any request that will be
served under the current lease.

This commit doesn't make the switch from consulting the HLC clock to consulting
the timestamp cache during TransferLease request evaluation, but a future commit
will.

Other benefits

Besides this primary change, a number of other benefits fall out of this
restructuring.

1. we avoid relying on custom synchronization around leases, instead relying
   on more the more general latching mechanism.
2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now
   both grab clock readings during evaluation and will both need to forward
   their clock reading by the upper-bound of a range's portion of the timestamp
   cache. It makes sense that these two requests would be very similar, as both
   are responsible for renouncing the current leaseholder's powers and passing
   them elsewhere.
3. we more closely aligns the lease acquisition handling with the handling of
   `MergeInProgressError` by classifying a new `InvalidLeaseError` as a
   "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing
   structure of: grab latches, check range state, drop latches and wait if
   necessary, retry.
4. in doing so, we fuse the critical section of lease checks and the rest of
   the checks in `checkExecutionCanProceed`. So we grab the replica read lock
   one fewer time in the request path.
5. we move one step closer to a world where we can "ship a portion of the
   timestamp cache" during lease transfers (and range merges) to avoid retry
   errors / transaction aborts on the new leaseholder. This commit will be
   followed up by one that ships a very basic summary of a leaseholder's
   timestamp cache during lease transfers. However, this would now be trivial to
   extend with higher resolution information, given some size limit. Perhaps we
   prioritize the local portion of the timestamp cache to avoid txn aborts?
6. now that leases are checked below latching, we no longer have the potential
   for an arbitrary delay due to latching and waiting on locks between when the
   lease is checked and when a request evaluates, so we no longer need checks
   like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119).
7. we pull observed timestamp handling a layer down, which will be useful to
   address plumbing comments on cockroachdb#57077.

Other behavioral changes

There are two auxiliary behavioral changes made by this commit that deserve
attention.

The first is that during a lease transfer, operations now block on the outgoing
leaseholder instead of immediately redirecting to the expected next leaseholder.
This has trade-offs. On one hand, this delays redirection, which may make lease
transfers more disruptive to ongoing traffic. On the other, we've seen in the
past that the optimistic redirection is not an absolute win. In many cases, it
can lead to thrashing and lots of wasted work, as the outgoing leaseholder and
the incoming leaseholder both point at each other and requests ping-pong between
them. We've seen this cause serious issues like cockroachdb#22837 and cockroachdb#32367, which we
addressed by adding exponential backoff in the client in 89d349a. So while this
change may make average-case latency during lease transfers slightly worse, it
will keep things much more orderly, avoid wasted work, and reduce worse case
latency during lease transfers.

The other behavioral changes made by this commit is that observed timestamps are
no longer applied to a request to reduce its MaxOffset until after latching and
locking, instead of before. This sounds concerning, but it's actually not for
two reasons. First, as of cockroachdb#57136, a transactions uncertainty interval is no
longer considered by the lock table because locks in a transaction's uncertainty
interval are no longer considered write-read conflicts. Instead, those locks'
provisional values are considered at evaluation time to be uncertain. Second,
the fact that the observed timestamp-limited MaxOffset was being used for
latching is no longer correct in a world with synthetic timestamps (see cockroachdb#57077),
so we would have had to make this change anyway. So put together, this
behavioral change isn't meaningful.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 5, 2021
Needed for cockroachdb#57688.

This commit reworks interactions between range leases and requests, pulling the
consultation of a replica's lease down below the level of latching while keeping
heavy-weight operations like lease acquisitions above the level of latching.
Doing so comes with several benefits, some related specifically to non-blocking
transactions and some more general.

Background

Before discussing the change here, let's discuss how lease checks, lease
acquisitions, lease redirection, and lease transfers currently work. Today,
requests consult a replica's range lease before acquiring latches. If the lease
is good to go, the request proceeds to acquire latches. If the lease is not
currently held by any replica, the lease is acquired (again, above latches)
through a coalesced `RequestLeaseRequest`. If the lease is currently held by a
different replica, the request is redirected to that replica using a
`NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in
progress, the request is optimistically redirected to the prospective new
leaseholder.

This all works, but only because it's been around for so long. Due to the lease
check above latching, we're forced to go to great lengths to get the
synchronization with in-flight requests right, which leads to very subtle logic.
This is most apparent with lease transfers, which properly synchronize with
ongoing requests through a delicate dance with the HLC clock and some serious
"spooky action at a distance". Every request bumps the local HLC clock in
`Store.Send`, then grabs the replica mutex, checks for an ongoing lease
transfer, drops the replica mutex, then evaluates. Lease transfers grab the
replica mutex, grab a clock reading from the local HLC clock, bump the
minLeaseProposedTS to stop using the current lease, drops the replica mutex,
then proposes a new lease using this clock reading as its start time. This works
only because each request bumps the HLC clock _before_ checking the lease, so
the HLC clock can serve as an upper bound on every request that has made it
through the lease check by the time the lease transfer begins.

This structure is inflexible, subtle, and falls over as soon as we try to extend
it.

Motivation

The primary motivation for pulling lease checks and transfers below latching is
that the interaction between requests and lease transfers is incompatible with
future-time operations, a key part of the non-blocking transaction project. This
is because the structure relies on the HLC clock providing an upper bound on the
time of any request served by an outgoing leaseholder, which is attached to
lease transfers to ensure that the new leaseholder does not violate any request
served on the old leaseholder. But this is quickly violated once we start
serving future-time operations, which don't bump the HLC clock.

So we quickly need to look elsewhere for this information. The obvious place to
look for this information is the timestamp cache, which records the upper bound
read time of each key span in a range, even if this upper bound time is
synthetic. If we could scan the timestamp cache and attach the maximum read time
to a lease transfer (through a new field, not as the lease start time), we'd be
good. But this runs into a problem, because if we just read the timestamp cache
under the lease transfer's lock, we can't be sure we didn't miss any in-progress
operations that had passed the lease check previously but had not yet bumped the
timestamp cache. Maybe they are still reading? So the custom locking quickly
runs into problems (I said it was inflexible!).

Solution

The solution here is to stop relying on custom locking for lease transfers by
pulling the lease check below latching and by pulling the determination of the
transfer's start time below latching. This ensures that during a lease transfer,
we don't only block new requests, but we also flush out in-flight requests. This
means that by the time we look at the timestamp cache during the evaluation of a
lease transfer, we know it has already been updated by any request that will be
served under the current lease.

This commit doesn't make the switch from consulting the HLC clock to consulting
the timestamp cache during TransferLease request evaluation, but a future commit
will.

Other benefits

Besides this primary change, a number of other benefits fall out of this
restructuring.

1. we avoid relying on custom synchronization around leases, instead relying
   on more the more general latching mechanism.
2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now
   both grab clock readings during evaluation and will both need to forward
   their clock reading by the upper-bound of a range's portion of the timestamp
   cache. It makes sense that these two requests would be very similar, as both
   are responsible for renouncing the current leaseholder's powers and passing
   them elsewhere.
3. we more closely aligns the lease acquisition handling with the handling of
   `MergeInProgressError` by classifying a new `InvalidLeaseError` as a
   "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing
   structure of: grab latches, check range state, drop latches and wait if
   necessary, retry.
4. in doing so, we fuse the critical section of lease checks and the rest of
   the checks in `checkExecutionCanProceed`. So we grab the replica read lock
   one fewer time in the request path.
5. we move one step closer to a world where we can "ship a portion of the
   timestamp cache" during lease transfers (and range merges) to avoid retry
   errors / transaction aborts on the new leaseholder. This commit will be
   followed up by one that ships a very basic summary of a leaseholder's
   timestamp cache during lease transfers. However, this would now be trivial to
   extend with higher resolution information, given some size limit. Perhaps we
   prioritize the local portion of the timestamp cache to avoid txn aborts?
6. now that leases are checked below latching, we no longer have the potential
   for an arbitrary delay due to latching and waiting on locks between when the
   lease is checked and when a request evaluates, so we no longer need checks
   like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119).
7. we pull observed timestamp handling a layer down, which will be useful to
   address plumbing comments on cockroachdb#57077.

Other behavioral changes

There are two auxiliary behavioral changes made by this commit that deserve
attention.

The first is that during a lease transfer, operations now block on the outgoing
leaseholder instead of immediately redirecting to the expected next leaseholder.
This has trade-offs. On one hand, this delays redirection, which may make lease
transfers more disruptive to ongoing traffic. On the other, we've seen in the
past that the optimistic redirection is not an absolute win. In many cases, it
can lead to thrashing and lots of wasted work, as the outgoing leaseholder and
the incoming leaseholder both point at each other and requests ping-pong between
them. We've seen this cause serious issues like cockroachdb#22837 and cockroachdb#32367, which we
addressed by adding exponential backoff in the client in 89d349a. So while this
change may make average-case latency during lease transfers slightly worse, it
will keep things much more orderly, avoid wasted work, and reduce worse case
latency during lease transfers.

The other behavioral changes made by this commit is that observed timestamps are
no longer applied to a request to reduce its MaxOffset until after latching and
locking, instead of before. This sounds concerning, but it's actually not for
two reasons. First, as of cockroachdb#57136, a transactions uncertainty interval is no
longer considered by the lock table because locks in a transaction's uncertainty
interval are no longer considered write-read conflicts. Instead, those locks'
provisional values are considered at evaluation time to be uncertain. Second,
the fact that the observed timestamp-limited MaxOffset was being used for
latching is no longer correct in a world with synthetic timestamps (see cockroachdb#57077),
so we would have had to make this change anyway. So put together, this
behavioral change isn't meaningful.
craig bot pushed a commit that referenced this issue Feb 6, 2021
59086: kv: move range lease checks and transfers below latching r=nvanbenschoten a=nvanbenschoten

Needed for #57688.

This PR reworks interactions between range leases and requests, pulling the consultation of a replica's lease down below the level of latching while keeping heavy-weight operations like lease acquisitions above the level of latching. Doing so comes with several benefits, some related specifically to non-blocking transactions and some more general.

### Background

Before discussing the change here, let's discuss how lease checks, lease acquisitions, lease redirection, and lease transfers currently work. Today, requests consult a replica's range lease before acquiring latches. If the lease is good to go, the request proceeds to acquire latches. If the lease is not currently held by any replica, the lease is acquired (again, above latches) through a coalesced `RequestLeaseRequest`. If the lease is currently held by a different replica, the request is redirected to that replica using a `NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in progress, the request is optimistically redirected to the prospective new leaseholder.

This all works, but only because it's been around for so long. Due to the lease check above latching, we're forced to go to great lengths to get the synchronization with in-flight requests right, which leads to very subtle logic. This is most apparent with lease transfers, which properly synchronize with ongoing requests through a delicate dance with the HLC clock and some serious "spooky action at a distance". Every request bumps the local HLC clock in `Store.Send`, then grabs the replica mutex, checks for an ongoing lease transfer, drops the replica mutex, then evaluates. Lease transfers grab the replica mutex, grab a clock reading from the local HLC clock, bump the minLeaseProposedTS to stop using the current lease, drops the replica mutex, then proposes a new lease using this clock reading as its start time. This works only because each request bumps the HLC clock _before_ checking the lease, so the HLC clock can serve as an upper bound on every request that has made it through the lease check by the time the lease transfer begins.

This structure is inflexible, subtle, and falls over as soon as we try to extend it.

### Motivation

The primary motivation for pulling lease checks and transfers below latching is that the interaction between requests and lease transfers is incompatible with future-time operations, a key part of the non-blocking transaction project. This is because the structure relies on the HLC clock providing an upper bound on the time of any request served by an outgoing leaseholder, which is attached to lease transfers to ensure that the new leaseholder does not violate any request served on the old leaseholder. But this is quickly violated once we start serving future-time operations, which don't bump the HLC clock.

So we quickly need to look elsewhere for this information. The obvious place to look for this information is the timestamp cache, which records the upper bound read time of each key span in a range, even if this upper bound time is synthetic. If we could scan the timestamp cache and attach the maximum read time to a lease transfer (through a new field, not as the lease start time), we'd be good. But this runs into a problem, because if we just read the timestamp cache under the lease transfer's lock, we can't be sure we didn't miss any in-progress operations that had passed the lease check previously but had not yet bumped the timestamp cache. Maybe they are still reading? So the custom locking quickly runs into problems (I said it was inflexible!).

### Solution

The solution here is to stop relying on custom locking for lease transfers by pulling the lease check below latching and by pulling the determination of the transfer's start time below latching. This ensures that during a lease transfer, we don't only block new requests, but we also flush out in-flight requests. This means that by the time we look at the timestamp cache during the evaluation of a lease transfer, we know it has already been updated by any request that will be served under the current lease.

This commit doesn't make the switch from consulting the HLC clock to consulting the timestamp cache during TransferLease request evaluation, but a future commit will.

### Other benefits

Besides this primary change, a number of other benefits fall out of this restructuring.

1. we avoid relying on custom synchronization around leases, instead relying on more the more general latching mechanism.
2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now both grab clock readings during evaluation and will both need to forward their clock reading by the upper-bound of a range's portion of the timestamp cache. It makes sense that these two requests would be very similar, as both are responsible for renouncing the current leaseholder's powers and passing them elsewhere.
3. we more closely aligns the lease acquisition handling with the handling of `MergeInProgressError` by classifying a new `InvalidLeaseError` as a "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing structure of: grab latches, check range state, drop latches and wait if necessary, retry.
4. in doing so, we fuse the critical section of lease checks and the rest of the checks in `checkExecutionCanProceed`. So we grab the replica read lock one fewer time in the request path.
5. we move one step closer to a world where we can "ship a portion of the timestamp cache" during lease transfers (and range merges) to avoid retry errors / transaction aborts on the new leaseholder. This commit will be followed up by one that ships a very basic summary of a leaseholder's timestamp cache during lease transfers. However, this would now be trivial to extend with higher resolution information, given some size limit. Perhaps we prioritize the local portion of the timestamp cache to avoid txn aborts?
6. now that leases are checked below latching, we no longer have the potential for an arbitrary delay due to latching and waiting on locks between when the lease is checked and when a request evaluates, so we no longer need checks like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119).
7. we pull observed timestamp handling a layer down, which will be useful to address plumbing comments on #57077.

### Other behavioral changes

There are two auxiliary behavioral changes made by this commit that deserve attention.

The first is that during a lease transfer, operations now block on the outgoing leaseholder instead of immediately redirecting to the expected next leaseholder. This has trade-offs. On one hand, this delays redirection, which may make lease transfers more disruptive to ongoing traffic. On the other, we've seen in the past that the optimistic redirection is not an absolute win. In many cases, it can lead to thrashing and lots of wasted work, as the outgoing leaseholder and the incoming leaseholder both point at each other and requests ping-pong between them. We've seen this cause serious issues like #22837 and #32367, which we addressed by adding exponential backoff in the client in 89d349a. So while this change may make average-case latency during lease transfers slightly worse, it will keep things much more orderly, avoid wasted work, and reduce worst-case latency during lease transfers.

The other behavioral changes made by this commit is that observed timestamps are no longer applied to a request to reduce its MaxOffset until after latching and locking, instead of before. This sounds concerning, but it's actually not for two reasons. First, as of #57136, a transactions uncertainty interval is no longer considered by the lock table because locks in a transaction's uncertainty interval are no longer considered write-read conflicts. Instead, those locks' provisional values are considered at evaluation time to be uncertain. Second, the fact that the observed timestamp-limited MaxOffset was being used for latching is no longer correct in a world with synthetic timestamps (see #57077), so we would have had to make this change anyway. So put together, this behavioral change isn't meaningful.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
andreimatei pushed a commit to andreimatei/cockroach that referenced this issue Feb 9, 2021
Relates to cockroachdb#57688.

This commit updates the following four requests types to properly handle
future-time operations:
- `PushTxnRequest`
- `QueryTxnRequest`
- `QueryIntentRequest`
- `RecoverTxnRequest`

It also updates the request evaluation code to properly check that all
timestamp cache updates are safe, based on the batch header timestamps
of the requests. The next commit will be adding a more strict assertion
about proper timestamp cache use, so it's better that we catch these
violations early.

In doing so, the commit also updates the replica lease check to test
against the batch's read or write timestamp, whichever is later. This
addresses the concerns raised in cockroachdb#58904 (review).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 11, 2021
Relates to cockroachdb#57688.

This commit updates the following four requests types to properly handle
future-time operations:
- `PushTxnRequest`
- `QueryTxnRequest`
- `QueryIntentRequest`
- `RecoverTxnRequest`

It also updates the request evaluation code to properly check that all
timestamp cache updates are safe, based on the batch header timestamps
of the requests. The next commit will be adding a more strict assertion
about proper timestamp cache use, so it's better that we catch these
violations early.

In doing so, the commit also updates the replica lease check to test
against the batch's read or write timestamp, whichever is later. This
addresses the concerns raised in cockroachdb#58904 (review).
andreimatei pushed a commit to andreimatei/cockroach that referenced this issue Feb 12, 2021
Relates to cockroachdb#57688.

This commit updates the following four requests types to properly handle
future-time operations:
- `PushTxnRequest`
- `QueryTxnRequest`
- `QueryIntentRequest`
- `RecoverTxnRequest`

It also updates the request evaluation code to properly check that all
timestamp cache updates are safe, based on the batch header timestamps
of the requests. The next commit will be adding a more strict assertion
about proper timestamp cache use, so it's better that we catch these
violations early.

In doing so, the commit also updates the replica lease check to test
against the batch's read or write timestamp, whichever is later. This
addresses the concerns raised in cockroachdb#58904 (review).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 12, 2021
Relates to cockroachdb#57688.

This commit updates the following four requests types to properly handle
future-time operations:
- `PushTxnRequest`
- `QueryTxnRequest`
- `QueryIntentRequest`
- `RecoverTxnRequest`

It also updates the request evaluation code to properly check that all
timestamp cache updates are safe, based on the batch header timestamps
of the requests. The next commit will be adding a more strict assertion
about proper timestamp cache use, so it's better that we catch these
violations early.

In doing so, the commit also updates the replica lease check to test
against the batch's read or write timestamp, whichever is later. This
addresses the concerns raised in cockroachdb#58904 (review).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 12, 2021
Fixes cockroachdb#57688.
Fixes cockroachdb#59679.
Fixes cockroachdb#60520.

This commit introduces new logic to ship summaries of a leaseholders
timestamp cache through lease transfers and range merges. For lease
transfers, the read summary is sent from the outgoing leaseholder to the
incoming leaseholder. For range merges, the read summary is sent from
the right-hand side leaseholder (through the SubsumeResponse), to the
left-hand side leaseholder (through the MergeTrigger).

The read summaries perform the role of the lease start time and merge
freeze time used to play for lease transfers and range merges,
respectively - the summaries instruct the post-operation leaseholder on
how to update its timestamp cache to ensure that no future writes are
allowed to invalidate prior reads.

Read summaries have two distinct advantages over the old approach:
1. they can transfer a higher-resolution snapshot of the reads on the
    range through a lease transfer, to make the lease transfers less
    disruptive to writes because the timestamp cache won't be bumped as
    high. This avoids transaction aborts and retries after lease
    transfers and merges.
2. they can transfer information about reads with synthetic timestamps,
    which are not otherwise captured by the new lease's start time.
    Because of this, they are needed for correctness on `global_read`
    ranges, which can serve reads in the future.

This commit does not realize the first benefit, because it uses very
low-resolution read summaries. However, it sets up the infrastructure
that will allow us to realize the benefit in the future by capturing and
shipping higher-resolution read summaries. The commit does realize the
second benefit, as it fixes correctness issues around future time reads.

----

The commit also fixes a related bug that was revealed during the
development of this patch. As explained in cockroachdb#60520, it was possible for a
range merge to be applied to the leaseholder of the LHS of the merge
through a Raft snapshot. In such cases, we were not properly updating
the leaseholder's timestamp cache to reflect the reads served on the RHS
range. This could allow the post-merged range to invalidate reads served
by the pre-merge RHS range.

This commit fixes this bug using the new read summary infrastructure.
Merge triggers now write to the left-hand side's prior read summary with
a read summary gathered from the right-hand side during subsumption.
Later, upon ingesting a Raft snapshot, we check if we subsumed any
replicas and if we are the leaseholder. If both of those conditions are
true, we forward the replica's timestamp cache to the read summary on
the range. Since this read summary must have been updated by the merge
trigger, it will include all reads served on the pre-merge RHS range.

----

Release note (bug fix): Fixes a very rare, possible impossible in
practice, bug where a range merge that applied through a Raft snapshot
on the left-hand side range's leaseholder could allow that leaseholder
to serve writes that invalidated reads from before the merge on the
right-hand side.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 12, 2021
Relates to cockroachdb#57688.

This commit updates the following four requests types to properly handle
future-time operations:
- `PushTxnRequest`
- `QueryTxnRequest`
- `QueryIntentRequest`
- `RecoverTxnRequest`

It also updates the request evaluation code to properly check that all
timestamp cache updates are safe, based on the batch header timestamps
of the requests. The next commit will be adding a more strict assertion
about proper timestamp cache use, so it's better that we catch these
violations early.

In doing so, the commit also updates the replica lease check to test
against the batch's read or write timestamp, whichever is later. This
addresses the concerns raised in cockroachdb#58904 (review).
craig bot pushed a commit that referenced this issue Feb 13, 2021
59693: kv: handle future-time operations for conflict resolution requests r=nvanbenschoten a=nvanbenschoten

Relates to #57688.

This PR updates the following four requests types to properly handle future-time operations:
- `PushTxnRequest`
- `QueryTxnRequest`
- `QueryIntentRequest`
- `RecoverTxnRequest`

It also updates the request evaluation code to properly check that all timestamp cache updates are safe, based on the batch header timestamps of the requests. This is coupled with a stricter, more comprehensive assertion when updating the timestamp cache that all updates are performed below the expiration time of the active lease. This ensures that timestamp cache updates are not lost during non-cooperative lease change.

In doing so, the PR also updates the replica lease check to test against the batch's read or write timestamp, whichever is later. This addresses the concerns raised in #58904 (review).

60503: kvserver: add some logscopes r=andreimatei a=andreimatei

These are the last in this package I think.

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Andrei Matei <andrei@cockroachlabs.com>
andreimatei pushed a commit to andreimatei/cockroach that referenced this issue Feb 13, 2021
Relates to cockroachdb#57688.

This commit updates the following four requests types to properly handle
future-time operations:
- `PushTxnRequest`
- `QueryTxnRequest`
- `QueryIntentRequest`
- `RecoverTxnRequest`

It also updates the request evaluation code to properly check that all
timestamp cache updates are safe, based on the batch header timestamps
of the requests. The next commit will be adding a more strict assertion
about proper timestamp cache use, so it's better that we catch these
violations early.

In doing so, the commit also updates the replica lease check to test
against the batch's read or write timestamp, whichever is later. This
addresses the concerns raised in cockroachdb#58904 (review).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 16, 2021
Fixes cockroachdb#57688.
Fixes cockroachdb#59679.
Fixes cockroachdb#60520.

This commit introduces new logic to ship summaries of a leaseholders
timestamp cache through lease transfers and range merges. For lease
transfers, the read summary is sent from the outgoing leaseholder to the
incoming leaseholder. For range merges, the read summary is sent from
the right-hand side leaseholder (through the SubsumeResponse), to the
left-hand side leaseholder (through the MergeTrigger).

The read summaries perform the role of the lease start time and merge
freeze time used to play for lease transfers and range merges,
respectively - the summaries instruct the post-operation leaseholder on
how to update its timestamp cache to ensure that no future writes are
allowed to invalidate prior reads.

Read summaries have two distinct advantages over the old approach:
1. they can transfer a higher-resolution snapshot of the reads on the
    range through a lease transfer, to make the lease transfers less
    disruptive to writes because the timestamp cache won't be bumped as
    high. This avoids transaction aborts and retries after lease
    transfers and merges.
2. they can transfer information about reads with synthetic timestamps,
    which are not otherwise captured by the new lease's start time.
    Because of this, they are needed for correctness on `global_read`
    ranges, which can serve reads in the future.

This commit does not realize the first benefit, because it uses very
low-resolution read summaries. However, it sets up the infrastructure
that will allow us to realize the benefit in the future by capturing and
shipping higher-resolution read summaries. The commit does realize the
second benefit, as it fixes correctness issues around future time reads.

----

The commit also fixes a related bug that was revealed during the
development of this patch. As explained in cockroachdb#60520, it was possible for a
range merge to be applied to the leaseholder of the LHS of the merge
through a Raft snapshot. In such cases, we were not properly updating
the leaseholder's timestamp cache to reflect the reads served on the RHS
range. This could allow the post-merged range to invalidate reads served
by the pre-merge RHS range.

This commit fixes this bug using the new read summary infrastructure.
Merge triggers now write to the left-hand side's prior read summary with
a read summary gathered from the right-hand side during subsumption.
Later, upon ingesting a Raft snapshot, we check if we subsumed any
replicas and if we are the leaseholder. If both of those conditions are
true, we forward the replica's timestamp cache to the read summary on
the range. Since this read summary must have been updated by the merge
trigger, it will include all reads served on the pre-merge RHS range.

----

Release note (bug fix): Fixes a very rare, possible impossible in
practice, bug where a range merge that applied through a Raft snapshot
on the left-hand side range's leaseholder could allow that leaseholder
to serve writes that invalidated reads from before the merge on the
right-hand side.
craig bot pushed a commit that referenced this issue Feb 24, 2021
60521: kv: ship timestamp cache summary during lease transfers and range merges r=nvanbenschoten a=nvanbenschoten

Fixes #57688.
Fixes #59679.
Fixes #60520.

This commit introduces new logic to ship summaries of a leaseholders timestamp cache through lease transfers and range merges. For lease transfers, the read summary is sent from the outgoing leaseholder to the incoming leaseholder. For range merges, the read summary is sent from the right-hand side leaseholder (through the SubsumeResponse), to the left-hand side leaseholder (through the MergeTrigger).

The read summaries perform the role of the lease start time and merge freeze time used to play for lease transfers and range merges, respectively - the summaries instruct the post-operation leaseholder on how to update its timestamp cache to ensure that no future writes are allowed to invalidate prior reads.

Read summaries have two distinct advantages over the old approach:
1. they can transfer a higher-resolution snapshot of the reads on the range through a lease transfer, to make the lease transfers less disruptive to writes because the timestamp cache won't be bumped as high. This avoids transaction aborts and retries after lease transfers and merges.
2. they can transfer information about reads with synthetic timestamps, which are not otherwise captured by the new lease's start time. Because of this, they are needed for correctness on `global_read` ranges, which can serve reads in the future.

This commit does not realize the first benefit, because it uses very low-resolution read summaries. However, it sets up the infrastructure that will allow us to realize the benefit in the future by capturing and shipping higher-resolution read summaries. The commit does realize the second benefit, as it fixes correctness issues around future time reads.

----

The commit also fixes a related bug that was revealed during the development of this patch. As explained in #60520, it was possible for a range merge to be applied to the leaseholder of the LHS of the merge through a Raft snapshot. In such cases, we were not properly updating the leaseholder's timestamp cache to reflect the reads served on the RHS range. This could allow the post-merged range to invalidate reads served by the pre-merge RHS range.

This commit fixes this bug using the new read summary infrastructure. Merge triggers now write to the left-hand side's prior read summary with a read summary gathered from the right-hand side during subsumption. Later, upon ingesting a Raft snapshot, we check if we subsumed any replicas and if we are the leaseholder. If both of those conditions are true, we forward the replica's timestamp cache to the read summary on the range. Since this read summary must have been updated by the merge trigger, it will include all reads served on the pre-merge RHS range.

The existence of this bug was verified by a new variant of `TestStoreRangeMergeTimestampCache`, which only passes with the rest of this commit.

----

Release note (bug fix): Fixes a very rare, possible impossible in practice, bug where a range merge that applied through a Raft snapshot on the left-hand side range's leaseholder could allow that leaseholder to serve writes that invalidated reads from before the merge on the right-hand side.

Release justification: bug fix

61013: sql/pgwire: fix encoding of int4 and bpchar in tuples  r=otan a=rafiss

fixes #58069

This includes 3 commits:

### sql/pgwire: make PGTest pass against PostgresSQL

There are a few minor differences: dataTypeSize is different for tuples,
and PG does not show seconds offset for times if it is zero.

### sql/pgwire: fix binary encoding of ints in tuples

### sql/pgwire: fix text encoding of bpchar in tuples

### sql/pgwire: fix encoding of collated strings

Release note (bug fix): Integers inside of tuples were not being encoded
properly when using the binary format for retrieving data. This is now
fixed, and the proper integer width is reported.

Release note (bug fix): Blank-padded chars (e.g. CHAR(3)) were not being
encoded correctly when returning results to the client. Now they
correctly include blank-padding when appropriate.

Release note (bug fix): Collated strings were not encoded with the
proper type OID when sending results to the client if the OID was
for the `char` type. This is now fixed.


61027: authors: add Rachit Srivastava to authors r=rachitgsrivastava a=rachitgsrivastava

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
Co-authored-by: Rachit Srivastava <rachit@cockroachlabs.com>
@craig craig bot closed this as completed in a7472e3 Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. A-multiregion Related to multi-region C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant