Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: consider intent in uncertainty interval to be uncertain, not a write-read conflict #57136

Merged
merged 3 commits into from
Jan 16, 2021

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Nov 25, 2020

This PR is a partial revert of #40600 and #46830. It solves the same problem that those PRs were solving, but in a different way.

Those two PRs were handling the case where a reading transaction observes an intent in its uncertainty interval. Before those fixes, we were not considering intents in a scan's uncertainty interval to be uncertain. This had the potential to cause stale reads because an unresolved intent doesn't indicate that its transaction hasn’t been committed and is not a causal ancestor of the scan.

The approach the first PR took was to throw a WriteIntentError on intents in a scan's uncertainty interval. Effectively, it made scans consider intents in their uncertainty interval to be write-read conflicts. This had the benefit that the scan would push the intent and eventually resolve the conflict, either by aborting the intent, pushing it out of the read's uncertainty interval, or waiting for it to commit. In this last case (which is by the far the most common), the scan would then throw an uncertainty error, because it now had a committed value in its uncertainty interval.

The second PR introduced some awkward code to deal with the fallout from this decision. Once we started trying to keep the lock table in sync with the MVCC state, we needed to use latches to ensure proper synchronization between any operations that could conflict because such conflicts would influence the state of the lock table. This meant that we needed to start holding latches across a read's uncertainty interval, because intent writes in this interval were considered write-read conflicts. This led to some moderately gross code and always felt a little wrong.

Now that we are complicating the logic around uncertainty intervals even further, this becomes even more of a burden. This motivates a reworking of these interactions. This commit replaces the logic that considers intents in a transaction's uncertainty interval to be write-read conflicts for logic that considers such intents to be... uncertain. Doing so means that a transaction will immediately refresh/restart above the uncertain timestamp and will only then begin conflicting with the intent.

This has a number of nice benefits:

  1. it keeps all considerations of read uncertainty intervals down in MVCC iteration logic. The lock table no longer needs to be aware of it. This is a big win now and an even bigger win once we introduce synthetic timestamps.

  2. readers that are almost certainly bound to hit an uncertainty error and need to restart will now do so sooner. In rare cases, this may result in wasted work. In the vast majority of cases, this will allow the reader to be more responsive to the commit of its conflict.

  3. uncertainty errors will only be thrown for locks in the uncertainty interval of a read that are protecting a provisional write (intents). Before, any form of a lock in a read's uncertainty interval would be considered a write-read conflict, which was pessimistic and not needed for correctness.

    In a future with a fully segregated lock table, the change in semantic meaning here becomes even more clear. Instead of detecting the lock associated with an intent in a read's uncertainty interval and declaring a write-read conflict, the read will instead pass through the lock table untouched and will detect the provisional value associated with an intent and declaring uncertainty. This seems closer to what were actually trying to say about these interactions.

Before making this change, I intend to validate the hypothesis that it will not affect performance (or may even slightly improve performance) by running it on the YCSB benchmark suite.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!
And when I read this explanation, it sounds so obvious, that I am scratching my head on why we did it the other complicated way to begin with.

Reviewed 8 of 8 files at r6, 3 of 3 files at r7.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @sumeerbhola)


pkg/storage/pebble_mvcc_scanner.go, line 388 at r4 (raw file):

	ownIntent := p.txn != nil && p.meta.Txn.ID.Equal(p.txn.ID)
	otherIntentVisible := metaTS.LessEq(p.ts) ||
		(p.checkUncertainty && p.isUncertainValue(metaTS)) ||

why this change in meaning of otherIntentVisible and the associated change below?

@sumeerbhola
Copy link
Collaborator

This code is delaying discovery of the uncertainty until the MVCC scan: wouldn't it be beneficial in the future, with the separated lock table, to discover these early? We are already going to be iterating over the lock table key space to find locks with timestamp < read timestamp -- we may as well also inform the reader about the uncertainty so it can push itself right away.

@nvanbenschoten nvanbenschoten added the A-multiregion Related to multi-region label Dec 2, 2020
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I've split this off from #57077 so we can push this ahead.

I did need to make one small change here. I had hoped that this change would allow us to no longer declare read latches all the way up to a read's uncertainty interval, but this doesn't actually appear to be true. Reads still need to continue declaring latches up through their uncertainty interval so that they are properly synchronized with earlier writes that may have a happened-before relationship with the read. These writes could not have completed and returned to the client until they were durable in the Range's Raft log. However, they may not have been applied to the replica's state machine by the time the write was acknowledged, because Raft entry application occur be asynchronously with respect to the writer (see AckCommittedEntriesBeforeApplication). Latching is the only mechanism that ensures that any observers of the write wait for the write apply before reading.

This code is delaying discovery of the uncertainty until the MVCC scan: wouldn't it be beneficial in the future, with the separated lock table, to discover these early? We are already going to be iterating over the lock table key space to find locks with timestamp < read timestamp -- we may as well also inform the reader about the uncertainty so it can push itself right away.

Yeah, I think that's a good point. I had hoped that we wouldn't need to introduce any of the complications of #57077 into the lockTable, but this might be reason enough to do so.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @sumeerbhola)


pkg/storage/pebble_mvcc_scanner.go, line 388 at r4 (raw file):

Previously, sumeerbhola wrote…

why this change in meaning of otherIntentVisible and the associated change below?

We're no longer considering an intent in a read's uncertainty interval to be "visible", for some meaning of "visible". Instead, we consider ourselves reading below the intent even if it is in our uncertainty interval, but then later check that it's not. I renamed this variable to intentVisible because "other" wasn't quite correct. I'm thinking of renaming it again to conflictingIntent to avoid the ambiguity of the term "visible". What do you think?

@nvanbenschoten
Copy link
Member Author

I still owe this change some YCSB numbers. I'll try to collect those tonight.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 25 of 31 files at r1, 26 of 26 files at r8, 3 of 3 files at r9.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/batcheval/declare.go, line 63 at r8 (raw file):

			// However, they may not have been applied to the replica's state
			// machine by the time the write was acknowledged, because Raft
			// entry application occur be asynchronously with respect to the

... occurs asynchronously ...


pkg/storage/pebble_mvcc_scanner.go, line 388 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We're no longer considering an intent in a read's uncertainty interval to be "visible", for some meaning of "visible". Instead, we consider ourselves reading below the intent even if it is in our uncertainty interval, but then later check that it's not. I renamed this variable to intentVisible because "other" wasn't quite correct. I'm thinking of renaming it again to conflictingIntent to avoid the ambiguity of the term "visible". What do you think?

conflictingIntent sounds fine.


pkg/storage/pebble_mvcc_scanner.go, line 390 at r8 (raw file):

				return p.uncertaintyError(metaTS)
			}

nit: can you remove the empty line

This could use a comment, such as
// The intent is not within the uncertainty, but there could be an uncertain committed value, so seek and check uncertainty using MaxTimestamp.

This commit is a partial revert of cockroachdb#40600 and cockroachdb#46830. It solves the same
problem that those PRs were solving, but in a different way.

Those two PRs were handling the case where a reading transaction
observes an intent in its uncertainty interval. Before those fixes, we
were not considering intents in a scan's uncertainty interval to be
uncertain. This had the potential to cause stale reads because an
unresolved intent doesn't indicate that its transaction hasn’t been
committed and is not a causal ancestor of the scan.

The approach the first PR took was to throw a WriteIntentError on
intents in a scan's uncertainty interval. Effectively, it made scans
consider intents in their uncertainty interval to be write-read
conflicts. This had the benefit that the scan would push the intent and
eventually resolve the conflict, either by aborting the intent, pushing
it out of the read's uncertainty interval, or waiting for it to commit.
In this last case (which is by the far the most common), the scan would
then throw an uncertainty error, because it now had a committed value in
its uncertainty interval.

The second PR introduced some awkward code to deal with the fallout from
this decision. Once we started trying to keep the lock table in sync
with the MVCC state, we needed to use latches to ensure proper
synchronization between any operations that could conflict because such
conflicts would influence the state of the lock table. This meant that
we needed to start holding latches across a read's uncertainty interval,
because intent writes in this interval were considered write-read
conflicts. This led to some moderately gross code and always felt a little
wrong.

Now that we are complicating the logic around uncertainty intervals even
further, this becomes even more of a burden. This motivates a reworking
of these interactions. This commit replaces the logic that considers
intents in a transaction's uncertainty interval to be write-read
conflicts for logic that considers such intents to be... uncertain.
Doing so means that a transaction will immediately refresh/restart
above the uncertain timestamp and will only then begin conflicting
with the intent.

This has a number of nice benefits:
1. it keeps all considerations of read uncertainty intervals down in
   MVCC iteration logic. The lock table no longer needs to be aware of it.
   This is a big win now and an even bigger win once we introduce synthetic
   timestamps.
2. readers that are almost certainly bound to hit an uncertainty error
   and need to restart will now do so sooner. In rare cases, this may
   result in wasted work. In the vast majority of cases, this will allow
   the reader to be more responsive to the commit of its conflict.
3. uncertainty errors will only be thrown for locks in the uncertainty
   interval of a read that are protecting a provisional write (intents).
   Before, any form of a lock in a read's uncertainty interval would be
   considered a write-read conflict, which was pessimistic and not needed
   for correctness.

   In a future with a fully segregated lock table, the change in semantic
   meaning here becomes even more clear. Instead of detecting the lock
   associated with an intent in a read's uncertainty interval and declaring
   a write-read conflict, the read will instead pass through the lock table
   untouched and will detect the provisional value associated with an intent
   and declaring uncertainty. This seems closer to what were actually trying
   to say about these interactions.

Before making this change, I intend to validate the hypothesis that it
will not affect performance (or may even slightly improve performance)
by running it on the YCSB benchmark suite.
…plit

Now that the readConflictTimestamp does not include a read's uncertainty
interval, this distinction is immaterial. We were already not using
writeConflictTimestamp in the lockTable anyway.
This commit updates the check in tryBumpBatchTimestamp to prevent a
batch from bumping its timestamp if any read latches are held,
regardless of whether they are held at a high-enough timestamp to
"protect" the read.

Even if a request holds read latches with high enough timestamps to
fully cover ("protect") the batch at the new timestamp, we still don't
want to allow the bump. This is because a batch with read spans and a
higher timestamp may now conflict with locks that it previously did not.
However, server-side retries don't re-scan the lock table. This can lead
to requests missing unreplicated locks in the lock table that they
should have seen or discovering replicated intents in MVCC that they
should not have seen (from the perspective of the lock table's
AddDiscoveredLock method).

This was discovered by kvnemesis, which saw a "discovered non-conflicting
lock" assertion failure.
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @sumeerbhola)


pkg/kv/kvserver/batcheval/declare.go, line 63 at r8 (raw file):

Previously, sumeerbhola wrote…

... occurs asynchronously ...

Done.


pkg/storage/pebble_mvcc_scanner.go, line 388 at r4 (raw file):

Previously, sumeerbhola wrote…

conflictingIntent sounds fine.

Done.


pkg/storage/pebble_mvcc_scanner.go, line 390 at r8 (raw file):

Previously, sumeerbhola wrote…

nit: can you remove the empty line

This could use a comment, such as
// The intent is not within the uncertainty, but there could be an uncertain committed value, so seek and check uncertainty using MaxTimestamp.

Done.

@nvanbenschoten
Copy link
Member Author

bors r=sumeerbhola

@craig
Copy link
Contributor

craig bot commented Jan 16, 2021

Build succeeded:

@craig craig bot merged commit cd54fb7 into cockroachdb:master Jan 16, 2021
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/rwuieIntent branch January 16, 2021 17:40
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 17, 2021
Needed for cockroachdb#57688.

This commit reworks interactions between range leases and requests, pulling the
consultation of a replica's lease down below the level of latching while keeping
heavy-weight operations like lease acquisitions above the level of latching.
Doing so comes with several benefits, some related specifically to non-blocking
transactions and some more general.

Background

Before discussing the change here, let's discuss how lease checks, lease
acquisitions, lease redirection, and lease transfers currently work. Today,
requests consult a replica's range lease before acquiring latches. If the lease
is good to go, the request proceeds to acquire latches. If the lease is not
currently held by any replica, the lease is acquired (again, above latches)
through a coalesced `RequestLeaseRequest`. If the lease is currently held by a
different replica, the request is redirected to that replica using a
`NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in
progress, the request is optimistically redirected to the prospective new
leaseholder.

This all works, but only because it's been around for so long. Due to the lease
check above latching, we're forced to go to great lengths to get the
synchronization with in-flight requests right, which leads to very subtle logic.
This is most apparent with lease transfers, which properly synchronize with
ongoing requests through a delicate dance with the HLC clock and some serious
"spooky action at a distance". Every request bumps the local HLC clock in
`Store.Send`, then grabs the replica mutex, checks for an ongoing lease
transfer, drops the replica mutex, then evaluates. Lease transfers grab the
replica mutex, grab a clock reading from the local HLC clock, bump the
minLeaseProposedTS to stop using the current lease, drops the replica mutex,
then proposes a new lease using this clock reading as its start time. This works
only because each request bumps the HLC clock _before_ checking the lease, so
the HLC clock can serve as an upper bound on every request that has made it
through the lease check by the time the lease transfer begins.

This structure is inflexible, subtle, and falls over as soon as we try to extend
it.

Motivation

The primary motivation for pulling lease checks and transfers below latching is
that the interaction between requests and lease transfers is incompatible with
future-time operations, a key part of the non-blocking transaction project. This
is because the structure relies on the HLC clock providing an upper bound on the
time of any request served by an outgoing leaseholder, which is attached to
lease transfers to ensure that the new leaseholder does not violate any request
served on the old leaseholder. But this is quickly violated once we start
serving future-time operations, which don't bump the HLC clock.

So we quickly need to look elsewhere for this information. The obvious place to
look for this information is the timestamp cache, which records the upper bound
read time of each key span in a range, even if this upper bound time is
synthetic. If we could scan the timestamp cache and attach the maximum read time
to a lease transfer (through a new field, not as the lease start time), we'd be
good. But this runs into a problem, because if we just read the timestamp cache
under the lease transfer's lock, we can't be sure we didn't miss any in-progress
operations that had passed the lease check previously but had not yet bumped the
timestamp cache. Maybe they are still reading? So the custom locking quickly
runs into problems (I said it was inflexible!).

Solution

The solution here is to stop relying on custom locking for lease transfers by
pulling the lease check below latching and by pulling the determination of the
transfer's start time below latching. This ensures that during a lease transfer,
we don't only block new requests, but we also flush out in-flight requests. This
means that by the time we look at the timestamp cache during the evaluation of a
lease transfer, we know it has already been updated by any request that will be
served under the current lease.

This commit doesn't make the switch from consulting the HLC clock to consulting
the timestamp cache during TransferLease request evaluation, but a future commit
will.

Other benefits

Besides this primary change, a number of other benefits fall out of this
restructuring.

1. we avoid relying on custom synchronization around leases, instead relying
   on more the more general latching mechanism.
2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now
   both grab clock readings during evaluation and will both need to forward
   their clock reading by the upper-bound of a range's portion of the timestamp
   cache. It makes sense that these two requests would be very similar, as both
   are responsible for renouncing the current leaseholder's powers and passing
   them elsewhere.
3. we more closely aligns the lease acquisition handling with the handling of
   `MergeInProgressError` by classifying a new `InvalidLeaseError` as a
   "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing
   structure of: grab latches, check range state, drop latches and wait if
   necessary, retry.
4. in doing so, we fuse the critical section of lease checks and the rest of
   the checks in `checkExecutionCanProceed`. So we grab the replica read lock
   one fewer time in the request path.
5. we move one step closer to a world where we can "ship a portion of the
   timestamp cache" during lease transfers (and range merges) to avoid retry
   errors / transaction aborts on the new leaseholder. This commit will be
   followed up by one that ships a very basic summary of a leaseholder's
   timestamp cache during lease transfers. However, this would now be trivial to
   extend with higher resolution information, given some size limit. Perhaps we
   prioritize the local portion of the timestamp cache to avoid txn aborts?
6. now that leases are checked below latching, we no longer have the potential
   for an arbitrary delay due to latching and waiting on locks between when the
   lease is checked and when a request evaluates, so we no longer need checks
   like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119).
7. we pull observed timestamp handling a layer down, which will be useful to
   address plumbing comments on cockroachdb#57077.

Other behavioral changes

There are two auxiliary behavioral changes made by this commit that deserve
attention.

The first is that during a lease transfer, operations now block on the outgoing
leaseholder instead of immediately redirecting to the expected next leaseholder.
This has trade-offs. On one hand, this delays redirection, which may make lease
transfers more disruptive to ongoing traffic. On the other, we've seen in the
past that the optimistic redirection is not an absolute win. In many cases, it
can lead to thrashing and lots of wasted work, as the outgoing leaseholder and
the incoming leaseholder both point at each other and requests ping-pong between
them. We've seen this cause serious issues like cockroachdb#22837 and cockroachdb#32367, which we
addressed by adding exponential backoff in the client in 89d349a. So while this
change may make average-case latency during lease transfers slightly worse, it
will keep things much more orderly, avoid wasted work, and reduce worse case
latency during lease transfers.

The other behavioral changes made by this commit is that observed timestamps are
no longer applied to a request to reduce its MaxOffset until after latching and
locking, instead of before. This sounds concerning, but it's actually not for
two reasons. First, as of cockroachdb#57136, a transactions uncertainty interval is no
longer considered by the lock table because locks in a transaction's uncertainty
interval are no longer considered write-read conflicts. Instead, those locks'
provisional values are considered at evaluation time to be uncertain. Second,
the fact that the observed timestamp-limited MaxOffset was being used for
latching is no longer correct in a world with synthetic timestamps (see cockroachdb#57077),
so we would have had to make this change anyway. So put together, this
behavioral change isn't meaningful.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jan 21, 2021
Needed for cockroachdb#57688.

This commit reworks interactions between range leases and requests, pulling the
consultation of a replica's lease down below the level of latching while keeping
heavy-weight operations like lease acquisitions above the level of latching.
Doing so comes with several benefits, some related specifically to non-blocking
transactions and some more general.

Background

Before discussing the change here, let's discuss how lease checks, lease
acquisitions, lease redirection, and lease transfers currently work. Today,
requests consult a replica's range lease before acquiring latches. If the lease
is good to go, the request proceeds to acquire latches. If the lease is not
currently held by any replica, the lease is acquired (again, above latches)
through a coalesced `RequestLeaseRequest`. If the lease is currently held by a
different replica, the request is redirected to that replica using a
`NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in
progress, the request is optimistically redirected to the prospective new
leaseholder.

This all works, but only because it's been around for so long. Due to the lease
check above latching, we're forced to go to great lengths to get the
synchronization with in-flight requests right, which leads to very subtle logic.
This is most apparent with lease transfers, which properly synchronize with
ongoing requests through a delicate dance with the HLC clock and some serious
"spooky action at a distance". Every request bumps the local HLC clock in
`Store.Send`, then grabs the replica mutex, checks for an ongoing lease
transfer, drops the replica mutex, then evaluates. Lease transfers grab the
replica mutex, grab a clock reading from the local HLC clock, bump the
minLeaseProposedTS to stop using the current lease, drops the replica mutex,
then proposes a new lease using this clock reading as its start time. This works
only because each request bumps the HLC clock _before_ checking the lease, so
the HLC clock can serve as an upper bound on every request that has made it
through the lease check by the time the lease transfer begins.

This structure is inflexible, subtle, and falls over as soon as we try to extend
it.

Motivation

The primary motivation for pulling lease checks and transfers below latching is
that the interaction between requests and lease transfers is incompatible with
future-time operations, a key part of the non-blocking transaction project. This
is because the structure relies on the HLC clock providing an upper bound on the
time of any request served by an outgoing leaseholder, which is attached to
lease transfers to ensure that the new leaseholder does not violate any request
served on the old leaseholder. But this is quickly violated once we start
serving future-time operations, which don't bump the HLC clock.

So we quickly need to look elsewhere for this information. The obvious place to
look for this information is the timestamp cache, which records the upper bound
read time of each key span in a range, even if this upper bound time is
synthetic. If we could scan the timestamp cache and attach the maximum read time
to a lease transfer (through a new field, not as the lease start time), we'd be
good. But this runs into a problem, because if we just read the timestamp cache
under the lease transfer's lock, we can't be sure we didn't miss any in-progress
operations that had passed the lease check previously but had not yet bumped the
timestamp cache. Maybe they are still reading? So the custom locking quickly
runs into problems (I said it was inflexible!).

Solution

The solution here is to stop relying on custom locking for lease transfers by
pulling the lease check below latching and by pulling the determination of the
transfer's start time below latching. This ensures that during a lease transfer,
we don't only block new requests, but we also flush out in-flight requests. This
means that by the time we look at the timestamp cache during the evaluation of a
lease transfer, we know it has already been updated by any request that will be
served under the current lease.

This commit doesn't make the switch from consulting the HLC clock to consulting
the timestamp cache during TransferLease request evaluation, but a future commit
will.

Other benefits

Besides this primary change, a number of other benefits fall out of this
restructuring.

1. we avoid relying on custom synchronization around leases, instead relying
   on more the more general latching mechanism.
2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now
   both grab clock readings during evaluation and will both need to forward
   their clock reading by the upper-bound of a range's portion of the timestamp
   cache. It makes sense that these two requests would be very similar, as both
   are responsible for renouncing the current leaseholder's powers and passing
   them elsewhere.
3. we more closely aligns the lease acquisition handling with the handling of
   `MergeInProgressError` by classifying a new `InvalidLeaseError` as a
   "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing
   structure of: grab latches, check range state, drop latches and wait if
   necessary, retry.
4. in doing so, we fuse the critical section of lease checks and the rest of
   the checks in `checkExecutionCanProceed`. So we grab the replica read lock
   one fewer time in the request path.
5. we move one step closer to a world where we can "ship a portion of the
   timestamp cache" during lease transfers (and range merges) to avoid retry
   errors / transaction aborts on the new leaseholder. This commit will be
   followed up by one that ships a very basic summary of a leaseholder's
   timestamp cache during lease transfers. However, this would now be trivial to
   extend with higher resolution information, given some size limit. Perhaps we
   prioritize the local portion of the timestamp cache to avoid txn aborts?
6. now that leases are checked below latching, we no longer have the potential
   for an arbitrary delay due to latching and waiting on locks between when the
   lease is checked and when a request evaluates, so we no longer need checks
   like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119).
7. we pull observed timestamp handling a layer down, which will be useful to
   address plumbing comments on cockroachdb#57077.

Other behavioral changes

There are two auxiliary behavioral changes made by this commit that deserve
attention.

The first is that during a lease transfer, operations now block on the outgoing
leaseholder instead of immediately redirecting to the expected next leaseholder.
This has trade-offs. On one hand, this delays redirection, which may make lease
transfers more disruptive to ongoing traffic. On the other, we've seen in the
past that the optimistic redirection is not an absolute win. In many cases, it
can lead to thrashing and lots of wasted work, as the outgoing leaseholder and
the incoming leaseholder both point at each other and requests ping-pong between
them. We've seen this cause serious issues like cockroachdb#22837 and cockroachdb#32367, which we
addressed by adding exponential backoff in the client in 89d349a. So while this
change may make average-case latency during lease transfers slightly worse, it
will keep things much more orderly, avoid wasted work, and reduce worse case
latency during lease transfers.

The other behavioral changes made by this commit is that observed timestamps are
no longer applied to a request to reduce its MaxOffset until after latching and
locking, instead of before. This sounds concerning, but it's actually not for
two reasons. First, as of cockroachdb#57136, a transactions uncertainty interval is no
longer considered by the lock table because locks in a transaction's uncertainty
interval are no longer considered write-read conflicts. Instead, those locks'
provisional values are considered at evaluation time to be uncertain. Second,
the fact that the observed timestamp-limited MaxOffset was being used for
latching is no longer correct in a world with synthetic timestamps (see cockroachdb#57077),
so we would have had to make this change anyway. So put together, this
behavioral change isn't meaningful.
@sumeerbhola
Copy link
Collaborator

I still owe this change some YCSB numbers. I'll try to collect those tonight.

did you happen to collect the YCSB numbers?

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 1, 2021
Needed for cockroachdb#57688.

This commit reworks interactions between range leases and requests, pulling the
consultation of a replica's lease down below the level of latching while keeping
heavy-weight operations like lease acquisitions above the level of latching.
Doing so comes with several benefits, some related specifically to non-blocking
transactions and some more general.

Background

Before discussing the change here, let's discuss how lease checks, lease
acquisitions, lease redirection, and lease transfers currently work. Today,
requests consult a replica's range lease before acquiring latches. If the lease
is good to go, the request proceeds to acquire latches. If the lease is not
currently held by any replica, the lease is acquired (again, above latches)
through a coalesced `RequestLeaseRequest`. If the lease is currently held by a
different replica, the request is redirected to that replica using a
`NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in
progress, the request is optimistically redirected to the prospective new
leaseholder.

This all works, but only because it's been around for so long. Due to the lease
check above latching, we're forced to go to great lengths to get the
synchronization with in-flight requests right, which leads to very subtle logic.
This is most apparent with lease transfers, which properly synchronize with
ongoing requests through a delicate dance with the HLC clock and some serious
"spooky action at a distance". Every request bumps the local HLC clock in
`Store.Send`, then grabs the replica mutex, checks for an ongoing lease
transfer, drops the replica mutex, then evaluates. Lease transfers grab the
replica mutex, grab a clock reading from the local HLC clock, bump the
minLeaseProposedTS to stop using the current lease, drops the replica mutex,
then proposes a new lease using this clock reading as its start time. This works
only because each request bumps the HLC clock _before_ checking the lease, so
the HLC clock can serve as an upper bound on every request that has made it
through the lease check by the time the lease transfer begins.

This structure is inflexible, subtle, and falls over as soon as we try to extend
it.

Motivation

The primary motivation for pulling lease checks and transfers below latching is
that the interaction between requests and lease transfers is incompatible with
future-time operations, a key part of the non-blocking transaction project. This
is because the structure relies on the HLC clock providing an upper bound on the
time of any request served by an outgoing leaseholder, which is attached to
lease transfers to ensure that the new leaseholder does not violate any request
served on the old leaseholder. But this is quickly violated once we start
serving future-time operations, which don't bump the HLC clock.

So we quickly need to look elsewhere for this information. The obvious place to
look for this information is the timestamp cache, which records the upper bound
read time of each key span in a range, even if this upper bound time is
synthetic. If we could scan the timestamp cache and attach the maximum read time
to a lease transfer (through a new field, not as the lease start time), we'd be
good. But this runs into a problem, because if we just read the timestamp cache
under the lease transfer's lock, we can't be sure we didn't miss any in-progress
operations that had passed the lease check previously but had not yet bumped the
timestamp cache. Maybe they are still reading? So the custom locking quickly
runs into problems (I said it was inflexible!).

Solution

The solution here is to stop relying on custom locking for lease transfers by
pulling the lease check below latching and by pulling the determination of the
transfer's start time below latching. This ensures that during a lease transfer,
we don't only block new requests, but we also flush out in-flight requests. This
means that by the time we look at the timestamp cache during the evaluation of a
lease transfer, we know it has already been updated by any request that will be
served under the current lease.

This commit doesn't make the switch from consulting the HLC clock to consulting
the timestamp cache during TransferLease request evaluation, but a future commit
will.

Other benefits

Besides this primary change, a number of other benefits fall out of this
restructuring.

1. we avoid relying on custom synchronization around leases, instead relying
   on more the more general latching mechanism.
2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now
   both grab clock readings during evaluation and will both need to forward
   their clock reading by the upper-bound of a range's portion of the timestamp
   cache. It makes sense that these two requests would be very similar, as both
   are responsible for renouncing the current leaseholder's powers and passing
   them elsewhere.
3. we more closely aligns the lease acquisition handling with the handling of
   `MergeInProgressError` by classifying a new `InvalidLeaseError` as a
   "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing
   structure of: grab latches, check range state, drop latches and wait if
   necessary, retry.
4. in doing so, we fuse the critical section of lease checks and the rest of
   the checks in `checkExecutionCanProceed`. So we grab the replica read lock
   one fewer time in the request path.
5. we move one step closer to a world where we can "ship a portion of the
   timestamp cache" during lease transfers (and range merges) to avoid retry
   errors / transaction aborts on the new leaseholder. This commit will be
   followed up by one that ships a very basic summary of a leaseholder's
   timestamp cache during lease transfers. However, this would now be trivial to
   extend with higher resolution information, given some size limit. Perhaps we
   prioritize the local portion of the timestamp cache to avoid txn aborts?
6. now that leases are checked below latching, we no longer have the potential
   for an arbitrary delay due to latching and waiting on locks between when the
   lease is checked and when a request evaluates, so we no longer need checks
   like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119).
7. we pull observed timestamp handling a layer down, which will be useful to
   address plumbing comments on cockroachdb#57077.

Other behavioral changes

There are two auxiliary behavioral changes made by this commit that deserve
attention.

The first is that during a lease transfer, operations now block on the outgoing
leaseholder instead of immediately redirecting to the expected next leaseholder.
This has trade-offs. On one hand, this delays redirection, which may make lease
transfers more disruptive to ongoing traffic. On the other, we've seen in the
past that the optimistic redirection is not an absolute win. In many cases, it
can lead to thrashing and lots of wasted work, as the outgoing leaseholder and
the incoming leaseholder both point at each other and requests ping-pong between
them. We've seen this cause serious issues like cockroachdb#22837 and cockroachdb#32367, which we
addressed by adding exponential backoff in the client in 89d349a. So while this
change may make average-case latency during lease transfers slightly worse, it
will keep things much more orderly, avoid wasted work, and reduce worse case
latency during lease transfers.

The other behavioral changes made by this commit is that observed timestamps are
no longer applied to a request to reduce its MaxOffset until after latching and
locking, instead of before. This sounds concerning, but it's actually not for
two reasons. First, as of cockroachdb#57136, a transactions uncertainty interval is no
longer considered by the lock table because locks in a transaction's uncertainty
interval are no longer considered write-read conflicts. Instead, those locks'
provisional values are considered at evaluation time to be uncertain. Second,
the fact that the observed timestamp-limited MaxOffset was being used for
latching is no longer correct in a world with synthetic timestamps (see cockroachdb#57077),
so we would have had to make this change anyway. So put together, this
behavioral change isn't meaningful.
@nvanbenschoten
Copy link
Member Author

I did, but I see that I forgot to post them. There wasn't much movement. The largest statistically significant change was a 4% regression in throughput on ycsb/B/nodes=3, but then there was also a statistically significant 1.4% speedup in ycsb/B/nodes=3/cpu=32, the same workload but on larger machines. That lines up with my prior experience with YCSB - that there is so much contention on certain keys that small changes in the interactions between txns can cause performance to vary, often unpredictably. Since there was no clear positive or negative trend, this didn't seem to warrant further investigation.

name                   old ops/s    new ops/s    delta
ycsb/A/nodes=3          15.4k ± 2%   14.9k ±10%     ~     (p=0.165 n=10+10)
ycsb/B/nodes=3          32.3k ± 1%   31.0k ±10%   -4.02%  (p=0.017 n=9+10)
ycsb/C/nodes=3          45.9k ± 1%   44.6k ± 4%     ~     (p=0.075 n=10+10)
ycsb/D/nodes=3          34.8k ± 1%   33.8k ± 7%     ~     (p=0.165 n=10+10)
ycsb/E/nodes=3            824 ± 4%     805 ± 4%   -2.36%  (p=0.035 n=10+10)
ycsb/F/nodes=3          7.07k ± 3%   7.27k ± 4%   +2.88%  (p=0.010 n=9+10)
ycsb/A/nodes=3/cpu=32   27.1k ± 3%   26.6k ± 5%     ~     (p=0.123 n=10+10)
ycsb/B/nodes=3/cpu=32   90.1k ± 3%   91.4k ± 5%   +1.39%  (p=0.040 n=9+9)
ycsb/C/nodes=3/cpu=32    122k ± 2%    122k ± 2%     ~     (p=0.280 n=10+10)
ycsb/D/nodes=3/cpu=32   91.0k ± 4%   91.2k ± 5%     ~     (p=0.684 n=10+10)
ycsb/E/nodes=3/cpu=32   1.13k ± 6%   1.13k ± 5%     ~     (p=0.971 n=10+10)
ycsb/F/nodes=3/cpu=32   14.3k ± 7%   14.9k ± 6%     ~     (p=0.190 n=10+10)

name                   old avg(ms)  new avg(ms)  delta
ycsb/A/nodes=3           6.26 ± 3%    6.38 ± 4%     ~     (p=0.307 n=10+9)
ycsb/B/nodes=3           4.48 ± 3%    4.81 ± 9%   +7.37%  (p=0.026 n=10+10)
ycsb/C/nodes=3           3.13 ± 2%    3.23 ± 4%   +3.19%  (p=0.021 n=10+10)
ycsb/D/nodes=3           2.77 ± 3%    2.83 ± 6%     ~     (p=0.196 n=10+10)
ycsb/E/nodes=3            116 ± 4%     119 ± 4%   +2.42%  (p=0.035 n=10+10)
ycsb/F/nodes=3           13.6 ± 4%    13.2 ± 4%   -2.95%  (p=0.010 n=9+10)
ycsb/A/nodes=3/cpu=32    5.31 ± 4%    5.43 ± 4%     ~     (p=0.089 n=10+10)
ycsb/B/nodes=3/cpu=32    2.10 ± 0%    2.10 ± 0%     ~     (all equal)
ycsb/C/nodes=3/cpu=32    1.60 ± 0%    1.60 ± 0%     ~     (all equal)
ycsb/D/nodes=3/cpu=32    1.56 ± 4%    1.57 ± 8%     ~     (p=1.000 n=10+10)
ycsb/E/nodes=3/cpu=32     127 ± 6%     127 ± 5%     ~     (p=0.955 n=10+10)
ycsb/F/nodes=3/cpu=32    10.1 ± 7%     9.7 ± 6%     ~     (p=0.155 n=10+10)

name                   old p99(ms)  new p99(ms)  delta
ycsb/A/nodes=3           43.5 ±23%    41.1 ±13%     ~     (p=0.112 n=9+8)
ycsb/B/nodes=3           29.0 ± 9%    31.8 ±17%   +9.66%  (p=0.015 n=10+10)
ycsb/C/nodes=3           12.9 ± 3%    13.4 ± 6%     ~     (p=0.061 n=10+10)
ycsb/D/nodes=3           14.4 ± 2%    14.6 ± 7%     ~     (p=0.344 n=10+10)
ycsb/E/nodes=3            580 ± 8%     587 ± 3%     ~     (p=0.714 n=10+10)
ycsb/F/nodes=3            110 ±29%     129 ±37%     ~     (p=0.202 n=9+10)
ycsb/A/nodes=3/cpu=32    71.9 ±11%    72.5 ±10%     ~     (p=0.792 n=10+9)
ycsb/B/nodes=3/cpu=32    15.9 ±10%    15.6 ± 6%     ~     (p=0.497 n=9+9)
ycsb/C/nodes=3/cpu=32    7.57 ± 4%    7.51 ± 3%     ~     (p=0.720 n=10+10)
ycsb/D/nodes=3/cpu=32    8.80 ± 7%    8.62 ± 9%     ~     (p=0.439 n=10+10)
ycsb/E/nodes=3/cpu=32     671 ± 0%     671 ± 0%     ~     (all equal)
ycsb/F/nodes=3/cpu=32     154 ±20%     148 ±19%     ~     (p=0.535 n=10+10)

This also seems to indicate that reducing maximum clock uncertainty would positively impact YCSB throughput. That would be an interesting experiment to run.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Feb 5, 2021
Needed for cockroachdb#57688.

This commit reworks interactions between range leases and requests, pulling the
consultation of a replica's lease down below the level of latching while keeping
heavy-weight operations like lease acquisitions above the level of latching.
Doing so comes with several benefits, some related specifically to non-blocking
transactions and some more general.

Background

Before discussing the change here, let's discuss how lease checks, lease
acquisitions, lease redirection, and lease transfers currently work. Today,
requests consult a replica's range lease before acquiring latches. If the lease
is good to go, the request proceeds to acquire latches. If the lease is not
currently held by any replica, the lease is acquired (again, above latches)
through a coalesced `RequestLeaseRequest`. If the lease is currently held by a
different replica, the request is redirected to that replica using a
`NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in
progress, the request is optimistically redirected to the prospective new
leaseholder.

This all works, but only because it's been around for so long. Due to the lease
check above latching, we're forced to go to great lengths to get the
synchronization with in-flight requests right, which leads to very subtle logic.
This is most apparent with lease transfers, which properly synchronize with
ongoing requests through a delicate dance with the HLC clock and some serious
"spooky action at a distance". Every request bumps the local HLC clock in
`Store.Send`, then grabs the replica mutex, checks for an ongoing lease
transfer, drops the replica mutex, then evaluates. Lease transfers grab the
replica mutex, grab a clock reading from the local HLC clock, bump the
minLeaseProposedTS to stop using the current lease, drops the replica mutex,
then proposes a new lease using this clock reading as its start time. This works
only because each request bumps the HLC clock _before_ checking the lease, so
the HLC clock can serve as an upper bound on every request that has made it
through the lease check by the time the lease transfer begins.

This structure is inflexible, subtle, and falls over as soon as we try to extend
it.

Motivation

The primary motivation for pulling lease checks and transfers below latching is
that the interaction between requests and lease transfers is incompatible with
future-time operations, a key part of the non-blocking transaction project. This
is because the structure relies on the HLC clock providing an upper bound on the
time of any request served by an outgoing leaseholder, which is attached to
lease transfers to ensure that the new leaseholder does not violate any request
served on the old leaseholder. But this is quickly violated once we start
serving future-time operations, which don't bump the HLC clock.

So we quickly need to look elsewhere for this information. The obvious place to
look for this information is the timestamp cache, which records the upper bound
read time of each key span in a range, even if this upper bound time is
synthetic. If we could scan the timestamp cache and attach the maximum read time
to a lease transfer (through a new field, not as the lease start time), we'd be
good. But this runs into a problem, because if we just read the timestamp cache
under the lease transfer's lock, we can't be sure we didn't miss any in-progress
operations that had passed the lease check previously but had not yet bumped the
timestamp cache. Maybe they are still reading? So the custom locking quickly
runs into problems (I said it was inflexible!).

Solution

The solution here is to stop relying on custom locking for lease transfers by
pulling the lease check below latching and by pulling the determination of the
transfer's start time below latching. This ensures that during a lease transfer,
we don't only block new requests, but we also flush out in-flight requests. This
means that by the time we look at the timestamp cache during the evaluation of a
lease transfer, we know it has already been updated by any request that will be
served under the current lease.

This commit doesn't make the switch from consulting the HLC clock to consulting
the timestamp cache during TransferLease request evaluation, but a future commit
will.

Other benefits

Besides this primary change, a number of other benefits fall out of this
restructuring.

1. we avoid relying on custom synchronization around leases, instead relying
   on more the more general latching mechanism.
2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now
   both grab clock readings during evaluation and will both need to forward
   their clock reading by the upper-bound of a range's portion of the timestamp
   cache. It makes sense that these two requests would be very similar, as both
   are responsible for renouncing the current leaseholder's powers and passing
   them elsewhere.
3. we more closely aligns the lease acquisition handling with the handling of
   `MergeInProgressError` by classifying a new `InvalidLeaseError` as a
   "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing
   structure of: grab latches, check range state, drop latches and wait if
   necessary, retry.
4. in doing so, we fuse the critical section of lease checks and the rest of
   the checks in `checkExecutionCanProceed`. So we grab the replica read lock
   one fewer time in the request path.
5. we move one step closer to a world where we can "ship a portion of the
   timestamp cache" during lease transfers (and range merges) to avoid retry
   errors / transaction aborts on the new leaseholder. This commit will be
   followed up by one that ships a very basic summary of a leaseholder's
   timestamp cache during lease transfers. However, this would now be trivial to
   extend with higher resolution information, given some size limit. Perhaps we
   prioritize the local portion of the timestamp cache to avoid txn aborts?
6. now that leases are checked below latching, we no longer have the potential
   for an arbitrary delay due to latching and waiting on locks between when the
   lease is checked and when a request evaluates, so we no longer need checks
   like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119).
7. we pull observed timestamp handling a layer down, which will be useful to
   address plumbing comments on cockroachdb#57077.

Other behavioral changes

There are two auxiliary behavioral changes made by this commit that deserve
attention.

The first is that during a lease transfer, operations now block on the outgoing
leaseholder instead of immediately redirecting to the expected next leaseholder.
This has trade-offs. On one hand, this delays redirection, which may make lease
transfers more disruptive to ongoing traffic. On the other, we've seen in the
past that the optimistic redirection is not an absolute win. In many cases, it
can lead to thrashing and lots of wasted work, as the outgoing leaseholder and
the incoming leaseholder both point at each other and requests ping-pong between
them. We've seen this cause serious issues like cockroachdb#22837 and cockroachdb#32367, which we
addressed by adding exponential backoff in the client in 89d349a. So while this
change may make average-case latency during lease transfers slightly worse, it
will keep things much more orderly, avoid wasted work, and reduce worse case
latency during lease transfers.

The other behavioral changes made by this commit is that observed timestamps are
no longer applied to a request to reduce its MaxOffset until after latching and
locking, instead of before. This sounds concerning, but it's actually not for
two reasons. First, as of cockroachdb#57136, a transactions uncertainty interval is no
longer considered by the lock table because locks in a transaction's uncertainty
interval are no longer considered write-read conflicts. Instead, those locks'
provisional values are considered at evaluation time to be uncertain. Second,
the fact that the observed timestamp-limited MaxOffset was being used for
latching is no longer correct in a world with synthetic timestamps (see cockroachdb#57077),
so we would have had to make this change anyway. So put together, this
behavioral change isn't meaningful.
craig bot pushed a commit that referenced this pull request Feb 6, 2021
59086: kv: move range lease checks and transfers below latching r=nvanbenschoten a=nvanbenschoten

Needed for #57688.

This PR reworks interactions between range leases and requests, pulling the consultation of a replica's lease down below the level of latching while keeping heavy-weight operations like lease acquisitions above the level of latching. Doing so comes with several benefits, some related specifically to non-blocking transactions and some more general.

### Background

Before discussing the change here, let's discuss how lease checks, lease acquisitions, lease redirection, and lease transfers currently work. Today, requests consult a replica's range lease before acquiring latches. If the lease is good to go, the request proceeds to acquire latches. If the lease is not currently held by any replica, the lease is acquired (again, above latches) through a coalesced `RequestLeaseRequest`. If the lease is currently held by a different replica, the request is redirected to that replica using a `NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in progress, the request is optimistically redirected to the prospective new leaseholder.

This all works, but only because it's been around for so long. Due to the lease check above latching, we're forced to go to great lengths to get the synchronization with in-flight requests right, which leads to very subtle logic. This is most apparent with lease transfers, which properly synchronize with ongoing requests through a delicate dance with the HLC clock and some serious "spooky action at a distance". Every request bumps the local HLC clock in `Store.Send`, then grabs the replica mutex, checks for an ongoing lease transfer, drops the replica mutex, then evaluates. Lease transfers grab the replica mutex, grab a clock reading from the local HLC clock, bump the minLeaseProposedTS to stop using the current lease, drops the replica mutex, then proposes a new lease using this clock reading as its start time. This works only because each request bumps the HLC clock _before_ checking the lease, so the HLC clock can serve as an upper bound on every request that has made it through the lease check by the time the lease transfer begins.

This structure is inflexible, subtle, and falls over as soon as we try to extend it.

### Motivation

The primary motivation for pulling lease checks and transfers below latching is that the interaction between requests and lease transfers is incompatible with future-time operations, a key part of the non-blocking transaction project. This is because the structure relies on the HLC clock providing an upper bound on the time of any request served by an outgoing leaseholder, which is attached to lease transfers to ensure that the new leaseholder does not violate any request served on the old leaseholder. But this is quickly violated once we start serving future-time operations, which don't bump the HLC clock.

So we quickly need to look elsewhere for this information. The obvious place to look for this information is the timestamp cache, which records the upper bound read time of each key span in a range, even if this upper bound time is synthetic. If we could scan the timestamp cache and attach the maximum read time to a lease transfer (through a new field, not as the lease start time), we'd be good. But this runs into a problem, because if we just read the timestamp cache under the lease transfer's lock, we can't be sure we didn't miss any in-progress operations that had passed the lease check previously but had not yet bumped the timestamp cache. Maybe they are still reading? So the custom locking quickly runs into problems (I said it was inflexible!).

### Solution

The solution here is to stop relying on custom locking for lease transfers by pulling the lease check below latching and by pulling the determination of the transfer's start time below latching. This ensures that during a lease transfer, we don't only block new requests, but we also flush out in-flight requests. This means that by the time we look at the timestamp cache during the evaluation of a lease transfer, we know it has already been updated by any request that will be served under the current lease.

This commit doesn't make the switch from consulting the HLC clock to consulting the timestamp cache during TransferLease request evaluation, but a future commit will.

### Other benefits

Besides this primary change, a number of other benefits fall out of this restructuring.

1. we avoid relying on custom synchronization around leases, instead relying on more the more general latching mechanism.
2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now both grab clock readings during evaluation and will both need to forward their clock reading by the upper-bound of a range's portion of the timestamp cache. It makes sense that these two requests would be very similar, as both are responsible for renouncing the current leaseholder's powers and passing them elsewhere.
3. we more closely aligns the lease acquisition handling with the handling of `MergeInProgressError` by classifying a new `InvalidLeaseError` as a "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing structure of: grab latches, check range state, drop latches and wait if necessary, retry.
4. in doing so, we fuse the critical section of lease checks and the rest of the checks in `checkExecutionCanProceed`. So we grab the replica read lock one fewer time in the request path.
5. we move one step closer to a world where we can "ship a portion of the timestamp cache" during lease transfers (and range merges) to avoid retry errors / transaction aborts on the new leaseholder. This commit will be followed up by one that ships a very basic summary of a leaseholder's timestamp cache during lease transfers. However, this would now be trivial to extend with higher resolution information, given some size limit. Perhaps we prioritize the local portion of the timestamp cache to avoid txn aborts?
6. now that leases are checked below latching, we no longer have the potential for an arbitrary delay due to latching and waiting on locks between when the lease is checked and when a request evaluates, so we no longer need checks like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119).
7. we pull observed timestamp handling a layer down, which will be useful to address plumbing comments on #57077.

### Other behavioral changes

There are two auxiliary behavioral changes made by this commit that deserve attention.

The first is that during a lease transfer, operations now block on the outgoing leaseholder instead of immediately redirecting to the expected next leaseholder. This has trade-offs. On one hand, this delays redirection, which may make lease transfers more disruptive to ongoing traffic. On the other, we've seen in the past that the optimistic redirection is not an absolute win. In many cases, it can lead to thrashing and lots of wasted work, as the outgoing leaseholder and the incoming leaseholder both point at each other and requests ping-pong between them. We've seen this cause serious issues like #22837 and #32367, which we addressed by adding exponential backoff in the client in 89d349a. So while this change may make average-case latency during lease transfers slightly worse, it will keep things much more orderly, avoid wasted work, and reduce worst-case latency during lease transfers.

The other behavioral changes made by this commit is that observed timestamps are no longer applied to a request to reduce its MaxOffset until after latching and locking, instead of before. This sounds concerning, but it's actually not for two reasons. First, as of #57136, a transactions uncertainty interval is no longer considered by the lock table because locks in a transaction's uncertainty interval are no longer considered write-read conflicts. Instead, those locks' provisional values are considered at evaluation time to be uncertain. Second, the fact that the observed timestamp-limited MaxOffset was being used for latching is no longer correct in a world with synthetic timestamps (see #57077), so we would have had to make this change anyway. So put together, this behavioral change isn't meaningful.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multiregion Related to multi-region
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants