Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: NotLeaseholderError redirection can get in tight loop #22837

Closed
nvanbenschoten opened this issue Feb 20, 2018 · 6 comments
Closed

storage: NotLeaseholderError redirection can get in tight loop #22837

nvanbenschoten opened this issue Feb 20, 2018 · 6 comments
Assignees
Labels
A-kv-client Relating to the KV client and the KV interface. C-performance Perf of queries or internals. Solution not expected to change functional behavior.

Comments

@nvanbenschoten
Copy link
Member

I've observed that in at least v2.0-alpha.20180122 and v2.0-alpha.20180212 it's possible for NotLeaseholderError redirection to result in a tight loop where DistSender continually ping-pongs requests back between two replicas. I first noticed this in the RPC Errors graph of the Admin UI, where the number of Not Leaseholder Errors occasionally jumped up in the 10k range even though I only had about 50 SQL clients. This was backed up by the RPCs graph, which also showed a similar spike.

Later, I got lucky and caught this on the debug/requests page. Here I saw the following:

2018/02/20 08:10:24.413824 	13.750285 	kv.DistSender: sending partial batch
08:10:24.413826 	 .     2 	... txnID:bb4d74d6-fc80-46eb-b3e2-9121a450a1c2
08:10:24.414053 	 .   227 	... [client=127.0.0.1:35854,user=rk_user,n1] r52: sending batch 2 CPut, 1 BeginTxn to (n2,s2):3
08:10:24.414071 	 .    18 	... [client=127.0.0.1:35854,user=rk_user,n1] sending request to nathan-high-mem-0002:26257
08:10:24.419545 	 .  5473 	... [client=127.0.0.1:35854,user=rk_user,n1] application error: [NotLeaseHolderError] r52: replica (n2,s2):3 not lease holder; current lease is repl=(n1,s1):1 seq=0 start=1519114215.919013222,0 epo=1 pro=1519114215.919015761,0
08:10:24.419597 	 .    52 	... [client=127.0.0.1:35854,user=rk_user,n1] error: {(err: [NotLeaseHolderError] r52: replica (n2,s2):3 not lease holder; current lease is repl=(n1,s1):1 seq=0 start=1519114215.919013222,0 epo=1 pro=1519114215.919015761,0) <nil>}; trying next peer (n1,s1):1
08:10:24.419614 	 .    17 	... [client=127.0.0.1:35854,user=rk_user,n1] sending request to local server
08:10:24.419852 	 .   238 	... [client=127.0.0.1:35854,user=rk_user,n1] application error: [NotLeaseHolderError] r52: replica (n1,s1):1 not lease holder; current lease is repl=(n2,s2):3 seq=3 start=1519114118.566713425,0 epo=1 pro=1519114118.566716884,0
08:10:24.419861 	 .     9 	... [client=127.0.0.1:35854,user=rk_user,n1] error: {(err: [NotLeaseHolderError] r52: replica (n1,s1):1 not lease holder; current lease is repl=(n2,s2):3 seq=3 start=1519114118.566713425,0 epo=1 pro=1519114118.566716884,0) <nil>}; trying next peer (n2,s2):3
08:10:24.419878 	 .    17 	... [client=127.0.0.1:35854,user=rk_user,n1] sending request to nathan-high-mem-0002:26257
08:10:24.426201 	 .  6324 	... [client=127.0.0.1:35854,user=rk_user,n1] application error: [NotLeaseHolderError] r52: replica (n2,s2):3 not lease holder; current lease is repl=(n1,s1):1 seq=0 start=1519114215.919013222,0 epo=1 pro=1519114215.919015761,0
08:10:24.426220 	 .    18 	... [client=127.0.0.1:35854,user=rk_user,n1] error: {(err: [NotLeaseHolderError] r52: replica (n2,s2):3 not lease holder; current lease is repl=(n1,s1):1 seq=0 start=1519114215.919013222,0 epo=1 pro=1519114215.919015761,0) <nil>}; trying next peer (n1,s1):1
08:10:24.426233 	 .    13 	... [client=127.0.0.1:35854,user=rk_user,n1] sending request to local server
08:10:24.426578 	 .   345 	... [client=127.0.0.1:35854,user=rk_user,n1] application error: [NotLeaseHolderError] r52: replica (n1,s1):1 not lease holder; current lease is repl=(n2,s2):3 seq=3 start=1519114118.566713425,0 epo=1 pro=1519114118.566716884,0
08:10:24.426587 	 .     9 	... [client=127.0.0.1:35854,user=rk_user,n1] error: {(err: [NotLeaseHolderError] r52: replica (n1,s1):1 not lease holder; current lease is repl=(n2,s2):3 seq=3 start=1519114118.566713425,0 epo=1 pro=1519114118.566716884,0) <nil>}; trying next peer (n2,s2):3
08:10:24.426612 	 .    24 	... [client=127.0.0.1:35854,user=rk_user,n1] sending request to nathan-high-mem-0002:26257
...
for multiple seconds
...

We can see n1 and n2 continuously redirecting to each other. I believe a situation like this is possible if a node requests a range lease and then quickly falls behind in the Raft log before seeing the application of its new lease. In that case, I'm not sure if there's much we can do to inform the new leaseholder about its new lease, because it's not easily safe to communicate the lease information in a side-channel outside of Raft. Still, this results in 0 QPS across the entire range, so I wonder if there's something else we can do to prevent the situation entirely.

At a minimum, we should have some kind of backoff at the DistSender level to prevent such a tight loop from occurring and blowing up the RPC count.

@nvanbenschoten nvanbenschoten added this to the 2.1 milestone Feb 20, 2018
@nvanbenschoten
Copy link
Member Author

Based on this logic

if transferLease, ok := r.mu.pendingLeaseRequest.TransferInProgress(
repDesc.ReplicaID); ok {
return nil, roachpb.NewError(
newNotLeaseHolderError(&transferLease, r.store.StoreID(), r.mu.state.Desc))
}

it looks like this could also happen if the leaseholder is in the process of transferring its lease away. In that case, all reads and writes risk getting into this tight loop where they bounce back and forth between the current and future leaseholder.

It seems to me like we should at least allow reads to execute on a leaseholder that's in the process of transferring away its lease. Writes make less sense to allow when a lease is being transferred, but they still should not enter into a tight loop.

@tbg tbg added the A-kv-distribution Relating to rebalancing and leasing. label May 23, 2018
@tbg
Copy link
Member

tbg commented May 24, 2018

It seems to me like we should at least allow reads to execute on a leaseholder that's in the process of transferring away its lease.

But this isn't safe? The main promise during a lease transfer is to not do that (as it would bypass the recipient's timestamp cache).

Maybe we can make slightly more informed decisions if we include the lease sequence number in the not leaseholder error? In that case, you'd only invalidate the cache if the sequence number would increment in the process. This would avoid flip-flopping between old and new, but still runs hot, so we additionally want some backoff for nodes that we've visited before for a given range.

@nvanbenschoten
Copy link
Member Author

But this isn't safe? The main promise during a lease transfer is to not do that (as it would bypass the recipient's timestamp cache).

Yeah, I think you're right here.

Maybe we can make slightly more informed decisions if we include the lease sequence number in the not leaseholder error? In that case, you'd only invalidate the cache if the sequence number would increment in the process. This would avoid flip-flopping between old and new, but still runs hot, so we additionally want some backoff for nodes that we've visited before for a given range.

This is in line with what I was thinking. The one additional thing I'll add is that we should only backoff if the situation where the sequence number is not incremented. In the case where we're chasing a lease around, we don't need to have any backoff policy (in fact, we don't want one).

@tbg tbg added the C-performance Perf of queries or internals. Solution not expected to change functional behavior. label Jul 22, 2018
@tbg tbg added A-coreperf and removed A-disaster-recovery A-kv-transactions Relating to MVCC and the transactional model. A-kv-distribution Relating to rebalancing and leasing. A-kv-client Relating to the KV client and the KV interface. A-storage Relating to our storage engine (Pebble) on-disk storage. A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 31, 2018
@nvanbenschoten nvanbenschoten modified the milestones: 2.1, 2.2 Oct 1, 2018
@petermattis petermattis removed this from the 2.2 milestone Oct 5, 2018
@a-robinson
Copy link
Contributor

See #8816 for some old discussion on solutions to this.

@nvanbenschoten
Copy link
Member Author

This actually draws a lot of parallels to #31330. Both are caused by leaders applying a Raft command sufficiently before a follower who is then required to perform some later action. The result is that the action ends up hot looping while waiting for the followers Raft application.

It's possible that both issues will become much less pressing once we determine the cause of the commit index divergence observed in #31330.

@nvanbenschoten nvanbenschoten added A-kv-client Relating to the KV client and the KV interface. and removed A-coreperf labels Oct 16, 2018
@ajwerner
Copy link
Contributor

ajwerner commented Nov 14, 2018

Not sure it's the same issue but I've been seeing some stalls when running the below test against a 3 node cluster with 64 core machines:

roachprod wipe ${CLUSTER}
roachprod start ${CLUSTER}:1-3 --binary cockroach-base2
./cockroach sql --insecure --url $(roachprod pgurl ${CLUSTER}:1) -e 'set cluster setting trace.debug.enable = true'
roachprod run ${CLUSTER}:4 -- ./workload run kv '{pgurl:1-3}' --init --read-percent=95 --concurrency=512 

ajwerner added a commit to ajwerner/cockroach that referenced this issue Dec 5, 2018
This PR address a problem which could lead to very long stalls in range
throughput when a lease transfer occurs when under load. As soon as the
current lease holder begins a lease transfer, it rejects all future requests
to the range with a NotLeaseHolderError which contains the new lease
information. As soon as this happens, the new lease holder immediately begins
receiving requests but is not able to service those requests until it processes
the raft command that makes it the lease hold. Until it applies that command, it
returns NotLeaseHolderError with the previous lease information. Prior to this
change, the DistSender would immediately retry the request at the node indicated
in the most recent NotLeaseHolderError it has received. This leads to a tight
loop of requests bouncing between the current lease holder and the new lease
holder which is unaware of the pending transfer (as observed in cockroachdb#22837) . The
amount of load generated by this traffic can grind raft progress to a complete
halt, with the author observing multi-minute durations for the new node to
process a raft Ready and hundreds of milliseconds to process a single command.
Fortunately, the DistSender can detect when this situation is occurring and can
back off accordingly.

This change detects that a replica is in the midst of a lease transfer by
noticing that it continues to receive NotLeaseHolderErrors without observing
new lease sequence number. In this case, the DistSender backs off exponentially
until it succeeds, fails, or observes a new lease sequence.

Fixes cockroachdb#22837, Fixes cockroachdb#32367

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this issue Dec 6, 2018
This PR address a problem which could lead to very long stalls in range
throughput when a lease transfer occurs when under load. As soon as the
current lease holder begins a lease transfer, it rejects all future requests
to the range with a NotLeaseHolderError which contains the new lease
information. As soon as this happens, the new lease holder immediately begins
receiving requests but is not able to service those requests until it processes
the raft command that makes it the lease hold. Until it applies that command, it
returns NotLeaseHolderError with the previous lease information. Prior to this
change, the DistSender would immediately retry the request at the node indicated
in the most recent NotLeaseHolderError it has received. This leads to a tight
loop of requests bouncing between the current lease holder and the new lease
holder which is unaware of the pending transfer (as observed in cockroachdb#22837) . The
amount of load generated by this traffic can grind raft progress to a complete
halt, with the author observing multi-minute durations for the new node to
process a raft Ready and hundreds of milliseconds to process a single command.
Fortunately, the DistSender can detect when this situation is occurring and can
back off accordingly.

This change detects that a replica is in the midst of a lease transfer by
noticing that it continues to receive NotLeaseHolderErrors without observing
new lease sequence number. In this case, the DistSender backs off exponentially
until it succeeds, fails, or observes a new lease sequence.

Fixes cockroachdb#22837, Fixes cockroachdb#32367

Release note: None
craig bot pushed a commit that referenced this issue Dec 7, 2018
32877: kv: detect lease transfer and back off in DistSender r=ajwerner a=ajwerner

This PR address a problem which could lead to very long stalls in range
throughput when a lease transfer occurs when under load. As soon as the
current lease holder begins a lease transfer, it rejects all future requests
to the range with a NotLeaseHolderError which contains the new lease
information. As soon as this happens, the new lease holder immediately begins
receiving requests but is not able to service those requests until it processes
the raft command that makes it the lease hold. Until it applies that command, it
returns NotLeaseHolderError with the previous lease information. Prior to this
change, the DistSender would immediately retry the request at the node indicated
in the most recent NotLeaseHolderError it has received. This leads to a tight
loop of requests bouncing between the current lease holder and the new lease
holder which is unaware of the pending transfer (as observed in #22837) . The
amount of load generated by this traffic can grind raft progress to a complete
halt, with the author observing multi-minute durations for the new node to
process a raft Ready and hundreds of milliseconds to process a single command.
Fortunately, the DistSender can detect when this situation is occurring and can
back off accordingly.

This change detects that a replica is in the midst of a lease transfer by
noticing that it continues to receive NotLeaseHolderErrors without observing
new lease sequence number. In this case, the DistSender backs off exponentially
until it succeeds, fails, or observes a new lease sequence.

Fixes #22837, Fixes #32367

Release note: None

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@craig craig bot closed this as completed in #32877 Dec 7, 2018
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 17, 2021
Needed for cockroachdb#57688.

This commit reworks interactions between range leases and requests, pulling the
consultation of a replica's lease down below the level of latching while keeping
heavy-weight operations like lease acquisitions above the level of latching.
Doing so comes with several benefits, some related specifically to non-blocking
transactions and some more general.

Background

Before discussing the change here, let's discuss how lease checks, lease
acquisitions, lease redirection, and lease transfers currently work. Today,
requests consult a replica's range lease before acquiring latches. If the lease
is good to go, the request proceeds to acquire latches. If the lease is not
currently held by any replica, the lease is acquired (again, above latches)
through a coalesced `RequestLeaseRequest`. If the lease is currently held by a
different replica, the request is redirected to that replica using a
`NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in
progress, the request is optimistically redirected to the prospective new
leaseholder.

This all works, but only because it's been around for so long. Due to the lease
check above latching, we're forced to go to great lengths to get the
synchronization with in-flight requests right, which leads to very subtle logic.
This is most apparent with lease transfers, which properly synchronize with
ongoing requests through a delicate dance with the HLC clock and some serious
"spooky action at a distance". Every request bumps the local HLC clock in
`Store.Send`, then grabs the replica mutex, checks for an ongoing lease
transfer, drops the replica mutex, then evaluates. Lease transfers grab the
replica mutex, grab a clock reading from the local HLC clock, bump the
minLeaseProposedTS to stop using the current lease, drops the replica mutex,
then proposes a new lease using this clock reading as its start time. This works
only because each request bumps the HLC clock _before_ checking the lease, so
the HLC clock can serve as an upper bound on every request that has made it
through the lease check by the time the lease transfer begins.

This structure is inflexible, subtle, and falls over as soon as we try to extend
it.

Motivation

The primary motivation for pulling lease checks and transfers below latching is
that the interaction between requests and lease transfers is incompatible with
future-time operations, a key part of the non-blocking transaction project. This
is because the structure relies on the HLC clock providing an upper bound on the
time of any request served by an outgoing leaseholder, which is attached to
lease transfers to ensure that the new leaseholder does not violate any request
served on the old leaseholder. But this is quickly violated once we start
serving future-time operations, which don't bump the HLC clock.

So we quickly need to look elsewhere for this information. The obvious place to
look for this information is the timestamp cache, which records the upper bound
read time of each key span in a range, even if this upper bound time is
synthetic. If we could scan the timestamp cache and attach the maximum read time
to a lease transfer (through a new field, not as the lease start time), we'd be
good. But this runs into a problem, because if we just read the timestamp cache
under the lease transfer's lock, we can't be sure we didn't miss any in-progress
operations that had passed the lease check previously but had not yet bumped the
timestamp cache. Maybe they are still reading? So the custom locking quickly
runs into problems (I said it was inflexible!).

Solution

The solution here is to stop relying on custom locking for lease transfers by
pulling the lease check below latching and by pulling the determination of the
transfer's start time below latching. This ensures that during a lease transfer,
we don't only block new requests, but we also flush out in-flight requests. This
means that by the time we look at the timestamp cache during the evaluation of a
lease transfer, we know it has already been updated by any request that will be
served under the current lease.

This commit doesn't make the switch from consulting the HLC clock to consulting
the timestamp cache during TransferLease request evaluation, but a future commit
will.

Other benefits

Besides this primary change, a number of other benefits fall out of this
restructuring.

1. we avoid relying on custom synchronization around leases, instead relying
   on more the more general latching mechanism.
2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now
   both grab clock readings during evaluation and will both need to forward
   their clock reading by the upper-bound of a range's portion of the timestamp
   cache. It makes sense that these two requests would be very similar, as both
   are responsible for renouncing the current leaseholder's powers and passing
   them elsewhere.
3. we more closely aligns the lease acquisition handling with the handling of
   `MergeInProgressError` by classifying a new `InvalidLeaseError` as a
   "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing
   structure of: grab latches, check range state, drop latches and wait if
   necessary, retry.
4. in doing so, we fuse the critical section of lease checks and the rest of
   the checks in `checkExecutionCanProceed`. So we grab the replica read lock
   one fewer time in the request path.
5. we move one step closer to a world where we can "ship a portion of the
   timestamp cache" during lease transfers (and range merges) to avoid retry
   errors / transaction aborts on the new leaseholder. This commit will be
   followed up by one that ships a very basic summary of a leaseholder's
   timestamp cache during lease transfers. However, this would now be trivial to
   extend with higher resolution information, given some size limit. Perhaps we
   prioritize the local portion of the timestamp cache to avoid txn aborts?
6. now that leases are checked below latching, we no longer have the potential
   for an arbitrary delay due to latching and waiting on locks between when the
   lease is checked and when a request evaluates, so we no longer need checks
   like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119).
7. we pull observed timestamp handling a layer down, which will be useful to
   address plumbing comments on cockroachdb#57077.

Other behavioral changes

There are two auxiliary behavioral changes made by this commit that deserve
attention.

The first is that during a lease transfer, operations now block on the outgoing
leaseholder instead of immediately redirecting to the expected next leaseholder.
This has trade-offs. On one hand, this delays redirection, which may make lease
transfers more disruptive to ongoing traffic. On the other, we've seen in the
past that the optimistic redirection is not an absolute win. In many cases, it
can lead to thrashing and lots of wasted work, as the outgoing leaseholder and
the incoming leaseholder both point at each other and requests ping-pong between
them. We've seen this cause serious issues like cockroachdb#22837 and cockroachdb#32367, which we
addressed by adding exponential backoff in the client in 89d349a. So while this
change may make average-case latency during lease transfers slightly worse, it
will keep things much more orderly, avoid wasted work, and reduce worse case
latency during lease transfers.

The other behavioral changes made by this commit is that observed timestamps are
no longer applied to a request to reduce its MaxOffset until after latching and
locking, instead of before. This sounds concerning, but it's actually not for
two reasons. First, as of cockroachdb#57136, a transactions uncertainty interval is no
longer considered by the lock table because locks in a transaction's uncertainty
interval are no longer considered write-read conflicts. Instead, those locks'
provisional values are considered at evaluation time to be uncertain. Second,
the fact that the observed timestamp-limited MaxOffset was being used for
latching is no longer correct in a world with synthetic timestamps (see cockroachdb#57077),
so we would have had to make this change anyway. So put together, this
behavioral change isn't meaningful.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Jan 21, 2021
Needed for cockroachdb#57688.

This commit reworks interactions between range leases and requests, pulling the
consultation of a replica's lease down below the level of latching while keeping
heavy-weight operations like lease acquisitions above the level of latching.
Doing so comes with several benefits, some related specifically to non-blocking
transactions and some more general.

Background

Before discussing the change here, let's discuss how lease checks, lease
acquisitions, lease redirection, and lease transfers currently work. Today,
requests consult a replica's range lease before acquiring latches. If the lease
is good to go, the request proceeds to acquire latches. If the lease is not
currently held by any replica, the lease is acquired (again, above latches)
through a coalesced `RequestLeaseRequest`. If the lease is currently held by a
different replica, the request is redirected to that replica using a
`NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in
progress, the request is optimistically redirected to the prospective new
leaseholder.

This all works, but only because it's been around for so long. Due to the lease
check above latching, we're forced to go to great lengths to get the
synchronization with in-flight requests right, which leads to very subtle logic.
This is most apparent with lease transfers, which properly synchronize with
ongoing requests through a delicate dance with the HLC clock and some serious
"spooky action at a distance". Every request bumps the local HLC clock in
`Store.Send`, then grabs the replica mutex, checks for an ongoing lease
transfer, drops the replica mutex, then evaluates. Lease transfers grab the
replica mutex, grab a clock reading from the local HLC clock, bump the
minLeaseProposedTS to stop using the current lease, drops the replica mutex,
then proposes a new lease using this clock reading as its start time. This works
only because each request bumps the HLC clock _before_ checking the lease, so
the HLC clock can serve as an upper bound on every request that has made it
through the lease check by the time the lease transfer begins.

This structure is inflexible, subtle, and falls over as soon as we try to extend
it.

Motivation

The primary motivation for pulling lease checks and transfers below latching is
that the interaction between requests and lease transfers is incompatible with
future-time operations, a key part of the non-blocking transaction project. This
is because the structure relies on the HLC clock providing an upper bound on the
time of any request served by an outgoing leaseholder, which is attached to
lease transfers to ensure that the new leaseholder does not violate any request
served on the old leaseholder. But this is quickly violated once we start
serving future-time operations, which don't bump the HLC clock.

So we quickly need to look elsewhere for this information. The obvious place to
look for this information is the timestamp cache, which records the upper bound
read time of each key span in a range, even if this upper bound time is
synthetic. If we could scan the timestamp cache and attach the maximum read time
to a lease transfer (through a new field, not as the lease start time), we'd be
good. But this runs into a problem, because if we just read the timestamp cache
under the lease transfer's lock, we can't be sure we didn't miss any in-progress
operations that had passed the lease check previously but had not yet bumped the
timestamp cache. Maybe they are still reading? So the custom locking quickly
runs into problems (I said it was inflexible!).

Solution

The solution here is to stop relying on custom locking for lease transfers by
pulling the lease check below latching and by pulling the determination of the
transfer's start time below latching. This ensures that during a lease transfer,
we don't only block new requests, but we also flush out in-flight requests. This
means that by the time we look at the timestamp cache during the evaluation of a
lease transfer, we know it has already been updated by any request that will be
served under the current lease.

This commit doesn't make the switch from consulting the HLC clock to consulting
the timestamp cache during TransferLease request evaluation, but a future commit
will.

Other benefits

Besides this primary change, a number of other benefits fall out of this
restructuring.

1. we avoid relying on custom synchronization around leases, instead relying
   on more the more general latching mechanism.
2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now
   both grab clock readings during evaluation and will both need to forward
   their clock reading by the upper-bound of a range's portion of the timestamp
   cache. It makes sense that these two requests would be very similar, as both
   are responsible for renouncing the current leaseholder's powers and passing
   them elsewhere.
3. we more closely aligns the lease acquisition handling with the handling of
   `MergeInProgressError` by classifying a new `InvalidLeaseError` as a
   "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing
   structure of: grab latches, check range state, drop latches and wait if
   necessary, retry.
4. in doing so, we fuse the critical section of lease checks and the rest of
   the checks in `checkExecutionCanProceed`. So we grab the replica read lock
   one fewer time in the request path.
5. we move one step closer to a world where we can "ship a portion of the
   timestamp cache" during lease transfers (and range merges) to avoid retry
   errors / transaction aborts on the new leaseholder. This commit will be
   followed up by one that ships a very basic summary of a leaseholder's
   timestamp cache during lease transfers. However, this would now be trivial to
   extend with higher resolution information, given some size limit. Perhaps we
   prioritize the local portion of the timestamp cache to avoid txn aborts?
6. now that leases are checked below latching, we no longer have the potential
   for an arbitrary delay due to latching and waiting on locks between when the
   lease is checked and when a request evaluates, so we no longer need checks
   like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119).
7. we pull observed timestamp handling a layer down, which will be useful to
   address plumbing comments on cockroachdb#57077.

Other behavioral changes

There are two auxiliary behavioral changes made by this commit that deserve
attention.

The first is that during a lease transfer, operations now block on the outgoing
leaseholder instead of immediately redirecting to the expected next leaseholder.
This has trade-offs. On one hand, this delays redirection, which may make lease
transfers more disruptive to ongoing traffic. On the other, we've seen in the
past that the optimistic redirection is not an absolute win. In many cases, it
can lead to thrashing and lots of wasted work, as the outgoing leaseholder and
the incoming leaseholder both point at each other and requests ping-pong between
them. We've seen this cause serious issues like cockroachdb#22837 and cockroachdb#32367, which we
addressed by adding exponential backoff in the client in 89d349a. So while this
change may make average-case latency during lease transfers slightly worse, it
will keep things much more orderly, avoid wasted work, and reduce worse case
latency during lease transfers.

The other behavioral changes made by this commit is that observed timestamps are
no longer applied to a request to reduce its MaxOffset until after latching and
locking, instead of before. This sounds concerning, but it's actually not for
two reasons. First, as of cockroachdb#57136, a transactions uncertainty interval is no
longer considered by the lock table because locks in a transaction's uncertainty
interval are no longer considered write-read conflicts. Instead, those locks'
provisional values are considered at evaluation time to be uncertain. Second,
the fact that the observed timestamp-limited MaxOffset was being used for
latching is no longer correct in a world with synthetic timestamps (see cockroachdb#57077),
so we would have had to make this change anyway. So put together, this
behavioral change isn't meaningful.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 1, 2021
Needed for cockroachdb#57688.

This commit reworks interactions between range leases and requests, pulling the
consultation of a replica's lease down below the level of latching while keeping
heavy-weight operations like lease acquisitions above the level of latching.
Doing so comes with several benefits, some related specifically to non-blocking
transactions and some more general.

Background

Before discussing the change here, let's discuss how lease checks, lease
acquisitions, lease redirection, and lease transfers currently work. Today,
requests consult a replica's range lease before acquiring latches. If the lease
is good to go, the request proceeds to acquire latches. If the lease is not
currently held by any replica, the lease is acquired (again, above latches)
through a coalesced `RequestLeaseRequest`. If the lease is currently held by a
different replica, the request is redirected to that replica using a
`NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in
progress, the request is optimistically redirected to the prospective new
leaseholder.

This all works, but only because it's been around for so long. Due to the lease
check above latching, we're forced to go to great lengths to get the
synchronization with in-flight requests right, which leads to very subtle logic.
This is most apparent with lease transfers, which properly synchronize with
ongoing requests through a delicate dance with the HLC clock and some serious
"spooky action at a distance". Every request bumps the local HLC clock in
`Store.Send`, then grabs the replica mutex, checks for an ongoing lease
transfer, drops the replica mutex, then evaluates. Lease transfers grab the
replica mutex, grab a clock reading from the local HLC clock, bump the
minLeaseProposedTS to stop using the current lease, drops the replica mutex,
then proposes a new lease using this clock reading as its start time. This works
only because each request bumps the HLC clock _before_ checking the lease, so
the HLC clock can serve as an upper bound on every request that has made it
through the lease check by the time the lease transfer begins.

This structure is inflexible, subtle, and falls over as soon as we try to extend
it.

Motivation

The primary motivation for pulling lease checks and transfers below latching is
that the interaction between requests and lease transfers is incompatible with
future-time operations, a key part of the non-blocking transaction project. This
is because the structure relies on the HLC clock providing an upper bound on the
time of any request served by an outgoing leaseholder, which is attached to
lease transfers to ensure that the new leaseholder does not violate any request
served on the old leaseholder. But this is quickly violated once we start
serving future-time operations, which don't bump the HLC clock.

So we quickly need to look elsewhere for this information. The obvious place to
look for this information is the timestamp cache, which records the upper bound
read time of each key span in a range, even if this upper bound time is
synthetic. If we could scan the timestamp cache and attach the maximum read time
to a lease transfer (through a new field, not as the lease start time), we'd be
good. But this runs into a problem, because if we just read the timestamp cache
under the lease transfer's lock, we can't be sure we didn't miss any in-progress
operations that had passed the lease check previously but had not yet bumped the
timestamp cache. Maybe they are still reading? So the custom locking quickly
runs into problems (I said it was inflexible!).

Solution

The solution here is to stop relying on custom locking for lease transfers by
pulling the lease check below latching and by pulling the determination of the
transfer's start time below latching. This ensures that during a lease transfer,
we don't only block new requests, but we also flush out in-flight requests. This
means that by the time we look at the timestamp cache during the evaluation of a
lease transfer, we know it has already been updated by any request that will be
served under the current lease.

This commit doesn't make the switch from consulting the HLC clock to consulting
the timestamp cache during TransferLease request evaluation, but a future commit
will.

Other benefits

Besides this primary change, a number of other benefits fall out of this
restructuring.

1. we avoid relying on custom synchronization around leases, instead relying
   on more the more general latching mechanism.
2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now
   both grab clock readings during evaluation and will both need to forward
   their clock reading by the upper-bound of a range's portion of the timestamp
   cache. It makes sense that these two requests would be very similar, as both
   are responsible for renouncing the current leaseholder's powers and passing
   them elsewhere.
3. we more closely aligns the lease acquisition handling with the handling of
   `MergeInProgressError` by classifying a new `InvalidLeaseError` as a
   "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing
   structure of: grab latches, check range state, drop latches and wait if
   necessary, retry.
4. in doing so, we fuse the critical section of lease checks and the rest of
   the checks in `checkExecutionCanProceed`. So we grab the replica read lock
   one fewer time in the request path.
5. we move one step closer to a world where we can "ship a portion of the
   timestamp cache" during lease transfers (and range merges) to avoid retry
   errors / transaction aborts on the new leaseholder. This commit will be
   followed up by one that ships a very basic summary of a leaseholder's
   timestamp cache during lease transfers. However, this would now be trivial to
   extend with higher resolution information, given some size limit. Perhaps we
   prioritize the local portion of the timestamp cache to avoid txn aborts?
6. now that leases are checked below latching, we no longer have the potential
   for an arbitrary delay due to latching and waiting on locks between when the
   lease is checked and when a request evaluates, so we no longer need checks
   like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119).
7. we pull observed timestamp handling a layer down, which will be useful to
   address plumbing comments on cockroachdb#57077.

Other behavioral changes

There are two auxiliary behavioral changes made by this commit that deserve
attention.

The first is that during a lease transfer, operations now block on the outgoing
leaseholder instead of immediately redirecting to the expected next leaseholder.
This has trade-offs. On one hand, this delays redirection, which may make lease
transfers more disruptive to ongoing traffic. On the other, we've seen in the
past that the optimistic redirection is not an absolute win. In many cases, it
can lead to thrashing and lots of wasted work, as the outgoing leaseholder and
the incoming leaseholder both point at each other and requests ping-pong between
them. We've seen this cause serious issues like cockroachdb#22837 and cockroachdb#32367, which we
addressed by adding exponential backoff in the client in 89d349a. So while this
change may make average-case latency during lease transfers slightly worse, it
will keep things much more orderly, avoid wasted work, and reduce worse case
latency during lease transfers.

The other behavioral changes made by this commit is that observed timestamps are
no longer applied to a request to reduce its MaxOffset until after latching and
locking, instead of before. This sounds concerning, but it's actually not for
two reasons. First, as of cockroachdb#57136, a transactions uncertainty interval is no
longer considered by the lock table because locks in a transaction's uncertainty
interval are no longer considered write-read conflicts. Instead, those locks'
provisional values are considered at evaluation time to be uncertain. Second,
the fact that the observed timestamp-limited MaxOffset was being used for
latching is no longer correct in a world with synthetic timestamps (see cockroachdb#57077),
so we would have had to make this change anyway. So put together, this
behavioral change isn't meaningful.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 5, 2021
Needed for cockroachdb#57688.

This commit reworks interactions between range leases and requests, pulling the
consultation of a replica's lease down below the level of latching while keeping
heavy-weight operations like lease acquisitions above the level of latching.
Doing so comes with several benefits, some related specifically to non-blocking
transactions and some more general.

Background

Before discussing the change here, let's discuss how lease checks, lease
acquisitions, lease redirection, and lease transfers currently work. Today,
requests consult a replica's range lease before acquiring latches. If the lease
is good to go, the request proceeds to acquire latches. If the lease is not
currently held by any replica, the lease is acquired (again, above latches)
through a coalesced `RequestLeaseRequest`. If the lease is currently held by a
different replica, the request is redirected to that replica using a
`NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in
progress, the request is optimistically redirected to the prospective new
leaseholder.

This all works, but only because it's been around for so long. Due to the lease
check above latching, we're forced to go to great lengths to get the
synchronization with in-flight requests right, which leads to very subtle logic.
This is most apparent with lease transfers, which properly synchronize with
ongoing requests through a delicate dance with the HLC clock and some serious
"spooky action at a distance". Every request bumps the local HLC clock in
`Store.Send`, then grabs the replica mutex, checks for an ongoing lease
transfer, drops the replica mutex, then evaluates. Lease transfers grab the
replica mutex, grab a clock reading from the local HLC clock, bump the
minLeaseProposedTS to stop using the current lease, drops the replica mutex,
then proposes a new lease using this clock reading as its start time. This works
only because each request bumps the HLC clock _before_ checking the lease, so
the HLC clock can serve as an upper bound on every request that has made it
through the lease check by the time the lease transfer begins.

This structure is inflexible, subtle, and falls over as soon as we try to extend
it.

Motivation

The primary motivation for pulling lease checks and transfers below latching is
that the interaction between requests and lease transfers is incompatible with
future-time operations, a key part of the non-blocking transaction project. This
is because the structure relies on the HLC clock providing an upper bound on the
time of any request served by an outgoing leaseholder, which is attached to
lease transfers to ensure that the new leaseholder does not violate any request
served on the old leaseholder. But this is quickly violated once we start
serving future-time operations, which don't bump the HLC clock.

So we quickly need to look elsewhere for this information. The obvious place to
look for this information is the timestamp cache, which records the upper bound
read time of each key span in a range, even if this upper bound time is
synthetic. If we could scan the timestamp cache and attach the maximum read time
to a lease transfer (through a new field, not as the lease start time), we'd be
good. But this runs into a problem, because if we just read the timestamp cache
under the lease transfer's lock, we can't be sure we didn't miss any in-progress
operations that had passed the lease check previously but had not yet bumped the
timestamp cache. Maybe they are still reading? So the custom locking quickly
runs into problems (I said it was inflexible!).

Solution

The solution here is to stop relying on custom locking for lease transfers by
pulling the lease check below latching and by pulling the determination of the
transfer's start time below latching. This ensures that during a lease transfer,
we don't only block new requests, but we also flush out in-flight requests. This
means that by the time we look at the timestamp cache during the evaluation of a
lease transfer, we know it has already been updated by any request that will be
served under the current lease.

This commit doesn't make the switch from consulting the HLC clock to consulting
the timestamp cache during TransferLease request evaluation, but a future commit
will.

Other benefits

Besides this primary change, a number of other benefits fall out of this
restructuring.

1. we avoid relying on custom synchronization around leases, instead relying
   on more the more general latching mechanism.
2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now
   both grab clock readings during evaluation and will both need to forward
   their clock reading by the upper-bound of a range's portion of the timestamp
   cache. It makes sense that these two requests would be very similar, as both
   are responsible for renouncing the current leaseholder's powers and passing
   them elsewhere.
3. we more closely aligns the lease acquisition handling with the handling of
   `MergeInProgressError` by classifying a new `InvalidLeaseError` as a
   "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing
   structure of: grab latches, check range state, drop latches and wait if
   necessary, retry.
4. in doing so, we fuse the critical section of lease checks and the rest of
   the checks in `checkExecutionCanProceed`. So we grab the replica read lock
   one fewer time in the request path.
5. we move one step closer to a world where we can "ship a portion of the
   timestamp cache" during lease transfers (and range merges) to avoid retry
   errors / transaction aborts on the new leaseholder. This commit will be
   followed up by one that ships a very basic summary of a leaseholder's
   timestamp cache during lease transfers. However, this would now be trivial to
   extend with higher resolution information, given some size limit. Perhaps we
   prioritize the local portion of the timestamp cache to avoid txn aborts?
6. now that leases are checked below latching, we no longer have the potential
   for an arbitrary delay due to latching and waiting on locks between when the
   lease is checked and when a request evaluates, so we no longer need checks
   like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119).
7. we pull observed timestamp handling a layer down, which will be useful to
   address plumbing comments on cockroachdb#57077.

Other behavioral changes

There are two auxiliary behavioral changes made by this commit that deserve
attention.

The first is that during a lease transfer, operations now block on the outgoing
leaseholder instead of immediately redirecting to the expected next leaseholder.
This has trade-offs. On one hand, this delays redirection, which may make lease
transfers more disruptive to ongoing traffic. On the other, we've seen in the
past that the optimistic redirection is not an absolute win. In many cases, it
can lead to thrashing and lots of wasted work, as the outgoing leaseholder and
the incoming leaseholder both point at each other and requests ping-pong between
them. We've seen this cause serious issues like cockroachdb#22837 and cockroachdb#32367, which we
addressed by adding exponential backoff in the client in 89d349a. So while this
change may make average-case latency during lease transfers slightly worse, it
will keep things much more orderly, avoid wasted work, and reduce worse case
latency during lease transfers.

The other behavioral changes made by this commit is that observed timestamps are
no longer applied to a request to reduce its MaxOffset until after latching and
locking, instead of before. This sounds concerning, but it's actually not for
two reasons. First, as of cockroachdb#57136, a transactions uncertainty interval is no
longer considered by the lock table because locks in a transaction's uncertainty
interval are no longer considered write-read conflicts. Instead, those locks'
provisional values are considered at evaluation time to be uncertain. Second,
the fact that the observed timestamp-limited MaxOffset was being used for
latching is no longer correct in a world with synthetic timestamps (see cockroachdb#57077),
so we would have had to make this change anyway. So put together, this
behavioral change isn't meaningful.
craig bot pushed a commit that referenced this issue Feb 6, 2021
59086: kv: move range lease checks and transfers below latching r=nvanbenschoten a=nvanbenschoten

Needed for #57688.

This PR reworks interactions between range leases and requests, pulling the consultation of a replica's lease down below the level of latching while keeping heavy-weight operations like lease acquisitions above the level of latching. Doing so comes with several benefits, some related specifically to non-blocking transactions and some more general.

### Background

Before discussing the change here, let's discuss how lease checks, lease acquisitions, lease redirection, and lease transfers currently work. Today, requests consult a replica's range lease before acquiring latches. If the lease is good to go, the request proceeds to acquire latches. If the lease is not currently held by any replica, the lease is acquired (again, above latches) through a coalesced `RequestLeaseRequest`. If the lease is currently held by a different replica, the request is redirected to that replica using a `NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in progress, the request is optimistically redirected to the prospective new leaseholder.

This all works, but only because it's been around for so long. Due to the lease check above latching, we're forced to go to great lengths to get the synchronization with in-flight requests right, which leads to very subtle logic. This is most apparent with lease transfers, which properly synchronize with ongoing requests through a delicate dance with the HLC clock and some serious "spooky action at a distance". Every request bumps the local HLC clock in `Store.Send`, then grabs the replica mutex, checks for an ongoing lease transfer, drops the replica mutex, then evaluates. Lease transfers grab the replica mutex, grab a clock reading from the local HLC clock, bump the minLeaseProposedTS to stop using the current lease, drops the replica mutex, then proposes a new lease using this clock reading as its start time. This works only because each request bumps the HLC clock _before_ checking the lease, so the HLC clock can serve as an upper bound on every request that has made it through the lease check by the time the lease transfer begins.

This structure is inflexible, subtle, and falls over as soon as we try to extend it.

### Motivation

The primary motivation for pulling lease checks and transfers below latching is that the interaction between requests and lease transfers is incompatible with future-time operations, a key part of the non-blocking transaction project. This is because the structure relies on the HLC clock providing an upper bound on the time of any request served by an outgoing leaseholder, which is attached to lease transfers to ensure that the new leaseholder does not violate any request served on the old leaseholder. But this is quickly violated once we start serving future-time operations, which don't bump the HLC clock.

So we quickly need to look elsewhere for this information. The obvious place to look for this information is the timestamp cache, which records the upper bound read time of each key span in a range, even if this upper bound time is synthetic. If we could scan the timestamp cache and attach the maximum read time to a lease transfer (through a new field, not as the lease start time), we'd be good. But this runs into a problem, because if we just read the timestamp cache under the lease transfer's lock, we can't be sure we didn't miss any in-progress operations that had passed the lease check previously but had not yet bumped the timestamp cache. Maybe they are still reading? So the custom locking quickly runs into problems (I said it was inflexible!).

### Solution

The solution here is to stop relying on custom locking for lease transfers by pulling the lease check below latching and by pulling the determination of the transfer's start time below latching. This ensures that during a lease transfer, we don't only block new requests, but we also flush out in-flight requests. This means that by the time we look at the timestamp cache during the evaluation of a lease transfer, we know it has already been updated by any request that will be served under the current lease.

This commit doesn't make the switch from consulting the HLC clock to consulting the timestamp cache during TransferLease request evaluation, but a future commit will.

### Other benefits

Besides this primary change, a number of other benefits fall out of this restructuring.

1. we avoid relying on custom synchronization around leases, instead relying on more the more general latching mechanism.
2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now both grab clock readings during evaluation and will both need to forward their clock reading by the upper-bound of a range's portion of the timestamp cache. It makes sense that these two requests would be very similar, as both are responsible for renouncing the current leaseholder's powers and passing them elsewhere.
3. we more closely aligns the lease acquisition handling with the handling of `MergeInProgressError` by classifying a new `InvalidLeaseError` as a "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing structure of: grab latches, check range state, drop latches and wait if necessary, retry.
4. in doing so, we fuse the critical section of lease checks and the rest of the checks in `checkExecutionCanProceed`. So we grab the replica read lock one fewer time in the request path.
5. we move one step closer to a world where we can "ship a portion of the timestamp cache" during lease transfers (and range merges) to avoid retry errors / transaction aborts on the new leaseholder. This commit will be followed up by one that ships a very basic summary of a leaseholder's timestamp cache during lease transfers. However, this would now be trivial to extend with higher resolution information, given some size limit. Perhaps we prioritize the local portion of the timestamp cache to avoid txn aborts?
6. now that leases are checked below latching, we no longer have the potential for an arbitrary delay due to latching and waiting on locks between when the lease is checked and when a request evaluates, so we no longer need checks like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119).
7. we pull observed timestamp handling a layer down, which will be useful to address plumbing comments on #57077.

### Other behavioral changes

There are two auxiliary behavioral changes made by this commit that deserve attention.

The first is that during a lease transfer, operations now block on the outgoing leaseholder instead of immediately redirecting to the expected next leaseholder. This has trade-offs. On one hand, this delays redirection, which may make lease transfers more disruptive to ongoing traffic. On the other, we've seen in the past that the optimistic redirection is not an absolute win. In many cases, it can lead to thrashing and lots of wasted work, as the outgoing leaseholder and the incoming leaseholder both point at each other and requests ping-pong between them. We've seen this cause serious issues like #22837 and #32367, which we addressed by adding exponential backoff in the client in 89d349a. So while this change may make average-case latency during lease transfers slightly worse, it will keep things much more orderly, avoid wasted work, and reduce worst-case latency during lease transfers.

The other behavioral changes made by this commit is that observed timestamps are no longer applied to a request to reduce its MaxOffset until after latching and locking, instead of before. This sounds concerning, but it's actually not for two reasons. First, as of #57136, a transactions uncertainty interval is no longer considered by the lock table because locks in a transaction's uncertainty interval are no longer considered write-read conflicts. Instead, those locks' provisional values are considered at evaluation time to be uncertain. Second, the fact that the observed timestamp-limited MaxOffset was being used for latching is no longer correct in a world with synthetic timestamps (see #57077), so we would have had to make this change anyway. So put together, this behavioral change isn't meaningful.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-client Relating to the KV client and the KV interface. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

5 participants