kv: operations on a range can stall during lease changes when under heavy load #32367

ajwerner · 2018-11-14T23:22:39Z

Description
When a lease range is under very heavy load and a lease change occurs, throughput on the range can stall for minutes at a time. It seems like requests which pile up are too high to

To Reproduce

The stall was observed while running the kv workload at high concurrency on large machines. The stall can be reliably reproduced with the following steps:

Set up CockroachDB with 3 database nodes on large nodes (I was using 64 core, doesn't seem to really reproduce on 4 core machines)
Put load on the cluster using the following workload:
roachprod run ${CLUSTER}:4 -- ./workload run kv '{pgurl:1-3}' --init --read-percent=95 --concurrency=1024
While the cluster in under load, execute the following SQL command to trigger lease change:
ALTER TABLE kv.kv SCATTER;
Watch throughput tank and the command hang

Expected behavior
The scatter should complete in a timely manner and the throughput should not be adversely affected for a long period of time.

Additional data / screenshots

The hypothesis is that when load is to high, by the time the newly elected leader catches up to recognize that it is the leader, there are too many outstanding requests for it to catch up. Adding backoff in DistSender when NotLeaseHolderError is encountered seems to resolve the problem. The below small diff has been verified as resolving the problem for the above reproduction steps.

diff --git a/pkg/kv/dist_sender.go b/pkg/kv/dist_sender.go
index 1ccf87b11a..8822551bdd 100644
--- a/pkg/kv/dist_sender.go
+++ b/pkg/kv/dist_sender.go
@@ -1324,6 +1324,7 @@ func (ds *DistSender) sendToReplicas(
        }
        br, err := transport.SendNext(ctx, ba)

+       r := retry.StartWithCtx(ctx, ds.rpcRetryOptions)
        // This loop will retry operations that fail with errors that reflect
        // per-replica state and may succeed on other replicas.
        for {
@@ -1388,6 +1389,7 @@ func (ds *DistSender) sendToReplicas(
                                // via the NotLeaseHolderError or nil error paths, both of which update the                                                                                                       
                                // leaseholder cache.
                        case *roachpb.NotLeaseHolderError:
+                               r.Next()
                                ds.metrics.NotLeaseHolderErrCount.Inc(1)
                                if lh := tErr.LeaseHolder; lh != nil {
                                        // Update the leaseholder cache. Naively this would also happen when the

@tbg, @nvanbenschoten mentioned you were making changes here recently and might want to take a look.

Environment:

CockroachDB version v2.2.0-alpha.00000000-2316-ga25f97f
Server OS: Ubuntu 16.04
Client app: workload
Server: GCE n1-standard-64 machines

Additional context

The problem was initially thought to relate to #22837 but this seems to be more somewhat more specific.

The text was updated successfully, but these errors were encountered:

a-robinson · 2018-11-14T23:40:50Z

Beyond improving the retry behavior, we also probably want to make SCATTER avoid transferring leases to behind replicas (as we did in #30938 load-based rebalancing), even if it makes for worse randomization of leases.

tbg · 2018-11-15T11:07:26Z

While we're touching SCATTER, I also suspect that it can end up removing replicas that are necessary for quorum, which looks like it would want to run the same kind of check.

nvanbenschoten · 2018-11-15T21:06:35Z

@ajwerner we should dig more into why this is happening and why the backoff helps. Is the theory that the foreground traffic is in some way starving out the lease transfer from succeeding?

Are some queries still able to make progress while others aren't? If not, then I wouldn't expect the starvation to be coming from beneath Raft. Are we hammering the Replica mutex in redirectOnOrAcquireLease to the point where the lease is never able to apply on the new leaseholder? It would be shocking to me if starvation due to this interaction resulted in the multi-second stall we see here.

ajwerner · 2018-11-16T14:52:46Z

The (relatively unsubstantiated) theory I discussed with @benesch (please correct me if I misunderstood) is that the lease is being transferred to a node which is behind and so when the lease transfer commits, the node which has become the lease holder has to process a number of messages to catch up. While the newly elected replica is behind, it is receiving all of the requests which have failed on other nodes with NotLeaseHolderError. During this time, the newly elected lease holder does not yet know that it's the lease holder so for each request it has to grab the replica.mu just to give back another NotLeaseHolderError which will get immediately retried in DistSender. All of this load then leads to the replica being extremely slow to catch up and begin actually processing requests which may lead to slowness replying to heart beats and the process may repeat itself.

I wonder if there's a solution here in the server rather than the client to queue up access to the replica behind doing raft catch up work. I'm going to spend some time working on confirming the theory and exploring alternative mitigations at the storage layer.

nvanbenschoten · 2018-11-16T17:42:11Z

The new leaseholder needing to catch up is certainly an issue, but I don't see how it's possible for this to block indefinitely. The old leaseholder will stop proposing writes the second it decides to transfer the lease (or at least, it's supposed to) and will immediately begin redirecting to the new leaseholder. This means that the only entries that the new leaseholder will need to apply before applying the lease transfer and noticing that it has the lease are the entries that it was behind on at the moment that the lease transfer started.

So then the theory is that the wave of requests bouncing off the new leaseholder is slowing it down to the point where it can't make progress on its unapplied Raft log.

I wonder if there's a solution here in the server rather than the client to queue up access to the replica behind doing raft catch up work. I'm going to spend some time working on confirming the theory and exploring alternative mitigations at the storage layer.

There really shouldn't be any interplay between the two components. Incoming Requests that bounce off a leaseholder check shouldn't be able to affect Raft entry application. That said, it's fully possible that there's some side-channel interaction between the two, like contention on the Replica mutex.

which may lead to slowness replying to heart beats and the process may repeat itself.

Remember that we're talking about Range leases here, not Raft leadership. Heartbeats shouldn't be coming into play and the lease shouldn't expire as long as the new leaseholder is still able to update its liveness record. We aren't seeing leases bounce around, are we? Just requests searching for the lease?

nvanbenschoten · 2018-11-27T17:55:20Z

@ajwerner and I were discussing this in person and arrived at a theory about what we thought might be going wrong. It turns out that it's not an issue because lease transfers don't go through a DistSender, but I figured I'd post it anyway as a discussion of what could go wrong if we did.

When a Range begins to transfer a lease, it immediately starts rejecting all requests from the old (still current) leaseholder:

cockroach/pkg/storage/replica.go

Lines 1539 to 1543 in 88c78e7

if transferLease, ok := r.mu.pendingLeaseRequest.TransferInProgress(

repDesc.ReplicaID); ok {

return nil, roachpb.NewError(

newNotLeaseHolderError(&transferLease, r.store.StoreID(), r.mu.state.Desc))

}

Of course, we don't reject the LeaseTransfer request itself:

cockroach/pkg/storage/replica.go

Lines 3366 to 3370 in 88c78e7

var status storagepb.LeaseStatus

// For lease commands, use the provided previous lease for verification.

if ba.IsSingleSkipLeaseCheckRequest() {

lease = ba.GetPrevLeaseForLeaseRequest()

} else {

When a DistSender receives one of these NotLeaseholderErrors, it updates its leaseholder cache with the new leaseholder.

If lease transfer requests when through a DistSender (which they don't!) then they could be redirected the the wrong place, which might cause issues.

I do think it's worth adding some instrumentation into this loop. The next steps here will be to track down where this lease transfer is getting stuck.

ajwerner · 2018-12-05T19:52:47Z

So then the theory is that the wave of requests bouncing off the new leaseholder is slowing it down to the point where it can't make progress on its unapplied Raft log.

This seems to be exactly what is happening, there is so much contention on the various mutexes that the replica cannot make progress, leading to raft command processing taking 10s-100s of ms each which. The stall lasts until the current lease holder is able to process the lease change and depending on how many entries the current lease holder needs to process, this can take minutes.

I181205 19:23:08.340869 305 storage/replica.go:4195  [n3,s2,r22/3:/{Table/53-Max}] handleRaftReady took 1m5.577022134s

We can detect this case in DistSender pretty trivially by noticing that we keep getting NotLeaseHolderError without the LeaseSequence increasing. This behavior indicates that a lease transfer is occurring and that the client should back off until it does. This approach leads to a change which seems to solve the problem. PR on its way.

This PR address a problem which could lead to very long stalls in range throughput when a lease transfer occurs when under load. As soon as the current lease holder begins a lease transfer, it rejects all future requests to the range with a NotLeaseHolderError which contains the new lease information. As soon as this happens, the new lease holder immediately begins receiving requests but is not able to service those requests until it processes the raft command that makes it the lease hold. Until it applies that command, it returns NotLeaseHolderError with the previous lease information. Prior to this change, the DistSender would immediately retry the request at the node indicated in the most recent NotLeaseHolderError it has received. This leads to a tight loop of requests bouncing between the current lease holder and the new lease holder which is unaware of the pending transfer (as observed in cockroachdb#22837) . The amount of load generated by this traffic can grind raft progress to a complete halt, with the author observing multi-minute durations for the new node to process a raft Ready and hundreds of milliseconds to process a single command. Fortunately, the DistSender can detect when this situation is occurring and can back off accordingly. This change detects that a replica is in the midst of a lease transfer by noticing that it continues to receive NotLeaseHolderErrors without observing new lease sequence number. In this case, the DistSender backs off exponentially until it succeeds, fails, or observes a new lease sequence. Fixes cockroachdb#22837, Fixes cockroachdb#32367 Release note: None

32877: kv: detect lease transfer and back off in DistSender r=ajwerner a=ajwerner This PR address a problem which could lead to very long stalls in range throughput when a lease transfer occurs when under load. As soon as the current lease holder begins a lease transfer, it rejects all future requests to the range with a NotLeaseHolderError which contains the new lease information. As soon as this happens, the new lease holder immediately begins receiving requests but is not able to service those requests until it processes the raft command that makes it the lease hold. Until it applies that command, it returns NotLeaseHolderError with the previous lease information. Prior to this change, the DistSender would immediately retry the request at the node indicated in the most recent NotLeaseHolderError it has received. This leads to a tight loop of requests bouncing between the current lease holder and the new lease holder which is unaware of the pending transfer (as observed in #22837) . The amount of load generated by this traffic can grind raft progress to a complete halt, with the author observing multi-minute durations for the new node to process a raft Ready and hundreds of milliseconds to process a single command. Fortunately, the DistSender can detect when this situation is occurring and can back off accordingly. This change detects that a replica is in the midst of a lease transfer by noticing that it continues to receive NotLeaseHolderErrors without observing new lease sequence number. In this case, the DistSender backs off exponentially until it succeeds, fails, or observes a new lease sequence. Fixes #22837, Fixes #32367 Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>

Needed for cockroachdb#57688. This commit reworks interactions between range leases and requests, pulling the consultation of a replica's lease down below the level of latching while keeping heavy-weight operations like lease acquisitions above the level of latching. Doing so comes with several benefits, some related specifically to non-blocking transactions and some more general. Background Before discussing the change here, let's discuss how lease checks, lease acquisitions, lease redirection, and lease transfers currently work. Today, requests consult a replica's range lease before acquiring latches. If the lease is good to go, the request proceeds to acquire latches. If the lease is not currently held by any replica, the lease is acquired (again, above latches) through a coalesced `RequestLeaseRequest`. If the lease is currently held by a different replica, the request is redirected to that replica using a `NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in progress, the request is optimistically redirected to the prospective new leaseholder. This all works, but only because it's been around for so long. Due to the lease check above latching, we're forced to go to great lengths to get the synchronization with in-flight requests right, which leads to very subtle logic. This is most apparent with lease transfers, which properly synchronize with ongoing requests through a delicate dance with the HLC clock and some serious "spooky action at a distance". Every request bumps the local HLC clock in `Store.Send`, then grabs the replica mutex, checks for an ongoing lease transfer, drops the replica mutex, then evaluates. Lease transfers grab the replica mutex, grab a clock reading from the local HLC clock, bump the minLeaseProposedTS to stop using the current lease, drops the replica mutex, then proposes a new lease using this clock reading as its start time. This works only because each request bumps the HLC clock _before_ checking the lease, so the HLC clock can serve as an upper bound on every request that has made it through the lease check by the time the lease transfer begins. This structure is inflexible, subtle, and falls over as soon as we try to extend it. Motivation The primary motivation for pulling lease checks and transfers below latching is that the interaction between requests and lease transfers is incompatible with future-time operations, a key part of the non-blocking transaction project. This is because the structure relies on the HLC clock providing an upper bound on the time of any request served by an outgoing leaseholder, which is attached to lease transfers to ensure that the new leaseholder does not violate any request served on the old leaseholder. But this is quickly violated once we start serving future-time operations, which don't bump the HLC clock. So we quickly need to look elsewhere for this information. The obvious place to look for this information is the timestamp cache, which records the upper bound read time of each key span in a range, even if this upper bound time is synthetic. If we could scan the timestamp cache and attach the maximum read time to a lease transfer (through a new field, not as the lease start time), we'd be good. But this runs into a problem, because if we just read the timestamp cache under the lease transfer's lock, we can't be sure we didn't miss any in-progress operations that had passed the lease check previously but had not yet bumped the timestamp cache. Maybe they are still reading? So the custom locking quickly runs into problems (I said it was inflexible!). Solution The solution here is to stop relying on custom locking for lease transfers by pulling the lease check below latching and by pulling the determination of the transfer's start time below latching. This ensures that during a lease transfer, we don't only block new requests, but we also flush out in-flight requests. This means that by the time we look at the timestamp cache during the evaluation of a lease transfer, we know it has already been updated by any request that will be served under the current lease. This commit doesn't make the switch from consulting the HLC clock to consulting the timestamp cache during TransferLease request evaluation, but a future commit will. Other benefits Besides this primary change, a number of other benefits fall out of this restructuring. 1. we avoid relying on custom synchronization around leases, instead relying on more the more general latching mechanism. 2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now both grab clock readings during evaluation and will both need to forward their clock reading by the upper-bound of a range's portion of the timestamp cache. It makes sense that these two requests would be very similar, as both are responsible for renouncing the current leaseholder's powers and passing them elsewhere. 3. we more closely aligns the lease acquisition handling with the handling of `MergeInProgressError` by classifying a new `InvalidLeaseError` as a "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing structure of: grab latches, check range state, drop latches and wait if necessary, retry. 4. in doing so, we fuse the critical section of lease checks and the rest of the checks in `checkExecutionCanProceed`. So we grab the replica read lock one fewer time in the request path. 5. we move one step closer to a world where we can "ship a portion of the timestamp cache" during lease transfers (and range merges) to avoid retry errors / transaction aborts on the new leaseholder. This commit will be followed up by one that ships a very basic summary of a leaseholder's timestamp cache during lease transfers. However, this would now be trivial to extend with higher resolution information, given some size limit. Perhaps we prioritize the local portion of the timestamp cache to avoid txn aborts? 6. now that leases are checked below latching, we no longer have the potential for an arbitrary delay due to latching and waiting on locks between when the lease is checked and when a request evaluates, so we no longer need checks like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119). 7. we pull observed timestamp handling a layer down, which will be useful to address plumbing comments on cockroachdb#57077. Other behavioral changes There are two auxiliary behavioral changes made by this commit that deserve attention. The first is that during a lease transfer, operations now block on the outgoing leaseholder instead of immediately redirecting to the expected next leaseholder. This has trade-offs. On one hand, this delays redirection, which may make lease transfers more disruptive to ongoing traffic. On the other, we've seen in the past that the optimistic redirection is not an absolute win. In many cases, it can lead to thrashing and lots of wasted work, as the outgoing leaseholder and the incoming leaseholder both point at each other and requests ping-pong between them. We've seen this cause serious issues like cockroachdb#22837 and cockroachdb#32367, which we addressed by adding exponential backoff in the client in 89d349a. So while this change may make average-case latency during lease transfers slightly worse, it will keep things much more orderly, avoid wasted work, and reduce worse case latency during lease transfers. The other behavioral changes made by this commit is that observed timestamps are no longer applied to a request to reduce its MaxOffset until after latching and locking, instead of before. This sounds concerning, but it's actually not for two reasons. First, as of cockroachdb#57136, a transactions uncertainty interval is no longer considered by the lock table because locks in a transaction's uncertainty interval are no longer considered write-read conflicts. Instead, those locks' provisional values are considered at evaluation time to be uncertain. Second, the fact that the observed timestamp-limited MaxOffset was being used for latching is no longer correct in a world with synthetic timestamps (see cockroachdb#57077), so we would have had to make this change anyway. So put together, this behavioral change isn't meaningful.

59086: kv: move range lease checks and transfers below latching r=nvanbenschoten a=nvanbenschoten Needed for #57688. This PR reworks interactions between range leases and requests, pulling the consultation of a replica's lease down below the level of latching while keeping heavy-weight operations like lease acquisitions above the level of latching. Doing so comes with several benefits, some related specifically to non-blocking transactions and some more general. ### Background Before discussing the change here, let's discuss how lease checks, lease acquisitions, lease redirection, and lease transfers currently work. Today, requests consult a replica's range lease before acquiring latches. If the lease is good to go, the request proceeds to acquire latches. If the lease is not currently held by any replica, the lease is acquired (again, above latches) through a coalesced `RequestLeaseRequest`. If the lease is currently held by a different replica, the request is redirected to that replica using a `NotLeaseHolderError`. Finally, if the lease check notices a lease transfer in progress, the request is optimistically redirected to the prospective new leaseholder. This all works, but only because it's been around for so long. Due to the lease check above latching, we're forced to go to great lengths to get the synchronization with in-flight requests right, which leads to very subtle logic. This is most apparent with lease transfers, which properly synchronize with ongoing requests through a delicate dance with the HLC clock and some serious "spooky action at a distance". Every request bumps the local HLC clock in `Store.Send`, then grabs the replica mutex, checks for an ongoing lease transfer, drops the replica mutex, then evaluates. Lease transfers grab the replica mutex, grab a clock reading from the local HLC clock, bump the minLeaseProposedTS to stop using the current lease, drops the replica mutex, then proposes a new lease using this clock reading as its start time. This works only because each request bumps the HLC clock _before_ checking the lease, so the HLC clock can serve as an upper bound on every request that has made it through the lease check by the time the lease transfer begins. This structure is inflexible, subtle, and falls over as soon as we try to extend it. ### Motivation The primary motivation for pulling lease checks and transfers below latching is that the interaction between requests and lease transfers is incompatible with future-time operations, a key part of the non-blocking transaction project. This is because the structure relies on the HLC clock providing an upper bound on the time of any request served by an outgoing leaseholder, which is attached to lease transfers to ensure that the new leaseholder does not violate any request served on the old leaseholder. But this is quickly violated once we start serving future-time operations, which don't bump the HLC clock. So we quickly need to look elsewhere for this information. The obvious place to look for this information is the timestamp cache, which records the upper bound read time of each key span in a range, even if this upper bound time is synthetic. If we could scan the timestamp cache and attach the maximum read time to a lease transfer (through a new field, not as the lease start time), we'd be good. But this runs into a problem, because if we just read the timestamp cache under the lease transfer's lock, we can't be sure we didn't miss any in-progress operations that had passed the lease check previously but had not yet bumped the timestamp cache. Maybe they are still reading? So the custom locking quickly runs into problems (I said it was inflexible!). ### Solution The solution here is to stop relying on custom locking for lease transfers by pulling the lease check below latching and by pulling the determination of the transfer's start time below latching. This ensures that during a lease transfer, we don't only block new requests, but we also flush out in-flight requests. This means that by the time we look at the timestamp cache during the evaluation of a lease transfer, we know it has already been updated by any request that will be served under the current lease. This commit doesn't make the switch from consulting the HLC clock to consulting the timestamp cache during TransferLease request evaluation, but a future commit will. ### Other benefits Besides this primary change, a number of other benefits fall out of this restructuring. 1. we avoid relying on custom synchronization around leases, instead relying on more the more general latching mechanism. 2. we more closely aligns `TransferLeaseRequest` and `SubsumeRequest`, which now both grab clock readings during evaluation and will both need to forward their clock reading by the upper-bound of a range's portion of the timestamp cache. It makes sense that these two requests would be very similar, as both are responsible for renouncing the current leaseholder's powers and passing them elsewhere. 3. we more closely aligns the lease acquisition handling with the handling of `MergeInProgressError` by classifying a new `InvalidLeaseError` as a "concurrencyRetryError" (see isConcurrencyRetryError). This fits the existing structure of: grab latches, check range state, drop latches and wait if necessary, retry. 4. in doing so, we fuse the critical section of lease checks and the rest of the checks in `checkExecutionCanProceed`. So we grab the replica read lock one fewer time in the request path. 5. we move one step closer to a world where we can "ship a portion of the timestamp cache" during lease transfers (and range merges) to avoid retry errors / transaction aborts on the new leaseholder. This commit will be followed up by one that ships a very basic summary of a leaseholder's timestamp cache during lease transfers. However, this would now be trivial to extend with higher resolution information, given some size limit. Perhaps we prioritize the local portion of the timestamp cache to avoid txn aborts? 6. now that leases are checked below latching, we no longer have the potential for an arbitrary delay due to latching and waiting on locks between when the lease is checked and when a request evaluates, so we no longer need checks like [this](https://github.com/cockroachdb/cockroach/blob/7bcb2cef794da56f6993f1b27d5b6a036016242b/pkg/kv/kvserver/replica_write.go#L119). 7. we pull observed timestamp handling a layer down, which will be useful to address plumbing comments on #57077. ### Other behavioral changes There are two auxiliary behavioral changes made by this commit that deserve attention. The first is that during a lease transfer, operations now block on the outgoing leaseholder instead of immediately redirecting to the expected next leaseholder. This has trade-offs. On one hand, this delays redirection, which may make lease transfers more disruptive to ongoing traffic. On the other, we've seen in the past that the optimistic redirection is not an absolute win. In many cases, it can lead to thrashing and lots of wasted work, as the outgoing leaseholder and the incoming leaseholder both point at each other and requests ping-pong between them. We've seen this cause serious issues like #22837 and #32367, which we addressed by adding exponential backoff in the client in 89d349a. So while this change may make average-case latency during lease transfers slightly worse, it will keep things much more orderly, avoid wasted work, and reduce worst-case latency during lease transfers. The other behavioral changes made by this commit is that observed timestamps are no longer applied to a request to reduce its MaxOffset until after latching and locking, instead of before. This sounds concerning, but it's actually not for two reasons. First, as of #57136, a transactions uncertainty interval is no longer considered by the lock table because locks in a transaction's uncertainty interval are no longer considered write-read conflicts. Instead, those locks' provisional values are considered at evaluation time to be uncertain. Second, the fact that the observed timestamp-limited MaxOffset was being used for latching is no longer correct in a world with synthetic timestamps (see #57077), so we would have had to make this change anyway. So put together, this behavioral change isn't meaningful. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

nvanbenschoten added C-investigation Further steps needed to qualify. C-label will change. A-kv-client Relating to the KV client and the KV interface. labels Nov 15, 2018

nvanbenschoten assigned ajwerner Nov 15, 2018

ajwerner mentioned this issue Dec 5, 2018

kv: detect lease transfer and back off in DistSender #32877

Merged

craig bot closed this as completed in #32877 Dec 7, 2018

nvanbenschoten mentioned this issue Jan 16, 2021

kv: move range lease checks and transfers below latching #59086

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv: operations on a range can stall during lease changes when under heavy load #32367

kv: operations on a range can stall during lease changes when under heavy load #32367

ajwerner commented Nov 14, 2018

a-robinson commented Nov 14, 2018

tbg commented Nov 15, 2018

nvanbenschoten commented Nov 15, 2018

ajwerner commented Nov 16, 2018

nvanbenschoten commented Nov 16, 2018

nvanbenschoten commented Nov 27, 2018

ajwerner commented Dec 5, 2018

kv: operations on a range can stall during lease changes when under heavy load #32367

kv: operations on a range can stall during lease changes when under heavy load #32367

Comments

ajwerner commented Nov 14, 2018

a-robinson commented Nov 14, 2018

tbg commented Nov 15, 2018

nvanbenschoten commented Nov 15, 2018

ajwerner commented Nov 16, 2018

nvanbenschoten commented Nov 16, 2018

nvanbenschoten commented Nov 27, 2018

ajwerner commented Dec 5, 2018