kvserver: default raft scheduler concurrency can cause cascading failures on beefy machines #56851

tbg · 2020-11-18T14:01:42Z

The default number of worker goroutines in the Raft scheduler is 8*runtime.NumCPUs(). We have observed that, at least on v20.1, this can cause pathological behavior that is most likely to occur when the CPU and range count are both "high" (32 CPUs and 55k ranges did it in one recent example).

The pathological behavior entails a full breakdown of the system. The UI and all ranges stop working. It becomes nearly impossible to extract debugging information from the system.

From a goroutine dump (via kill -ABRT), we see many of the worker goroutines with the following stack:

sync.(*Mutex).Lock(...)
    /usr/local/go/src/sync/mutex.go:81
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueueN(0xc0011ea900, 0x8, 0xc041aea000, 0x844d, 0x9800, 0xc038b821c0)

(enqueue1 similarly shows up). These are contending on a mutex, which is thought to be the root cause of the pathological behavior. This all looks like golang/go#33747, which was fixed in go1.14. CRDB v20.1 and v20.2 are both built with go1.13, which makes them susceptible to this bug. v21.1 will be built with go1.15, which has the fix.

Following the contention in the scheduler, we see outgoing raft message streams that are backed up because the recipient’s raft scheduler is unable to keep up. These have been seen stuck for dozens of minutes ([select, 17 minutes] etc):

goroutine 2707 [select]:
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport.(*writeQuota).get(0xc014642600, 0xc000000052, 0x4d, 0x5)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport/flowcontrol.go:59 +0xaa
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport.(*http2Client).Write(0xc0120461c0, 0xc0042dab00, 0xc02dbaa480, 0x5, 0x60, 0xc01dd44a80, 0x4d, 0xbb, 0xc03e70ecf5, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport/http2_client.go:840 +0x1ae
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*csAttempt).sendMsg(0xc013e8cd80, 0x416e3e0, 0xc014642700, 0xc03e70ecf0, 0x5, 0x5, 0xc01dd44a80, 0x4d, 0xbb, 0xc00bf3ccc0, ...)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:828 +0x128
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*clientStream).SendMsg.func2(0xc013e8cd80, 0x4d, 0xbb)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:693 +0xb3
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*clientStream).withRetry(0xc000139e60, 0xc018462c30, 0xc019b036f0, 0xc000263840, 0x0)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:573 +0x360
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*clientStream).SendMsg(0xc000139e60, 0x416e3e0, 0xc014642700, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:699 +0x399

To resolve the gridlock, we added the environment variable COCKROACH_SCHEDULER_CONCURRENCY=64 to all nodes in the cluster and restarted.
We verifed the problem was solved by letting the cluster come together, watch the metrics for Raft leaders to be elected on all ranges, gradually add load back to the cluster and keep monitoring.

We need to set better defaults for the Raft scheduler worker pool. Additionally, we should understand whether the extent of the degradation was expected given the misconfiguration or whether there are more improvements to resilience we need to make. This will likely entail reproducing the problem locally.

gz#8824

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2020-11-18T16:34:51Z

we see many of the worker goroutines with the following stack:

In addition to the worker goroutines stuck there, we also saw all 8 (one per traffic class x one per node) of the RaftMessageBatch handlers stuck in the Raft scheduler with stacktraces that looked like:

goroutine 321096 [semacquire]:
sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000070998)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199

This is the receiving side of the Raft message stream. It explains why we saw the Raft transport senders all stuck.

Relates to cockroachdb#56851. In investigations like cockroachdb#56851, we've seen the mutex in the Raft scheduler collapse due to too much concurrency. To address this, we needed to drop the scheduler's goroutine pool size to bound the amount of contention on the mutex to ensure that the scheduler was able to schedule any goroutines. This commit caps this concurrency to 96, instead of letting it grow unbounded as a function of the number of cores on the system.

Relates to cockroachdb#56851. In cockroachdb#56851, we saw that all of the Raft transport's receiving goroutines were stuck in the Raft scheduler, attempting to enqueue Ranges in response to coalesced heartbeats. We saw this in stacktraces like: ``` goroutine 321096 [semacquire]: sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1) /usr/local/go/src/runtime/sema.go:71 +0x47 sync.(*Mutex).lockSlow(0xc000070998) /usr/local/go/src/sync/mutex.go:138 +0xfc sync.(*Mutex).Lock(...) /usr/local/go/src/sync/mutex.go:81 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199 ``` In that issue, we also saw that too much concurrency on the Raft scheduler's Mutex had caused the mutex to collapse (get stuck in the slow path, in the OS kernel) and hundreds of goroutines to pile up on it. We suspect that part of the problem here was that each of the coalesced heartbeats was locking the Raft scheduler once per Range. So a coalesced heartbeat that contained 10k ranges would lock the scheduler 10k times on the receiver. The commit attempts to alleviate this issue by batch enqueuing Ranges in the Raft scheduler in response to coalesced heartbeats. This has a slight fixed overhead (i.e. the need for a slice) but in response, reduces the load that coalesced heartbeats place on the Raft scheduler's mutex by a factor of 128 (`enqueueChunkSize`). This should reduce the impact that a large number of Ranges have on contention in the Raft scheduler.

Relates to cockroachdb#56851. In cockroachdb#56851 and in many other investigations, we've seen cases where the NodeLiveness Range has a hard time performing writes when a system is under heavy load. We already split RPC traffic into two classes, ensuring that NodeLiveness traffic does not get stuck behind traffic on user ranges. However, to this point, it was still possible for the NodeLiveness range to get stuck behind other Ranges in the Raft scheduler, leading to high scheduling latency for Raft operations. This commit addresses this by prioritizing the NodeLiveness range above all others in the Raft scheduler. This prioritization mechanism is naive, but should be effective. It should also not run into any issues with fairness or starvation of other ranges, as such starvation is not possible as long as the scheduler concurrency (8*num_cpus) is above the number of high priority ranges (1).

Relates to cockroachdb#56851. In cockroachdb#56851, we saw that all of the Raft transport's receiving goroutines were stuck in the Raft scheduler, attempting to enqueue Ranges in response to coalesced heartbeats. We saw this in stacktraces like: ``` goroutine 321096 [semacquire]: sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1) /usr/local/go/src/runtime/sema.go:71 +0x47 sync.(*Mutex).lockSlow(0xc000070998) /usr/local/go/src/sync/mutex.go:138 +0xfc sync.(*Mutex).Lock(...) /usr/local/go/src/sync/mutex.go:81 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199 ``` In that issue, we also saw that too much concurrency on the Raft scheduler's Mutex had caused the mutex to collapse (get stuck in the slow path, in the OS kernel) and hundreds of goroutines to pile up on it. We suspect that part of the problem here was that each of the coalesced heartbeats was locking the Raft scheduler once per Range. So a coalesced heartbeat that contained 10k ranges would lock the scheduler 10k times on the receiver. The commit attempts to alleviate this issue by batch enqueuing Ranges in the Raft scheduler in response to coalesced heartbeats. This has a slight fixed overhead (i.e. the need for a slice) but in response, reduces the load that coalesced heartbeats place on the Raft scheduler's mutex by a factor of 128 (`enqueueChunkSize`). This should reduce the impact that a large number of Ranges have on contention in the Raft scheduler.

Relates to cockroachdb#56851. In cockroachdb#56851 and in many other investigations, we've seen cases where the NodeLiveness Range has a hard time performing writes when a system is under heavy load. We already split RPC traffic into two classes, ensuring that NodeLiveness traffic does not get stuck behind traffic on user ranges. However, to this point, it was still possible for the NodeLiveness range to get stuck behind other Ranges in the Raft scheduler, leading to high scheduling latency for Raft operations. This commit addresses this by prioritizing the NodeLiveness range above all others in the Raft scheduler. This prioritization mechanism is naive, but should be effective. It should also not run into any issues with fairness or starvation of other ranges, as such starvation is not possible as long as the scheduler concurrency (8*num_cpus) is above the number of high priority ranges (1).

Relates to cockroachdb#56851. In investigations like cockroachdb#56851, we've seen the mutex in the Raft scheduler collapse due to too much concurrency. To address this, we needed to drop the scheduler's goroutine pool size to bound the amount of contention on the mutex to ensure that the scheduler was able to schedule any goroutines. This commit caps this concurrency to 96, instead of letting it grow unbounded as a function of the number of cores on the system. Release note (performance improvement): The Raft processing goroutine pool's size is now capped at 96. This was observed to prevent instability on large machines (32+ vCPU) in clusters with many ranges (50k+ per node).

Relates to cockroachdb#56851. In cockroachdb#56851, we saw that all of the Raft transport's receiving goroutines were stuck in the Raft scheduler, attempting to enqueue Ranges in response to coalesced heartbeats. We saw this in stacktraces like: ``` goroutine 321096 [semacquire]: sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1) /usr/local/go/src/runtime/sema.go:71 +0x47 sync.(*Mutex).lockSlow(0xc000070998) /usr/local/go/src/sync/mutex.go:138 +0xfc sync.(*Mutex).Lock(...) /usr/local/go/src/sync/mutex.go:81 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199 ``` In that issue, we also saw that too much concurrency on the Raft scheduler's Mutex had caused the mutex to collapse (get stuck in the slow path, in the OS kernel) and hundreds of goroutines to pile up on it. We suspect that part of the problem here was that each of the coalesced heartbeats was locking the Raft scheduler once per Range. So a coalesced heartbeat that contained 10k ranges would lock the scheduler 10k times on the receiver. The commit attempts to alleviate this issue by batch enqueuing Ranges in the Raft scheduler in response to coalesced heartbeats. This has a slight fixed overhead (i.e. the need for a slice) but in response, reduces the load that coalesced heartbeats place on the Raft scheduler's mutex by a factor of 128 (`enqueueChunkSize`). This should reduce the impact that a large number of Ranges have on contention in the Raft scheduler. Release note (performance improvement): Interactions between Raft heartbeats and the Raft goroutine pool scheduler are now more efficient and avoid excessive mutex contention. This was observed to prevent instability on large machines (32+ vCPU) in clusters with many ranges (50k+ per node).

Relates to cockroachdb#56851. In cockroachdb#56851 and in many other investigations, we've seen cases where the NodeLiveness Range has a hard time performing writes when a system is under heavy load. We already split RPC traffic into two classes, ensuring that NodeLiveness traffic does not get stuck behind traffic on user ranges. However, to this point, it was still possible for the NodeLiveness range to get stuck behind other Ranges in the Raft scheduler, leading to high scheduling latency for Raft operations. This commit addresses this by prioritizing the NodeLiveness range above all others in the Raft scheduler. This prioritization mechanism is naive, but should be effective. It should also not run into any issues with fairness or starvation of other ranges, as such starvation is not possible as long as the scheduler concurrency (8*num_cpus) is above the number of high priority ranges (1). Release note (performance improvement): The Raft scheduler now prioritizes the node liveness Range. This was observed to prevent instability on large machines (32+ vCPU) in clusters with many ranges (50k+ per node).