release-20.1: kv: improve Raft scheduler behavior under CPU starvation #64568

nvanbenschoten · 2021-05-03T14:05:25Z

Backport 2/3 commits from #56860. The other commit was backported in #57009.

/cc @cockroachdb/release

This PR contains 3 commits that should improve the health of a cluster under CPU starvation and with many Ranges. I ran a series of experiments (see #56860 (comment)) which demonstrate that the combination of these commits improves the health of a cluster with many ranges dramatically, ensuring that liveness never falls over and that liveness heartbeat latency stays constant even as all other ranges become overloaded.

kv: cap COCKROACH_SCHEDULER_CONCURRENCY at 96

In investigations like #56851, we've seen the mutex in the Raft scheduler collapse due to too much concurrency. To address this, we needed to drop the scheduler's goroutine pool size to bound the amount of contention on the mutex to ensure that the scheduler was able to schedule any goroutines.

This commit caps this concurrency to 96, instead of letting it grow unbounded as a function of the number of cores on the system.

kv: batch enqueue Ranges in Raft scheduler for coalesced heartbeats

In #56851, we saw that all of the Raft transport's receiving goroutines were stuck in the Raft scheduler, attempting to enqueue Ranges in response to coalesced heartbeats. We saw this in stacktraces like:

goroutine 321096 [semacquire]:
sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000070998)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199

In that issue, we also saw that too much concurrency on the Raft scheduler's Mutex had caused the mutex to collapse (get stuck in the slow path, in the OS kernel) and hundreds of goroutines to pile up on it.

We suspect that part of the problem here was that each of the coalesced heartbeats was locking the Raft scheduler once per Range. So a coalesced heartbeat that contained 10k ranges would lock the scheduler 10k times on the receiver.

The commit attempts to alleviate this issue by batch enqueuing Ranges in the Raft scheduler in response to coalesced heartbeats. This has a slight fixed overhead (i.e. the need for a slice) but in response, reduces the load that coalesced heartbeats place on the Raft scheduler's mutex by a factor of 128 (enqueueChunkSize). This should reduce the impact that a large number of Ranges have on contention in the Raft scheduler.

kv: prioritize NodeLiveness Range in Raft scheduler

In #56851 and in many other investigations, we've seen cases where the NodeLiveness Range has a hard time performing writes when a system is under heavy load. We already split RPC traffic into two classes, ensuring that NodeLiveness traffic does not get stuck behind traffic on user ranges. However, to this point, it was still possible for the NodeLiveness range to get stuck behind other Ranges in the Raft scheduler, leading to high scheduling latency for Raft operations.

This commit addresses this by prioritizing the NodeLiveness range above all others in the Raft scheduler. This prioritization mechanism is naive, but should be effective. It should also not run into any issues with fairness or starvation of other ranges, as such starvation is not possible as long as the scheduler concurrency (8*num_cpus) is above the number of high priority ranges (1).

@ajwerner I'm adding you here specifically because we've talked about the need for something like the last commit a few times.

Relates to cockroachdb#56851. In investigations like cockroachdb#56851, we've seen the mutex in the Raft scheduler collapse due to too much concurrency. To address this, we needed to drop the scheduler's goroutine pool size to bound the amount of contention on the mutex to ensure that the scheduler was able to schedule any goroutines. This commit caps this concurrency to 96, instead of letting it grow unbounded as a function of the number of cores on the system. Release note (performance improvement): The Raft processing goroutine pool's size is now capped at 96. This was observed to prevent instability on large machines (32+ vCPU) in clusters with many ranges (50k+ per node).

cockroach-teamcity · 2021-05-03T14:05:34Z

This change is

erikgrinaker

Reviewed 1 of 1 files at r1, 5 of 5 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)

Relates to cockroachdb#56851. In cockroachdb#56851 and in many other investigations, we've seen cases where the NodeLiveness Range has a hard time performing writes when a system is under heavy load. We already split RPC traffic into two classes, ensuring that NodeLiveness traffic does not get stuck behind traffic on user ranges. However, to this point, it was still possible for the NodeLiveness range to get stuck behind other Ranges in the Raft scheduler, leading to high scheduling latency for Raft operations. This commit addresses this by prioritizing the NodeLiveness range above all others in the Raft scheduler. This prioritization mechanism is naive, but should be effective. It should also not run into any issues with fairness or starvation of other ranges, as such starvation is not possible as long as the scheduler concurrency (8*num_cpus) is above the number of high priority ranges (1). Release note (performance improvement): The Raft scheduler now prioritizes the node liveness Range. This was observed to prevent instability on large machines (32+ vCPU) in clusters with many ranges (50k+ per node).

nvanbenschoten requested review from erikgrinaker and tbg May 3, 2021 14:05

erikgrinaker approved these changes May 3, 2021

View reviewed changes

nvanbenschoten force-pushed the backport20.1-56860 branch from fc2b0c4 to 145b011 Compare May 3, 2021 16:19

nvanbenschoten merged commit 1a0b6fa into cockroachdb:release-20.1 May 3, 2021

nvanbenschoten deleted the backport20.1-56860 branch May 5, 2021 03:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-20.1: kv: improve Raft scheduler behavior under CPU starvation #64568

release-20.1: kv: improve Raft scheduler behavior under CPU starvation #64568

nvanbenschoten commented May 3, 2021

cockroach-teamcity commented May 3, 2021

erikgrinaker left a comment

release-20.1: kv: improve Raft scheduler behavior under CPU starvation #64568

release-20.1: kv: improve Raft scheduler behavior under CPU starvation #64568

Conversation

nvanbenschoten commented May 3, 2021

kv: cap COCKROACH_SCHEDULER_CONCURRENCY at 96

kv: batch enqueue Ranges in Raft scheduler for coalesced heartbeats

kv: prioritize NodeLiveness Range in Raft scheduler

cockroach-teamcity commented May 3, 2021

erikgrinaker left a comment

Choose a reason for hiding this comment