Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: default raft scheduler concurrency can cause cascading failures on beefy machines #56851

Closed
tbg opened this issue Nov 18, 2020 · 5 comments · Fixed by #56860
Closed
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@tbg
Copy link
Member

tbg commented Nov 18, 2020

The default number of worker goroutines in the Raft scheduler is 8*runtime.NumCPUs(). We have observed that, at least on v20.1, this can cause pathological behavior that is most likely to occur when the CPU and range count are both "high" (32 CPUs and 55k ranges did it in one recent example).

The pathological behavior entails a full breakdown of the system. The UI and all ranges stop working. It becomes nearly impossible to extract debugging information from the system.

From a goroutine dump (via kill -ABRT), we see many of the worker goroutines with the following stack:

sync.(*Mutex).Lock(...)
    /usr/local/go/src/sync/mutex.go:81
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueueN(0xc0011ea900, 0x8, 0xc041aea000, 0x844d, 0x9800, 0xc038b821c0)

(enqueue1 similarly shows up). These are contending on a mutex, which is thought to be the root cause of the pathological behavior. This all looks like golang/go#33747, which was fixed in go1.14. CRDB v20.1 and v20.2 are both built with go1.13, which makes them susceptible to this bug. v21.1 will be built with go1.15, which has the fix.

Following the contention in the scheduler, we see outgoing raft message streams that are backed up because the recipient’s raft scheduler is unable to keep up. These have been seen stuck for dozens of minutes ([select, 17 minutes] etc):

goroutine 2707 [select]:
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport.(*writeQuota).get(0xc014642600, 0xc000000052, 0x4d, 0x5)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport/flowcontrol.go:59 +0xaa
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport.(*http2Client).Write(0xc0120461c0, 0xc0042dab00, 0xc02dbaa480, 0x5, 0x60, 0xc01dd44a80, 0x4d, 0xbb, 0xc03e70ecf5, 0x0, ...)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/internal/transport/http2_client.go:840 +0x1ae
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*csAttempt).sendMsg(0xc013e8cd80, 0x416e3e0, 0xc014642700, 0xc03e70ecf0, 0x5, 0x5, 0xc01dd44a80, 0x4d, 0xbb, 0xc00bf3ccc0, ...)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:828 +0x128
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*clientStream).SendMsg.func2(0xc013e8cd80, 0x4d, 0xbb)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:693 +0xb3
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*clientStream).withRetry(0xc000139e60, 0xc018462c30, 0xc019b036f0, 0xc000263840, 0x0)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:573 +0x360
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*clientStream).SendMsg(0xc000139e60, 0x416e3e0, 0xc014642700, 0x0, 0x0)
        /go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:699 +0x399

To resolve the gridlock, we added the environment variable COCKROACH_SCHEDULER_CONCURRENCY=64 to all nodes in the cluster and restarted.
We verifed the problem was solved by letting the cluster come together, watch the metrics for Raft leaders to be elected on all ranges, gradually add load back to the cluster and keep monitoring.

We need to set better defaults for the Raft scheduler worker pool. Additionally, we should understand whether the extent of the degradation was expected given the misconfiguration or whether there are more improvements to resilience we need to make. This will likely entail reproducing the problem locally.

gz#8824

@tbg tbg added A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. labels Nov 18, 2020
@nvanbenschoten
Copy link
Member

we see many of the worker goroutines with the following stack:

In addition to the worker goroutines stuck there, we also saw all 8 (one per traffic class x one per node) of the RaftMessageBatch handlers stuck in the Raft scheduler with stacktraces that looked like:

goroutine 321096 [semacquire]:
sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000070998)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199

This is the receiving side of the Raft message stream. It explains why we saw the Raft transport senders all stuck.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 18, 2020
Relates to cockroachdb#56851.

In investigations like cockroachdb#56851, we've seen the mutex in the Raft
scheduler collapse due to too much concurrency. To address this, we
needed to drop the scheduler's goroutine pool size to bound the amount
of contention on the mutex to ensure that the scheduler was able to
schedule any goroutines.

This commit caps this concurrency to 96, instead of letting it grow
unbounded as a function of the number of cores on the system.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 18, 2020
Relates to cockroachdb#56851.

In cockroachdb#56851, we saw that all of the Raft transport's receiving goroutines
were stuck in the Raft scheduler, attempting to enqueue Ranges in
response to coalesced heartbeats. We saw this in stacktraces like:
```
goroutine 321096 [semacquire]:
sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000070998)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199
```

In that issue, we also saw that too much concurrency on the Raft
scheduler's Mutex had caused the mutex to collapse (get stuck in the
slow path, in the OS kernel) and hundreds of goroutines to pile up on
it.

We suspect that part of the problem here was that each of the coalesced
heartbeats was locking the Raft scheduler once per Range. So a coalesced
heartbeat that contained 10k ranges would lock the scheduler 10k times
on the receiver.

The commit attempts to alleviate this issue by batch enqueuing Ranges in
the Raft scheduler in response to coalesced heartbeats. This has a
slight fixed overhead (i.e. the need for a slice) but in response,
reduces the load that coalesced heartbeats place on the Raft scheduler's
mutex by a factor of 128 (`enqueueChunkSize`). This should reduce the
impact that a large number of Ranges have on contention in the Raft
scheduler.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 18, 2020
Relates to cockroachdb#56851.

In cockroachdb#56851 and in many other investigations, we've seen cases where the
NodeLiveness Range has a hard time performing writes when a system is
under heavy load. We already split RPC traffic into two classes,
ensuring that NodeLiveness traffic does not get stuck behind traffic on
user ranges. However, to this point, it was still possible for the
NodeLiveness range to get stuck behind other Ranges in the Raft
scheduler, leading to high scheduling latency for Raft operations.

This commit addresses this by prioritizing the NodeLiveness range above
all others in the Raft scheduler. This prioritization mechanism is
naive, but should be effective. It should also not run into any issues
with fairness or starvation of other ranges, as such starvation is not
possible as long as the scheduler concurrency (8*num_cpus) is above the
number of high priority ranges (1).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 19, 2020
Relates to cockroachdb#56851.

In cockroachdb#56851, we saw that all of the Raft transport's receiving goroutines
were stuck in the Raft scheduler, attempting to enqueue Ranges in
response to coalesced heartbeats. We saw this in stacktraces like:
```
goroutine 321096 [semacquire]:
sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000070998)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199
```

In that issue, we also saw that too much concurrency on the Raft
scheduler's Mutex had caused the mutex to collapse (get stuck in the
slow path, in the OS kernel) and hundreds of goroutines to pile up on
it.

We suspect that part of the problem here was that each of the coalesced
heartbeats was locking the Raft scheduler once per Range. So a coalesced
heartbeat that contained 10k ranges would lock the scheduler 10k times
on the receiver.

The commit attempts to alleviate this issue by batch enqueuing Ranges in
the Raft scheduler in response to coalesced heartbeats. This has a
slight fixed overhead (i.e. the need for a slice) but in response,
reduces the load that coalesced heartbeats place on the Raft scheduler's
mutex by a factor of 128 (`enqueueChunkSize`). This should reduce the
impact that a large number of Ranges have on contention in the Raft
scheduler.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 19, 2020
Relates to cockroachdb#56851.

In cockroachdb#56851 and in many other investigations, we've seen cases where the
NodeLiveness Range has a hard time performing writes when a system is
under heavy load. We already split RPC traffic into two classes,
ensuring that NodeLiveness traffic does not get stuck behind traffic on
user ranges. However, to this point, it was still possible for the
NodeLiveness range to get stuck behind other Ranges in the Raft
scheduler, leading to high scheduling latency for Raft operations.

This commit addresses this by prioritizing the NodeLiveness range above
all others in the Raft scheduler. This prioritization mechanism is
naive, but should be effective. It should also not run into any issues
with fairness or starvation of other ranges, as such starvation is not
possible as long as the scheduler concurrency (8*num_cpus) is above the
number of high priority ranges (1).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 20, 2020
Relates to cockroachdb#56851.

In investigations like cockroachdb#56851, we've seen the mutex in the Raft
scheduler collapse due to too much concurrency. To address this, we
needed to drop the scheduler's goroutine pool size to bound the amount
of contention on the mutex to ensure that the scheduler was able to
schedule any goroutines.

This commit caps this concurrency to 96, instead of letting it grow
unbounded as a function of the number of cores on the system.

Release note (performance improvement): The Raft processing goroutine
pool's size is now capped at 96. This was observed to prevent instability
on large machines (32+ vCPU) in clusters with many ranges (50k+ per node).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 20, 2020
Relates to cockroachdb#56851.

In cockroachdb#56851, we saw that all of the Raft transport's receiving goroutines
were stuck in the Raft scheduler, attempting to enqueue Ranges in
response to coalesced heartbeats. We saw this in stacktraces like:
```
goroutine 321096 [semacquire]:
sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000070998)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199
```

In that issue, we also saw that too much concurrency on the Raft
scheduler's Mutex had caused the mutex to collapse (get stuck in the
slow path, in the OS kernel) and hundreds of goroutines to pile up on
it.

We suspect that part of the problem here was that each of the coalesced
heartbeats was locking the Raft scheduler once per Range. So a coalesced
heartbeat that contained 10k ranges would lock the scheduler 10k times
on the receiver.

The commit attempts to alleviate this issue by batch enqueuing Ranges in
the Raft scheduler in response to coalesced heartbeats. This has a
slight fixed overhead (i.e. the need for a slice) but in response,
reduces the load that coalesced heartbeats place on the Raft scheduler's
mutex by a factor of 128 (`enqueueChunkSize`). This should reduce the
impact that a large number of Ranges have on contention in the Raft
scheduler.

Release note (performance improvement): Interactions between Raft heartbeats
and the Raft goroutine pool scheduler are now more efficient and avoid excessive
mutex contention. This was observed to prevent instability on large machines
(32+ vCPU) in clusters with many ranges (50k+ per node).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 20, 2020
Relates to cockroachdb#56851.

In cockroachdb#56851 and in many other investigations, we've seen cases where the
NodeLiveness Range has a hard time performing writes when a system is
under heavy load. We already split RPC traffic into two classes,
ensuring that NodeLiveness traffic does not get stuck behind traffic on
user ranges. However, to this point, it was still possible for the
NodeLiveness range to get stuck behind other Ranges in the Raft
scheduler, leading to high scheduling latency for Raft operations.

This commit addresses this by prioritizing the NodeLiveness range above
all others in the Raft scheduler. This prioritization mechanism is
naive, but should be effective. It should also not run into any issues
with fairness or starvation of other ranges, as such starvation is not
possible as long as the scheduler concurrency (8*num_cpus) is above the
number of high priority ranges (1).

Release note (performance improvement): The Raft scheduler now prioritizes
the node liveness Range. This was observed to prevent instability on large
machines (32+ vCPU) in clusters with many ranges (50k+ per node).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 20, 2020
Relates to cockroachdb#56851.

In investigations like cockroachdb#56851, we've seen the mutex in the Raft
scheduler collapse due to too much concurrency. To address this, we
needed to drop the scheduler's goroutine pool size to bound the amount
of contention on the mutex to ensure that the scheduler was able to
schedule any goroutines.

This commit caps this concurrency to 96, instead of letting it grow
unbounded as a function of the number of cores on the system.

Release note (performance improvement): The Raft processing goroutine
pool's size is now capped at 96. This was observed to prevent instability
on large machines (32+ vCPU) in clusters with many ranges (50k+ per node).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 20, 2020
Relates to cockroachdb#56851.

In cockroachdb#56851, we saw that all of the Raft transport's receiving goroutines
were stuck in the Raft scheduler, attempting to enqueue Ranges in
response to coalesced heartbeats. We saw this in stacktraces like:
```
goroutine 321096 [semacquire]:
sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000070998)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199
```

In that issue, we also saw that too much concurrency on the Raft
scheduler's Mutex had caused the mutex to collapse (get stuck in the
slow path, in the OS kernel) and hundreds of goroutines to pile up on
it.

We suspect that part of the problem here was that each of the coalesced
heartbeats was locking the Raft scheduler once per Range. So a coalesced
heartbeat that contained 10k ranges would lock the scheduler 10k times
on the receiver.

The commit attempts to alleviate this issue by batch enqueuing Ranges in
the Raft scheduler in response to coalesced heartbeats. This has a
slight fixed overhead (i.e. the need for a slice) but in response,
reduces the load that coalesced heartbeats place on the Raft scheduler's
mutex by a factor of 128 (`enqueueChunkSize`). This should reduce the
impact that a large number of Ranges have on contention in the Raft
scheduler.

Release note (performance improvement): Interactions between Raft heartbeats
and the Raft goroutine pool scheduler are now more efficient and avoid excessive
mutex contention. This was observed to prevent instability on large machines
(32+ vCPU) in clusters with many ranges (50k+ per node).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 20, 2020
Relates to cockroachdb#56851.

In cockroachdb#56851 and in many other investigations, we've seen cases where the
NodeLiveness Range has a hard time performing writes when a system is
under heavy load. We already split RPC traffic into two classes,
ensuring that NodeLiveness traffic does not get stuck behind traffic on
user ranges. However, to this point, it was still possible for the
NodeLiveness range to get stuck behind other Ranges in the Raft
scheduler, leading to high scheduling latency for Raft operations.

This commit addresses this by prioritizing the NodeLiveness range above
all others in the Raft scheduler. This prioritization mechanism is
naive, but should be effective. It should also not run into any issues
with fairness or starvation of other ranges, as such starvation is not
possible as long as the scheduler concurrency (8*num_cpus) is above the
number of high priority ranges (1).

Release note (performance improvement): The Raft scheduler now prioritizes
the node liveness Range. This was observed to prevent instability on large
machines (32+ vCPU) in clusters with many ranges (50k+ per node).
@craig craig bot closed this as completed in be3ff28 Nov 22, 2020
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 23, 2020
Relates to cockroachdb#56851.

In cockroachdb#56851, we saw that all of the Raft transport's receiving goroutines
were stuck in the Raft scheduler, attempting to enqueue Ranges in
response to coalesced heartbeats. We saw this in stacktraces like:
```
goroutine 321096 [semacquire]:
sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000070998)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199
```

In that issue, we also saw that too much concurrency on the Raft
scheduler's Mutex had caused the mutex to collapse (get stuck in the
slow path, in the OS kernel) and hundreds of goroutines to pile up on
it.

We suspect that part of the problem here was that each of the coalesced
heartbeats was locking the Raft scheduler once per Range. So a coalesced
heartbeat that contained 10k ranges would lock the scheduler 10k times
on the receiver.

The commit attempts to alleviate this issue by batch enqueuing Ranges in
the Raft scheduler in response to coalesced heartbeats. This has a
slight fixed overhead (i.e. the need for a slice) but in response,
reduces the load that coalesced heartbeats place on the Raft scheduler's
mutex by a factor of 128 (`enqueueChunkSize`). This should reduce the
impact that a large number of Ranges have on contention in the Raft
scheduler.

Release note (performance improvement): Interactions between Raft heartbeats
and the Raft goroutine pool scheduler are now more efficient and avoid excessive
mutex contention. This was observed to prevent instability on large machines
(32+ vCPU) in clusters with many ranges (50k+ per node).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Nov 23, 2020
Relates to cockroachdb#56851.

In cockroachdb#56851, we saw that all of the Raft transport's receiving goroutines
were stuck in the Raft scheduler, attempting to enqueue Ranges in
response to coalesced heartbeats. We saw this in stacktraces like:
```
goroutine 321096 [semacquire]:
sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000070998)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199
```

In that issue, we also saw that too much concurrency on the Raft
scheduler's Mutex had caused the mutex to collapse (get stuck in the
slow path, in the OS kernel) and hundreds of goroutines to pile up on
it.

We suspect that part of the problem here was that each of the coalesced
heartbeats was locking the Raft scheduler once per Range. So a coalesced
heartbeat that contained 10k ranges would lock the scheduler 10k times
on the receiver.

The commit attempts to alleviate this issue by batch enqueuing Ranges in
the Raft scheduler in response to coalesced heartbeats. This has a
slight fixed overhead (i.e. the need for a slice) but in response,
reduces the load that coalesced heartbeats place on the Raft scheduler's
mutex by a factor of 128 (`enqueueChunkSize`). This should reduce the
impact that a large number of Ranges have on contention in the Raft
scheduler.

Release note (performance improvement): Interactions between Raft heartbeats
and the Raft goroutine pool scheduler are now more efficient and avoid excessive
mutex contention. This was observed to prevent instability on large machines
(32+ vCPU) in clusters with many ranges (50k+ per node).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 3, 2021
Relates to cockroachdb#56851.

In investigations like cockroachdb#56851, we've seen the mutex in the Raft
scheduler collapse due to too much concurrency. To address this, we
needed to drop the scheduler's goroutine pool size to bound the amount
of contention on the mutex to ensure that the scheduler was able to
schedule any goroutines.

This commit caps this concurrency to 96, instead of letting it grow
unbounded as a function of the number of cores on the system.

Release note (performance improvement): The Raft processing goroutine
pool's size is now capped at 96. This was observed to prevent instability
on large machines (32+ vCPU) in clusters with many ranges (50k+ per node).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 3, 2021
Relates to cockroachdb#56851.

In cockroachdb#56851 and in many other investigations, we've seen cases where the
NodeLiveness Range has a hard time performing writes when a system is
under heavy load. We already split RPC traffic into two classes,
ensuring that NodeLiveness traffic does not get stuck behind traffic on
user ranges. However, to this point, it was still possible for the
NodeLiveness range to get stuck behind other Ranges in the Raft
scheduler, leading to high scheduling latency for Raft operations.

This commit addresses this by prioritizing the NodeLiveness range above
all others in the Raft scheduler. This prioritization mechanism is
naive, but should be effective. It should also not run into any issues
with fairness or starvation of other ranges, as such starvation is not
possible as long as the scheduler concurrency (8*num_cpus) is above the
number of high priority ranges (1).

Release note (performance improvement): The Raft scheduler now prioritizes
the node liveness Range. This was observed to prevent instability on large
machines (32+ vCPU) in clusters with many ranges (50k+ per node).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 3, 2021
Relates to cockroachdb#56851.

In investigations like cockroachdb#56851, we've seen the mutex in the Raft
scheduler collapse due to too much concurrency. To address this, we
needed to drop the scheduler's goroutine pool size to bound the amount
of contention on the mutex to ensure that the scheduler was able to
schedule any goroutines.

This commit caps this concurrency to 96, instead of letting it grow
unbounded as a function of the number of cores on the system.

Release note (performance improvement): The Raft processing goroutine
pool's size is now capped at 96. This was observed to prevent instability
on large machines (32+ vCPU) in clusters with many ranges (50k+ per node).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 3, 2021
Relates to cockroachdb#56851.

In cockroachdb#56851 and in many other investigations, we've seen cases where the
NodeLiveness Range has a hard time performing writes when a system is
under heavy load. We already split RPC traffic into two classes,
ensuring that NodeLiveness traffic does not get stuck behind traffic on
user ranges. However, to this point, it was still possible for the
NodeLiveness range to get stuck behind other Ranges in the Raft
scheduler, leading to high scheduling latency for Raft operations.

This commit addresses this by prioritizing the NodeLiveness range above
all others in the Raft scheduler. This prioritization mechanism is
naive, but should be effective. It should also not run into any issues
with fairness or starvation of other ranges, as such starvation is not
possible as long as the scheduler concurrency (8*num_cpus) is above the
number of high priority ranges (1).

Release note (performance improvement): The Raft scheduler now prioritizes
the node liveness Range. This was observed to prevent instability on large
machines (32+ vCPU) in clusters with many ranges (50k+ per node).
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 3, 2021
Relates to cockroachdb#56851.

In cockroachdb#56851 and in many other investigations, we've seen cases where the
NodeLiveness Range has a hard time performing writes when a system is
under heavy load. We already split RPC traffic into two classes,
ensuring that NodeLiveness traffic does not get stuck behind traffic on
user ranges. However, to this point, it was still possible for the
NodeLiveness range to get stuck behind other Ranges in the Raft
scheduler, leading to high scheduling latency for Raft operations.

This commit addresses this by prioritizing the NodeLiveness range above
all others in the Raft scheduler. This prioritization mechanism is
naive, but should be effective. It should also not run into any issues
with fairness or starvation of other ranges, as such starvation is not
possible as long as the scheduler concurrency (8*num_cpus) is above the
number of high priority ranges (1).

Release note (performance improvement): The Raft scheduler now prioritizes
the node liveness Range. This was observed to prevent instability on large
machines (32+ vCPU) in clusters with many ranges (50k+ per node).
@daniel-crlabs
Copy link
Contributor

I see this issue has been closed, has a fix for this been released yet?
Are the k8s images updated with the release containing this fix?
We received a ticket on Friday ZD 8824 from a customer experiencing this same exact problem, and once the customer updated their statefulset to include and set their env variable the issue went away.

COCKROACH_SCHEDULER_CONCURRENCY=64

The customer is running CockroachDB Version 21.1.2

Server specs:

48 CPUs
PowerEdge R640: 2 x Intel Xeon Silver 4214 CPUs, 256GB RAM (16 x16GB DIMMs), Intel X550 quad port 10GBaseT NIC, Dell PERCH H740 RAID controller w/ 8GB NVCACHE,  2 x 480GB SATA Boot drives in RAID1

@daniel-crlabs daniel-crlabs reopened this Jun 28, 2021
@tbg
Copy link
Member Author

tbg commented Jun 28, 2021

@daniel-crlabs 21.1.2 definitely caps the scheduler concurrency at 96. Seems suspicious that between 64 and 96 there would be a cliff like that, but maybe it's possible? A 4214 CPU has 12 cores, probably 24 vcpus then, so 48 vcpus for each machine. So 96 should have been the active value before the manual env.
I'm not sure if the customer was really seeing the issue here, or whether the restart just "happened" to fix everything.

I would look at the debug.zip and in particular the raft.enqueued.pending metric via grep raft.enqueued.pending nodes/*/status.json. This should be at close to 50k if this issue is at play.

@daniel-crlabs
Copy link
Contributor

Thanks for the insight. The customer's pods kept restarting previously and the nodes would not stay alive for very long. This only stopped happening once the customer deployed a new statefulset with that env var set.
Looking at the logs, there is no such status in the logs unfortunately. Maybe this is in a different log?

wasabi $ egrep -i "node_id|raft.enqueued.pending" debugNode{1..3}/nodes/*/status.json
debugNode1/nodes/1/status.json:    "node_id": 1,
debugNode2/nodes/2/status.json:    "node_id": 2,
debugNode3/nodes/3/status.json:    "node_id": 3,

However, I do see that status in a different file:

wasabi $ egrep -i "raft.enqueued.pending" debugNode{1..3}/nodes/*/crdb_internal.node_metrics.txt
debugNode1/nodes/1/crdb_internal.node_metrics.txt:1	raft.enqueued.pending	0
debugNode2/nodes/2/crdb_internal.node_metrics.txt:2	raft.enqueued.pending	0
debugNode3/nodes/3/crdb_internal.node_metrics.txt:3	raft.enqueued.pending	0

But since the pods kept restarting themselves every few minutes, not sure that metric being reported is very accurate.

As the customer has show in the ticket, you can see the pods all restarted after been running for about 2 or so minutes, and this pattern of behavior continued until the env var was set:

*** Cluster initially came up
ba@kubemaster1:~/fcoco/bin$ ckgp
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cockroachdb-0 1/1 Running 0 3m39s 192.168.241.164 kubenode4.alpha13 <none> <none>
cockroachdb-1 1/1 Running 0 3m39s 192.168.153.119 kubenode2.alpha13 <none> <none>
cockroachdb-2 1/1 Running 0 3m39s 192.168.28.118 kubenode3.alpha13 <none> <none>
cockroachdb-3 1/1 Running 0 29s 192.168.115.172 kubenode6.alpha13 <none> <none>
cockroachdb-4 1/1 Running 0 29s 192.168.213.158 kubenode5.alpha13 <none> <none>

*** Cluster went down ~2 minutes after starting traffic
ba@kubemaster1:~/fcoco/bin$ ckgp
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cockroachdb-0 0/1 Running 0 8m35s 192.168.241.164 kubenode4.alpha13 <none> <none>
cockroachdb-1 0/1 Running 0 8m35s 192.168.153.119 kubenode2.alpha13 <none> <none>
cockroachdb-2 0/1 Running 0 8m35s 192.168.28.118 kubenode3.alpha13 <none> <none>
cockroachdb-3 0/1 Running 0 5m25s 192.168.115.172 kubenode6.alpha13 <none> <none>
cockroachdb-4 0/1 Running 0 5m25s 192.168.213.158 kubenode5.alpha13 <none> <none>

Hopefully this helps shed some light.

@tbg
Copy link
Member Author

tbg commented Jun 29, 2021

If you ran those greps against a debug.zip that was taken in the defunct state, what we're seeing tells me that we weren't looking at the raft scheduler thrashing here. The thrashing problem also didn't lead to restarts in the cases in which we saw it. Replication would simply not make progress and the cluster would be very unhappy.

It's likely that something else was going on. Let's file a separate issue for it to investigate via the usual support channels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants