-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage,nodedialer,kv,rpc: introduce connection class to separate system traffic #39172
storage,nodedialer,kv,rpc: introduce connection class to separate system traffic #39172
Conversation
Can you add some commentary to the PR explaining the rationale, expected benefits, etc? |
0532599
to
12471a7
Compare
Yes indeed. I opened it as a draft to get CI to run and to look at where I am. I'll clean up the PR and commits before I open it for review. |
7aae579
to
e74cccf
Compare
@knz could you give this a review? I tend to reach for @nvanbenschoten but I think you're more familiar with this code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM but I'd like to persist a bit more motivation and rationale in the explanatory comments. see below
Reviewed 4 of 4 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @knz)
pkg/rpc/connection_class.go, line 16 at r1 (raw file):
// ConnectionClass allows multiple connections to the same host to not share // an underlying connection.
The explanatory comment is confusing. I think you probably want something like this instead: "ConnectionClass is the identifier of a group of RPC sessions that are allowed to share an underlying TCP connections; RPC sessions with different connection classes are guaranteed to use separate TCP connections.".
You can also maybe extend the comment to explain what motivates this and that we may want to remove this concept in the future: "RPC sessions that share a connection class are arbitrated using the gRPC flow control logic, see . The lack of support of prioritization in the current gRPC implementation is the reason why we are separating different priority flows across separate TCP connections. Future gRPC improvements may enable further simplification here."
pkg/rpc/connection_class.go, line 35 at r1 (raw file):
// The first 5 ranges are always system ranges. // TODO(ajwerner): consider expanding this or using another mechanism. if rangeID <= 10 {
Avoid using a magic number here and instead declare (or reuse) a constant in an appropriate location.
pkg/rpc/context.go, line 712 at r1 (raw file):
// connection. This connection will not be reconnected automatically; // the returned channel is closed when a reconnection is attempted. // This method implied a DefaultClass ConnectionClass for the returned
*implies
pkg/rpc/context.go, line 761 at r1 (raw file):
// GRPCDialNodeClass calls grpc.Dial with options appropriate for the // context and class.
Refer the reader to the explanatory comment next to the declaration for ConnectionClass
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I have not reviewed the raft changes - I hope Nathan or perhaps @tbg can have a look here)
Reviewed 3 of 3 files at r2, 11 of 11 files at r3, 2 of 2 files at r4, 4 of 4 files at r5.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/kv/dist_sender_rangefeed.go, line 232 at r3 (raw file):
replicas.OptimizeReplicaOrder(ds.getNodeDescriptor(), latencyFn) transport, err := ds.transportFactory(SendOptions{}, ds.nodeDialer, rpc.DefaultClass, replicas)
why not the class for the given desc.RangeID
here?
6c2d842
to
591929e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! @nvanbenschoten if I could bother you to review just the raft_transport.go
change it would be much appreciated.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/kv/dist_sender_rangefeed.go, line 232 at r3 (raw file):
Previously, knz (kena) wrote…
why not the class for the given
desc.RangeID
here?
The range RangeFeed method isn't used for anything system critical. I added a comment.
pkg/rpc/connection_class.go, line 16 at r1 (raw file):
Previously, knz (kena) wrote…
The explanatory comment is confusing. I think you probably want something like this instead: "ConnectionClass is the identifier of a group of RPC sessions that are allowed to share an underlying TCP connections; RPC sessions with different connection classes are guaranteed to use separate TCP connections.".
You can also maybe extend the comment to explain what motivates this and that we may want to remove this concept in the future: "RPC sessions that share a connection class are arbitrated using the gRPC flow control logic, see . The lack of support of prioritization in the current gRPC implementation is the reason why we are separating different priority flows across separate TCP connections. Future gRPC improvements may enable further simplification here."
Much better, thanks.
pkg/rpc/connection_class.go, line 35 at r1 (raw file):
Previously, knz (kena) wrote…
Avoid using a magic number here and instead declare (or reuse) a constant in an appropriate location.
<= 10
was too wide of a net. In order to have the desired impact we really only need these few ranges. I've added commentary around the specific ranges though there's no good constants to reuse.
pkg/rpc/context.go, line 712 at r1 (raw file):
Previously, knz (kena) wrote…
*implies
Done.
pkg/rpc/context.go, line 761 at r1 (raw file):
Previously, knz (kena) wrote…
Refer the reader to the explanatory comment next to the declaration for
ConnectionClass
.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 18 files at r6, 2 of 3 files at r7, 9 of 11 files at r8, 1 of 2 files at r9, 4 of 4 files at r10.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @knz)
pkg/kv/dist_sender_rangefeed.go, line 232 at r10 (raw file):
replicas.OptimizeReplicaOrder(ds.getNodeDescriptor(), latencyFn) // The RangeFeed is not used for system critical traffic so use a DefaultClass // connection regardless of the range.
Arguably I think changefeeds would be best served by a 3rd connection class. You don't want CDC updates to start lagging behind because the main replication traffic is hogging the pipe.
Maybe you can talk with @danhhz about it.
pkg/rpc/connection_class.go, line 54 at r6 (raw file):
// // TODO(ajwerner): consider expanding this or using another mechanism. case 1, 2, 6, 7:
That's really not better. you can do case keys.SomeRangeID, keys.OtherRangeID, ...:
instead with suitable constant definitions.
You'll also need an assertion somewhere that the ranges of interest actually get these IDs. The mechanism would be too brittle otherwise.
6b9e72e
to
9086af2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @knz)
pkg/rpc/connection_class.go, line 54 at r6 (raw file):
Previously, knz (kena) wrote…
That's really not better. you can do
case keys.SomeRangeID, keys.OtherRangeID, ...:
instead with suitable constant definitions.You'll also need an assertion somewhere that the ranges of interest actually get these IDs. The mechanism would be too brittle otherwise.
Alright I pared it back to the minimum required for this change to work and exported constants from the config
package. For client traffic in kv
we might be better served by adding logic to inspect spans rather than range ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/kv/dist_sender_rangefeed.go, line 232 at r10 (raw file):
Previously, knz (kena) wrote…
Arguably I think changefeeds would be best served by a 3rd connection class. You don't want CDC updates to start lagging behind because the main replication traffic is hogging the pipe.
Maybe you can talk with @danhhz about it.
Import/schema change traffic might also want its own class. However, I think it's best to start with 2 and only add more if it becomes absolutely necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 19 files at r11.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @knz)
pkg/config/system.go, line 373 at r11 (raw file):
// NodeLivenessRangeID is the numeric range ID for node liveness information. // See staticSplits above. NodeLivenessRangeID roachpb.RangeID = 2
Just out of curiosity, where are these range IDs assigned to the ranges?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discussed offline with @nvanbenschoten about moving the class decision to be span oriented rather than range ID oriented. It will require a bit of plumbing but will generally make for a much better solution.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
pkg/config/system.go, line 373 at r11 (raw file):
Previously, knz (kena) wrote…
Just out of curiosity, where are these range IDs assigned to the ranges?
cockroach/pkg/storage/store_bootstrap.go
Line 140 in 11e4b35
rangeID := roachpb.RangeID(i + 2) // RangeIDs are 1-based. |
called
Lines 226 to 234 in 11e4b35
if i == 0 { | |
schema := GetBootstrapSchema(defaultZoneConfig, defaultSystemZoneConfig) | |
initialValues, tableSplits := schema.GetInitialValues() | |
splits := append(config.StaticSplits(), tableSplits...) | |
sort.Slice(splits, func(i, j int) bool { | |
return splits[i].Less(splits[j]) | |
}) | |
if err := storage.WriteInitialClusterData( |
9086af2
to
e48b61e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned up the PR to be key addressed rather than range ID addressed. Also moved the class
into SendOptions
in the TransportFactory
rather than being a distinct argument. Also added a kv
unit test. PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz)
e48b61e
to
0243f55
Compare
39167: roachtest: add tpccoverload to exercise a node-liveness failure scenario r=nvanbenschoten a=ajwerner This roachtest combines the tpcc dataset with a contrived query to highlight a scenario in which node liveness is currently starved. This test is skipped as it currently fails on master. It is important to note that after separating the TCP connection used for node liveness from the rest of traffic (#39172) this test passes reliably. Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
This commit extends the poorly abstracted TransportFactory interface to additionally capture connection class. It then utilizes this new property in the DistSender to separate traffic intended for critical system ranges from all other traffic. Release note: None
This commit adopts the ConnectionClass in the storage raftTransport to separate raft messages for system ranges from that of other ranges. Release note: None
This adds a unit test that exercises the change to the raftTransport to use a separate connection for system traffic from regular traffic. Release note: None
ec1c48a
to
ecee66c
Compare
bors r+ PR to adopt ConnectionClass coming right up. |
39172: storage,nodedialer,kv,rpc: introduce connection class to separate system traffic r=ajwerner a=ajwerner **Motivation** This PR was primarily motivated by #35535. In that specific case and in other reproducible workloads we observe that large volumes of traffic primarily due to large OLAP-style table scans with filtering and sorting can lead to massive increases in latency node-liveness latency. The roachtest added in #39167 demonstrates this scenario. That roachtest fails every time on master and has succeeded for 10 consecutive runs with this change. **Impact** Below we have charts showing the node liveness failure rate, latency, and CPU utilization on a 3-node 16-CPU (n1-standard-16) single region cluster with 100 TPCC warehouses loaded using the query from the above roachtest both before and after this change. We can see that the former suffers from node liveness failures and high node liveness latency while the latter does not. ![image](https://user-images.githubusercontent.com/1839234/62215027-2764b380-b374-11e9-9640-9bf5d17b21b9.png) The dip in CPU in the middle of the chart is where the binary is switched from master to a build with this PR. It is clear that before this change node liveness failed at a high rate leading to much poorer CPU utilization. It is worth noting that there was a single liveness failure that occurred during the run with this change. **PR Layout** The PR itself comes in 5 commits 1) Introduce `ConnectionClass` to the `rpc` pacakge and `nodedialer` 2) Add testing knobs to inject interceptors in the `rpc` package 3) Adopt `ConnectionClass` in the the `kv` package 4) Adopt `ConnectionClass` in the `storage` package 5) Add testing for the storage package change The change probably would benefit from a unit test in the `kv` package which I'll type today but wanted to start the review process. Another point of conversation is that this change will double the number of TCP connections between nodes for any reasonably sized cluster. Today we never close opened connections and on large clusters we'll potentially dial `SystemClass` connections to nodes which at one point were the leaseholder for or held a replica of a system range that no longer do. In order to mitigate these superfluous connections a follow up PR will add logic to detect and close idle connections. Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Build succeeded |
[cockroachdb#39172](https:github.com/cockroachdb/pull/39172) added the concept of connection class to create multiple connections to a given target. In order to reduce the adoption burden that PR introduced new methods for dialing connections, checking connection health, and retrieving circuit breakers which took a ConnectionClass and carried a `Class` suffix. It left the previous method signatures untouched, opting instead to convert them to a shorthand which passed DefaultClass to the new method. This PR moves all clients of these methods to use ConnectionClass explicitly. Release note: None
Now that cockroachdb#39172 has landed this test should pass reliably. Release note: None
39567: roachtest: unskip overload/tpcc_olap r=ajwerner a=ajwerner Now that #39172 has landed this test should pass reliably. I ran it 10s of times without a failure while working on that PR and 20 more times this morning. Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
39398: rpc,nodedialer: excise methods which do not use ConnectionClass r=ajwerner a=ajwerner #39172 added the concept of connection class to create multiple connections to a given target. In order to reduce the adoption burden that PR introduced new methods for dialing connections, checking connection health, and retrieving circuit breakers which took a ConnectionClass and carried a `Class` suffix. It left the previous method signatures untouched, opting instead to convert them to a shorthand which passed DefaultClass to the new method. This PR moves all clients of these methods to use ConnectionClass explicitly. Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
56860: kv: improve Raft scheduler behavior under CPU starvation r=jordanlewis a=nvanbenschoten Fixes #56851. This PR contains 3 commits that should improve the health of a cluster under CPU starvation and with many Ranges. I ran a series of experiments (see #56860 (comment)) which demonstrate that the combination of these commits improves the health of a cluster with many ranges dramatically, ensuring that liveness never falls over and that liveness heartbeat latency stays constant even as all other ranges become overloaded. #### kv: cap COCKROACH_SCHEDULER_CONCURRENCY at 96 In investigations like #56851, we've seen the mutex in the Raft scheduler collapse due to too much concurrency. To address this, we needed to drop the scheduler's goroutine pool size to bound the amount of contention on the mutex to ensure that the scheduler was able to schedule any goroutines. This commit caps this concurrency to 96, instead of letting it grow unbounded as a function of the number of cores on the system. #### kv: batch enqueue Ranges in Raft scheduler for coalesced heartbeats In #56851, we saw that all of the Raft transport's receiving goroutines were stuck in the Raft scheduler, attempting to enqueue Ranges in response to coalesced heartbeats. We saw this in stacktraces like: ``` goroutine 321096 [semacquire]: sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1) /usr/local/go/src/runtime/sema.go:71 +0x47 sync.(*Mutex).lockSlow(0xc000070998) /usr/local/go/src/sync/mutex.go:138 +0xfc sync.(*Mutex).Lock(...) /usr/local/go/src/sync/mutex.go:81 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201 github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720) /go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199 ``` In that issue, we also saw that too much concurrency on the Raft scheduler's Mutex had caused the mutex to collapse (get stuck in the slow path, in the OS kernel) and hundreds of goroutines to pile up on it. We suspect that part of the problem here was that each of the coalesced heartbeats was locking the Raft scheduler once per Range. So a coalesced heartbeat that contained 10k ranges would lock the scheduler 10k times on the receiver. The commit attempts to alleviate this issue by batch enqueuing Ranges in the Raft scheduler in response to coalesced heartbeats. This has a slight fixed overhead (i.e. the need for a slice) but in response, reduces the load that coalesced heartbeats place on the Raft scheduler's mutex by a factor of 128 (`enqueueChunkSize`). This should reduce the impact that a large number of Ranges have on contention in the Raft scheduler. #### kv: prioritize NodeLiveness Range in Raft scheduler In #56851 and in many other investigations, we've seen cases where the NodeLiveness Range has a hard time performing writes when a system is under heavy load. We [already split RPC traffic into two classes](#39172), ensuring that NodeLiveness traffic does not get stuck behind traffic on user ranges. However, to this point, it was still possible for the NodeLiveness range to get stuck behind other Ranges in the Raft scheduler, leading to high scheduling latency for Raft operations. This commit addresses this by prioritizing the NodeLiveness range above all others in the Raft scheduler. This prioritization mechanism is naive, but should be effective. It should also not run into any issues with fairness or starvation of other ranges, as such starvation is not possible as long as the scheduler concurrency (8*num_cpus) is above the number of high priority ranges (1). @ajwerner I'm adding you here specifically because we've talked about the need for something like the last commit a few times. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Motivation
This PR was primarily motivated by #35535. In that specific case and in other reproducible workloads we observe that large volumes of traffic primarily due to large OLAP-style table scans with filtering and sorting can lead to massive increases in latency node-liveness latency. The roachtest added in #39167 demonstrates this scenario. That roachtest fails every time on master and has succeeded for 10 consecutive runs with this change.
Impact
Below we have charts showing the node liveness failure rate, latency, and CPU utilization on a 3-node 16-CPU (n1-standard-16) single region cluster with 100 TPCC warehouses loaded using the query from the above roachtest both before and after this change. We can see that the former suffers from node liveness failures and high node liveness latency while the latter does not.
The dip in CPU in the middle of the chart is where the binary is switched from master to a build with this PR. It is clear that before this change node liveness failed at a high rate leading to much poorer CPU utilization. It is worth noting that there was a single liveness failure that occurred during the run with this change.
PR Layout
The PR itself comes in 5 commits
ConnectionClass
to therpc
pacakge andnodedialer
rpc
packageConnectionClass
in the thekv
packageConnectionClass
in thestorage
packageThe change probably would benefit from a unit test in the
kv
package which I'll type today but wanted to start the review process.Another point of conversation is that this change will double the number of TCP connections between nodes for any reasonably sized cluster. Today we never close opened connections and on large clusters we'll potentially dial
SystemClass
connections to nodes which at one point were the leaseholder for or held a replica of a system range that no longer do. In order to mitigate these superfluous connections a follow up PR will add logic to detect and close idle connections.