Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage,nodedialer,kv,rpc: introduce connection class to separate system traffic #39172

Merged

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Jul 30, 2019

Motivation

This PR was primarily motivated by #35535. In that specific case and in other reproducible workloads we observe that large volumes of traffic primarily due to large OLAP-style table scans with filtering and sorting can lead to massive increases in latency node-liveness latency. The roachtest added in #39167 demonstrates this scenario. That roachtest fails every time on master and has succeeded for 10 consecutive runs with this change.

Impact

Below we have charts showing the node liveness failure rate, latency, and CPU utilization on a 3-node 16-CPU (n1-standard-16) single region cluster with 100 TPCC warehouses loaded using the query from the above roachtest both before and after this change. We can see that the former suffers from node liveness failures and high node liveness latency while the latter does not.

image

The dip in CPU in the middle of the chart is where the binary is switched from master to a build with this PR. It is clear that before this change node liveness failed at a high rate leading to much poorer CPU utilization. It is worth noting that there was a single liveness failure that occurred during the run with this change.

PR Layout

The PR itself comes in 5 commits

  1. Introduce ConnectionClass to the rpc pacakge and nodedialer
  2. Add testing knobs to inject interceptors in the rpc package
  3. Adopt ConnectionClass in the the kv package
  4. Adopt ConnectionClass in the storage package
  5. Add testing for the storage package change

The change probably would benefit from a unit test in the kv package which I'll type today but wanted to start the review process.

Another point of conversation is that this change will double the number of TCP connections between nodes for any reasonably sized cluster. Today we never close opened connections and on large clusters we'll potentially dial SystemClass connections to nodes which at one point were the leaseholder for or held a replica of a system range that no longer do. In order to mitigate these superfluous connections a follow up PR will add logic to detect and close idle connections.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andy-kimball
Copy link
Contributor

Can you add some commentary to the PR explaining the rationale, expected benefits, etc?

@ajwerner ajwerner force-pushed the ajwerner/rpc-separate-traffic-class branch from 0532599 to 12471a7 Compare July 30, 2019 16:19
@ajwerner
Copy link
Contributor Author

Can you add some commentary to the PR explaining the rationale, expected benefits, etc?

Yes indeed. I opened it as a draft to get CI to run and to look at where I am. I'll clean up the PR and commits before I open it for review.

@ajwerner ajwerner force-pushed the ajwerner/rpc-separate-traffic-class branch 3 times, most recently from 7aae579 to e74cccf Compare July 31, 2019 04:13
@ajwerner ajwerner marked this pull request as ready for review July 31, 2019 13:30
@ajwerner ajwerner requested review from a team and knz July 31, 2019 13:30
@ajwerner
Copy link
Contributor Author

@knz could you give this a review? I tend to reach for @nvanbenschoten but I think you're more familiar with this code.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM but I'd like to persist a bit more motivation and rationale in the explanatory comments. see below

Reviewed 4 of 4 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @knz)


pkg/rpc/connection_class.go, line 16 at r1 (raw file):

// ConnectionClass allows multiple connections to the same host to not share
// an underlying connection.

The explanatory comment is confusing. I think you probably want something like this instead: "ConnectionClass is the identifier of a group of RPC sessions that are allowed to share an underlying TCP connections; RPC sessions with different connection classes are guaranteed to use separate TCP connections.".

You can also maybe extend the comment to explain what motivates this and that we may want to remove this concept in the future: "RPC sessions that share a connection class are arbitrated using the gRPC flow control logic, see . The lack of support of prioritization in the current gRPC implementation is the reason why we are separating different priority flows across separate TCP connections. Future gRPC improvements may enable further simplification here."


pkg/rpc/connection_class.go, line 35 at r1 (raw file):

	// The first 5 ranges are always system ranges.
	// TODO(ajwerner): consider expanding this or using another mechanism.
	if rangeID <= 10 {

Avoid using a magic number here and instead declare (or reuse) a constant in an appropriate location.


pkg/rpc/context.go, line 712 at r1 (raw file):

// connection. This connection will not be reconnected automatically;
// the returned channel is closed when a reconnection is attempted.
// This method implied a DefaultClass ConnectionClass for the returned

*implies


pkg/rpc/context.go, line 761 at r1 (raw file):

// GRPCDialNodeClass calls grpc.Dial with options appropriate for the
// context and class.

Refer the reader to the explanatory comment next to the declaration for ConnectionClass.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I have not reviewed the raft changes - I hope Nathan or perhaps @tbg can have a look here)

Reviewed 3 of 3 files at r2, 11 of 11 files at r3, 2 of 2 files at r4, 4 of 4 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/kv/dist_sender_rangefeed.go, line 232 at r3 (raw file):

	replicas.OptimizeReplicaOrder(ds.getNodeDescriptor(), latencyFn)

	transport, err := ds.transportFactory(SendOptions{}, ds.nodeDialer, rpc.DefaultClass, replicas)

why not the class for the given desc.RangeID here?

@ajwerner ajwerner force-pushed the ajwerner/rpc-separate-traffic-class branch 2 times, most recently from 6c2d842 to 591929e Compare August 1, 2019 13:12
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! @nvanbenschoten if I could bother you to review just the raft_transport.go change it would be much appreciated.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/kv/dist_sender_rangefeed.go, line 232 at r3 (raw file):

Previously, knz (kena) wrote…

why not the class for the given desc.RangeID here?

The range RangeFeed method isn't used for anything system critical. I added a comment.


pkg/rpc/connection_class.go, line 16 at r1 (raw file):

Previously, knz (kena) wrote…

The explanatory comment is confusing. I think you probably want something like this instead: "ConnectionClass is the identifier of a group of RPC sessions that are allowed to share an underlying TCP connections; RPC sessions with different connection classes are guaranteed to use separate TCP connections.".

You can also maybe extend the comment to explain what motivates this and that we may want to remove this concept in the future: "RPC sessions that share a connection class are arbitrated using the gRPC flow control logic, see . The lack of support of prioritization in the current gRPC implementation is the reason why we are separating different priority flows across separate TCP connections. Future gRPC improvements may enable further simplification here."

Much better, thanks.


pkg/rpc/connection_class.go, line 35 at r1 (raw file):

Previously, knz (kena) wrote…

Avoid using a magic number here and instead declare (or reuse) a constant in an appropriate location.

<= 10 was too wide of a net. In order to have the desired impact we really only need these few ranges. I've added commentary around the specific ranges though there's no good constants to reuse.


pkg/rpc/context.go, line 712 at r1 (raw file):

Previously, knz (kena) wrote…

*implies

Done.


pkg/rpc/context.go, line 761 at r1 (raw file):

Previously, knz (kena) wrote…

Refer the reader to the explanatory comment next to the declaration for ConnectionClass.

Done.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 18 files at r6, 2 of 3 files at r7, 9 of 11 files at r8, 1 of 2 files at r9, 4 of 4 files at r10.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @knz)


pkg/kv/dist_sender_rangefeed.go, line 232 at r10 (raw file):

	replicas.OptimizeReplicaOrder(ds.getNodeDescriptor(), latencyFn)
	// The RangeFeed is not used for system critical traffic so use a DefaultClass
	// connection regardless of the range.

Arguably I think changefeeds would be best served by a 3rd connection class. You don't want CDC updates to start lagging behind because the main replication traffic is hogging the pipe.
Maybe you can talk with @danhhz about it.


pkg/rpc/connection_class.go, line 54 at r6 (raw file):

	//
	// TODO(ajwerner): consider expanding this or using another mechanism.
	case 1, 2, 6, 7:

That's really not better. you can do case keys.SomeRangeID, keys.OtherRangeID, ...: instead with suitable constant definitions.

You'll also need an assertion somewhere that the ranges of interest actually get these IDs. The mechanism would be too brittle otherwise.

@ajwerner ajwerner force-pushed the ajwerner/rpc-separate-traffic-class branch 2 times, most recently from 6b9e72e to 9086af2 Compare August 1, 2019 14:07
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @knz)


pkg/rpc/connection_class.go, line 54 at r6 (raw file):

Previously, knz (kena) wrote…

That's really not better. you can do case keys.SomeRangeID, keys.OtherRangeID, ...: instead with suitable constant definitions.

You'll also need an assertion somewhere that the ranges of interest actually get these IDs. The mechanism would be too brittle otherwise.

Alright I pared it back to the minimum required for this change to work and exported constants from the config package. For client traffic in kv we might be better served by adding logic to inspect spans rather than range ids.

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/kv/dist_sender_rangefeed.go, line 232 at r10 (raw file):

Previously, knz (kena) wrote…

Arguably I think changefeeds would be best served by a 3rd connection class. You don't want CDC updates to start lagging behind because the main replication traffic is hogging the pipe.
Maybe you can talk with @danhhz about it.

Import/schema change traffic might also want its own class. However, I think it's best to start with 2 and only add more if it becomes absolutely necessary.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 19 files at r11.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @knz)


pkg/config/system.go, line 373 at r11 (raw file):

	// NodeLivenessRangeID is the numeric range ID for node liveness information.
	// See staticSplits above.
	NodeLivenessRangeID roachpb.RangeID = 2

Just out of curiosity, where are these range IDs assigned to the ranges?

Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed offline with @nvanbenschoten about moving the class decision to be span oriented rather than range ID oriented. It will require a bit of plumbing but will generally make for a much better solution.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)


pkg/config/system.go, line 373 at r11 (raw file):

Previously, knz (kena) wrote…

Just out of curiosity, where are these range IDs assigned to the ranges?

rangeID := roachpb.RangeID(i + 2) // RangeIDs are 1-based.

called

if i == 0 {
schema := GetBootstrapSchema(defaultZoneConfig, defaultSystemZoneConfig)
initialValues, tableSplits := schema.GetInitialValues()
splits := append(config.StaticSplits(), tableSplits...)
sort.Slice(splits, func(i, j int) bool {
return splits[i].Less(splits[j])
})
if err := storage.WriteInitialClusterData(

@ajwerner ajwerner force-pushed the ajwerner/rpc-separate-traffic-class branch from 9086af2 to e48b61e Compare August 1, 2019 19:15
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up the PR to be key addressed rather than range ID addressed. Also moved the class into SendOptions in the TransportFactory rather than being a distinct argument. Also added a kv unit test. PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz)

@ajwerner ajwerner force-pushed the ajwerner/rpc-separate-traffic-class branch from e48b61e to 0243f55 Compare August 1, 2019 20:13
craig bot pushed a commit that referenced this pull request Aug 2, 2019
39167: roachtest: add tpccoverload to exercise a node-liveness failure scenario r=nvanbenschoten a=ajwerner

This roachtest combines the tpcc dataset with a contrived query to highlight
a scenario in which node liveness is currently starved.

This test is skipped as it currently fails on master. It is important to note
that after separating the TCP connection used for node liveness from the rest
of traffic (#39172) this test passes reliably. 

Release note: None

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
This commit extends the poorly abstracted TransportFactory interface to
additionally capture connection class. It then utilizes this new property
in the DistSender to separate traffic intended for critical system ranges
from all other traffic.

Release note: None
This commit adopts the ConnectionClass in the storage raftTransport to separate
raft messages for system ranges from that of other ranges.

Release note: None
This adds a unit test that exercises the change to the raftTransport to
use a separate connection for system traffic from regular traffic.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/rpc-separate-traffic-class branch from ec1c48a to ecee66c Compare August 6, 2019 22:06
@ajwerner
Copy link
Contributor Author

ajwerner commented Aug 7, 2019

bors r+

PR to adopt ConnectionClass coming right up.

craig bot pushed a commit that referenced this pull request Aug 7, 2019
39172: storage,nodedialer,kv,rpc: introduce connection class to separate system traffic r=ajwerner a=ajwerner

**Motivation**

This PR was primarily motivated by #35535. In that specific case and in other reproducible workloads we observe that large volumes of traffic primarily due to large OLAP-style table scans with filtering and sorting can lead to massive increases in latency node-liveness latency. The roachtest added in #39167 demonstrates this scenario. That roachtest fails every time on master and has succeeded for 10 consecutive runs with this change.

**Impact**

Below we have charts showing the node liveness failure rate, latency, and CPU utilization on a 3-node 16-CPU (n1-standard-16) single region cluster with 100 TPCC warehouses loaded using the query from the above roachtest both before and after this change. We can see that the former suffers from node liveness failures and high node liveness latency while the latter does not.

![image](https://user-images.githubusercontent.com/1839234/62215027-2764b380-b374-11e9-9640-9bf5d17b21b9.png)

The dip in CPU in the middle of the chart is where the binary is switched from master to a build with this PR. It is clear that before this change node liveness failed at a high rate leading to much poorer CPU utilization. It is worth noting that there was a single liveness failure that occurred during the run with this change. 

**PR Layout**

The PR itself comes in 5 commits

1) Introduce `ConnectionClass` to the `rpc` pacakge and `nodedialer`
2) Add testing knobs to inject interceptors in the `rpc` package
3) Adopt `ConnectionClass` in the the `kv` package
4) Adopt `ConnectionClass` in the `storage` package
5) Add testing for the storage package change

The change probably would benefit from a unit test in the `kv` package which I'll type today but wanted to start the review process.

Another point of conversation is that this change will double the number of TCP connections between nodes for any reasonably sized cluster. Today we never close opened connections and on large clusters we'll potentially dial `SystemClass` connections to nodes which at one point were the leaseholder for or held a replica of a system range that no longer do. In order to mitigate these superfluous connections a follow up PR will add logic to detect and close idle connections. 

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Aug 7, 2019

Build succeeded

@craig craig bot merged commit ecee66c into cockroachdb:master Aug 7, 2019
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Aug 8, 2019
[cockroachdb#39172](https:github.com/cockroachdb/pull/39172) added the concept
of connection class to create multiple connections to a given target. In order
to reduce the adoption burden that PR introduced new methods for dialing
connections, checking connection health, and retrieving circuit breakers which
took a ConnectionClass and carried a `Class` suffix. It left the previous
method signatures untouched, opting instead to convert them to a shorthand
which passed DefaultClass to the new method.

This PR moves all clients of these methods to use ConnectionClass explicitly.

Release note: None
ajwerner added a commit to ajwerner/cockroach that referenced this pull request Aug 12, 2019
Now that cockroachdb#39172 has landed this test should pass reliably.

Release note: None
craig bot pushed a commit that referenced this pull request Aug 12, 2019
39567: roachtest: unskip overload/tpcc_olap r=ajwerner a=ajwerner

Now that #39172 has landed this test should pass reliably. I ran it 10s of times without a failure while working on that PR and 20 more times this morning. 

Release note: None

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
craig bot pushed a commit that referenced this pull request Aug 21, 2019
39398: rpc,nodedialer: excise methods which do not use ConnectionClass r=ajwerner a=ajwerner

#39172 added the concept of connection class to create multiple connections
to a given target. In order to reduce the adoption burden that PR introduced new
methods for dialing connections, checking connection health, and retrieving circuit
breakers which took a ConnectionClass and carried a `Class` suffix. It left the previous
method signatures untouched, opting instead to convert them to a shorthand which passed DefaultClass to the new method.

This PR moves all clients of these methods to use ConnectionClass explicitly.

Release note: None

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
craig bot pushed a commit that referenced this pull request Nov 22, 2020
56860: kv: improve Raft scheduler behavior under CPU starvation r=jordanlewis a=nvanbenschoten

Fixes #56851.

This PR contains 3 commits that should improve the health of a cluster under CPU starvation and with many Ranges. I ran a series of experiments (see #56860 (comment)) which demonstrate that the combination of these commits improves the health of a cluster with many ranges dramatically, ensuring that liveness never falls over and that liveness heartbeat latency stays constant even as all other ranges become overloaded.

#### kv: cap COCKROACH_SCHEDULER_CONCURRENCY at 96

In investigations like #56851, we've seen the mutex in the Raft scheduler collapse due to too much concurrency. To address this, we needed to drop the scheduler's goroutine pool size to bound the amount of contention on the mutex to ensure that the scheduler was able to schedule any goroutines.

This commit caps this concurrency to 96, instead of letting it grow unbounded as a function of the number of cores on the system.

#### kv: batch enqueue Ranges in Raft scheduler for coalesced heartbeats

In #56851, we saw that all of the Raft transport's receiving goroutines were stuck in the Raft scheduler, attempting to enqueue Ranges in response to coalesced heartbeats. We saw this in stacktraces like:
```
goroutine 321096 [semacquire]:
sync.runtime_SemacquireMutex(0xc00007099c, 0xc005822a00, 0x1)
	/usr/local/go/src/runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc000070998)
	/usr/local/go/src/sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	/usr/local/go/src/sync/mutex.go:81
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).enqueue1(0xc000070980, 0x4, 0x19d8cb, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:261 +0xb0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).EnqueueRaftRequest(0xc000070980, 0x19d8cb)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:299 +0x3e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftUncoalescedRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc01288e5c0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:175 +0x201
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).uncoalesceBeats(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc035790a80, 0x37, 0x43, 0x100000001, 0x29b00000000, 0x0, 0x400000004, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:110 +0x33b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).HandleRaftRequest(0xc001136700, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:130 +0x1be
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).handleRaftRequest(0xc000188780, 0x4becc00, 0xc019f31b60, 0xc02be585f0, 0x4ba44c0, 0xc014ff2b40, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:299 +0xab
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).RaftMessageBatch.func1.1.1(0x4c3fac0, 0xc00d3ccdf0, 0xc000188780, 0x4becc00, 0xc019f31b60, 0x95fe98, 0x40c5720)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:370 +0x199
```

In that issue, we also saw that too much concurrency on the Raft scheduler's Mutex had caused the mutex to collapse (get stuck in the slow path, in the OS kernel) and hundreds of goroutines to pile up on it.

We suspect that part of the problem here was that each of the coalesced heartbeats was locking the Raft scheduler once per Range. So a coalesced heartbeat that contained 10k ranges would lock the scheduler 10k times on the receiver.

The commit attempts to alleviate this issue by batch enqueuing Ranges in the Raft scheduler in response to coalesced heartbeats. This has a slight fixed overhead (i.e. the need for a slice) but in response, reduces the load that coalesced heartbeats place on the Raft scheduler's mutex by a factor of 128 (`enqueueChunkSize`). This should reduce the impact that a large number of Ranges have on contention in the Raft scheduler.

#### kv: prioritize NodeLiveness Range in Raft scheduler

In #56851 and in many other investigations, we've seen cases where the NodeLiveness Range has a hard time performing writes when a system is under heavy load. We [already split RPC traffic into two classes](#39172), ensuring that NodeLiveness traffic does not get stuck behind traffic on user ranges. However, to this point, it was still possible for the NodeLiveness range to get stuck behind other Ranges in the Raft scheduler, leading to high scheduling latency for Raft operations.

This commit addresses this by prioritizing the NodeLiveness range above all others in the Raft scheduler. This prioritization mechanism is naive, but should be effective. It should also not run into any issues with fairness or starvation of other ranges, as such starvation is not possible as long as the scheduler concurrency (8*num_cpus) is above the number of high priority ranges (1).

@ajwerner I'm adding you here specifically because we've talked about the need for something like the last commit a few times.

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants