roachtest: add tpccoverload to exercise a node-liveness failure scenario #39167

ajwerner · 2019-07-30T13:20:00Z

This roachtest combines the tpcc dataset with a contrived query to highlight
a scenario in which node liveness is currently starved.

This test is skipped as it currently fails on master. It is important to note
that after separating the TCP connection used for node liveness from the rest
of traffic (#39172) this test passes reliably.

Release note: None

cockroach-teamcity · 2019-07-30T13:20:08Z

This change is

nvanbenschoten

Reviewed 2 of 3 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)

pkg/cmd/roachtest/tpcc_overload.go, line 81 at r1 (raw file):

	url := "http://" + adminURLs[0] + "/ts/query"
	now := timeutil.Now()
	request := tspb.TimeSeriesQueryRequest{

Can we simplify this with getMetrics?

pkg/cmd/roachtest/tpcc_overload.go, line 114 at r1 (raw file):

func registerTPCCOverloadSpec(r *testRegistry, s tpccOverloadSpec) {
	name := fmt.Sprintf("tpccoverload/nodes=%d/cpu=%d/w=%d/c=%d",

nit: I expect that we'll have a suite of these "overload" tests, so consider naming this overload/tpcc/...

This roachtest combines the tpcc dataset with a contrived query to highlight a scenario in which node liveness is currently starved. This test is skipped as it currently fails on master. It is important to note that after separating the TCP connection used for node liveness from the rest of traffic this test passes. Release note: None

ajwerner

TFTR!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/cmd/roachtest/tpcc_overload.go, line 81 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can we simplify this with getMetrics?

Done.

pkg/cmd/roachtest/tpcc_overload.go, line 114 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: I expect that we'll have a suite of these "overload" tests, so consider naming this overload/tpcc/...

I want it to be clear that this isn't actually running tpcc but I agree with your prefix idea. I made it overload/tpcc_olap/.... How do you feel about that?

ajwerner · 2019-08-02T19:49:57Z

bors r=nvanbenschoten

39167: roachtest: add tpccoverload to exercise a node-liveness failure scenario r=nvanbenschoten a=ajwerner This roachtest combines the tpcc dataset with a contrived query to highlight a scenario in which node liveness is currently starved. This test is skipped as it currently fails on master. It is important to note that after separating the TCP connection used for node liveness from the rest of traffic (#39172) this test passes reliably. Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>

craig · 2019-08-02T20:37:19Z

Build succeeded

GitHub CI (Cockroach)

39172: storage,nodedialer,kv,rpc: introduce connection class to separate system traffic r=ajwerner a=ajwerner **Motivation** This PR was primarily motivated by #35535. In that specific case and in other reproducible workloads we observe that large volumes of traffic primarily due to large OLAP-style table scans with filtering and sorting can lead to massive increases in latency node-liveness latency. The roachtest added in #39167 demonstrates this scenario. That roachtest fails every time on master and has succeeded for 10 consecutive runs with this change. **Impact** Below we have charts showing the node liveness failure rate, latency, and CPU utilization on a 3-node 16-CPU (n1-standard-16) single region cluster with 100 TPCC warehouses loaded using the query from the above roachtest both before and after this change. We can see that the former suffers from node liveness failures and high node liveness latency while the latter does not. ![image](https://user-images.githubusercontent.com/1839234/62215027-2764b380-b374-11e9-9640-9bf5d17b21b9.png) The dip in CPU in the middle of the chart is where the binary is switched from master to a build with this PR. It is clear that before this change node liveness failed at a high rate leading to much poorer CPU utilization. It is worth noting that there was a single liveness failure that occurred during the run with this change. **PR Layout** The PR itself comes in 5 commits 1) Introduce `ConnectionClass` to the `rpc` pacakge and `nodedialer` 2) Add testing knobs to inject interceptors in the `rpc` package 3) Adopt `ConnectionClass` in the the `kv` package 4) Adopt `ConnectionClass` in the `storage` package 5) Add testing for the storage package change The change probably would benefit from a unit test in the `kv` package which I'll type today but wanted to start the review process. Another point of conversation is that this change will double the number of TCP connections between nodes for any reasonably sized cluster. Today we never close opened connections and on large clusters we'll potentially dial `SystemClass` connections to nodes which at one point were the leaseholder for or held a replica of a system range that no longer do. In order to mitigate these superfluous connections a follow up PR will add logic to detect and close idle connections. Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>

ajwerner mentioned this pull request Jul 31, 2019

storage,nodedialer,kv,rpc: introduce connection class to separate system traffic #39172

Merged

ajwerner requested a review from nvanbenschoten August 1, 2019 15:25

ajwerner marked this pull request as ready for review August 1, 2019 19:19

nvanbenschoten reviewed Aug 1, 2019

View reviewed changes

ajwerner force-pushed the ajwerner/overload-roachtest branch from 74bcf65 to cf37e12 Compare August 2, 2019 18:08

ajwerner commented Aug 2, 2019

View reviewed changes

nvanbenschoten approved these changes Aug 2, 2019

View reviewed changes

craig bot merged commit cf37e12 into cockroachdb:master Aug 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: add tpccoverload to exercise a node-liveness failure scenario #39167

roachtest: add tpccoverload to exercise a node-liveness failure scenario #39167

ajwerner commented Jul 30, 2019 •

edited

Loading

cockroach-teamcity commented Jul 30, 2019

nvanbenschoten left a comment

ajwerner left a comment

ajwerner commented Aug 2, 2019

craig bot commented Aug 2, 2019

roachtest: add tpccoverload to exercise a node-liveness failure scenario #39167

roachtest: add tpccoverload to exercise a node-liveness failure scenario #39167

Conversation

ajwerner commented Jul 30, 2019 • edited Loading

cockroach-teamcity commented Jul 30, 2019

nvanbenschoten left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner commented Aug 2, 2019

craig bot commented Aug 2, 2019

Build succeeded

ajwerner commented Jul 30, 2019 •

edited

Loading