Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: add tpccoverload to exercise a node-liveness failure scenario #39167

Merged
merged 1 commit into from
Aug 2, 2019

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Jul 30, 2019

This roachtest combines the tpcc dataset with a contrived query to highlight
a scenario in which node liveness is currently starved.

This test is skipped as it currently fails on master. It is important to note
that after separating the TCP connection used for node liveness from the rest
of traffic (#39172) this test passes reliably.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)


pkg/cmd/roachtest/tpcc_overload.go, line 81 at r1 (raw file):

	url := "http://" + adminURLs[0] + "/ts/query"
	now := timeutil.Now()
	request := tspb.TimeSeriesQueryRequest{

Can we simplify this with getMetrics?


pkg/cmd/roachtest/tpcc_overload.go, line 114 at r1 (raw file):

func registerTPCCOverloadSpec(r *testRegistry, s tpccOverloadSpec) {
	name := fmt.Sprintf("tpccoverload/nodes=%d/cpu=%d/w=%d/c=%d",

nit: I expect that we'll have a suite of these "overload" tests, so consider naming this overload/tpcc/...

This roachtest combines the tpcc dataset with a contrived query to highlight
a scenario in which node liveness is currently starved.

This test is skipped as it currently fails on master. It is important to note
that after separating the TCP connection used for node liveness from the rest
of traffic this test passes.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/overload-roachtest branch from 74bcf65 to cf37e12 Compare August 2, 2019 18:08
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/cmd/roachtest/tpcc_overload.go, line 81 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Can we simplify this with getMetrics?

Done.


pkg/cmd/roachtest/tpcc_overload.go, line 114 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: I expect that we'll have a suite of these "overload" tests, so consider naming this overload/tpcc/...

I want it to be clear that this isn't actually running tpcc but I agree with your prefix idea. I made it overload/tpcc_olap/.... How do you feel about that?

@ajwerner
Copy link
Contributor Author

ajwerner commented Aug 2, 2019

bors r=nvanbenschoten

craig bot pushed a commit that referenced this pull request Aug 2, 2019
39167: roachtest: add tpccoverload to exercise a node-liveness failure scenario r=nvanbenschoten a=ajwerner

This roachtest combines the tpcc dataset with a contrived query to highlight
a scenario in which node liveness is currently starved.

This test is skipped as it currently fails on master. It is important to note
that after separating the TCP connection used for node liveness from the rest
of traffic (#39172) this test passes reliably. 

Release note: None

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
@craig
Copy link
Contributor

craig bot commented Aug 2, 2019

Build succeeded

@craig craig bot merged commit cf37e12 into cockroachdb:master Aug 2, 2019
craig bot pushed a commit that referenced this pull request Aug 7, 2019
39172: storage,nodedialer,kv,rpc: introduce connection class to separate system traffic r=ajwerner a=ajwerner

**Motivation**

This PR was primarily motivated by #35535. In that specific case and in other reproducible workloads we observe that large volumes of traffic primarily due to large OLAP-style table scans with filtering and sorting can lead to massive increases in latency node-liveness latency. The roachtest added in #39167 demonstrates this scenario. That roachtest fails every time on master and has succeeded for 10 consecutive runs with this change.

**Impact**

Below we have charts showing the node liveness failure rate, latency, and CPU utilization on a 3-node 16-CPU (n1-standard-16) single region cluster with 100 TPCC warehouses loaded using the query from the above roachtest both before and after this change. We can see that the former suffers from node liveness failures and high node liveness latency while the latter does not.

![image](https://user-images.githubusercontent.com/1839234/62215027-2764b380-b374-11e9-9640-9bf5d17b21b9.png)

The dip in CPU in the middle of the chart is where the binary is switched from master to a build with this PR. It is clear that before this change node liveness failed at a high rate leading to much poorer CPU utilization. It is worth noting that there was a single liveness failure that occurred during the run with this change. 

**PR Layout**

The PR itself comes in 5 commits

1) Introduce `ConnectionClass` to the `rpc` pacakge and `nodedialer`
2) Add testing knobs to inject interceptors in the `rpc` package
3) Adopt `ConnectionClass` in the the `kv` package
4) Adopt `ConnectionClass` in the `storage` package
5) Add testing for the storage package change

The change probably would benefit from a unit test in the `kv` package which I'll type today but wanted to start the review process.

Another point of conversation is that this change will double the number of TCP connections between nodes for any reasonably sized cluster. Today we never close opened connections and on large clusters we'll potentially dial `SystemClass` connections to nodes which at one point were the leaseholder for or held a replica of a system range that no longer do. In order to mitigate these superfluous connections a follow up PR will add logic to detect and close idle connections. 

Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants