-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: add tpccoverload to exercise a node-liveness failure scenario #39167
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 3 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner and @nvanbenschoten)
pkg/cmd/roachtest/tpcc_overload.go, line 81 at r1 (raw file):
url := "http://" + adminURLs[0] + "/ts/query" now := timeutil.Now() request := tspb.TimeSeriesQueryRequest{
Can we simplify this with getMetrics
?
pkg/cmd/roachtest/tpcc_overload.go, line 114 at r1 (raw file):
func registerTPCCOverloadSpec(r *testRegistry, s tpccOverloadSpec) { name := fmt.Sprintf("tpccoverload/nodes=%d/cpu=%d/w=%d/c=%d",
nit: I expect that we'll have a suite of these "overload" tests, so consider naming this overload/tpcc/...
This roachtest combines the tpcc dataset with a contrived query to highlight a scenario in which node liveness is currently starved. This test is skipped as it currently fails on master. It is important to note that after separating the TCP connection used for node liveness from the rest of traffic this test passes. Release note: None
74bcf65
to
cf37e12
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/cmd/roachtest/tpcc_overload.go, line 81 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Can we simplify this with
getMetrics
?
Done.
pkg/cmd/roachtest/tpcc_overload.go, line 114 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: I expect that we'll have a suite of these "overload" tests, so consider naming this
overload/tpcc/...
I want it to be clear that this isn't actually running tpcc but I agree with your prefix idea. I made it overload/tpcc_olap/...
. How do you feel about that?
bors r=nvanbenschoten |
39167: roachtest: add tpccoverload to exercise a node-liveness failure scenario r=nvanbenschoten a=ajwerner This roachtest combines the tpcc dataset with a contrived query to highlight a scenario in which node liveness is currently starved. This test is skipped as it currently fails on master. It is important to note that after separating the TCP connection used for node liveness from the rest of traffic (#39172) this test passes reliably. Release note: None Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
Build succeeded |
39172: storage,nodedialer,kv,rpc: introduce connection class to separate system traffic r=ajwerner a=ajwerner **Motivation** This PR was primarily motivated by #35535. In that specific case and in other reproducible workloads we observe that large volumes of traffic primarily due to large OLAP-style table scans with filtering and sorting can lead to massive increases in latency node-liveness latency. The roachtest added in #39167 demonstrates this scenario. That roachtest fails every time on master and has succeeded for 10 consecutive runs with this change. **Impact** Below we have charts showing the node liveness failure rate, latency, and CPU utilization on a 3-node 16-CPU (n1-standard-16) single region cluster with 100 TPCC warehouses loaded using the query from the above roachtest both before and after this change. We can see that the former suffers from node liveness failures and high node liveness latency while the latter does not. ![image](https://user-images.githubusercontent.com/1839234/62215027-2764b380-b374-11e9-9640-9bf5d17b21b9.png) The dip in CPU in the middle of the chart is where the binary is switched from master to a build with this PR. It is clear that before this change node liveness failed at a high rate leading to much poorer CPU utilization. It is worth noting that there was a single liveness failure that occurred during the run with this change. **PR Layout** The PR itself comes in 5 commits 1) Introduce `ConnectionClass` to the `rpc` pacakge and `nodedialer` 2) Add testing knobs to inject interceptors in the `rpc` package 3) Adopt `ConnectionClass` in the the `kv` package 4) Adopt `ConnectionClass` in the `storage` package 5) Add testing for the storage package change The change probably would benefit from a unit test in the `kv` package which I'll type today but wanted to start the review process. Another point of conversation is that this change will double the number of TCP connections between nodes for any reasonably sized cluster. Today we never close opened connections and on large clusters we'll potentially dial `SystemClass` connections to nodes which at one point were the leaseholder for or held a replica of a system range that no longer do. In order to mitigate these superfluous connections a follow up PR will add logic to detect and close idle connections. Co-authored-by: Andrew Werner <ajwerner@cockroachlabs.com>
This roachtest combines the tpcc dataset with a contrived query to highlight
a scenario in which node liveness is currently starved.
This test is skipped as it currently fails on master. It is important to note
that after separating the TCP connection used for node liveness from the rest
of traffic (#39172) this test passes reliably.
Release note: None