-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench/nodes=6/cpu=16/multi-az failed [overload] #61974
Comments
Our line searcher gets too aggressive towards the end, drastically pushing the cluster into overloaded territory, where we're then subjected to #62010. We should probably amend this test with a higher estimated warehouse count. It's currently at 2500, which is evidently too low. Hmm, roachperf shows a ton of variability for this test. Maybe that won't fix the failure. I wonder if, as a stop-gap for #62010, we should stop our search at whatever the max warehouse count is (and of course make sure we re-configure the max warehouses accordingly). I think it makes sense for these tests to be PASS/FAIL for a given warehouse count, and if fail, capture the extent of the regression, rather than capture how high we can go (where by design we risk going into overload territory, with no protections for it, like #62010 proposes) |
Here's a random "successful" run. Left y-axis is tpmC, x-axis is TPC-C warehouse count, right y-axis is efficiency. The fact that there's a cliff is because we're overloading the cluster to the point where one the nodes just OOMs or disappears. That feels like a bad way to run a benchmark? At least we can agree it's a bad idea to do it this way without #62010, where the node failure could end up tripping up the testing infrastructure as well. |
Things got worse ~Feb 21 and it's been pretty spotty since (though improving after #61777?). |
Fixes cockroachdb#61973. With tracing, our top-of-line TPC-C performance took a hit. Given that the TPC-C line searcher starts off at the estimated max, we're now starting off at "overloaded" territory; this makes for a very unhappy roachtest. Ideally we'd have something like cockroachdb#62010, or even admission control, to not make this test less noisy. Until then we can start off at a lower max warehouse count. This "fix" is still not a panacea, the entire tpccbench suite as written tries to nudge the warehouse count until the efficiency is sub-85%. Unfortunately, with our current infrastructure that's a stand-in for "the point where nodes are overloaded and VMs no longer reachable". See \cockroachdb#61974. --- A longer-term approach to these tests could instead be as follows. We could start our search at whatever the max warehouse count is (making sure we've re-configure the max warehouses accordingly). These tests could then PASS/FAIL for that given warehouse count, and only if FAIL, could capture the extent of the regression by probing lower warehouse counts. This is in contrast to what we're doing today where we capture how high we can go (and by design risking going into overload territory, with no protections for it). Doing so lets us use this test suite to capture regressions from a given baseline, rather than hoping our roachperf dashboards capture unexpected perf improvements (if they're expected, we should update max warehouses accordingly). In the steady state, we should want the roachperf dashboards to be mostly flatlined, with step-increases when we're re-upping the max warehouse count to incorporate various system-wide performance increases. Release note: None
62015: cli: Add some more warning comments to unsafe-remove-dead-replicas r=knz a=bdarnell The comments always said this tool was meant to be used with the supervision of a CRL engineer, but didn't otherwise make the risks and downsides clear. Add some more explicit warnings which can also serve as guidance for the supervising engineer. Release note: None 62039: roachtest: stabilize tpccbench/nodes=3/cpu=16 r=irfansharif a=irfansharif Fixes #61973. With tracing, our top-of-line TPC-C performance took a hit. Given that the TPC-C line searcher starts off at the estimated max, we're now starting off at "overloaded" territory; this makes for a very unhappy roachtest. Ideally we'd have something like #62010, or even admission control, to not make this test less noisy. Until then we can start off at a lower max warehouse count. This "fix" is still not a panacea, the entire tpccbench suite as written tries to nudge the warehouse count until the efficiency is sub-85%. Unfortunately, with our current infrastructure that's a stand-in for "the point where nodes are overloaded and VMs no longer reachable". See #61974. --- A longer-term approach to these tests could instead be as follows. We could start our search at whatever the max warehouse count is (making sure we've re-configure the max warehouses accordingly). These tests could then PASS/FAIL for that given warehouse count, and only if FAIL, could capture the extent of the regression by probing lower warehouse counts. This is in contrast to what we're doing today where we capture how high we can go (and by design risking going into overload territory, with no protections for it). Doing so lets us use this test suite to capture regressions from a given baseline, rather than hoping our roachperf dashboards capture unexpected perf improvements (if they're expected, we should update max warehouses accordingly). In the steady state, we should want the roachperf dashboards to be mostly flatlined, with step-increases when we're re-upping the max warehouse count to incorporate various system-wide performance increases. Release note: None Co-authored-by: Ben Darnell <ben@cockroachlabs.com> Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@ee9f47b9ec9476a693464e2dcd09a01bf9d39ad2:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@3d19b2cf6b290a152b23722fc32e995eed3b437b:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@53bf501e233c337b9863755914d9c00010517329:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@9fa4b125bfb07552b43ba4fd52c9301afd7a937b:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@3cfe2a38044b9e0d47b09815658e8634e4f4bfda:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@d891594d3c998f153b88f631e3c89ac7d12c2a6e:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@ed698aecdf0715c4edb91a9617bcc5df45f7ccde:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@d145e9fc02064a8b6b4179b5af7da5238b192f74:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
See this test on roachdash |
n3 oomed, unfortunately we timed out getting the heap profiles. #62361 would help |
(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@4d44ddf24153d8ef8e0a996fdbe75ac5607f9574:
More
Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:
roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #61718 roachtest: tpccbench/nodes=6/cpu=16/multi-az failed C-test-failure O-roachtest O-robot branch-release-21.1 release-blocker
roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #59044 roachtest: tpccbench/nodes=6/cpu=16/multi-az failed C-test-failure O-roachtest O-robot branch-release-20.1
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: