-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: stabilize tpccbench/nodes=3/cpu=16 #62039
Conversation
Fixes cockroachdb#61973. With tracing, our top-of-line TPC-C performance took a hit. Given that the TPC-C line searcher starts off at the estimated max, we're now starting off at "overloaded" territory; this makes for a very unhappy roachtest. Ideally we'd have something like cockroachdb#62010, or even admission control, to not make this test less noisy. Until then we can start off at a lower max warehouse count. This "fix" is still not a panacea, the entire tpccbench suite as written tries to nudge the warehouse count until the efficiency is sub-85%. Unfortunately, with our current infrastructure that's a stand-in for "the point where nodes are overloaded and VMs no longer reachable". See \cockroachdb#61974. --- A longer-term approach to these tests could instead be as follows. We could start our search at whatever the max warehouse count is (making sure we've re-configure the max warehouses accordingly). These tests could then PASS/FAIL for that given warehouse count, and only if FAIL, could capture the extent of the regression by probing lower warehouse counts. This is in contrast to what we're doing today where we capture how high we can go (and by design risking going into overload territory, with no protections for it). Doing so lets us use this test suite to capture regressions from a given baseline, rather than hoping our roachperf dashboards capture unexpected perf improvements (if they're expected, we should update max warehouses accordingly). In the steady state, we should want the roachperf dashboards to be mostly flatlined, with step-increases when we're re-upping the max warehouse count to incorporate various system-wide performance increases. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though this is a mildly concerning change to be making as we ramp up for a release. Is the intention that we drop the expected warehouse count for now and then increase it once we mitigate the performance regression? Or is the intention that we're going to eat the 5% throughput hit in this release?
Re. the longer-term approach proposal, I'm torn. I'd hate to see our performance benchmarks lose their ability to capture performance improvements. It does surprise me that we're reliably seeing nodes so overloaded by an incremental 5% load that they collapse to the point where a VM is no longer reachable. That doesn't sound right to me from a system stability standpoint and, anecdotally, I don't recall seeing this for most of tpccbench's life. Are we tickling some new failure mode here, perhaps due to excessive memory pressure?
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)
I'm also a bit confused here. Over in #61973, it sounds like you're saying that the remaining tracing is unlikely to be the culprit here, but it doesn't seem that we actually ruled it out (by doing a few runs at 2200 with the sample rate set to 0.00). Now in this PR, it sounds like you've established that tracing is responsible for the regression. Can you provide clarity about which one it is? I think we need to test 2200 warehouses on GCE with a zero sample rate and if that is relatively stable, we blame tracing; otherwise, we exclude tracing as the primary culprit. But either way, I think we want to spend some time trying to nurse this back to healthy at 2200.
I think this failure mode has been around for a while, we just didn't always recognize it as such. I agree though that it's odd how tpccbench can trigger this at-will. I wonder if there's some cascading behavior that we're not considering, either in tracing, or txn heartbeats, or something like that. |
Here's are a few 2200 warehouse runs with
I'm not sure how to model this kind of regression, but I don't think I'd say it's a 5% throughput hit. What's happening is that there's an increase in memory usage due to tracing, but it's just that that increase (because we're running at the limit) is enough to cause the nodes to either OOM themselves or cause the gcloud VMs to be inoperable. This could be due to just the memory pressure, or something cascading from it. The lower efficiency+max warehouse numbers are in some sense a misnomer. What's really happening is that the workload is being run in degraded mode, where instead of being able to use 3 nodes in the cluster, it's using 2, and reporting a lower efficiency number because of it. I agree that we should want to try and nurse the max warehouses count back to 2200; what I'm saying is that for now we should keep it 2100 so the test at least doesn't misbehave in the way it does today where it right away plunges into overload territory without being able to recover from it (because the VMs are inaccessible, it can't re-run with a lower warehouse count - the VMs appear legitimately dead to the roachtest). As for the cascading failure mode here when we're running at the limit, that's worthy of investigation in its own right. Is that something that's worth doing now? I'm sure there will be a ton of interesting threads to pull on when we run crdb at the limit. Also, given that the GCE VMs are inoperable, and we're seeing that they definitely are (see below: my attempt to bounce the cluster has been spinning for 30m now), I don't think that's worth doing right now.
|
Discussed in team meeting. This patch is still a stop gap but worth merging to get a better sense of TPC-C stability with 21.1 (will backport). I'll file a separate issue to investigate why the GCE VMs disappear at the limit. Good thing (?) we can reliably do that with TPC-C 2500 + v21.1. bors r+ |
Build succeeded: |
See cockroachdb#62039. `tpccbench`, by design, pushes CRDB into overload territory. The test harness handles nodes crashing or tpmc tanking well. However, it was not prepared to handle the cloud VMs going unresponsive for ~minutes, which is one common failure mode. This commit tweaks the line search to be resilient to failures to communicate with the cloud VM in the one place where it matters (stopping the cluster at the beginning of a new search attempt). The hope is that this will allow the search to run to completion, even in the face of overload-imposed temporary VM outages. It is not expected to do this reliably, but at least anecdotally most VMs seem to come back a few minutes in. Release note: None
See cockroachdb#62039. `tpccbench`, by design, pushes CRDB into overload territory. However, tpccbench was not handling the resulting conditions at all. Any node death or stalling VM would fail the whole test run, rather than retry again with a lower warehouse count. This commit makes most errors "recoverable" in the sense that they will simply treat the run as failing, but continue the line search. Exceptions are communicated via `t.Fatal`, which will abort the whole run instead. We also make the `c.Stop()` step at the beginning of each search step resilient to VMs browning out under memory pressure (from the previous run), by patiently retrying for a few minutes. The hope is that this will allow the search to run to completion, even in the face of overload-imposed temporary VM outages. It is not expected to do this perfectly - after all, VMs don't generally return from brown-out within any fixed time period - but at least anecdotally most VMs seem to come back a few minutes in. Release note: None
See cockroachdb#62039. `tpccbench`, by design, pushes CRDB into overload territory. However, tpccbench was not handling the resulting conditions at all. Any node death or stalling VM would fail the whole test run, rather than retry again with a lower warehouse count. This commit makes most errors "recoverable" in the sense that they will simply treat the run as failing, but continue the line search. Exceptions are communicated via `t.Fatal`, which will abort the whole run instead. We also make the `c.Stop()` step at the beginning of each search step resilient to VMs browning out under memory pressure (from the previous run), by patiently retrying for a few minutes. The hope is that this will allow the search to run to completion, even in the face of overload-imposed temporary VM outages. It is not expected to do this perfectly - after all, VMs don't generally return from brown-out within any fixed time period - but at least anecdotally most VMs seem to come back a few minutes in. Release note: None
See cockroachdb#62039. `tpccbench`, by design, pushes CRDB into overload territory. However, tpccbench was not handling the resulting conditions at all. Any node death or stalling VM would fail the whole test run, rather than retry again with a lower warehouse count. This commit makes most errors "recoverable" in the sense that they will simply treat the run as failing, but continue the line search. Exceptions are communicated via `t.Fatal`, which will abort the whole run instead. We also make the `c.Stop()` step at the beginning of each search step resilient to VMs browning out under memory pressure (from the previous run), by patiently retrying for a few minutes. The hope is that this will allow the search to run to completion, even in the face of overload-imposed temporary VM outages. It is not expected to do this perfectly - after all, VMs don't generally return from brown-out within any fixed time period - but at least anecdotally most VMs seem to come back a few minutes in. Release note: None
61600: kvserver: make the StoreRebalancer aware of non-voters r=aayushshah15 a=aayushshah15 This commit teaches the `StoreRebalancer` to rebalance non-voting replicas. Release justification: needed for non-voting replicas Release note: None 62361: roachtest: attempt to handle VM overload under tpccbench r=irfansharif a=tbg See #62039. `tpccbench`, by design, pushes CRDB into overload territory. The test harness handles nodes crashing or tpmc tanking well. However, it was not prepared to handle the cloud VMs going unresponsive for ~minutes, which is one common failure mode. This commit tweaks the line search to be resilient to failures to communicate with the cloud VM in the one place where it matters (stopping the cluster at the beginning of a new search attempt). The hope is that this will allow the search to run to completion, even in the face of overload-imposed temporary VM outages. It is not expected to do this reliably, but at least anecdotally most VMs seem to come back a few minutes in. Release note: None 62826: sql: add tests for concurrent add/drop region operations r=ajstorm a=arulajmani This patch generalizes the setup in what was previously `TestConcurrentDropRegion` and extends it to all combinations of add/drop region on a multi-region database. The only change is that I've added a regional by row table into the test setup mixer, so as to excercise the repartitioning semantics. Previously, there was a limitation with concurrent add/drop regions where both the operations were bound to fail in the repartitioning phase. This limitation was fixed in #60620, but we never had a regression test for it. Adding a regional by row table during the test setup serves as one. Closes #62813 Release note: None Co-authored-by: Aayush Shah <aayush.shah15@gmail.com> Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: arulajmani <arulajmani@gmail.com>
See cockroachdb#62039. `tpccbench`, by design, pushes CRDB into overload territory. However, tpccbench was not handling the resulting conditions at all. Any node death or stalling VM would fail the whole test run, rather than retry again with a lower warehouse count. This commit makes most errors "recoverable" in the sense that they will simply treat the run as failing, but continue the line search. Exceptions are communicated via `t.Fatal`, which will abort the whole run instead. We also make the `c.Stop()` step at the beginning of each search step resilient to VMs browning out under memory pressure (from the previous run), by patiently retrying for a few minutes. The hope is that this will allow the search to run to completion, even in the face of overload-imposed temporary VM outages. It is not expected to do this perfectly - after all, VMs don't generally return from brown-out within any fixed time period - but at least anecdotally most VMs seem to come back a few minutes in. Release note: None
Fixes #61973. With tracing, our top-of-line TPC-C performance took a
hit. Given that the TPC-C line searcher starts off at the estimated max,
we're now starting off at "overloaded" territory; this makes for a very
unhappy roachtest.
Ideally we'd have something like #62010, or even admission control, to
not make this test less noisy. Until then we can start off at a lower
max warehouse count.
This "fix" is still not a panacea, the entire tpccbench suite as written
tries to nudge the warehouse count until the efficiency is sub-85%.
Unfortunately, with our current infrastructure that's a stand-in for
"the point where nodes are overloaded and VMs no longer reachable".
See #61974.
A longer-term approach to these tests could instead be as follows.
We could start our search at whatever the max warehouse count is (making
sure we've re-configure the max warehouses accordingly). These tests
could then PASS/FAIL for that given warehouse count, and only if FAIL,
could capture the extent of the regression by probing lower warehouse
counts. This is in contrast to what we're doing today where we capture
how high we can go (and by design risking going into overload territory,
with no protections for it).
Doing so lets us use this test suite to capture regressions from a given
baseline, rather than hoping our roachperf dashboards capture
unexpected perf improvements (if they're expected, we should update max
warehouses accordingly). In the steady state, we should want the
roachperf dashboards to be mostly flatlined, with step-increases when
we're re-upping the max warehouse count to incorporate various
system-wide performance increases.
Release note: None