Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: stabilize tpccbench/nodes=3/cpu=16 #62039

Merged
merged 1 commit into from
Mar 16, 2021

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Mar 15, 2021

Fixes #61973. With tracing, our top-of-line TPC-C performance took a
hit. Given that the TPC-C line searcher starts off at the estimated max,
we're now starting off at "overloaded" territory; this makes for a very
unhappy roachtest.

Ideally we'd have something like #62010, or even admission control, to
not make this test less noisy. Until then we can start off at a lower
max warehouse count.

This "fix" is still not a panacea, the entire tpccbench suite as written
tries to nudge the warehouse count until the efficiency is sub-85%.
Unfortunately, with our current infrastructure that's a stand-in for
"the point where nodes are overloaded and VMs no longer reachable".
See #61974.


A longer-term approach to these tests could instead be as follows.
We could start our search at whatever the max warehouse count is (making
sure we've re-configure the max warehouses accordingly). These tests
could then PASS/FAIL for that given warehouse count, and only if FAIL,
could capture the extent of the regression by probing lower warehouse
counts. This is in contrast to what we're doing today where we capture
how high we can go (and by design risking going into overload territory,
with no protections for it).

Doing so lets us use this test suite to capture regressions from a given
baseline, rather than hoping our roachperf dashboards capture
unexpected perf improvements (if they're expected, we should update max
warehouses accordingly). In the steady state, we should want the
roachperf dashboards to be mostly flatlined, with step-increases when
we're re-upping the max warehouse count to incorporate various
system-wide performance increases.

Release note: None

Fixes cockroachdb#61973. With tracing, our top-of-line TPC-C performance took a
hit. Given that the TPC-C line searcher starts off at the estimated max,
we're now starting off at "overloaded" territory; this makes for a very
unhappy roachtest.

Ideally we'd have something like cockroachdb#62010, or even admission control, to
not make this test less noisy. Until then we can start off at a lower
max warehouse count.

This "fix" is still not a panacea, the entire tpccbench suite as written
tries to nudge the warehouse count until the efficiency is sub-85%.
Unfortunately, with our current infrastructure that's a stand-in for
"the point where nodes are overloaded and VMs no longer reachable". See
\cockroachdb#61974.

---

A longer-term approach to these tests could instead be as follows.
We could start our search at whatever the max warehouse count is (making
sure we've re-configure the max warehouses accordingly). These tests
could then PASS/FAIL for that given warehouse count, and only if FAIL,
could capture the extent of the regression by probing lower warehouse
counts. This is in contrast to what we're doing today where we capture
how high we can go (and by design risking going into overload territory,
with no protections for it).

Doing so lets us use this test suite to capture regressions from a given
baseline, rather than hoping our roachperf dashboards capture
unexpected perf improvements (if they're expected, we should update max
warehouses accordingly). In the steady state, we should want the
roachperf dashboards to be mostly flatlined, with step-increases when
we're re-upping the max warehouse count to incorporate various
system-wide performance increases.

Release note: None
@irfansharif irfansharif requested review from tbg, nvanbenschoten and a team March 15, 2021 21:09
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: though this is a mildly concerning change to be making as we ramp up for a release. Is the intention that we drop the expected warehouse count for now and then increase it once we mitigate the performance regression? Or is the intention that we're going to eat the 5% throughput hit in this release?

Re. the longer-term approach proposal, I'm torn. I'd hate to see our performance benchmarks lose their ability to capture performance improvements. It does surprise me that we're reliably seeing nodes so overloaded by an incremental 5% load that they collapse to the point where a VM is no longer reachable. That doesn't sound right to me from a system stability standpoint and, anecdotally, I don't recall seeing this for most of tpccbench's life. Are we tickling some new failure mode here, perhaps due to excessive memory pressure?

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @tbg)

@tbg
Copy link
Member

tbg commented Mar 16, 2021

I'm also a bit confused here. Over in #61973, it sounds like you're saying that the remaining tracing is unlikely to be the culprit here, but it doesn't seem that we actually ruled it out (by doing a few runs at 2200 with the sample rate set to 0.00). Now in this PR, it sounds like you've established that tracing is responsible for the regression. Can you provide clarity about which one it is? I think we need to test 2200 warehouses on GCE with a zero sample rate and if that is relatively stable, we blame tracing; otherwise, we exclude tracing as the primary culprit. But either way, I think we want to spend some time trying to nurse this back to healthy at 2200.

It does surprise me that we're reliably seeing nodes so overloaded by an incremental 5% load that they collapse to the point where a VM is no longer reachable.

I think this failure mode has been around for a while, we just didn't always recognize it as such. I agree though that it's odd how tpccbench can trigger this at-will. I wonder if there's some cascading behavior that we're not considering, either in tracing, or txn heartbeats, or something like that.

@irfansharif
Copy link
Contributor Author

Here's are a few 2200 warehouse runs with sql.txn_stats.sample_rate = 0.0. I've tried to get more, but given the cluster plunges into instability and the VMs remain unusable once it does, it's a cumbersome exercise.

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  600.0s     6871.7  25.1%  26787.5  16643.0  68719.5  98784.2 103079.2 103079.2

_elapsed_______tpmC____efc__avg(ms)__p50(ms)__p90(ms)__p95(ms)__p99(ms)_pMax(ms)
  600.0s     5709.4  20.9%  36440.3  28991.0  81604.4 103079.2 103079.2 103079.2

is the intention that we're going to eat the 5% throughput hit in this release?

I'm not sure how to model this kind of regression, but I don't think I'd say it's a 5% throughput hit. What's happening is that there's an increase in memory usage due to tracing, but it's just that that increase (because we're running at the limit) is enough to cause the nodes to either OOM themselves or cause the gcloud VMs to be inoperable. This could be due to just the memory pressure, or something cascading from it. The lower efficiency+max warehouse numbers are in some sense a misnomer. What's really happening is that the workload is being run in degraded mode, where instead of being able to use 3 nodes in the cluster, it's using 2, and reporting a lower efficiency number because of it.

I agree that we should want to try and nurse the max warehouses count back to 2200; what I'm saying is that for now we should keep it 2100 so the test at least doesn't misbehave in the way it does today where it right away plunges into overload territory without being able to recover from it (because the VMs are inaccessible, it can't re-run with a lower warehouse count - the VMs appear legitimately dead to the roachtest).

As for the cascading failure mode here when we're running at the limit, that's worthy of investigation in its own right. Is that something that's worth doing now? I'm sure there will be a ton of interesting threads to pull on when we run crdb at the limit. Also, given that the GCE VMs are inoperable, and we're seeing that they definitely are (see below: my attempt to bounce the cluster has been spinning for 30m now), I don't think that's worth doing right now.

$ roachprod stop irfansharif-tpccbench:1-3
irfansharif-tpccbench: stopping and waiting 2/3 /

@irfansharif
Copy link
Contributor Author

Discussed in team meeting. This patch is still a stop gap but worth merging to get a better sense of TPC-C stability with 21.1 (will backport). I'll file a separate issue to investigate why the GCE VMs disappear at the limit. Good thing (?) we can reliably do that with TPC-C 2500 + v21.1.

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 16, 2021

Build succeeded:

@craig craig bot merged commit 86d6a45 into cockroachdb:master Mar 16, 2021
@irfansharif irfansharif deleted the 210315.tpccbench-max branch March 16, 2021 16:31
tbg added a commit to tbg/cockroach that referenced this pull request Mar 22, 2021
See cockroachdb#62039.

`tpccbench`, by design, pushes CRDB into overload territory. The test
harness handles nodes crashing or tpmc tanking well. However, it was
not prepared to handle the cloud VMs going unresponsive for ~minutes,
which is one common failure mode.

This commit tweaks the line search to be resilient to failures to
communicate with the cloud VM in the one place where it matters
(stopping the cluster at the beginning of a new search attempt).

The hope is that this will allow the search to run to completion,
even in the face of overload-imposed temporary VM outages. It is
not expected to do this reliably, but at least anecdotally most
VMs seem to come back a few minutes in.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Mar 25, 2021
See cockroachdb#62039.

`tpccbench`, by design, pushes CRDB into overload territory. However,
tpccbench was not handling the resulting conditions at all. Any node
death or stalling VM would fail the whole test run, rather than retry
again with a lower warehouse count.

This commit makes most errors "recoverable" in the sense that they
will simply treat the run as failing, but continue the line search.
Exceptions are communicated via `t.Fatal`, which will abort the
whole run instead.

We also make the `c.Stop()` step at the beginning of each search
step resilient to VMs browning out under memory pressure (from
the previous run), by patiently retrying for a few minutes.

The hope is that this will allow the search to run to completion, even
in the face of overload-imposed temporary VM outages. It is not expected
to do this perfectly - after all, VMs don't generally return from
brown-out within any fixed time period - but at least anecdotally most
VMs seem to come back a few minutes in.

Release note: None
irfansharif pushed a commit to irfansharif/cockroach that referenced this pull request Mar 31, 2021
See cockroachdb#62039.

`tpccbench`, by design, pushes CRDB into overload territory. However,
tpccbench was not handling the resulting conditions at all. Any node
death or stalling VM would fail the whole test run, rather than retry
again with a lower warehouse count.

This commit makes most errors "recoverable" in the sense that they
will simply treat the run as failing, but continue the line search.
Exceptions are communicated via `t.Fatal`, which will abort the
whole run instead.

We also make the `c.Stop()` step at the beginning of each search
step resilient to VMs browning out under memory pressure (from
the previous run), by patiently retrying for a few minutes.

The hope is that this will allow the search to run to completion, even
in the face of overload-imposed temporary VM outages. It is not expected
to do this perfectly - after all, VMs don't generally return from
brown-out within any fixed time period - but at least anecdotally most
VMs seem to come back a few minutes in.

Release note: None
irfansharif pushed a commit to tbg/cockroach that referenced this pull request Apr 1, 2021
See cockroachdb#62039.

`tpccbench`, by design, pushes CRDB into overload territory. However,
tpccbench was not handling the resulting conditions at all. Any node
death or stalling VM would fail the whole test run, rather than retry
again with a lower warehouse count.

This commit makes most errors "recoverable" in the sense that they
will simply treat the run as failing, but continue the line search.
Exceptions are communicated via `t.Fatal`, which will abort the
whole run instead.

We also make the `c.Stop()` step at the beginning of each search
step resilient to VMs browning out under memory pressure (from
the previous run), by patiently retrying for a few minutes.

The hope is that this will allow the search to run to completion, even
in the face of overload-imposed temporary VM outages. It is not expected
to do this perfectly - after all, VMs don't generally return from
brown-out within any fixed time period - but at least anecdotally most
VMs seem to come back a few minutes in.

Release note: None
craig bot pushed a commit that referenced this pull request Apr 1, 2021
61600: kvserver: make the StoreRebalancer aware of non-voters r=aayushshah15 a=aayushshah15

This commit teaches the `StoreRebalancer` to rebalance non-voting
replicas.

Release justification: needed for non-voting replicas
Release note: None

62361: roachtest: attempt to handle VM overload under tpccbench r=irfansharif a=tbg

See #62039.

`tpccbench`, by design, pushes CRDB into overload territory. The test
harness handles nodes crashing or tpmc tanking well. However, it was
not prepared to handle the cloud VMs going unresponsive for ~minutes,
which is one common failure mode.

This commit tweaks the line search to be resilient to failures to
communicate with the cloud VM in the one place where it matters
(stopping the cluster at the beginning of a new search attempt).

The hope is that this will allow the search to run to completion,
even in the face of overload-imposed temporary VM outages. It is
not expected to do this reliably, but at least anecdotally most
VMs seem to come back a few minutes in.

Release note: None


62826: sql: add tests for concurrent add/drop region operations r=ajstorm a=arulajmani

This patch generalizes the setup in what was previously
`TestConcurrentDropRegion` and extends it to all combinations of
add/drop region on a multi-region database. The only change is that
I've added a regional by row table into the test setup mixer, so as to
excercise the repartitioning semantics.

Previously, there was a limitation with concurrent add/drop regions
where both the operations were bound to fail in the repartitioning
phase. This limitation was fixed in #60620, but we never had a
regression test for it. Adding a regional by row table during the
test setup serves as one.

Closes #62813

Release note: None

Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: arulajmani <arulajmani@gmail.com>
tbg added a commit to tbg/cockroach that referenced this pull request Jun 24, 2021
See cockroachdb#62039.

`tpccbench`, by design, pushes CRDB into overload territory. However,
tpccbench was not handling the resulting conditions at all. Any node
death or stalling VM would fail the whole test run, rather than retry
again with a lower warehouse count.

This commit makes most errors "recoverable" in the sense that they
will simply treat the run as failing, but continue the line search.
Exceptions are communicated via `t.Fatal`, which will abort the
whole run instead.

We also make the `c.Stop()` step at the beginning of each search
step resilient to VMs browning out under memory pressure (from
the previous run), by patiently retrying for a few minutes.

The hope is that this will allow the search to run to completion, even
in the face of overload-imposed temporary VM outages. It is not expected
to do this perfectly - after all, VMs don't generally return from
brown-out within any fixed time period - but at least anecdotally most
VMs seem to come back a few minutes in.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: tpccbench/nodes=3/cpu=16 failed
4 participants