roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #53443

cockroach-teamcity · 2020-08-25T22:33:57Z

(roachtest).tpccbench/nodes=9/cpu=4/multi-region failed on provisional_202008241848_v20.1.5@f990441079c686f9eec32d80044c719175a2bee5:

The test failed on branch=provisional_202008241848_v20.1.5, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=9/cpu=4/multi-region/run_1
	cluster.go:2597,tpcc.go:782,tpcc.go:617,test_runner.go:754: monitor failure: monitor task failed: failed with output "I200825 22:09:10.901685 1 workload/cli/run.go:356  creating load generator...\nInitializing 5000 connections...\nInitializing 5000 workers and preparing statements...\nE200825 22:27:07.259214 1 workload/cli/run.go:379  Attempt to create load generator failed. It's been more than 10m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: preparing \n\t\tUPDATE district\n\t\tSET d_next_o_id = d_next_o_id + 1\n\t\tWHERE d_w_id = $1 AND d_id = $2\n\t\tRETURNING d_tax, d_next_o_id: context deadline exceeded\nError: failed to initialize the load generator: preparing \n\t\tUPDATE district\n\t\tSET d_next_o_id = d_next_o_id + 1\n\t\tWHERE d_w_id = $1 AND d_id = $2\n\t\tRETURNING d_tax, d_next_o_id: context deadline exceeded\nError: COMMAND_PROBLEM: exit status 1\n(1) COMMAND_PROBLEM\nWraps: (2) Node 4. Command with error:\n  | ```\n  | ./workload run tpcc --warehouses=5000 --workers=5000 --max-rate=2500 --wait=false --duration=1h40m0s --scatter --tolerate-errors  {pgurl:1-3,5-7,9-11}\n  | ```\nWraps: (3) exit status 1\nError types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError\n": /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2221186-1598354118-70-n12cpu4-geo:4 -- ./workload run tpcc --warehouses=5000 --workers=5000 --max-rate=2500 --wait=false --duration=1h40m0s --scatter --tolerate-errors  {pgurl:1-3,5-7,9-11}: exit status 20
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2585
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2593
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:782
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:617
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:754
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2641
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.loadTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:710
		  | [...repeated from below...]
		Wraps: (6) failed with output "I200825 22:09:10.901685 1 workload/cli/run.go:356  creating load generator...\nInitializing 5000 connections...\nInitializing 5000 workers and preparing statements...\nE200825 22:27:07.259214 1 workload/cli/run.go:379  Attempt to create load generator failed. It's been more than 10m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: preparing \n\t\tUPDATE district\n\t\tSET d_next_o_id = d_next_o_id + 1\n\t\tWHERE d_w_id = $1 AND d_id = $2\n\t\tRETURNING d_tax, d_next_o_id: context deadline exceeded\nError: failed to initialize the load generator: preparing \n\t\tUPDATE district\n\t\tSET d_next_o_id = d_next_o_id + 1\n\t\tWHERE d_w_id = $1 AND d_id = $2\n\t\tRETURNING d_tax, d_next_o_id: context deadline exceeded\nError: COMMAND_PROBLEM: exit status 1\n(1) COMMAND_PROBLEM\nWraps: (2) Node 4. Command with error:\n  | ```\n  | ./workload run tpcc --warehouses=5000 --workers=5000 --max-rate=2500 --wait=false --duration=1h40m0s --scatter --tolerate-errors  {pgurl:1-3,5-7,9-11}\n  | ```\nWraps: (3) exit status 1\nError types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError\n"
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.execCmdWithBuffer
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:552
		  | main.(*cluster).RunWithBuffer
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2266
		  | main.loadTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:709
		  | main.runTPCCBench.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:780
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2575
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/sync@v0.0.0-20190911185100-cd5d95a43a6e/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (8) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2221186-1598354118-70-n12cpu4-geo:4 -- ./workload run tpcc --warehouses=5000 --workers=5000 --max-rate=2500 --wait=false --duration=1h40m0s --scatter --tolerate-errors  {pgurl:1-3,5-7,9-11}
		Wraps: (9) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *exec.ExitError

More

Artifacts: /tpccbench/nodes=9/cpu=4/multi-region
Related:

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #53414 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-release-19.1 release-blocker
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #52337 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-provisional_202008031850_v20.2.0-alpha.3 release-blocker
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #52206 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-release-20.1 release-blocker
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #50698 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-release-19.2 release-blocker
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #48255 roachtest: tpccbench/nodes=9/cpu=4/multi-region failed C-test-failure O-roachtest O-robot branch-master release-blocker

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

The text was updated successfully, but these errors were encountered:

knz · 2020-08-31T13:09:29Z

a SQL query is failing with context deadline exceeded

this looks like a plain timeout

tbg · 2020-09-08T08:48:07Z

Reassigning to Nathan since he runs all of the tpccbench tests right now anyway

nvanbenschoten · 2020-09-08T14:24:00Z

I have a new theory about this after hitting it a few times while debugging other issues. I think this is because we perform the "warmup" phase by running unpartitioned load from a single load gen against all CRDB nodes. This means that we're preparing statements over a WAN, which is never the case during normal multi-region TPC-C operation. We should stop doing this, either by running partitioned load or by only pointing the load gen at local CRDB nodes.

tbg · 2020-10-06T08:48:16Z

@nvanbenschoten is this scheduled for any time soon? If not, we should.

nvanbenschoten · 2020-10-07T22:19:38Z

I've bumped this timeout to 60m in #55300. I think the entire timeout was misguided and probably shouldn't have been introduced in the first place, but 60m at least accounts for the fact that statements can take a while to prepare for large datasets. I'm checking to see whether that improves the situation.

tbg · 2020-10-27T15:20:40Z

Still failing with a 60m timeout in #55939

58014: roachtest/tpcc: don't scatter on each tpccbench search iteration r=ajwerner a=nvanbenschoten Fixes #48255. Fixes #53443. Fixes #54258. Fixes #54570. Fixes #55599. Fixes #55688. Fixes #55817. Fixes #55939. Fixes #56996. Fixes #57062. Fixes #57864. This needs to be backported to `release-20.1` and `release-20.2` In #55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of #56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

58014: roachtest/tpcc: don't scatter on each tpccbench search iteration r=nvanbenschoten a=nvanbenschoten Fixes #48255. Fixes #53443. Fixes #54258. Fixes #54570. Fixes #55599. Fixes #55688. Fixes #55817. Fixes #55939. Fixes #56996. Fixes #57062. Fixes #57864. This needs to be backported to `release-20.1` and `release-20.2` In #55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of #56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance. Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

Fixes cockroachdb#48255. Fixes cockroachdb#53443. Fixes cockroachdb#54258. Fixes cockroachdb#54570. Fixes cockroachdb#55599. Fixes cockroachdb#55688. Fixes cockroachdb#55817. Fixes cockroachdb#55939. Fixes cockroachdb#56996. Fixes cockroachdb#57062. Fixes cockroachdb#57864. This needs to be backported to `release-20.1` and `release-20.2` In cockroachdb#55688 (comment), we saw that the failures to create load generators in tpccbench were due to long-running SCATTER operations. These operations weren't stuck, but were very slow due to the amount of data being moved and the 2MiB/s limit on snapshots. In hindsight, this should have been expected, as scatter has the potential to rebalance data and was being run of datasets on the order of 100s of GBs or even TBs in size. But this alone did not explain why we used to see this issue infrequently and only recently began seeing it regularly. We determined that the most likely reason why this has recently gotten worse is because of cockroachdb#56942. That PR fixed a race condition in tpcc's `scatterRanges` function which often resulted in 9 scatters of the `warehouse` table instead of 1 scatter of each table in the database. So before this PR, we were often (but not always due to the racey nature of the bug) avoiding the scatter on all but the dataset's smallest table. After this PR, we were always scattering all 9 tables in the dataset, leading to much larger rebalancing. To address these issues, this commit removes the per-iteration scattering in tpccbench. Scattering on each search iteration was a misguided decision. It wasn't needed because we already scatter once during dataset import (with a higher `kv.snapshot_rebalance.max_rate`). It was also disruptive as it had the potential to slow down the test significantly and cause issues like the one were are fixing here. With this change, I've seen `tpccbench/nodes=6/cpu=16/multi-az` go from failing 6 out of 10 times to succeeding 10 out of 10 times. This change appears to have no impact on performance.

cockroach-teamcity added this to the 20.2 milestone Aug 25, 2020

This was referenced Aug 26, 2020

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #48255

Closed

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [congmgr assertion failed] #53540

Closed

tbg assigned nvanbenschoten Aug 27, 2020

tbg removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 27, 2020

tbg assigned aayushshah15 and unassigned nvanbenschoten Aug 27, 2020

tbg assigned nvanbenschoten and unassigned aayushshah15 Sep 8, 2020

This was referenced Sep 9, 2020

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #54137

Closed

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #54258

Closed

This was referenced Sep 28, 2020

roachtest: tpccbench/nodes=12/cpu=16 failed #54570

Closed

roachtest: tpccbench/nodes=3/cpu=4 failed #54802

Closed

dt removed the branch-provisional_202008241848_v20.1.5 label Oct 12, 2020

cockroach-teamcity mentioned this issue Oct 24, 2020

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #55939

Closed

nvanbenschoten mentioned this issue Dec 17, 2020

roachtest/tpcc: don't scatter on each tpccbench search iteration #58014

Merged

craig bot closed this as completed in 9dc433d Dec 17, 2020

craig bot closed this as completed in #58014 Dec 17, 2020

nvanbenschoten mentioned this issue Dec 17, 2020

release-20.2: roachtest/tpcc: don't scatter on each tpccbench search iteration #58051

Merged

nvanbenschoten mentioned this issue Dec 17, 2020

release-20.1: roachtest/tpcc: don't scatter on each tpccbench search iteration #58052

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #53443

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #53443

cockroach-teamcity commented Aug 25, 2020

knz commented Aug 31, 2020

tbg commented Sep 8, 2020

nvanbenschoten commented Sep 8, 2020 •

edited

Loading

tbg commented Oct 6, 2020

nvanbenschoten commented Oct 7, 2020

tbg commented Oct 27, 2020

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #53443

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed #53443

Comments

cockroach-teamcity commented Aug 25, 2020

knz commented Aug 31, 2020

tbg commented Sep 8, 2020

nvanbenschoten commented Sep 8, 2020 • edited Loading

tbg commented Oct 6, 2020

nvanbenschoten commented Oct 7, 2020

tbg commented Oct 27, 2020

nvanbenschoten commented Sep 8, 2020 •

edited

Loading