Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #58298

Closed
cockroach-teamcity opened this issue Dec 28, 2020 · 43 comments · Fixed by #60992
Closed

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #58298

cockroach-teamcity opened this issue Dec 28, 2020 · 43 comments · Fixed by #60992
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@08c89a597a06520c30faf01965f9c74fe9b9854f:

		3: dead
		1: dead
		6: 6206
		5: 5974
		2: 5958
		4: 6175
		Error: UNCLASSIFIED_PROBLEM: 3: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) secondary error attachment
		  | 1: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1850
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 1: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1850
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (4) 3: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.leafError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Dec 28, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@365c5504b75c9a9260365a628a5110c48312178b:

		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1850
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 3: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (6) secondary error attachment
		  | 1: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1850
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 1: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1850
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) 5: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *secondary.withSecondaryError (4) *secondary.withSecondaryError (5) *secondary.withSecondaryError (6) *secondary.withSecondaryError (7) *withstack.withStack (8) *errutil.leafError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@6d49a323b52966becfe8a2c38a1a8ccdf8ee58a1:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	cluster.go:2180,tpcc.go:747,search.go:43,search.go:173,tpcc.go:743,tpcc.go:566,test_runner.go:760: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2550095-1609311871-63-n7cpu16-geo:1-6 returned: exit status 1
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2550095-1609311871-63-n7cpu16-geo:1-6 returned
		  | stderr:
		  |
		  | stdout:
		  | teamcity-2550095-1609311871-63-n7cpu16-geo: stopping and waiting.............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
		  | 1: exit status 255: 
		  | I201230 14:10:38.371774 1 (gostd) cluster_synced.go:1732  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@b93fd531b93cb010729cb73fe679cdff9388cf27:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	cluster.go:2648,tpcc.go:725,tpcc.go:566,test_runner.go:760: monitor failure: unexpected node event: 5: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2636
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2644
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:725
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:566
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:760
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 5: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1637,context.go:140,cluster.go:1626,test_runner.go:841: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2552340-1609398198-74-n7cpu16-geo --oneshot --ignore-empty-nodes: exit status 1 7: skipped
		5: dead
		6: 6305
		2: 5606
		1: 6986
		4: 6203
		3: 6208
		Error: UNCLASSIFIED_PROBLEM: 5: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1850
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 5: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@d7bbe0060531063b9bee29f69bc4d23d41b84e3d:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	cluster.go:2648,tpcc.go:725,tpcc.go:566,test_runner.go:760: monitor failure: unexpected node event: 4: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2636
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2644
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:725
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:566
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:760
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 4: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1637,context.go:140,cluster.go:1626,test_runner.go:841: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2553797-1609484597-65-n7cpu16-geo --oneshot --ignore-empty-nodes: exit status 1 7: skipped
		4: dead
		3: 5636
		5: 6017
		6: 6253
		1: 6849
		2: 6228
		Error: UNCLASSIFIED_PROBLEM: 4: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1850
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 4: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@tbg
Copy link
Member

tbg commented Jan 5, 2021

Another overload-to-death. cc @nvanbenschoten

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@cee475331ca3629b503cd2e7c7919b72c98a5ca5:

		6: dead
		4: 6175
		2: dead
		5: 6223
		3: 6121
		1: 6679
		Error: UNCLASSIFIED_PROBLEM: 6: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) secondary error attachment
		  | 2: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1850
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 2: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1850
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (4) 6: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *withstack.withStack (4) *errutil.leafError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@tbg
Copy link
Member

tbg commented Jan 6, 2021

fatal error: runtime: out of memory

The stacks show >400 goroutines with stacks containing handleMergeInProgressError. The corresponding maybeWatchForMerge goroutine is there; it seems to be waiting for its push to come back from raft:

goroutine 10529958 [select]:
runtime.gopark(0x4f63550, 0x0, 0x1809, 0x1)
	/usr/local/go/src/runtime/proc.go:306 +0xe5 fp=0xc17d5b9288 sp=0xc17d5b9268 pc=0x4e4b25
runtime.selectgo(0xc17d5b9920, 0xc17d5b94a0, 0x4, 0x183a, 0x94495)
	/usr/local/go/src/runtime/select.go:338 +0xcef fp=0xc17d5b93b0 sp=0xc17d5b9288 pc=0x4f4c8f
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeWriteBatch(0xc0419eb800, 0x55a3500, 0xc03480bad0, 0xc1b705bef0, 0x1657a50ef8d73c6f, 0x1, 0x0, 0x600000006, 0x1, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_write.go:194 +0xdd2 fp=0xc17d5b99d0 sp=0xc17d5b93b0 pc=0x1dc5b52
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeBatchWithConcurrencyRetries(0xc0419eb800, 0x55a3500, 0xc03480bad0, 0xc1b705bef0, 0x4f5b4f0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:352 +0x491 fp=0xc17d5b9d18 sp=0xc17d5b99d0 pc=0x1dba4f1
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc0419eb800, 0x55a3500, 0xc03480baa0, 0x183a, 0xc1b705bef0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:97 +0x53d fp=0xc17d5b9f30 sp=0xc17d5b9d18 pc=0x1db91fd
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(...)
[...]

This may not be the root cause. I'm just pointing them out because I chanced upon them. The logs look just about as unhappy as they do for the other open tpccbench problems.

@nvanbenschoten
Copy link
Member

I'm able to reproduce this without too much difficultly, which lines up with the frequency at which we've been seeing this fail recently. I still don't have a firm grasp on what's going wrong, but I see from each failure that the cluster gets very unhappy shortly after completing its import, scattering, and then running what should be a low degree of load.

One thing that did jump out from some of the goroutine dumps is that we see a rapid growth of goroutines in (*IntentResolver).gcTxnRecord shortly before a node OOMs. This checks out, because we run gcTxnRecord in an async task without a concurrency limit from cleanupFinishedTxnIntents. I'm a bit confused about how we got to this state though, because all callers of cleanupFinishedTxnIntents are themselves in concurrency-limited async tasks. So while the number of goroutines that can call this function are limited, it is allowed to launch an unbounded number of goroutines to clean up txn records. If you add in an unavailable or close to unavailable destination for these GCRequests, it's not hard to see how things could get ugly.

I'm going to explore this more tomorrow, but I'm wondering whether there's an easy fix here - can we just run gcTxnRecord synchronously from the existing concurrency-limited async task? That passes all tests and actually simplifies the code. So why the nested tasks? I think this began in #34242, which even added the comment "Run transaction record GC outside of ir.sem". In fact, I think it came out of the PR review process, as I see some discussion of this and the behavior changes in revision 6 in reviewable. But I don't quite understand the motivation. Is it to avoid clogging up the semaphore while performing the batch wait? defaultGCBatchWait is all the way at 1s, which seems very long for anything that might be semi-synchronous. Maybe we should be adding to the batcher without waiting on the result at all, using SendWithChan instead of Send. That would allow us to avoid an additional async task without any risk of paying the batch delay latency. If we do that though, we may want to introduce some form of concurrency limit within the batcher. @ajwerner do you have any memory of this?

@tbg
Copy link
Member

tbg commented Jan 7, 2021

Interesting. Can you tell whether the failures are new (or was this always a failure mode of tpccbench that we just hadn't caught on to because of the SCATTER failure mode)?

@ajwerner
Copy link
Contributor

ajwerner commented Jan 7, 2021

defaultGCBatchWait is all the way at 1s, which seems very long for anything that might be semi-synchronous

that rings a bell

@nvanbenschoten
Copy link
Member

Can you tell whether the failures are new (or was this always a failure mode of tpccbench that we just hadn't caught on to because of the SCATTER failure mode)?

It's not yet clear. This is easy enough to reproduce that I'm just going to try to bisect it and see.

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@339275585b7d30b9ee2d49b0c696b9ddb8d51ad4:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	cluster.go:2180,tpcc.go:747,search.go:43,search.go:173,tpcc.go:743,tpcc.go:566,test_runner.go:760: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2566787-1610089287-61-n7cpu16-geo:1-6 returned: exit status 1
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2566787-1610089287-61-n7cpu16-geo:1-6 returned
		  | stderr:
		  |
		  | stdout:
		  | teamcity-2566787-1610089287-61-n7cpu16-geo: stopping and waiting.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
		  | 4: exit status 255: Connection to 34.68.167.105 closed by remote host.
		  |
		  | I210108 13:02:19.814071 1 (gostd) cluster_synced.go:1732  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@dbc7245c5d8c9f009072353fec261419e573032c:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	cluster.go:2648,tpcc.go:725,tpcc.go:566,test_runner.go:760: monitor failure: unexpected node event: 2: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2636
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2644
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:725
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:566
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:760
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 2: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@03797b17417ae34451537f8f76d66ac69dba2d07:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	cluster.go:2180,tpcc.go:747,search.go:43,search.go:173,tpcc.go:743,tpcc.go:566,test_runner.go:760: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2578434-1610521418-81-n7cpu16-geo:1-6 returned: exit status 1
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2578434-1610521418-81-n7cpu16-geo:1-6 returned
		  | stderr:
		  |
		  | stdout:
		  | teamcity-2578434-1610521418-81-n7cpu16-geo: stopping and waiting.......................................................................................................................................................................................................................................................................................................................................................
		  | 3: exit status 255: 
		  | I210113 12:51:40.258861 1 (gostd) cluster_synced.go:1732  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@a786c51627fe66e47b4a4445c67b2a9077ae2a93:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	cluster.go:2661,tpcc.go:727,tpcc.go:568,test_runner.go:760: monitor failure: unexpected node event: 4: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2649
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2657
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:727
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:568
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:760
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 4: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1642,context.go:140,cluster.go:1631,test_runner.go:841: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2586231-1610694347-68-n7cpu16-geo --oneshot --ignore-empty-nodes: exit status 1 7: skipped
		4: dead
		2: 5670
		1: 6357
		3: 6170
		5: 5728
		6: 6112
		Error: UNCLASSIFIED_PROBLEM: 4: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1850
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 4: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@7b0ccdda99b81613e70f421c9374483c3feddff3:

		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 2: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (6) secondary error attachment
		  | 5: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 5: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) 6: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *secondary.withSecondaryError (4) *secondary.withSecondaryError (5) *secondary.withSecondaryError (6) *secondary.withSecondaryError (7) *withstack.withStack (8) *errutil.leafError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@nvanbenschoten
Copy link
Member

One thing that's now clear to me is that we see these OOMs much more regularly in the cpu=16 tests than in the cpu=4 tests because the VMs in the former (n1-highcpu-16) actually have less RAM (14 GB) vs. those (n1-standard-4) in the latter (15 GB). We still shouldn't be OOMing here, but we do hit these instances with 4 times the load and expect linear scaling.

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Jan 19, 2021

I also now think I have a better understanding of why TPC-C in particular is good at triggering these OOMs. In these TPC-C tests and especially in the ramp-up period, we seem to use a high concurrency and then a low-ish --max-rate. The effect of this is that once things start to slow down, load will continue to pile on to the cluster, driving it to overload.

@nvanbenschoten
Copy link
Member

I had 10 more pass last night on the same config as #58298 (comment). So there's definitely progress being made here. Hopefully master is just as stable and we can chalk this all up to #58722, with maybe some help from #59206 (but only in some of the cases).

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@d86781c07065421f4a4d8bf5d988900ab07fdce5:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	tpcc.go:897,tpcc.go:600,test_runner.go:767: monitor failure: unexpected node event: 2: dead
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2675
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:874
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:783
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:600
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 2: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@tbg
Copy link
Member

tbg commented Jan 26, 2021

My computer suspended before the runs finished, but the clusters were still up and looked fine. Unfortunately, that failure just above has 64d6d87.

irfansharif pushed a commit that referenced this issue Jan 26, 2021
We are still seeing memory issues on tpccbench/nodes=6/cpu=16/multi-az
which need to be investigated. Turn off background tracing while we do.

Touches #58298. We're also reverting an earlier commit as part of this
one (d252400). This revert is needed given
we've not yet addressed an underlying bug (#59203).

Release note: None
craig bot pushed a commit that referenced this issue Jan 26, 2021
59431: tracing: revert trace.mode default to legacy r=irfansharif a=tbg

We are still seeing memory issues on tpccbench/nodes=6/cpu=16/multi-az
which need to be investigated.

Turn off background tracing while we do.

Touches #58298.

Release note: None


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
asubiotto pushed a commit that referenced this issue Jan 27, 2021
We are still seeing memory issues on tpccbench/nodes=6/cpu=16/multi-az
which need to be investigated. Turn off background tracing while we do.

Touches #58298. We're also reverting an earlier commit as part of this
one (d252400). This revert is needed given
we've not yet addressed an underlying bug (#59203).

Release note: None
@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@f7c5898f3d552f7ab0751cdd9ffa95cdfd6b8a76:

		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 2: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (6) secondary error attachment
		  | 6: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 6: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) 4: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *secondary.withSecondaryError (4) *secondary.withSecondaryError (5) *secondary.withSecondaryError (6) *secondary.withSecondaryError (7) *withstack.withStack (8) *errutil.leafError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@1a459a81dba35b6a091f0a2954aa33d50f1e5d24:

		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2675
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:874
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:783
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:600
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:767
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 4: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1666,context.go:140,cluster.go:1655,test_runner.go:848: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2622646-1611816990-63-n7cpu16-geo --oneshot --ignore-empty-nodes: exit status 1 7: skipped
		4: dead
		1: 20499
		5: 19270
		6: 19501
		2: 19284
		3: 19245
		Error: UNCLASSIFIED_PROBLEM: 4: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 4: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@tbg
Copy link
Member

tbg commented Jan 28, 2021

This had always-on tracing off. Unfortunately, roachtest also did not fully collect the artifacts and so we don't have heap profiles for any of the nodes. I think (@nvanbenschoten had the same suspicion in another instance of this) that the parallelization of roachprod get is not resilient to individual failures. I'll look into this.

teardown: 14:03:59 test.go:196: test status: fetching logs
teardown: 14:03:59 cluster.go:386: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod get teamcity-2622646-1611816990-63-n7cpu16-geo logs /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1/logs/unredacted
teamcity-2622646-1611816990-63-n7cpu16-geo: getting (scp) logs /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1/logs/unredacted
.......
   1: done
   2: ~ scp -r -C -o StrictHostKeyChecking=no -i /root/.ssh/id_rsa -i /root/.ssh/google_compute_engine ubuntu@34.121.119.176:logs /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1/logs/2.unredacted
Warning: Permanently added '34.121.119.176' (ECDSA) to the list of known hosts.
scp: logs/heap_profiler/memstats.2021-01-28T14_03_08.933.12066476032.txt: No such file or directory: exit status 1
   3: done
   4: done
   5: done
   6: done
   7: ~ scp -r -C -o StrictHostKeyChecking=no -i /root/.ssh/id_rsa -i /root/.ssh/google_compute_engine ubuntu@104.198.214.18:logs /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1/logs/7.unredacted
scp: logs: No such file or directory: exit status 1
I210128 14:04:06.229198 1 (gostd) cluster_synced.go:1523  [-] 1  get logs failed

@tbg
Copy link
Member

tbg commented Jan 28, 2021

For the failure before that has roachprod stop failing, which confuses the dead node checker, so it's more of a flake (but we have seen that OOM territory can make the whole VM unresponsive, which may have happened here). The heap inuse_space profile doesn't look too bad, and was taken a minute before the log files ended, so unsure what was going on there.

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@0e6727832d58faf0f900601cd6fa6807e0a2ba75:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	cluster.go:2219,tpcc.go:804,search.go:43,search.go:173,tpcc.go:800,tpcc.go:617,test_runner.go:767: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2635610-1612249051-63-n7cpu16-geo:1-6 returned: exit status 1
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2635610-1612249051-63-n7cpu16-geo:1-6 returned
		  | stderr:
		  |
		  | stdout:
		  | <... some data truncated by circular buffer; go to artifacts for details ...>
		  | ..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
		  | 0: exit status 255: 
		  | I210202 16:05:30.118329 1 (gostd) cluster_synced.go:1732  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@64c4aef909f4382523cd9248341ca9f4448d841a:

		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 1: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (6) secondary error attachment
		  | 3: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 3: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) 2: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *secondary.withSecondaryError (4) *secondary.withSecondaryError (5) *secondary.withSecondaryError (6) *secondary.withSecondaryError (7) *withstack.withStack (8) *errutil.leafError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on master@bf9744bad5a416a4b06907f0f3dd42896f7342f3:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	cluster.go:2220,tpcc.go:807,search.go:43,search.go:173,tpcc.go:803,tpcc.go:617,test_runner.go:767: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2702231-1613977007-62-n7cpu16-geo:1-6 returned: exit status 1
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2702231-1613977007-62-n7cpu16-geo:1-6 returned
		  | stderr:
		  |
		  | stdout:
		  | teamcity-2702231-1613977007-62-n7cpu16-geo: stopping and waiting..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
		  | 1: exit status 255: 
		  | I210222 14:36:43.493747 1 (gostd) cluster_synced.go:1732  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Feb 23, 2021
Fixes cockroachdb#60852.
Fixes cockroachdb#60833.
Fixes cockroachdb#58298.
Fixes cockroachdb#59428.
Fixes cockroachdb#60756.
Fixes cockroachdb#60848.
Fixes cockroachdb#60849.

In cockroachdb#60852 and related issues, we saw that the introduction of a non-nullable
`RaftCommand.ClosedTimestamp`, coupled with the `ClosedTimestampFooter` encoding
strategy we use, led to encoded `RaftCommand` protos with their ClosedTimestamp
field set twice. This is ok from a correctness perspective, at least as protobuf
is concerned, but it led to a subtle interaction where the process of passing
through sideloading (`maybeInlineSideloadedRaftCommand(maybeSideloadEntriesImpl(e))`)
would reduce the size of an encoded RaftCommand by 3 bytes (the encoded size of
an empty `hlc.Timestamp`). This was resulting in an `uncommittedSize` leak in
Raft, which was eventually stalling on its `MaxUncommittedEntriesSize` limit.

This commit fixes this issue by making `RaftCommand.ClosedTimestamp` nullable.
With the field marked as nullable, it will no longer be encoded as an empty
timestamp when unset, ensuring that when the encoded `ClosedTimestampFooter` is
appended, it contains the only instance of the `ClosedTimestamp` field.
craig bot pushed a commit that referenced this issue Feb 23, 2021
60836: opt: support UPDATE with partial UNIQUE WITHOUT INDEX constraints r=mgartner a=mgartner

This commit add uniqueness checks for partial `UNIQUE WITHOUT INDEX`
constraints during `UPDATE` statements.

As partial of this change, I discovered that #60535 introduced a
regression where columns not required by uniqueness checks are not
pruned. I've left TODOs in the column pruning tests and plan on fixing
this in a follow-up PR.

There is no release note because these constraints are gated behind the
experimental_enable_unique_without_index_constraints session variable.

Release note: None

60992: kv: make RaftCommand.ClosedTimestamp nullable r=nvanbenschoten a=nvanbenschoten

Fixes #60852.
Fixes #60833.
Fixes #58298.
Fixes #59428.
Fixes #60756.
Fixes #60848.
Fixes #60849.

In #60852 and related issues, we saw that the introduction of a non-nullable `RaftCommand.ClosedTimestamp`, coupled with the `ClosedTimestampFooter` encoding strategy we use, led to encoded `RaftCommand` protos with their ClosedTimestamp field set twice. This is ok from a correctness perspective, at least as protobuf is concerned, but it led to a subtle interaction where the process of passing through sideloading (`maybeInlineSideloadedRaftCommand(maybeSideloadEntriesImpl(e))`) would reduce the size of an encoded RaftCommand by 3 bytes (the encoded size of an empty `hlc.Timestamp`). This was resulting in an `uncommittedSize` leak in Raft, which was eventually stalling on its `MaxUncommittedEntriesSize` limit.

This commit fixes this issue by making `RaftCommand.ClosedTimestamp` nullable. With the field marked as nullable, it will no longer be encoded as an empty timestamp when unset, ensuring that when the encoded `ClosedTimestampFooter` is appended, it contains the only instance of the `ClosedTimestamp` field.

cc. @cockroachdb/bulk-io 

Co-authored-by: Marcus Gartner <marcus@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@craig craig bot closed this as completed in a2b9c19 Feb 23, 2021
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 8, 2021
Relates to cockroachdb#58298.

One thing I've noticed when looking into cockroachdb#58298 is that we were often
badly overloading the cluster during the rebalance wait period. During
this time, we just want to apply a small amount of load to help instruct
load-based splitting and rebalancing. But in some cases, we were
completely overloading the cluster. We also weren't ramping up the load,
as we had intended to.

This commit fixes both of these issues. It adds a ramp period for the
first quarter of the rebalance time and it scales the txn rate based on
the expected max warehouse count instead of the loaded warehouse count.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue May 6, 2021
Relates to cockroachdb#58298.

One thing I've noticed when looking into cockroachdb#58298 is that we were often
badly overloading the cluster during the rebalance wait period. During
this time, we just want to apply a small amount of load to help instruct
load-based splitting and rebalancing. But in some cases, we were
completely overloading the cluster. We also weren't ramping up the load,
as we had intended to.

This commit fixes both of these issues. It adds a ramp period for the
first quarter of the rebalance time and it scales the txn rate based on
the expected max warehouse count instead of the loaded warehouse count.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants