Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed [overload] #62339

Closed
cockroach-teamcity opened this issue Mar 22, 2021 · 10 comments
Closed
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-21.1@2bdb62260a178e5bb63cf15f704944c5384f4347:

The test failed on branch=release-21.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	cluster.go:2220,tpcc.go:807,search.go:43,search.go:173,tpcc.go:803,tpcc.go:617,test_runner.go:767: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2798738-1616393020-64-n7cpu16-geo:1-6 returned: exit status 1
		(1) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod stop teamcity-2798738-1616393020-64-n7cpu16-geo:1-6 returned
		  | stderr:
		  |
		  | stdout:
		  | teamcity-2798738-1616393020-64-n7cpu16-geo: stopping and waiting........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
		  | 0: exit status 255: 
		  | I210322 13:25:56.981836 1 (gostd) cluster_synced.go:1732  [-] 1  command failed
		Wraps: (2) exit status 1
		Error types: (1) *main.withCommandDetails (2) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-release-21.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 22, 2021
@tbg tbg added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 23, 2021
@tbg
Copy link
Member

tbg commented Mar 24, 2021

Oomkill

@tbg tbg self-assigned this Mar 24, 2021
@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-21.1@f602e37e31a256980ae897917f45cba9c135b412:

		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 4: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (6) secondary error attachment
		  | 1: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:204
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		  | Wraps: (2) 1: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1852
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) 2: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *secondary.withSecondaryError (4) *secondary.withSecondaryError (5) *secondary.withSecondaryError (6) *secondary.withSecondaryError (7) *withstack.withStack (8) *errutil.leafError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@tbg
Copy link
Member

tbg commented Mar 29, 2021

Likely oomkill, log stops mid-file, but no evidence

@tbg tbg changed the title roachtest: tpccbench/nodes=6/cpu=16/multi-az failed roachtest: tpccbench/nodes=6/cpu=16/multi-az failed [overload] Mar 29, 2021
@tbg tbg removed the GA-blocker label Mar 29, 2021
@ajwerner
Copy link
Contributor

I've been seeing this a number of times lately. I'm mostly alarmed by the lack of log messages. I was able to reproduce a very similar looking failure in #62320 but also found no evidence of an OOM kill in the journalctl logs. I also saw no evidence from the runtime. I'm presently looking at changing the logging to avoid hijacking stderr and see if I can confirm that this is an issue with the runtime failing to allocate memory. The heap profile and memory usage from the logging certainly point in that direction. The fact that we have roachtests failing this way is alarming.

@tbg
Copy link
Member

tbg commented Mar 29, 2021

I was wrong in this case, we're failing during roachprod stop so all of the nodes are likely getting shut down but roachtest doesn't realize. We then claim the nodes dead even though we likely stopped them ourselves.

But I don't think I've made that mistake 100% of the time and I also feel like logs go missing here. I looked at #62320 and agree that that is just a case of straight up missing files.

@ajwerner
Copy link
Contributor

I looked at #62320 and agree that that is just a case of straight up missing files.

Is it missing files? I don't have any evidence that there should be more files. It's most like the files we have are missing data. The other nodes stop being able to connect to n4 in that issue before the logs stop. I'm worried that things seem so much worse in this release than the last one.

@tbg
Copy link
Member

tbg commented Mar 29, 2021

Sorry, I meant it is missing evidence of the node stopping as you outlined. I agree that something has gotten worse. My conjecture would be that we're losing the "runtime: out of memory" messages when we didn't before? At least I haven't seen an "out of memory" message in a long time despite looking at lots of ooms.

@cockroach-teamcity

This comment has been minimized.

@tbg
Copy link
Member

tbg commented Apr 12, 2021

Likely oom but again no trace #62763

@tbg
Copy link
Member

tbg commented May 5, 2021

This has gone cold and the date of last failure lines up well with the 21.1 perf regression fixes (see https://roachperf.crdb.dev/?filter=&view=tpccbench%2Fnodes%3D3%2Fcpu%3D16&tab=gce), so considering this closed.

@tbg tbg closed this as completed May 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

3 participants