Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #58641

Closed
cockroach-teamcity opened this issue Jan 8, 2021 · 8 comments
Closed

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #58641

cockroach-teamcity opened this issue Jan 8, 2021 · 8 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@36a7014e1fe848547195e18fd63a8a5c298cb5ba:

		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2642
		  | main.runTPCCBench.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:844
		  | github.com/cockroachdb/cockroach/pkg/util/search.searchWithSearcher
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:43
		  | github.com/cockroachdb/cockroach/pkg/util/search.(*lineSearcher).Search
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/util/search/search.go:173
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:753
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:576
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:755
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 1: dead
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1657,context.go:135,cluster.go:1646,test_runner.go:836: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2566808-1610089878-65-n7cpu16-geo --oneshot --ignore-empty-nodes: exit status 1 7: skipped
		5: 25072
		6: 24587
		1: dead
		3: 24560
		4: 23506
		2: 24393
		Error: UNCLASSIFIED_PROBLEM: 1: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:204
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (3) 1: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-release-20.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jan 8, 2021
@cockroach-teamcity cockroach-teamcity added this to the 20.2 milestone Jan 8, 2021
@tbg
Copy link
Member

tbg commented Jan 19, 2021

oomkiller on n1. The log ends with logs of slow handle raft ready. There was a heap profile taken ~1m before the crash.

I think the crash was at ca 15:16:25.652775. Last profile taken at 15:15:04:460. Unfortunately the profile shows ca. nothing and captures only ~816mb:

image

For comparison, here's near the profile and the crash:

I210108 15:15:04.460047 221 server/status/runtime.go:522 â‹® [n1] runtime stats: 6.6 GiB RSS, 3340 goroutines, 1.3 GiB/315 MiB/1.8 GiB GO alloc/idle/total, 3.8 GiB/4.8 GiB CGO alloc/total, 57459.7 CGO/sec, 1255.3/116.2 %(u/s)time, 0.0 %gc (7x), 161 MiB/144 MiB (r/w)net
I210108 15:16:14.469083 221 server/status/runtime.go:522 â‹® [n1] runtime stats: 6.5 GiB RSS, 3408 goroutines, 1.2 GiB/262 MiB/1.8 GiB GO alloc/idle/total, 3.8 GiB/4.7 GiB CGO alloc/total, 56882.8 CGO/sec, 1251.9/115.0 %(u/s)time, 0.0 %gc (7x), 167 MiB/149 MiB (r/w)net

@tbg tbg closed this as completed Jan 19, 2021
@tbg tbg reopened this Jan 19, 2021
@tbg
Copy link
Member

tbg commented Jan 19, 2021

The oomkiller, in the meantime, says

[12966.207971] Killed process 25785 (cockroach) total-vm:24692104kB, anon-rss:14046484kB, file-rss:0kB, shmem-rss:0kB

That's 14.04GB of anon-rss, i.e. clearly above the 6.5GiB RSS we see here. Using this message from dmesg

[    1.197557] rtc_cmos 00:00: setting system clock to 2021-01-08 11:40:19 UTC (1610106019)

1610106019s+12965s corresponds to GMT: Friday, January 8, 2021 3:16:24 PM. So a full 10s for which we don't have runtime stats info could have lead to a memory blowup. Or the RSS reported earlier was just off by a factor of >2.

@nvanbenschoten
Copy link
Member

Given the fact that we're seeing this across release branches on these n1-highcpu-16 machines and exploring this further in #58298, I don't think this should be considered a release blocker.

@nvanbenschoten nvanbenschoten removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Jan 20, 2021
@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@8c79e2bc4b35d36c8527f4c40c974f03d9034f46:

The test failed on branch=release-20.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	cluster.go:2654,tpcc.go:735,tpcc.go:576,test_runner.go:755: monitor failure: monitor task failed: failed with output "./workload: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by ./workload)\nError: COMMAND_PROBLEM: exit status 1\n(1) COMMAND_PROBLEM\nWraps: (2) Node 7. Command with error:\n  | ```\n  | ./workload run tpcc --warehouses=5000 --workers=5000 --max-rate=2500 --wait=false --duration=20m0s --scatter --tolerate-errors {pgurl:1-6}\n  | ```\nWraps: (3) exit status 1\nError types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError\n": /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2657161-1612856692-139-n7cpu16-geo:7 -- ./workload run tpcc --warehouses=5000 --workers=5000 --max-rate=2500 --wait=false --duration=20m0s --scatter --tolerate-errors {pgurl:1-6}: exit status 20
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2642
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2650
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:735
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:576
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:755
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2698
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.loadTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:663
		  | [...repeated from below...]
		Wraps: (6) failed with output "./workload: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by ./workload)\nError: COMMAND_PROBLEM: exit status 1\n(1) COMMAND_PROBLEM\nWraps: (2) Node 7. Command with error:\n  | ```\n  | ./workload run tpcc --warehouses=5000 --workers=5000 --max-rate=2500 --wait=false --duration=20m0s --scatter --tolerate-errors {pgurl:1-6}\n  | ```\nWraps: (3) exit status 1\nError types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError\n"
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.execCmdWithBuffer
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:564
		  | main.(*cluster).RunWithBuffer
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2311
		  | main.loadTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:662
		  | main.runTPCCBench.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:733
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2632
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/sync@v0.0.0-20190911185100-cd5d95a43a6e/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2657161-1612856692-139-n7cpu16-geo:7 -- ./workload run tpcc --warehouses=5000 --workers=5000 --max-rate=2500 --wait=false --duration=20m0s --scatter --tolerate-errors {pgurl:1-6}
		Wraps: (9) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@nvanbenschoten
Copy link
Member

version `GLIBC_2.29' not found (required by ./workload)

I believe we've seen this elsewhere. Is that correct @irfansharif?

@tbg
Copy link
Member

tbg commented Feb 10, 2021 via email

@cockroach-teamcity
Copy link
Member Author

(roachtest).tpccbench/nodes=6/cpu=16/multi-az failed on release-20.2@b0012907c1bc9627ae2de83e6099c4930a32699e:

The test failed on branch=release-20.2, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/tpccbench/nodes=6/cpu=16/multi-az/run_1
	cluster.go:2654,tpcc.go:735,tpcc.go:576,test_runner.go:755: monitor failure: monitor task failed: failed with output "./workload: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by ./workload)\nError: COMMAND_PROBLEM: exit status 1\n(1) COMMAND_PROBLEM\nWraps: (2) Node 7. Command with error:\n  | ```\n  | ./workload run tpcc --warehouses=5000 --workers=5000 --max-rate=2500 --wait=false --duration=20m0s --scatter --tolerate-errors {pgurl:1-6}\n  | ```\nWraps: (3) exit status 1\nError types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError\n": /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2661584-1612941367-142-n7cpu16-geo:7 -- ./workload run tpcc --warehouses=5000 --workers=5000 --max-rate=2500 --wait=false --duration=20m0s --scatter --tolerate-errors {pgurl:1-6}: exit status 20
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2642
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2650
		  | main.runTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:735
		  | main.registerTPCCBenchSpec.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:576
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:755
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitor).wait.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2698
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.loadTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:663
		  | [...repeated from below...]
		Wraps: (6) failed with output "./workload: /lib/x86_64-linux-gnu/libm.so.6: version `GLIBC_2.29' not found (required by ./workload)\nError: COMMAND_PROBLEM: exit status 1\n(1) COMMAND_PROBLEM\nWraps: (2) Node 7. Command with error:\n  | ```\n  | ./workload run tpcc --warehouses=5000 --workers=5000 --max-rate=2500 --wait=false --duration=20m0s --scatter --tolerate-errors {pgurl:1-6}\n  | ```\nWraps: (3) exit status 1\nError types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError\n"
		Wraps: (7) attached stack trace
		  -- stack trace:
		  | main.execCmdWithBuffer
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:564
		  | main.(*cluster).RunWithBuffer
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2311
		  | main.loadTPCCBench
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:662
		  | main.runTPCCBench.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tpcc.go:733
		  | main.(*monitor).Go.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2632
		  | golang.org/x/sync/errgroup.(*Group).Go.func1
		  | 	/home/agent/work/.go/pkg/mod/golang.org/x/sync@v0.0.0-20190911185100-cd5d95a43a6e/errgroup/errgroup.go:57
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1374
		Wraps: (8) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2661584-1612941367-142-n7cpu16-geo:7 -- ./workload run tpcc --warehouses=5000 --workers=5000 --max-rate=2500 --wait=false --duration=20m0s --scatter --tolerate-errors {pgurl:1-6}
		Wraps: (9) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *exec.ExitError

More

Artifacts: /tpccbench/nodes=6/cpu=16/multi-az
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@irfansharif
Copy link
Contributor

The glibc errors were fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

4 participants