Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: gossip/chaos/nodes=9 failed #51721

Closed
cockroach-teamcity opened this issue Jul 22, 2020 · 7 comments · Fixed by #51893
Closed

roachtest: gossip/chaos/nodes=9 failed #51721

cockroach-teamcity opened this issue Jul 22, 2020 · 7 comments · Fixed by #51893
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).gossip/chaos/nodes=9 failed on master@e9a4f83e3eee59510f97db2c6e0df9b57cf6b944:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/gossip/chaos/nodes=9/run_1
	gossip.go:64,gossip.go:102,gossip.go:114,gossip.go:124,test_runner.go:757: gossip did not stabilize in 20.0s

	cluster.go:1571,context.go:135,cluster.go:1560,test_runner.go:826: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2107908-1595398673-31-n9cpu4 --oneshot --ignore-empty-nodes: exit status 1 8: 4834
		9: 4617
		6: 4629
		4: 4619
		5: 4649
		3: 5248
		2: dead
		1: 6901
		7: 5288
		Error: UNCLASSIFIED_PROBLEM: 2: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1115
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:266
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1808
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (3) 3 safe details enclosed
		Wraps: (4) 2: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *safedetails.withSafeDetails (4) *errors.errorString

More

Artifacts: /gossip/chaos/nodes=9
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 22, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.2 milestone Jul 22, 2020
@knz
Copy link
Contributor

knz commented Jul 22, 2020

The error seems legitimate:

	gossip.go:64,gossip.go:102,gossip.go:114,gossip.go:124,test_runner.go:757: gossip did not stabilize in 20.0s

@knz
Copy link
Contributor

knz commented Jul 22, 2020

cc @nvanbenschoten @tbg for triage

@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@b8a50cc4d062293915969cdc83e3ec4d057cede5:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/gossip/chaos/nodes=9/run_1
	gossip.go:64,gossip.go:102,gossip.go:114,gossip.go:124,test_runner.go:757: gossip did not stabilize in 20.1s

	cluster.go:1571,context.go:135,cluster.go:1560,test_runner.go:826: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2111252-1595484018-31-n9cpu4 --oneshot --ignore-empty-nodes: exit status 1 3: 4707
		9: 4698
		4: 4737
		2: 4736
		8: 4691
		6: 5588
		1: 5422
		5: 4782
		7: dead
		Error: UNCLASSIFIED_PROBLEM: 7: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1115
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:266
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1808
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (3) 3 safe details enclosed
		Wraps: (4) 7: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *safedetails.withSafeDetails (4) *errors.errorString

More

Artifacts: /gossip/chaos/nodes=9
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@knz knz mentioned this issue Jul 23, 2020
25 tasks
@nvanbenschoten
Copy link
Member

Yes, this looks like an issue. I have a hunch I know what's going on. In the first failure, we see errors like:

07:04:27 gossip.go:79: 1: gossip not ok (dead node 2 present): 1:4,3:1,5:3,7:2,7:8,8:1,9:3,9:5 (19s)

in the second, errors like:

06:49:46 gossip.go:79: 1: gossip not ok (dead node 7 present): 2:1,4:1,5:4,6:4,7:2,8:4,9:2 (19s)

Notice that in both cases, the "dead" node is part of the gossip network but there is another node missing.

Incidentally, I ran a large TPC-E cluster last Friday and noticed that the node IDs were all out of order, even with roachprod start --sequential. I suspect that we're somehow bringing nodes up in a way that does not allow their IDs to line up with their corresponding roachprod node IDs.

@nvanbenschoten
Copy link
Member

To back this up, in the second failure, we can see from node 7's logs that it was assigned node ID 3. That's the node missing from gossip.

@irfansharif
Copy link
Contributor

#51497 (comment), fixed by #51790.

@irfansharif irfansharif self-assigned this Jul 23, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).gossip/chaos/nodes=9 failed on master@bfa6307c292ef4dfed4a53cb99f506e6dab26533:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/gossip/chaos/nodes=9/run_1
	gossip.go:64,gossip.go:102,gossip.go:114,gossip.go:124,test_runner.go:757: gossip did not stabilize in 20.1s

	cluster.go:1571,context.go:135,cluster.go:1560,test_runner.go:826: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2114210-1595571129-31-n9cpu4 --oneshot --ignore-empty-nodes: exit status 1 8: 4771
		6: 4841
		5: 5171
		2: 5151
		3: 4865
		4: 4792
		7: 5174
		9: dead
		1: 5462
		Error: UNCLASSIFIED_PROBLEM: 9: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  | main.glob..func13
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1115
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:266
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1808
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (3) 3 safe details enclosed
		Wraps: (4) 9: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *safedetails.withSafeDetails (4) *errors.errorString

More

Artifacts: /gossip/chaos/nodes=9
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

irfansharif added a commit to irfansharif/cockroach that referenced this issue Jul 24, 2020
..and the setting of cluster settings for single node clusters.
`roachprod start --sequential` was broken in cockroachdb#51329, and the broken-ness
outlined in TODOs in cockroachdb#51790. This PR just addresses those TODOs.

Fixes cockroachdb#51497
Fixes cockroachdb#51721
Fixes cockroachdb#51738
Fixes cockroachdb#51768
Fixes cockroachdb#51769
Fixes cockroachdb#51776

Release note: None
craig bot pushed a commit that referenced this issue Jul 25, 2020
51893: roachprod: fixup `roachprod --sequential` r=irfansharif a=irfansharif

..and the setting of cluster settings for single node clusters.
`roachprod start --sequential` was broken in #51329, and the broken-ness
outlined in TODOs in #51790. This PR just addresses those TODOs.

Fixes #51497
Fixes #51721
Fixes #51738
Fixes #51768
Fixes #51769
Fixes #51776

Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
@craig craig bot closed this as completed in 6d6706b Jul 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants