Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: decommission/randomized failed #55581

Closed
cockroach-teamcity opened this issue Oct 15, 2020 · 12 comments · Fixed by #55809
Closed

roachtest: decommission/randomized failed #55581

cockroach-teamcity opened this issue Oct 15, 2020 · 12 comments · Fixed by #55809
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).decommission/randomized failed on master@80e7127197f76ef35c1f6ec3984c4d49d4afde7f:

		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:203
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		  | Wraps: (2) 1: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (4) secondary error attachment
		  | 4: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:203
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		  | Wraps: (2) 4: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (6) 3: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *secondary.withSecondaryError (4) *secondary.withSecondaryError (5) *withstack.withStack (6) *errutil.leafError

More

Artifacts: /decommission/randomized

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 15, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.2 milestone Oct 15, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).decommission/randomized failed on master@47044feed11ec0c0390989bf8f44e777ec3eb00d:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/randomized/run_1
	test_runner.go:814: test timed out (10m0s)

More

Artifacts: /decommission/randomized

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).decommission/randomized failed on master@b1abf9c8dfb5880fce69dfc7240e593f077bf77c:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/randomized/run_1
	test_runner.go:814: test timed out (10m0s)

More

Artifacts: /decommission/randomized

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).decommission/randomized failed on master@d752fa2bd9afad255e8c655de9c7edc6dad14486:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/randomized/run_1
	test_runner.go:814: test timed out (10m0s)

	decommission.go:577,decommission.go:61,test_runner.go:755: decommission failed: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2372996-1603001519-36-n6cpu4:6 -- ./cockroach node decommission --wait=all --format=csv 5 6 --insecure --port={pgport:6}: signal: killed

More

Artifacts: /decommission/randomized

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).decommission/randomized failed on master@ab503e2fd708541e5e9ebb9a6f2651eda506f2ef:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/randomized/run_1
	decommission.go:682,retry.go:172,decommission.go:678,decommission.go:61,test_runner.go:755: node-ls failed: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2374262-1603088202-36-n6cpu4:2 -- ./cockroach node ls --format=csv --insecure --port={pgport:2}: exit status 20

	cluster.go:1657,context.go:135,cluster.go:1646,test_runner.go:836: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2374262-1603088202-36-n6cpu4 --oneshot --ignore-empty-nodes: exit status 1 5: 4766
		2: 4932
		4: 4464
		1: 4534
		3: dead
		6: 4600
		Error: UNCLASSIFIED_PROBLEM: 3: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 3: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /decommission/randomized

See this test on roachdash
powered by pkg/cmd/internal/issues

@tbg
Copy link
Member

tbg commented Oct 20, 2020

07:42:04 cluster.go:559: > /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2374262-1603088202-36-n6cpu4:2 -- ./cockroach node ls --format=csv --insecure --port={pgport:2}
07:42:19 decommission.go:1052: ERROR: cannot dial server.
Is the server running?
If the server is running, check --host client-side and --advertise server-side.

read tcp 127.0.0.1:60562 -> 127.0.0.1:26257: i/o timeout

This also fails as a result of #55286. This test is using decommissioned nodes to run cli commands, but since these nodes are locked out hard from the cluster now, this times out.

Will need to rework the test harness to always pick "up" nodes for the commands.

@tbg tbg self-assigned this Oct 20, 2020
@tbg
Copy link
Member

tbg commented Oct 20, 2020

Some issues in this test.

  1. start decommission of nX from nY. This uses --wait=none so typically it doesn't mark nX as decommissioned. However - this is a six node cluster, so in theory it's possible and presents one possible issue.
  2. recommission nX - note that at this point, nX may not have any replicas, though it's not marked as decommissioned yet (except in the problematic case above, but ignore for now)
  3. attempt to decommission all nodes with wait=none - note that nX may not have any replicas, but the way the code is written it will count the replicas across all of the target nodes, and only decomission individual nodes when the total drops to zero. So this is safe.
  4. pick nA, nB != nA, and runNode. Decommission nA and nB through runNode. The problem is that runNode may well be nA or nB, in which case it'll be a node that will lose access to the cluster.

We're seeing issue 4) here, which is straightforward to fix. For the issue in 1), I'll slightly derandomize the test to always target n1 in that step, which is guaranteed to have replicas.

Will send PR.

@cockroach-teamcity
Copy link
Member Author

(roachtest).decommission/randomized failed on master@1d46df77dbd8721cccf508fb5ed498f3de78022c:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/randomized/run_1
	test_runner.go:814: test timed out (10m0s)

	decommission.go:577,decommission.go:61,test_runner.go:755: decommission failed: cluster.RunWithBuffer: context canceled

More

Artifacts: /decommission/randomized

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).decommission/randomized failed on master@40b7942025de0d8e347d25451611ad2c20267d48:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/randomized/run_1
	test_runner.go:814: test timed out (10m0s)

More

Artifacts: /decommission/randomized

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).decommission/randomized failed on master@4702643dd0755a48365a115c970415fbb5023ad2:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/randomized/run_1
	decommission.go:682,retry.go:172,decommission.go:678,decommission.go:61,test_runner.go:755: node-ls failed: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2387928-1603433032-37-n6cpu4:4 -- ./cockroach node ls --format=csv --insecure --port={pgport:4}: exit status 20

	cluster.go:1657,context.go:140,cluster.go:1646,test_runner.go:836: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2387928-1603433032-37-n6cpu4 --oneshot --ignore-empty-nodes: exit status 1 2: 4277
		5: 4598
		4: 4177
		3: 4496
		1: 3894
		6: dead
		Error: UNCLASSIFIED_PROBLEM: 6: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 6: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /decommission/randomized

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).decommission/randomized failed on master@8aceac3c99c3addece3a9ef9af04cc74715cdb37:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/randomized/run_1
	test_runner.go:814: test timed out (10m0s)

	cluster.go:2335,cluster.go:2366,cluster.go:2470,decommission.go:624,decommission.go:630,decommission.go:61,test_runner.go:755: failed to get pgurl for nodes: teamcity-2392040-1603605966-38-n6cpu4:1: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod pgurl --external teamcity-2392040-1603605966-38-n6cpu4:1 returned: context canceled
		(1) attached stack trace
		  -- stack trace:
		  | main.(*cluster).pgURL
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2335
		  | main.(*cluster).ExternalPGUrl
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2366
		  | main.(*cluster).Conn
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2470
		  | main.runDecommissionRandomized.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/decommission.go:624
		  | main.runDecommissionRandomized
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/decommission.go:630
		  | main.registerDecommission.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/decommission.go:61
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:755
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (2) failed to get pgurl for nodes: teamcity-2392040-1603605966-38-n6cpu4:1
		Wraps: (3) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod pgurl --external teamcity-2392040-1603605966-38-n6cpu4:1 returned
		  | stderr:
		  |
		  | stdout:
		Wraps: (4) secondary error attachment
		  | context canceled
		  | (1) context canceled
		  | Error types: (1) *errors.errorString
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *main.withCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

More

Artifacts: /decommission/randomized

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).decommission/randomized failed on master@6184870a438ae34afbcf29dda5452345dc7587d3:

		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:203
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		  | Wraps: (2) 2: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (3) secondary error attachment
		  | 4: dead
		  | (1) attached stack trace
		  |   -- stack trace:
		  |   | main.glob..func14
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
		  |   | main.wrap.func1
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
		  |   | github.com/spf13/cobra.(*Command).execute
		  |   | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
		  |   | github.com/spf13/cobra.(*Command).ExecuteC
		  |   | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
		  |   | github.com/spf13/cobra.(*Command).Execute
		  |   | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
		  |   | main.main
		  |   | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  |   | runtime.main
		  |   | 	/usr/local/go/src/runtime/proc.go:203
		  |   | runtime.goexit
		  |   | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		  | Wraps: (2) 4: dead
		  | Error types: (1) *withstack.withStack (2) *errutil.leafError
		Wraps: (4) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (5) 6: dead
		Error types: (1) errors.Unclassified (2) *secondary.withSecondaryError (3) *secondary.withSecondaryError (4) *withstack.withStack (5) *errutil.leafError

More

Artifacts: /decommission/randomized

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(roachtest).decommission/randomized failed on master@5e3c201595fc33b0d120057c61413195716f811d:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/randomized/run_1
	test_runner.go:814: test timed out (10m0s)

More

Artifacts: /decommission/randomized

See this test on roachdash
powered by pkg/cmd/internal/issues

craig bot pushed a commit that referenced this issue Oct 27, 2020
55560: backupccl: avoid div-by-zero crash on failed node count r=dt a=dt

We've seen a report of a node that crashed due to a divide-by-zero
hit during metrics collection, specifically when computing the
throughput-per-node by dividing the backup size by node count.

Since this is only now used for that metric, make a failure to count
nodes a warning only for release builds (and fallback to 1), and make
any error while counting, or not counting to more than 0, a returned
error in non-release builds.

Release note (bug fix): avoid crashing when BACKUP is unable to count the total nodes in the cluster.

55809: roachtest: fix decommission/randomize r=irfansharif a=tbg

The test could end up using fully decommissioned nodes for cli commands,
which does not work as of #55286.

Fixes #55581.

Release note: None


56019: lexbase: pin `reserved_keywords.go` within Bazel r=irfansharif a=irfansharif

It's an auto-generated file that bazel doesn't yet know how to construct
within the sandbox. Before this PR `make bazel-generate` would show
spurious diffs on a clean checkout without this file present. Now it
no longer will.

Unfortunately it also means that successful bazel builds require
`reserved_keywords.go` being pre-generated ahead of time (it's not
checked-in into the repo). Once Bazel is taught to generate this file
however, this will no longer be the case. It was just something that I
missed in #55687.

Release note: None

Co-authored-by: David Taylor <tinystatemachine@gmail.com>
Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
@craig craig bot closed this as completed in a00ffe5 Oct 27, 2020
tbg added a commit to tbg/cockroach that referenced this issue Nov 4, 2020
The test could end up using fully decommissioned nodes for cli commands,
which does not work as of cockroachdb#55286.
Additionally, decommissioned nodes now become non-live after a short
while, so various cli output checks had to be adjusted.

Fixes cockroachdb#55581.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants