Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: jepsen/g2/split failed #51052

Closed
cockroach-teamcity opened this issue Jul 7, 2020 · 11 comments
Closed

roachtest: jepsen/g2/split failed #51052

cockroach-teamcity opened this issue Jul 7, 2020 · 11 comments
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).jepsen/g2/split failed on master@9304ecd70e9f3ba4cb16b5443a10b4e17d7baee0:

		  | main.runJepsen.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/jepsen.go:159
		  | main.runJepsen.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/jepsen.go:180
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_094027.424_n6_bash
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2068260-1594103442-43-n6cpu4:6 -- bash -e -c "\
		  | cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		  |  ~/lein run test \
		  |    --tarball file://${PWD}/cockroach.tgz \
		  |    --username ${USER} \
		  |    --ssh-private-key ~/.ssh/id_rsa \
		  |    --os ubuntu \
		  |    --time-limit 300 \
		  |    --concurrency 30 \
		  |    --recovery-time 25 \
		  |    --test-count 1 \
		  |    -n 10.128.0.37 -n 10.128.0.25 -n 10.128.0.71 -n 10.128.0.9 -n 10.128.0.45 \
		  |    --test g2 --nemesis split \
		  | > invoke.log 2>&1 \
		  | " returned
		  | stderr:
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 6. Command with error:
		  |   | ```
		  |   | bash -e -c "\
		  |   | cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		  |   |  ~/lein run test \
		  |   |    --tarball file://${PWD}/cockroach.tgz \
		  |   |    --username ${USER} \
		  |   |    --ssh-private-key ~/.ssh/id_rsa \
		  |   |    --os ubuntu \
		  |   |    --time-limit 300 \
		  |   |    --concurrency 30 \
		  |   |    --recovery-time 25 \
		  |   |    --test-count 1 \
		  |   |    -n 10.128.0.37 -n 10.128.0.25 -n 10.128.0.71 -n 10.128.0.9 -n 10.128.0.45 \
		  |   |    --test g2 --nemesis split \
		  |   | > invoke.log 2>&1 \
		  |   | "
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 20
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

More

Artifacts: /jepsen/g2/split

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 7, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.2 milestone Jul 7, 2020
@darinpp
Copy link
Contributor

darinpp commented Jul 7, 2020

The issue seems to be a failing import

Unsuccessful job 570272557671153665 of type IMPORT, description IMPORT TABLE tpcc.public."order" (o_id INT8 NOT NULL, o_d_id INT8 NOT NULL, o_w_id INT8 NOT NULL, o_c_id INT8, o_entry_d TIMESTAMP, o_carrier_id INT8, o_ol_cnt INT8, o_all_local INT8, PRIMARY KEY (o_w_id, o_d_id, o_id DESC), CONSTRAINT order_idx UNIQUE (o_w_id, o_d_id, o_c_id, o_id DESC) STORING (o_entry_d, o_carrier_id)) CSV DATA ('workload:///csv/tpcc/order?fks=true&interleaved=false&row-end=1500000&row-start=0&seed=1&version=2.1.0&warehouses=200', 'workload:///csv/tpcc/order?fks=true&interleaved=false&row-end=3000000&row-start=1500000&seed=1&version=2.1.0&warehouses=200', 'workload:///csv/tpcc/order?fks=true&interleaved=false&row-end=4500000&row-start=3000000&seed=1&version=2.1.0&warehouses=200', 'workload:///csv/tpcc/order?fks=true&interleaved=false&row-end=6000000&row-start=4500000&seed=1&version=2.1.0&warehouses=200') WITH "nullif" = 'NULL', status failed, error version mismatch in flow request: 30; this node accepts 27 through 28, coordinator 3

@dt
Copy link
Member

dt commented Jul 13, 2020

The error message indicates an IMPORT was attempted in a mixed version cluster where some of the nodes were too old to participate. Whatever is running the IMPORT should ensure it is done before any nodes are upgraded.

@dt
Copy link
Member

dt commented Jul 13, 2020

wait, actually, no, i think the line you quoted @darinpp is from some other test (probably jobs/mixed-version)? that's a tpcc import in the error you pasted, but this is a jepsen/g2/split issue?

@dt dt assigned darinpp and unassigned dt Jul 13, 2020
@knz
Copy link
Contributor

knz commented Jul 21, 2020

@darinpp in general Jepsen failures are to be triaged by fetching the failure-logs.tbg archive in the artifact dir, then looking at the file invoke.log inside the archive.

In this case, the G2 test found an actual Jepsen error:

 {:valid? false,
  :key-count 34184,
  :legal-count 2909,
  :illegal-count 5,
  :illegal {29053 2, 30653 2, 31104 2, 32719 2, 32975 2}},
 :valid? false}
Analysis invalid! (ノಥ益ಥ)ノ ┻━┻

@tbg tbg assigned tbg and unassigned darinpp Jul 21, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).jepsen/g2/split failed on master@e9a4f83e3eee59510f97db2c6e0df9b57cf6b944:

		  | main.runJepsen.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/jepsen.go:159
		  | main.runJepsen.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/jepsen.go:180
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_083635.792_n6_bash
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2107908-1595398673-61-n6cpu4:6 -- bash -e -c "\
		  | cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		  |  ~/lein run test \
		  |    --tarball file://${PWD}/cockroach.tgz \
		  |    --username ${USER} \
		  |    --ssh-private-key ~/.ssh/id_rsa \
		  |    --os ubuntu \
		  |    --time-limit 300 \
		  |    --concurrency 30 \
		  |    --recovery-time 25 \
		  |    --test-count 1 \
		  |    -n 10.128.0.244 -n 10.128.0.246 -n 10.128.0.230 -n 10.128.0.241 -n 10.128.0.231 \
		  |    --test g2 --nemesis split \
		  | > invoke.log 2>&1 \
		  | " returned
		  | stderr:
		  | Error: SSH_PROBLEM: exit status 255
		  | (1) SSH_PROBLEM
		  | Wraps: (2) Node 6. Command with error:
		  |   | ```
		  |   | bash -e -c "\
		  |   | cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		  |   |  ~/lein run test \
		  |   |    --tarball file://${PWD}/cockroach.tgz \
		  |   |    --username ${USER} \
		  |   |    --ssh-private-key ~/.ssh/id_rsa \
		  |   |    --os ubuntu \
		  |   |    --time-limit 300 \
		  |   |    --concurrency 30 \
		  |   |    --recovery-time 25 \
		  |   |    --test-count 1 \
		  |   |    -n 10.128.0.244 -n 10.128.0.246 -n 10.128.0.230 -n 10.128.0.241 -n 10.128.0.231 \
		  |   |    --test g2 --nemesis split \
		  |   | > invoke.log 2>&1 \
		  |   | "
		  |   | ```
		  | Wraps: (3) exit status 255
		  | Error types: (1) errors.SSH (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 10
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

More

Artifacts: /jepsen/g2/split

See this test on roachdash
powered by pkg/cmd/internal/issues

@knz
Copy link
Contributor

knz commented Jul 22, 2020

Last failure is different and caused by #51739.

You'll need to fix that issue (together with @irfansharif )before you can stress the test.

irfansharif added a commit to irfansharif/jepsen that referenced this issue Jul 23, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).jepsen/g2/split failed on master@b8a50cc4d062293915969cdc83e3ec4d057cede5:

		  | main.runJepsen.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/jepsen.go:159
		  | main.runJepsen.func3
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/jepsen.go:180
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1373
		Wraps: (2) 2 safe details enclosed
		Wraps: (3) output in run_082658.645_n6_bash
		Wraps: (4) /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-2111252-1595484018-59-n6cpu4:6 -- bash -e -c "\
		  | cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		  |  ~/lein run test \
		  |    --tarball file://${PWD}/cockroach.tgz \
		  |    --username ${USER} \
		  |    --ssh-private-key ~/.ssh/id_rsa \
		  |    --os ubuntu \
		  |    --time-limit 300 \
		  |    --concurrency 30 \
		  |    --recovery-time 25 \
		  |    --test-count 1 \
		  |    -n 10.128.0.54 -n 10.128.0.85 -n 10.128.0.134 -n 10.128.0.127 -n 10.128.0.77 \
		  |    --test g2 --nemesis split \
		  | > invoke.log 2>&1 \
		  | " returned
		  | stderr:
		  | Error: SSH_PROBLEM: exit status 255
		  | (1) SSH_PROBLEM
		  | Wraps: (2) Node 6. Command with error:
		  |   | ```
		  |   | bash -e -c "\
		  |   | cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
		  |   |  ~/lein run test \
		  |   |    --tarball file://${PWD}/cockroach.tgz \
		  |   |    --username ${USER} \
		  |   |    --ssh-private-key ~/.ssh/id_rsa \
		  |   |    --os ubuntu \
		  |   |    --time-limit 300 \
		  |   |    --concurrency 30 \
		  |   |    --recovery-time 25 \
		  |   |    --test-count 1 \
		  |   |    -n 10.128.0.54 -n 10.128.0.85 -n 10.128.0.134 -n 10.128.0.127 -n 10.128.0.77 \
		  |   |    --test g2 --nemesis split \
		  |   | > invoke.log 2>&1 \
		  |   | "
		  |   | ```
		  | Wraps: (3) exit status 255
		  | Error types: (1) errors.SSH (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (5) exit status 10
		Error types: (1) *withstack.withStack (2) *safedetails.withSafeDetails (3) *errutil.withMessage (4) *main.withCommandDetails (5) *exec.ExitError

More

Artifacts: /jepsen/g2/split

See this test on roachdash
powered by pkg/cmd/internal/issues

irfansharif added a commit to irfansharif/jepsen that referenced this issue Jul 23, 2020
@tbg
Copy link
Member

tbg commented Jul 28, 2020

Running 500 iterations of this test, starting now.

@tbg
Copy link
Member

tbg commented Jul 29, 2020

I ran 500 iterations via

#!/bin/bash
set -euo pipefail
rm -rf cockroach-*
make bin/{roachtest,roachprod}

rm -rf bin.docker_amd64/
./build/builder.sh make bin/workload
./build/builder.sh mkrelease

./bin/roachtest run --user tobias --workload bin.docker_amd64/workload --cockroach cockroach-linux-2.6.32-gnu-amd64 --cpu-quota 1024 --count 500 jepsen/g2/split 2>&1 | tee roachstress.log

and they all passed.

@tbg
Copy link
Member

tbg commented Aug 27, 2020

Ugh, the artifacts are all gone. @nvanbenschoten what do we do with this? Try harder to repro or do another push after the next natural repro?

@knz knz removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 31, 2020
@andreimatei
Copy link
Contributor

I think it's time to close

aliher1911 pushed a commit to aliher1911/jepsen that referenced this issue Dec 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

6 participants