Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: tpccbench/nodes=3/cpu=4 failed #55542

Closed
cockroach-teamcity opened this issue Oct 14, 2020 · 1 comment · Fixed by #55759
Closed

roachtest: tpccbench/nodes=3/cpu=4 failed #55542

cockroach-teamcity opened this issue Oct 14, 2020 · 1 comment · Fixed by #55759
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

(roachtest).tpccbench/nodes=3/cpu=4 failed on release-19.2@3e9adba8b62663b9e3521faad03a67f91fb3fc7a:

		  | W201014 08:03:47.241164 1 workload/cli/run.go:365  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | W201014 08:03:49.220047 1 workload/cli/run.go:365  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | W201014 08:03:51.311319 1 workload/cli/run.go:365  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | W201014 08:03:53.513543 1 workload/cli/run.go:365  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | W201014 08:03:55.387360 1 workload/cli/run.go:365  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | W201014 08:03:57.157097 1 workload/cli/run.go:365  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | W201014 08:03:59.347102 1 workload/cli/run.go:365  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | W201014 08:04:01.466075 1 workload/cli/run.go:365  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | W201014 08:04:03.451005 1 workload/cli/run.go:365  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | E201014 08:04:03.730917 1 workload/cli/run.go:379  Attempt to create load generator failed. It's been more than 1h0m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | Error: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 4. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=1000 --active-warehouses=615 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-3} --histograms=perf/warehouses=615/stats.json
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

	cluster.go:1657,context.go:135,cluster.go:1646,test_runner.go:836: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2363106-1602654685-06-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: 25462
		1: 28491
		2: dead
		Error: UNCLASSIFIED_PROBLEM: 2: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 2: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /tpccbench/nodes=3/cpu=4

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-release-19.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 14, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.2 milestone Oct 14, 2020
@tbg
Copy link
Member

tbg commented Oct 20, 2020

F201014 06:50:04.946609 183 server/server.go:253  [n2] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (2 of 4 are within the offset)

How can this be so bad? There's definitely something wrong across our roachprod infrastructure. Our weekly tests are virtually guaranteed to be hit by this. What can we do to avoid this?

I recently set up a machine and poked around. We're using ntp and pointing it at google's internal metadata service. What could go wrong?

craig bot pushed a commit that referenced this issue Oct 20, 2020
55459: kv: increase defaultRaftLogTruncationThreshold to 16MB r=nvanbenschoten a=nvanbenschoten

In v20.1, we increased the default max range size from 64MB to 512MB. However, we only doubled (258b965) the default raft log truncation threshold. This has the potential to exacerbate issues like #37906, because each range will now handle on average 8 times more write load, and will therefore be forced to truncate its log on average 4 times as frequently as it had previously.

This commit bumps the default raft log truncation to 16MB. It doesn't go as far as scaling the log truncation threshold by the max range size (either automatically or with a fixed 32MB default) because the cost of an individual log truncation is proportional to the log size and we don't want to make each log truncation significantly more disruptive. 16MB seems like a good middle ground and we know that we have customers already successfully running in production with this value.

55660: sql: add constraint name to constraint error r=yuzefovich a=alex-berger

Add constraint name to error for unique constraint, check constraintand foreign key constraint violations. Those constraint names are then propagated over the PostgreSQL wire protocol and will show up for example in JDBC exceptions (org.postgresql.util.PSQLException#getServerErrorMessage().getConstraint()). This commit improves PostgreSQL compatibility.

Release note: Improve PosgreSQL wire protocol compatibility by adding constraint name to sql errors.

55759: roachtest: don't expect node deaths in tpccbench r=nvanbenschoten a=nvanbenschoten

Closes #55542.

This was causing node deaths to be initially ignored in test
failures like #55542. The monitor was not watching when the
cluster was restarted, so there was no need to inform it of
the restart. Because of this, the monitor had an accumulated
"death quota" that allowed it to ignore the first few deaths
during the actual run of the test.

I tested this 5 times without issue.

55779: sql: disallow moving tables with user-defined types into a different DB r=lucy-zhang a=lucy-zhang

We disallow columns using user-defined types in a different database
from the table at creation time. However, we have a loophole where it's
possible to move the table to a different database using `RENAME` and
thus create a cross-database reference. This breaks backup/restore and
is generally unintended. This PR adds a check to disallow this.

Partially addresses #55709.
Related to #55772.

Release note (bug fix): Tables can no longer be moved to a different
database using `RENAME` if they have columns using user-defined types
(enums).

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: alex.berger@nexiot.ch <alex.berger@nexiot.ch>
Co-authored-by: Lucy Zhang <lucy@cockroachlabs.com>
@craig craig bot closed this as completed in ba0f7c1 Oct 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants