roachtest: tpccbench/nodes=3/cpu=4 failed #55542

cockroach-teamcity · 2020-10-14T08:04:25Z

(roachtest).tpccbench/nodes=3/cpu=4 failed on release-19.2@3e9adba8b62663b9e3521faad03a67f91fb3fc7a:

		  | W201014 08:03:47.241164 1 workload/cli/run.go:365  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | W201014 08:03:49.220047 1 workload/cli/run.go:365  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | W201014 08:03:51.311319 1 workload/cli/run.go:365  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | W201014 08:03:53.513543 1 workload/cli/run.go:365  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | W201014 08:03:55.387360 1 workload/cli/run.go:365  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | W201014 08:03:57.157097 1 workload/cli/run.go:365  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | W201014 08:03:59.347102 1 workload/cli/run.go:365  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | W201014 08:04:01.466075 1 workload/cli/run.go:365  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | W201014 08:04:03.451005 1 workload/cli/run.go:365  retrying after error while creating load: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | E201014 08:04:03.730917 1 workload/cli/run.go:379  Attempt to create load generator failed. It's been more than 1h0m0s since we started trying to create the load generator so we're giving up. Last failure: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | Error: failed to initialize the load generator: could not scatter ranges: Couldn't exec "ALTER TABLE warehouse SCATTER": dial tcp 10.128.0.41:26257: connect: connection refused
		  | Error: COMMAND_PROBLEM: exit status 1
		  | (1) COMMAND_PROBLEM
		  | Wraps: (2) Node 4. Command with error:
		  |   | ```
		  |   | ./workload run tpcc --warehouses=1000 --active-warehouses=615 --tolerate-errors --scatter --ramp=5m0s --duration=10m0s {pgurl:1-3} --histograms=perf/warehouses=615/stats.json
		  |   | ```
		  | Wraps: (3) exit status 1
		  | Error types: (1) errors.Cmd (2) *hintdetail.withDetail (3) *exec.ExitError
		  |
		  | stdout:
		Wraps: (10) exit status 20
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) *withstack.withStack (8) *errutil.withPrefix (9) *main.withCommandDetails (10) *exec.ExitError

	cluster.go:1657,context.go:135,cluster.go:1646,test_runner.go:836: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-2363106-1602654685-06-n4cpu4 --oneshot --ignore-empty-nodes: exit status 1 4: skipped
		3: 25462
		1: 28491
		2: dead
		Error: UNCLASSIFIED_PROBLEM: 2: dead
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1143
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:267
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1839
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:203
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1357
		Wraps: (3) 2: dead
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError

More

Artifacts: /tpccbench/nodes=3/cpu=4

See this test on roachdash
_{powered by pkg/cmd/internal/issues}

The text was updated successfully, but these errors were encountered:

tbg · 2020-10-20T14:20:25Z

F201014 06:50:04.946609 183 server/server.go:253  [n2] clock synchronization error: this node is more than 500ms away from at least half of the known nodes (2 of 4 are within the offset)

How can this be so bad? There's definitely something wrong across our roachprod infrastructure. Our weekly tests are virtually guaranteed to be hit by this. What can we do to avoid this?

I recently set up a machine and poked around. We're using ntp and pointing it at google's internal metadata service. What could go wrong?

55459: kv: increase defaultRaftLogTruncationThreshold to 16MB r=nvanbenschoten a=nvanbenschoten In v20.1, we increased the default max range size from 64MB to 512MB. However, we only doubled (258b965) the default raft log truncation threshold. This has the potential to exacerbate issues like #37906, because each range will now handle on average 8 times more write load, and will therefore be forced to truncate its log on average 4 times as frequently as it had previously. This commit bumps the default raft log truncation to 16MB. It doesn't go as far as scaling the log truncation threshold by the max range size (either automatically or with a fixed 32MB default) because the cost of an individual log truncation is proportional to the log size and we don't want to make each log truncation significantly more disruptive. 16MB seems like a good middle ground and we know that we have customers already successfully running in production with this value. 55660: sql: add constraint name to constraint error r=yuzefovich a=alex-berger Add constraint name to error for unique constraint, check constraintand foreign key constraint violations. Those constraint names are then propagated over the PostgreSQL wire protocol and will show up for example in JDBC exceptions (org.postgresql.util.PSQLException#getServerErrorMessage().getConstraint()). This commit improves PostgreSQL compatibility. Release note: Improve PosgreSQL wire protocol compatibility by adding constraint name to sql errors. 55759: roachtest: don't expect node deaths in tpccbench r=nvanbenschoten a=nvanbenschoten Closes #55542. This was causing node deaths to be initially ignored in test failures like #55542. The monitor was not watching when the cluster was restarted, so there was no need to inform it of the restart. Because of this, the monitor had an accumulated "death quota" that allowed it to ignore the first few deaths during the actual run of the test. I tested this 5 times without issue. 55779: sql: disallow moving tables with user-defined types into a different DB r=lucy-zhang a=lucy-zhang We disallow columns using user-defined types in a different database from the table at creation time. However, we have a loophole where it's possible to move the table to a different database using `RENAME` and thus create a cross-database reference. This breaks backup/restore and is generally unintended. This PR adds a check to disallow this. Partially addresses #55709. Related to #55772. Release note (bug fix): Tables can no longer be moved to a different database using `RENAME` if they have columns using user-defined types (enums). Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: alex.berger@nexiot.ch <alex.berger@nexiot.ch> Co-authored-by: Lucy Zhang <lucy@cockroachlabs.com>

cockroach-teamcity added branch-release-19.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Oct 14, 2020

cockroach-teamcity added this to the 20.2 milestone Oct 14, 2020

nvanbenschoten mentioned this issue Oct 20, 2020

roachtest: don't expect node deaths in tpccbench #55759

Merged

craig bot closed this as completed in ba0f7c1 Oct 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: tpccbench/nodes=3/cpu=4 failed #55542

roachtest: tpccbench/nodes=3/cpu=4 failed #55542

cockroach-teamcity commented Oct 14, 2020

tbg commented Oct 20, 2020

roachtest: tpccbench/nodes=3/cpu=4 failed #55542

roachtest: tpccbench/nodes=3/cpu=4 failed #55542

Comments

cockroach-teamcity commented Oct 14, 2020

tbg commented Oct 20, 2020