Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: schemachange/bulkingest failed [connection refused but no CRDB error] #114501

Closed
cockroach-teamcity opened this issue Nov 15, 2023 · 6 comments
Labels
branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-testeng TestEng Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Nov 15, 2023

roachtest.schemachange/bulkingest failed with artifacts on release-23.2 @ d48b4f943e00d8f10b3b8aaf868da4c560bdae57:

(test_runner.go:1113).runTest: test timed out (2h0m0s)
(schemachange.go:421).2: dial tcp 34.73.1.236:26257: connect: connection refused
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/schemachange/bulkingest/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_metamorphicBuild=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/sql-foundations

This test on roachdash | Improve this report!

Jira issue: CRDB-33533

@cockroach-teamcity cockroach-teamcity added branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Nov 15, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Nov 15, 2023
@rafiss
Copy link
Collaborator

rafiss commented Nov 16, 2023

34.73.1.236 appears to be n1. (side note: i thought the node ID would appear in this error message, but we only get the IP.)

The code that timed out during connection is:

				t.L().Printf("Creating index")
				before := timeutil.Now()
				if _, err := db.Exec(`CREATE INDEX payload_a ON bulkingest.bulkingest (payload, a)`); err != nil {
					t.Fatal(err)
				}

But it seems like the cluster was healthy and according to run_141746.338675167_n5_workload-run-bulking.log, it was handling the workload run bulkingest command fine.

I made PR #114551 to improve the logging. But I don't have any leads why the test would fail to connect to CRDB.

@rafiss rafiss changed the title roachtest: schemachange/bulkingest failed roachtest: schemachange/bulkingest failed [connection refused but no CRDB error] Nov 16, 2023
@rafiss rafiss added T-testeng TestEng Team and removed T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Nov 16, 2023
Copy link

blathers-crl bot commented Nov 16, 2023

cc @cockroachdb/test-eng

@rafiss
Copy link
Collaborator

rafiss commented Nov 16, 2023

I will assign to TestEng in case they know of a way to diagnose a timeout like this. If not, feel free to close this.

@renatolabs
Copy link
Contributor

The error message definitely could use some improvement; I'm pretty sure the connection refused part is a red herring and the actual reason the test failed is because it timed out:

(test_runner.go:1113).runTest: test timed out (2h0m0s)

I think the sequence of events here is:

  • test times out, test context is canceled;
  • cockroach processes are killed when context is canceled;
  • test observes connection refused error because the process is now dead.

So we have two errors happening before the test's Run function returns. I believe this would have been a little less confusing if we were using ExecContext() in the test, as we would likely see a context canceled error instead, which would point to some other error being the root cause, bringing more attention to the "test timed out" message.

For Test Eng more concretely, I think a takeaway is that we should not show any other errors if the test timed out. That is always going to be the root cause for the failure.

In any case, the ROACHTEST_metamorphicBuild=true tag here is suggestive. Seems like another case of a test that might need a longer timeout with metamorphic builds.

craig bot pushed a commit that referenced this issue Nov 16, 2023
114499: backupccl: add some basic checks for online restore r=msbutler a=stevendanna

This adds some basic guardrails to stop people from trying things out
that currently don't work.

Epic: none

Release note: None

114550: parser: allow a_expr in SET ON UPDATE clause r=rafiss a=rafiss

This was preventing some statements from round-tripping the parse-format-parse cycle. Note that this grammar change matches the expressions we allow for DEFAULT expressions.

fixes #114480
Release note: None

114551: roachtest: lower connection timeout and include node ID r=rafiss a=rafiss

This adds two improvements to help with debugging:
- Include the node ID in the message if we fail to connect to a node.
- Fix the connection timeout at 1 minute rather than leaving it infinite.

informs #114501
Release note: None

Co-authored-by: Steven Danna <danna@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
@cockroach-teamcity
Copy link
Member Author

roachtest.schemachange/bulkingest failed with artifacts on release-23.2 @ 0456c0d4e4dca39bde1b4bc146bf7158db3c0f30:

(test_runner.go:1114).runTest: test timed out (2h0m0s)
(schemachange.go:421).2: dial tcp 34.74.177.221:26257: connect: connection refused
(monitor.go:153).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/schemachange/bulkingest/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_metamorphicBuild=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@renatolabs
Copy link
Contributor

Test times out when using metamorphic cockroach builds. These runs were disabled (#114618) until we have a stable release. We'll re-assess test failures when this feature is enabled again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.2 Used to mark GA and release blockers, technical advisories, and bugs for 23.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-testeng TestEng Team
Projects
No open projects
Status: Done
Development

No branches or pull requests

3 participants