roachtest: unconditionally save clusters that show raft fatal errors #146990

tbg · 2025-05-20T08:27:17Z

When a cluster's logs contain a raft panic, it will be extended (by a week),
volume snapshots will be taken, and the cluster will not be destroyed. This
gives us the artifacts for a thorough investigation.

Verified manually via:

run --local acceptance/invariant-check-detection/failed=true

Here is the (editorialized) output:

test-teardown: 2025/05/20 08:15:15 cluster.go:2559: running cmd `([ -d logs ] && grep -RE '^...` on nodes [:1-4]; details in run_081515.744363000_n1-4_d-logs-grep-RE-Fraft.log
test-teardown: 2025/05/20 08:15:16 cluster.go:2995: extending cluster by 168h0m0s
test-teardown: 2025/05/20 08:15:16 cluster.go:1104: saving cluster local [tag:] (4 nodes) for debugging (--debug specified)
test-teardown: 2025/05/20 08:15:16 test_impl.go:478: test failure https://github.com/cockroachdb/cockroach/pull/2: full stack retained in failure_2.log: (test_runner.go:1705).maybeSaveClusterDueToInvariantProblems: invariant problem - snap name invariant-problem-local-8897676895823393049:
logs/foo.log:F250502 11:37:20.387424 1036 raft/raft.go:2411 ⋮ [T1,Vsystem,n1,s1,r155/1:?/Table/113/1/{43/578…-51/201…}?] 80 match(30115) is out of range [lastIndex(30114)]. Was the raft log corrupted, truncated, or lost?

Closes #145953.
Informs #146617.
Informs #138028.

Fixes #146355.

Epic: none

cockroach-teamcity · 2025-05-20T08:27:31Z

This change is

tbg · 2025-05-20T08:28:33Z

@pav-kv anything else we may want here? For select tests, we could set up the pebble.Cleaner as we discussed, I filed this in #146991.

herkolategan · 2025-05-20T09:55:59Z

Thanks, this looks good. QoL Nit: We could maybe add "parameters" [1] to the github issue (failure) to provide the cluster name and that it's in a saved state, or even maybe a github issue label for "saved failures".

[1] https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/test_runner.go#L2228

pkg/cmd/roachtest/test_runner.go

pkg/cmd/roachtest/tests/acceptance.go

pav-kv · 2025-05-20T10:57:49Z

pkg/cmd/roachtest/test_runner.go

+func (r *testRunner) maybeSaveClusterDueToInvariantProblems(
+	ctx context.Context, t *testImpl, c *clusterImpl,
+) {
+	dets, err := c.RunWithDetails(ctx, t.L(), option.WithNodes(c.All()),


There is currently a tiny friction point with failed roachtests that fataled. We need to manually go to the test log, see which node was reported down, go to its log and find the fatal message. I wonder if this could be automated (and the fatal message included in generated issues) with something similar/related to this PR. There might be no easy way to make it general enough though, but OTOH it doesn't have to be super reliable.

Locating fatal messages could also reduce the volume the grep above needs to scan in all tests.

@herkolategan

The thought has crossed my mind that we should be bubbling up obvious failures/info from logs to the issues. We could search the tail end of logs (to avoid searching the whole log) for fatals and provide those on the github issue. Will run the idea by test-eng in our next weekly.

We think it's a good idea to consolidate all fatal-level log messages [1].

[1] #147360

pkg/cmd/roachtest/tests/acceptance.go

Even if neither --debug nor --debug-always are specified.

tbg · 2025-06-12T09:15:19Z

Thanks, this looks good. QoL Nit: We could maybe add "parameters" [1] to the github issue (failure) to provide the cluster name and that it's in a saved state, or even maybe a github issue label for "saved failures".

[1] master/pkg/cmd/roachtest/test_runner.go#L2228

That's a good idea, but also more than I signed up for. Maybe TE wants to give this another pass when adding more triggers for this functionality? Given how we ~never expect this new functionality to trigger, I don't think it's too important to do this. OTOH, it would be so sad if we had these artifacts and people didn't realize. But it is in test.log.

tbg · 2025-06-12T09:16:52Z

I think I addressed all feedback, PTAL.
If anyone fancies a manual spin,

./dev build roachtest && roachtest run --local invariant-check-detection/failed=false

works and won't leak a snapshot (since local snapshots don't work).

When a cluster's logs contain a raft panic, it will be extended (by a week), volume snapshots will be taken, and the cluster will not be destroyed. This gives us the artifacts for a thorough investigation. Verified manually via: ``` run --local acceptance/invariant-check-detection/failed=true ``` Here is the (editorialized) output: ``` test-teardown: 2025/05/20 08:15:15 cluster.go:2559: running cmd `([ -d logs ] && grep -RE '^...` on nodes [:1-4]; details in run_081515.744363000_n1-4_d-logs-grep-RE-Fraft.log test-teardown: 2025/05/20 08:15:16 cluster.go:2995: extending cluster by 168h0m0s test-teardown: 2025/05/20 08:15:16 cluster.go:1104: saving cluster local [tag:] (4 nodes) for debugging (--debug specified) test-teardown: 2025/05/20 08:15:16 test_impl.go:478: test failure #2: full stack retained in failure_2.log: (test_runner.go:1705).maybeSaveClusterDueToInvariantProblems: invariant problem - snap name invariant-problem-local-8897676895823393049: logs/foo.log:F250502 11:37:20.387424 1036 raft/raft.go:2411 ⋮ [T1,Vsystem,n1,s1,r155/1:?/Table/113/1/{43/578…-51/201…}?] 80 match(30115) is out of range [lastIndex(30114)]. Was the raft log corrupted, truncated, or lost? ``` Closes cockroachdb#145953. Informs cockroachdb#146617. Informs cockroachdb#138028. Epic: none

See cockroachdb#148196 for rationale.

herkolategan

Looks good to me. I still think it would be good to have the created GitHub issue note the cluster (and cluster name for easy reference) is still up via parameters [1]. I don't have any comment on disabling the leaktest since I haven't used it to debug anything (yet).

[1] https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/test_runner.go#L2244

tbg · 2025-06-17T10:33:58Z

TFTR!

I still think it would be good to have the created GitHub issue note the cluster (and cluster name for easy reference) is still up via parameters [1].

Okay, that wasn't too hard - added a commit.

bors r+

craig · 2025-06-17T11:02:18Z

Build failed:

unit_tests

tbg · 2025-06-17T11:49:24Z

Too optimistic. Of course this broke some test. Will take a look.

tbg · 2025-06-18T08:06:29Z

bors r+
just needed a nil check.

craig · 2025-06-18T08:58:04Z

This PR was included in a batch that successfully built, but then failed to merge into master (it was a non-fast-forward update). It will be automatically retried.

tbg · 2025-06-18T09:15:39Z

bors r+

craig · 2025-06-18T09:15:42Z

Already running a review

craig · 2025-06-18T10:26:45Z

Build succeeded:

tbg requested a review from a team as a code owner May 20, 2025 08:27

tbg requested review from DarrylWong and golgeek and removed request for a team May 20, 2025 08:27

tbg mentioned this pull request May 20, 2025

roachtest,storage: use lenient pebble.Cleaner in some roachtests #146991

Open

pav-kv reviewed May 20, 2025

View reviewed changes

DarrylWong reviewed May 20, 2025

View reviewed changes

pkg/cmd/roachtest/tests/acceptance.go Outdated Show resolved Hide resolved

pkg/cmd/roachtest/tests/acceptance.go Outdated Show resolved Hide resolved

DarrylWong mentioned this pull request May 27, 2025

roachtest: add detection for live host migration (Azure) #143397

Open

roachtest: allow saving clusters more generally

dbf5d47

Even if neither --debug nor --debug-always are specified.

tbg force-pushed the roachtest-preserve-nightly branch from 841013f to 592383c Compare June 12, 2025 08:49

tbg mentioned this pull request Jun 12, 2025

roachtest: leaked goroutines #148196

Open

tbg requested a review from herkolategan June 12, 2025 09:16

tbg added 6 commits June 12, 2025 11:45

roachtest: trigger only on specific raft assertion panics

eb71e0f

roachtest: improve snap name and error handling

631a61b

roachtest: remove unnecessary Put

e9fac83

roachtest: move roachtest

2d73067

roachtest: disable leak checker

9678bad

See cockroachdb#148196 for rationale.

tbg force-pushed the roachtest-preserve-nightly branch from 4e31d74 to 9678bad Compare June 12, 2025 09:45

herkolategan approved these changes Jun 17, 2025

View reviewed changes

roachtest: add saved param if saved

96baece

tbg force-pushed the roachtest-preserve-nightly branch from feec288 to 96baece Compare June 17, 2025 12:48

craig bot merged commit 4fa3974 into cockroachdb:master Jun 18, 2025
22 checks passed

celeste-cockroachdb bot added the target-release-25.3.0 label Jun 18, 2025

celeste-cockroachdb bot added v25.3.0-prerelease and removed target-release-25.3.0 labels Jul 2, 2025

miraradeva mentioned this pull request Aug 25, 2025

roachtest: tpcc/w=100/nodes=3/chaos=true failed #150724

Closed

tbg mentioned this pull request Sep 16, 2025

roachtest: tpcc/w=100/nodes=3/chaos=true failed #152899

Closed

roachtest: unconditionally save clusters that show raft fatal errors #146990

roachtest: unconditionally save clusters that show raft fatal errors #146990

Uh oh!

Conversation

tbg commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cockroach-teamcity commented May 20, 2025

Uh oh!

tbg commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

herkolategan commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pav-kv May 20, 2025

Choose a reason for hiding this comment

Uh oh!

herkolategan May 20, 2025

Choose a reason for hiding this comment

Uh oh!

srosenberg May 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tbg commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tbg commented Jun 12, 2025

Uh oh!

herkolategan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tbg commented Jun 17, 2025

Uh oh!

craig bot commented Jun 17, 2025

Uh oh!

tbg commented Jun 17, 2025

Uh oh!

tbg commented Jun 18, 2025

Uh oh!

craig bot commented Jun 18, 2025

Uh oh!

tbg commented Jun 18, 2025

Uh oh!

craig bot commented Jun 18, 2025

Uh oh!

craig bot commented Jun 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

tbg commented May 20, 2025 •

edited

Loading

tbg commented May 20, 2025 •

edited

Loading

herkolategan commented May 20, 2025 •

edited

Loading

tbg commented Jun 12, 2025 •

edited

Loading

herkolategan left a comment •

edited

Loading