Skip to content

Conversation

@tbg
Copy link
Member

@tbg tbg commented May 20, 2025

When a cluster's logs contain a raft panic, it will be extended (by a week),
volume snapshots will be taken, and the cluster will not be destroyed. This
gives us the artifacts for a thorough investigation.

Verified manually via:

run --local acceptance/invariant-check-detection/failed=true

Here is the (editorialized) output:

test-teardown: 2025/05/20 08:15:15 cluster.go:2559: running cmd `([ -d logs ] && grep -RE '^...` on nodes [:1-4]; details in run_081515.744363000_n1-4_d-logs-grep-RE-Fraft.log
test-teardown: 2025/05/20 08:15:16 cluster.go:2995: extending cluster by 168h0m0s
test-teardown: 2025/05/20 08:15:16 cluster.go:1104: saving cluster local [tag:] (4 nodes) for debugging (--debug specified)
test-teardown: 2025/05/20 08:15:16 test_impl.go:478: test failure https://github.com/cockroachdb/cockroach/pull/2: full stack retained in failure_2.log: (test_runner.go:1705).maybeSaveClusterDueToInvariantProblems: invariant problem - snap name invariant-problem-local-8897676895823393049:
logs/foo.log:F250502 11:37:20.387424 1036 raft/raft.go:2411 ⋮ [T1,Vsystem,n1,s1,r155/1:?/Table/113/1/{43/578…-51/201…}?] 80 match(30115) is out of range [lastIndex(30114)]. Was the raft log corrupted, truncated, or lost?

Closes #145953.
Informs #146617.
Informs #138028.

Fixes #146355.

Epic: none

@tbg tbg requested a review from a team as a code owner May 20, 2025 08:27
@tbg tbg requested review from DarrylWong and golgeek and removed request for a team May 20, 2025 08:27
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg
Copy link
Member Author

tbg commented May 20, 2025

@pav-kv anything else we may want here? For select tests, we could set up the pebble.Cleaner as we discussed, I filed this in #146991.

@herkolategan
Copy link
Collaborator

herkolategan commented May 20, 2025

Thanks, this looks good. QoL Nit: We could maybe add "parameters" [1] to the github issue (failure) to provide the cluster name and that it's in a saved state, or even maybe a github issue label for "saved failures".

[1] https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/test_runner.go#L2228

func (r *testRunner) maybeSaveClusterDueToInvariantProblems(
ctx context.Context, t *testImpl, c *clusterImpl,
) {
dets, err := c.RunWithDetails(ctx, t.L(), option.WithNodes(c.All()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is currently a tiny friction point with failed roachtests that fataled. We need to manually go to the test log, see which node was reported down, go to its log and find the fatal message. I wonder if this could be automated (and the fatal message included in generated issues) with something similar/related to this PR. There might be no easy way to make it general enough though, but OTOH it doesn't have to be super reliable.

Locating fatal messages could also reduce the volume the grep above needs to scan in all tests.

@herkolategan

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thought has crossed my mind that we should be bubbling up obvious failures/info from logs to the issues. We could search the tail end of logs (to avoid searching the whole log) for fatals and provide those on the github issue. Will run the idea by test-eng in our next weekly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We think it's a good idea to consolidate all fatal-level log messages [1].

[1] #147360

Even if neither --debug nor --debug-always are specified.
@tbg tbg force-pushed the roachtest-preserve-nightly branch from 841013f to 592383c Compare June 12, 2025 08:49
@tbg
Copy link
Member Author

tbg commented Jun 12, 2025

Thanks, this looks good. QoL Nit: We could maybe add "parameters" [1] to the github issue (failure) to provide the cluster name and that it's in a saved state, or even maybe a github issue label for "saved failures".

[1] master/pkg/cmd/roachtest/test_runner.go#L2228

That's a good idea, but also more than I signed up for. Maybe TE wants to give this another pass when adding more triggers for this functionality? Given how we ~never expect this new functionality to trigger, I don't think it's too important to do this. OTOH, it would be so sad if we had these artifacts and people didn't realize. But it is in test.log.

@tbg tbg requested a review from herkolategan June 12, 2025 09:16
@tbg
Copy link
Member Author

tbg commented Jun 12, 2025

I think I addressed all feedback, PTAL.
If anyone fancies a manual spin,

./dev build roachtest && roachtest run --local invariant-check-detection/failed=false

works and won't leak a snapshot (since local snapshots don't work).

tbg added 6 commits June 12, 2025 11:45
When a cluster's logs contain a raft panic, it will be extended (by a week),
volume snapshots will be taken, and the cluster will not be destroyed. This
gives us the artifacts for a thorough investigation.

Verified manually via:

```
run --local acceptance/invariant-check-detection/failed=true
```

Here is the (editorialized) output:

```
test-teardown: 2025/05/20 08:15:15 cluster.go:2559: running cmd `([ -d logs ] && grep -RE '^...` on nodes [:1-4]; details in run_081515.744363000_n1-4_d-logs-grep-RE-Fraft.log
test-teardown: 2025/05/20 08:15:16 cluster.go:2995: extending cluster by 168h0m0s
test-teardown: 2025/05/20 08:15:16 cluster.go:1104: saving cluster local [tag:] (4 nodes) for debugging (--debug specified)
test-teardown: 2025/05/20 08:15:16 test_impl.go:478: test failure #2: full stack retained in failure_2.log: (test_runner.go:1705).maybeSaveClusterDueToInvariantProblems: invariant problem - snap name invariant-problem-local-8897676895823393049:
logs/foo.log:F250502 11:37:20.387424 1036 raft/raft.go:2411 ⋮ [T1,Vsystem,n1,s1,r155/1:?/Table/113/1/{43/578…-51/201…}?] 80 match(30115) is out of range [lastIndex(30114)]. Was the raft log corrupted, truncated, or lost?
```

Closes cockroachdb#145953.
Informs cockroachdb#146617.
Informs cockroachdb#138028.

Epic: none
@tbg tbg force-pushed the roachtest-preserve-nightly branch from 4e31d74 to 9678bad Compare June 12, 2025 09:45
Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I still think it would be good to have the created GitHub issue note the cluster (and cluster name for easy reference) is still up via parameters [1]. I don't have any comment on disabling the leaktest since I haven't used it to debug anything (yet).

[1] https://github.com/cockroachdb/cockroach/blob/master/pkg/cmd/roachtest/test_runner.go#L2244

@tbg
Copy link
Member Author

tbg commented Jun 17, 2025

TFTR!

I still think it would be good to have the created GitHub issue note the cluster (and cluster name for easy reference) is still up via parameters [1].

Okay, that wasn't too hard - added a commit.

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 17, 2025

Build failed:

@tbg
Copy link
Member Author

tbg commented Jun 17, 2025

Too optimistic. Of course this broke some test. Will take a look.

@tbg tbg force-pushed the roachtest-preserve-nightly branch from feec288 to 96baece Compare June 17, 2025 12:48
@tbg
Copy link
Member Author

tbg commented Jun 18, 2025

bors r+
just needed a nil check.

@craig
Copy link
Contributor

craig bot commented Jun 18, 2025

This PR was included in a batch that successfully built, but then failed to merge into master (it was a non-fast-forward update). It will be automatically retried.

@tbg
Copy link
Member Author

tbg commented Jun 18, 2025

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 18, 2025

Already running a review

@craig
Copy link
Contributor

craig bot commented Jun 18, 2025

@craig craig bot merged commit 4fa3974 into cockroachdb:master Jun 18, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

roachtest: preserve disks on certain types of failures roachtest: tpcc/w=100/nodes=3/chaos=true failed

6 participants