-
Notifications
You must be signed in to change notification settings - Fork 4k
roachtest: unconditionally save clusters that show raft fatal errors #146990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks, this looks good. QoL Nit: We could maybe add "parameters" [1] to the github issue (failure) to provide the cluster name and that it's in a saved state, or even maybe a github issue label for "saved failures". |
| func (r *testRunner) maybeSaveClusterDueToInvariantProblems( | ||
| ctx context.Context, t *testImpl, c *clusterImpl, | ||
| ) { | ||
| dets, err := c.RunWithDetails(ctx, t.L(), option.WithNodes(c.All()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is currently a tiny friction point with failed roachtests that fataled. We need to manually go to the test log, see which node was reported down, go to its log and find the fatal message. I wonder if this could be automated (and the fatal message included in generated issues) with something similar/related to this PR. There might be no easy way to make it general enough though, but OTOH it doesn't have to be super reliable.
Locating fatal messages could also reduce the volume the grep above needs to scan in all tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thought has crossed my mind that we should be bubbling up obvious failures/info from logs to the issues. We could search the tail end of logs (to avoid searching the whole log) for fatals and provide those on the github issue. Will run the idea by test-eng in our next weekly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We think it's a good idea to consolidate all fatal-level log messages [1].
[1] #147360
Even if neither --debug nor --debug-always are specified.
841013f to
592383c
Compare
That's a good idea, but also more than I signed up for. Maybe TE wants to give this another pass when adding more triggers for this functionality? Given how we ~never expect this new functionality to trigger, I don't think it's too important to do this. OTOH, it would be so sad if we had these artifacts and people didn't realize. But it is in test.log. |
|
I think I addressed all feedback, PTAL.
works and won't leak a snapshot (since local snapshots don't work). |
When a cluster's logs contain a raft panic, it will be extended (by a week), volume snapshots will be taken, and the cluster will not be destroyed. This gives us the artifacts for a thorough investigation. Verified manually via: ``` run --local acceptance/invariant-check-detection/failed=true ``` Here is the (editorialized) output: ``` test-teardown: 2025/05/20 08:15:15 cluster.go:2559: running cmd `([ -d logs ] && grep -RE '^...` on nodes [:1-4]; details in run_081515.744363000_n1-4_d-logs-grep-RE-Fraft.log test-teardown: 2025/05/20 08:15:16 cluster.go:2995: extending cluster by 168h0m0s test-teardown: 2025/05/20 08:15:16 cluster.go:1104: saving cluster local [tag:] (4 nodes) for debugging (--debug specified) test-teardown: 2025/05/20 08:15:16 test_impl.go:478: test failure #2: full stack retained in failure_2.log: (test_runner.go:1705).maybeSaveClusterDueToInvariantProblems: invariant problem - snap name invariant-problem-local-8897676895823393049: logs/foo.log:F250502 11:37:20.387424 1036 raft/raft.go:2411 ⋮ [T1,Vsystem,n1,s1,r155/1:?/Table/113/1/{43/578…-51/201…}?] 80 match(30115) is out of range [lastIndex(30114)]. Was the raft log corrupted, truncated, or lost? ``` Closes cockroachdb#145953. Informs cockroachdb#146617. Informs cockroachdb#138028. Epic: none
See cockroachdb#148196 for rationale.
4e31d74 to
9678bad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I still think it would be good to have the created GitHub issue note the cluster (and cluster name for easy reference) is still up via parameters [1]. I don't have any comment on disabling the leaktest since I haven't used it to debug anything (yet).
|
TFTR!
Okay, that wasn't too hard - added a commit. bors r+ |
|
Build failed: |
|
Too optimistic. Of course this broke some test. Will take a look. |
feec288 to
96baece
Compare
|
bors r+ |
|
This PR was included in a batch that successfully built, but then failed to merge into master (it was a non-fast-forward update). It will be automatically retried. |
|
bors r+ |
|
Already running a review |
|
Build succeeded: |
When a cluster's logs contain a raft panic, it will be extended (by a week),
volume snapshots will be taken, and the cluster will not be destroyed. This
gives us the artifacts for a thorough investigation.
Verified manually via:
Here is the (editorialized) output:
Closes #145953.
Informs #146617.
Informs #138028.
Fixes #146355.
Epic: none