roachtest: extract Fatal-level log messages to facilitate triage

Currently, the monitor API does a reasonably good job of _monitoring_ for _unexpected_ node death(s). In a recent roachtest failure [1], we can see that `n2` died, which resulted in `t.Fatal()`,

```
2025/05/27 10:36:22 monitor.go:203: Monitor event: n1: cockroach process for system interface is running (PID: 4694)
2025/05/27 10:36:22 monitor.go:203: Monitor event: n2: cockroach process for system interface is running (PID: 4317)
2025/05/27 10:36:22 monitor.go:203: Monitor event: n3: cockroach process for system interface is running (PID: 4264)
2025/05/27 10:36:22 monitor.go:203: Monitor event: n4: cockroach process for system interface is running (PID: 4312)
2025/05/27 10:53:09 monitor.go:203: Monitor event: n2: cockroach process for system interface died (exit code 7)
2025/05/27 10:53:09 test_impl.go:478: test failure #1: full stack retained in failure_1.log: (cluster.go:2462).Run: context canceled
2025/05/27 10:53:09 test_impl.go:478: test failure #2: full stack retained in failure_2.log: (monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
```

However, actual diagnostic context for node death isn't directly available to the test runner (i.e., roachtest), assuming that it can be obtained by examining node-level logs. Sometimes the diagnostic context _is_ logged along with a stacktrace. Such is the case in [1]. After downloading `artifacts.zip`, a simple `grep` reveals that the node panicked due to clock synchronization error [2],

```
grep -E "^F" logs/?.unredacted/cockroach.log|head
logs/2.unredacted/cockroach.log:F250527 10:53:08.112090 1531 1@rpc/peer.go:527 ⋮ [T1,Vsystem,n2,rnode=1,raddr=‹10.1.0.108:26257›,class=default,rpc] 361  clock synchronization error: this node is more than 400ms away from at least half of the known nodes (2 of 4 are within the offset)
```

Thus, by extracting and consolidating all Fatal-level log messages, the test runner could be augmented with new triage heuristics; e.g., clock synchronization error may be reported as an infra. flake rather than a test failure. We could also consider augmenting GH issue reporter to include an abbreviated form of this diagnostic context; e.g., the frame where `log.Fatal` was invoked.

[1] https://github.com/cockroachdb/cockroach/issues/147323
[2] https://www.cockroachlabs.com/docs/stable/operational-faqs#what-happens-when-node-clocks-are-not-properly-synchronized

Jira issue: CRDB-51016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

roachtest: extract Fatal-level log messages to facilitate triage #147360

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

roachtest: extract Fatal-level log messages to facilitate triage #147360

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions