-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Currently, the monitor API does a reasonably good job of monitoring for unexpected node death(s). In a recent roachtest failure [1], we can see that n2 died, which resulted in t.Fatal(),
2025/05/27 10:36:22 monitor.go:203: Monitor event: n1: cockroach process for system interface is running (PID: 4694)
2025/05/27 10:36:22 monitor.go:203: Monitor event: n2: cockroach process for system interface is running (PID: 4317)
2025/05/27 10:36:22 monitor.go:203: Monitor event: n3: cockroach process for system interface is running (PID: 4264)
2025/05/27 10:36:22 monitor.go:203: Monitor event: n4: cockroach process for system interface is running (PID: 4312)
2025/05/27 10:53:09 monitor.go:203: Monitor event: n2: cockroach process for system interface died (exit code 7)
2025/05/27 10:53:09 test_impl.go:478: test failure #1: full stack retained in failure_1.log: (cluster.go:2462).Run: context canceled
2025/05/27 10:53:09 test_impl.go:478: test failure #2: full stack retained in failure_2.log: (monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
However, actual diagnostic context for node death isn't directly available to the test runner (i.e., roachtest), assuming that it can be obtained by examining node-level logs. Sometimes the diagnostic context is logged along with a stacktrace. Such is the case in [1]. After downloading artifacts.zip, a simple grep reveals that the node panicked due to clock synchronization error [2],
grep -E "^F" logs/?.unredacted/cockroach.log|head
logs/2.unredacted/cockroach.log:F250527 10:53:08.112090 1531 1@rpc/peer.go:527 ⋮ [T1,Vsystem,n2,rnode=1,raddr=‹10.1.0.108:26257›,class=default,rpc] 361 clock synchronization error: this node is more than 400ms away from at least half of the known nodes (2 of 4 are within the offset)
Thus, by extracting and consolidating all Fatal-level log messages, the test runner could be augmented with new triage heuristics; e.g., clock synchronization error may be reported as an infra. flake rather than a test failure. We could also consider augmenting GH issue reporter to include an abbreviated form of this diagnostic context; e.g., the frame where log.Fatal was invoked.
[1] #147323
[2] https://www.cockroachlabs.com/docs/stable/operational-faqs#what-happens-when-node-clocks-are-not-properly-synchronized
Jira issue: CRDB-51016