-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
When consistency issues (say consistency check failures or raft assertion failures or missing rows etc) occur during nightly roachtests, it is important to hold on to as much information as possible. In particular, the durable state (disk volumes) are often the best source of information when trying to to piece together what happened. In the absence of such information, we have to embark upon a time-consuming reproduction attempt, which is often unsuccessful as such issues are often rare and require a very specific sequence of events, and may even be caused by faulty infrastructure.
For example, we saw a raft assertion (essentially a loss of durability) in late December here. We responded by increasing some logging verbosities, but, in hindsight this had little chance of being helpful.
It recently reoccurred again in #145953 (comment). I've since run 150+ iterations of this test and am continuing to do so, but don't expect to be able to reproduce the issue.
To put further investigations of this type onto a more solid foundation, I'd like to propose extending roachtest with heuristics to detect such classes of failures. These heuristics should cause disk snapshots of the cluster nodes to be taken and announced in the failure logs (if not linked into the issue outright). This should be part of the test harness.
For example, the KV team would likely start out with the heuristics (see #145953 (comment))
grep -RE '^F.*raft' . > /dev/null- failed consistency checks (either in test or in post-test checks).
We already have infrastructure for disk snapshots:
cockroach/pkg/cmd/roachtest/cluster/cluster_interface.go
Lines 189 to 198 in 45e5f2e
| // CreateSnapshot creates volume snapshots of the cluster using the given | |
| // prefix. These snapshots can later be retrieved, deleted or applied to | |
| // already instantiated clusters. | |
| // | |
| CreateSnapshot(ctx context.Context, snapshotPrefix string) ([]vm.VolumeSnapshot, error) | |
| // ListSnapshots lists the individual volume snapshots that satisfy the | |
| // search criteria. | |
| ListSnapshots(ctx context.Context, vslo vm.VolumeSnapshotListOpts) ([]vm.VolumeSnapshot, error) | |
| // DeleteSnapshots permanently deletes the given snapshots. | |
| DeleteSnapshots(ctx context.Context, snapshots ...vm.VolumeSnapshot) error |
The lifecycle of these snapshots should be considered so that they don't pile up over the years. An expiration of a few months (extended when needed) should be more than sufficient; we could be more aggressive if needed, all the way down to two weeks.
Jira issue: CRDB-50490