roachtest: preserve disks on certain types of failures

When consistency issues (say consistency check failures or raft assertion failures or missing rows etc) occur during nightly roachtests, it is important to hold on to as much information as possible. In particular, the durable state (disk volumes) are often the best source of information when trying to to piece together what happened. In the absence of such information, we have to embark upon a time-consuming reproduction attempt, which is often unsuccessful as such issues are often rare and require a very specific sequence of events, and may even be caused by faulty infrastructure.

For example, we saw a raft assertion (essentially a loss of durability) in late December [here](https://github.com/cockroachdb/cockroach/issues/138028#issuecomment-2621404936). We responded by increasing some logging verbosities, but, in hindsight this had little chance of being helpful.
It recently reoccurred again in https://github.com/cockroachdb/cockroach/issues/145953#issuecomment-2855942591. I've since run 150+ iterations of this test and am continuing to do so, but don't expect to be able to reproduce the issue.

----

To put further investigations of this type onto a more solid foundation, I'd like to propose extending roachtest with heuristics to detect such classes of failures. These heuristics should cause disk snapshots of the cluster nodes to be taken and announced in the failure logs (if not linked into the issue outright). This should be part of the test harness.

For example, the KV team would likely start out with the heuristics (see https://github.com/cockroachdb/cockroach/issues/145953#issuecomment-2855942591)

- `grep -RE '^F.*raft' . > /dev/null`
- failed consistency checks (either in test or in post-test checks).

We already have infrastructure for disk snapshots:

https://github.com/cockroachdb/cockroach/blob/45e5f2ef00d477e59950e3cae6a563822ca65939/pkg/cmd/roachtest/cluster/cluster_interface.go#L189-L198

The lifecycle of these snapshots should be considered so that they don't pile up over the years. An expiration of a few months (extended when needed) should be more than sufficient; we could be more aggressive if needed, all the way down to two weeks.



Jira issue: CRDB-50490

	// CreateSnapshot creates volume snapshots of the cluster using the given
	// prefix. These snapshots can later be retrieved, deleted or applied to
	// already instantiated clusters.
	//
	CreateSnapshot(ctx context.Context, snapshotPrefix string) ([]vm.VolumeSnapshot, error)
	// ListSnapshots lists the individual volume snapshots that satisfy the
	// search criteria.
	ListSnapshots(ctx context.Context, vslo vm.VolumeSnapshotListOpts) ([]vm.VolumeSnapshot, error)
	// DeleteSnapshots permanently deletes the given snapshots.
	DeleteSnapshots(ctx context.Context, snapshots ...vm.VolumeSnapshot) error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

roachtest: preserve disks on certain types of failures #146355

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

roachtest: preserve disks on certain types of failures #146355

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions