Skip to content

roachtest: preserve disks on certain types of failures #146355

@tbg

Description

@tbg

When consistency issues (say consistency check failures or raft assertion failures or missing rows etc) occur during nightly roachtests, it is important to hold on to as much information as possible. In particular, the durable state (disk volumes) are often the best source of information when trying to to piece together what happened. In the absence of such information, we have to embark upon a time-consuming reproduction attempt, which is often unsuccessful as such issues are often rare and require a very specific sequence of events, and may even be caused by faulty infrastructure.

For example, we saw a raft assertion (essentially a loss of durability) in late December here. We responded by increasing some logging verbosities, but, in hindsight this had little chance of being helpful.
It recently reoccurred again in #145953 (comment). I've since run 150+ iterations of this test and am continuing to do so, but don't expect to be able to reproduce the issue.


To put further investigations of this type onto a more solid foundation, I'd like to propose extending roachtest with heuristics to detect such classes of failures. These heuristics should cause disk snapshots of the cluster nodes to be taken and announced in the failure logs (if not linked into the issue outright). This should be part of the test harness.

For example, the KV team would likely start out with the heuristics (see #145953 (comment))

  • grep -RE '^F.*raft' . > /dev/null
  • failed consistency checks (either in test or in post-test checks).

We already have infrastructure for disk snapshots:

// CreateSnapshot creates volume snapshots of the cluster using the given
// prefix. These snapshots can later be retrieved, deleted or applied to
// already instantiated clusters.
//
CreateSnapshot(ctx context.Context, snapshotPrefix string) ([]vm.VolumeSnapshot, error)
// ListSnapshots lists the individual volume snapshots that satisfy the
// search criteria.
ListSnapshots(ctx context.Context, vslo vm.VolumeSnapshotListOpts) ([]vm.VolumeSnapshot, error)
// DeleteSnapshots permanently deletes the given snapshots.
DeleteSnapshots(ctx context.Context, snapshots ...vm.VolumeSnapshot) error

The lifecycle of these snapshots should be considered so that they don't pile up over the years. An expiration of a few months (extended when needed) should be more than sufficient; we could be more aggressive if needed, all the way down to two weeks.

Jira issue: CRDB-50490

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-enhancementSolution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)P-1Issues/test failures with a fix SLA of 1 monthT-testengTestEng Teamv25.3.0-prerelease

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions