Skip to content

Snapshot Guide

steviez edited this page Dec 11, 2024 · 2 revisions

This guide is for operators who have had trouble generating a snapshot in the past, or would like to better understand how snapshots and ledger work together.

Context

In order to process a transaction, the validator needs information about the pre-existing state of the blockchain. That state could be determined by starting at genesis and replaying every block prior to the transaction of interest. Replaying that many transactions is impractical, so instead we use snapshots.

The agave-validator process stores several kinds of state on disk, including ledger and snapshots:

  • blockstore - a collection of transactions, packed into blocks. Due to space limitations most nodes only retain the last 1-2 days worth of transactions in their local ledger.
    • This dates corresponds to the rocksdb directory
  • snapshots
    • full - a complete set of information about a specific block, containing all the state necessary to replay transactions for the next block. These are named snapshot-<slot>-<hash>.tar.zst.
    • incremental - A set of differences that can be applied to a full snapshot to fast-forward to a subsequent block without replaying all the transactions in between. These are named incremental-snapshot-<base slot>-<slot>-<hash>.tar.zst.

By default, Agave generates a full snapshot every 25,000 blocks and an incremental snapshot every 100 blocks.

Transactions can only be replayed going forward, not in reverse, so if you have a snapshot for slot S, and a ledger containing nearby blocks you can generate a snapshot for slot S+1, S+2, etc, but not S-1 or earlier slots.

In order to generate a snapshot for slot X you need:

  • A snapshot for slot S, where S < X
    • This can be a full snapshot at slot S OR
    • A full snapshot at slot R along with an incremental snapshot at slot S that is based on slot R full snapshot
  • A blockstore containing all the blocks from slots (S, X]

Common pitfalls

There are three common reasons that might prevent an operator from creating a snapshot at slot X.

1. All available snapshots are at some slot T where T > X

Cause: This could happen if your validator continues running after slot X. The validator continually makes new snapshots and the newest snapshots are retained (as defined by snapshot retention flags). When new snapshots are created, older snapshots are deleted in FIFO order.

Solution: As previously mentioned, it is not possible to replay blocks backwards. Thus, these newer snapshots are incapable of producing a snapshot at the earlier slot X. The solution is to be proactive and ensure your node halts at the appointed time when testnet has planned restarts.

2. A suitable snapshot at some slot S < X is available, but the blockstore doesn't contain all the blocks in the range (S, X]

Cause: This could happen if the validator goes offline (manual stop, crash, etc) before the cluster reaches slot X.

Solution: For planned restarts, the first 33% of stake to halt their nodes usually end up in this situation. For future testnet restarts, Anza and the Solana Foundation will halt their nodes just before the appointed time in hopes of preventing operators from getting into this situation. For an (unplanned) outage, this is somewhat luck of the draw but avoid manually stopping your node until instructed to do so.

3. A suitable snapshot at some slot S < X is available and the necessary blocks (S, X] are available in the blockstore, but an additional snapshot exists at slot T where T > X

Cause: agave-ledger-tool always tries to use the latest snapshot available

Solution: Examine the slot numbers in the snapshot filenames and move snapshots at slot T where T > X to some different, backup directory. This will allow agave-ledger-tool to find the correct snapshot at slot S where S < X

Clone this wiki locally