Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(repair): link shred collector with snapshot - root slot, leader schedule, verify shreds #169

Merged
merged 41 commits into from
Jul 1, 2024

Conversation

dnut
Copy link
Contributor

@dnut dnut commented Jun 10, 2024

Background

Previously, the shred collector would start collecting shreds from an arbitrary point defined on the CLI, and it was not verifying signatures. The snapshot would be loaded separately, and have no impact on this process.

There are two big problems with the existing approach:

  1. The shred collector is supposed to pick up where the snapshot leaves off, and catch up from that point forward, so it can account for all state changes since the snapshot. But there was nothing associating the shred collector with the state of the snapshot. It would start from an arbitrary slot with no assurance that it's handling the next slot after the snapshot.
  2. The shred collector was not verifying shred signatures, which means anyone on the internet can trick sig by sending malformed data.

New

This change links the snapshot with the shred collector. Now the snapshot is processed first, and then it is used to inform the shred collector about where to start, and about how to verify shreds (the leader schedule). This includes:

    • get the root slot from the snapshot
    • start repairing shreds from the root slot, instead of relying on cli --test-repair-for-slot
    • extract staked node information from the snapshot
    • derive the leader schedule from staked nodes
    • use leader schedule in the shred verifier to verify shred signatures

This required the implementation of a random number generator to be consistent with the leader schedule calculation done by the agave client. Zig std includes a chacha rng, but it is not compatible with rust's rand_chacha, since the rust crate is not IETF compliant, and it uses a novel approach to reuse previously generated random data.

Leader schedule flexibility

This introduces three ways to use the leader schedule:

  1. By default, the validator command will calculate the leader schedule from the snapshot and use it to verify shreds.
  2. Using the --leader-schedule option you can pass in the leader schedule from the CLI, instead of calculating it.
  3. Using the leader-schedule command you can calculate the leader schedule and print it to stdout.

Issues

Currently there is a major limitation with the current approach. The node stake weights in the snapshot are slightly different from the stakes that agave uses when it calculates the leader schedule. This leads to the incorrect leader schedule being generated. If you try to run it by deriving the leader schedule from the snapshot, shreds will fail signature verification. This is why there is an option to input a known leader schedule from the CLI.

The issue may be that the stakes in the snapshot represent the values at a different point in time from when the leader schedule is supposed to be derived from. I haven't dug into this issue yet and it needs to be addressed later. For now, we can get the leader schedule from the solana cli, and provide it to sig with sig validator --leader-schedule.

Code changes

  • .gitignore: Files in test_data now need to be whitelisted, instead of explicitly ignoring everything to exclude
  • accountsdb: expose data needed the leader schedule
    • Implement VoteAccounts.stakedNodes for lazy deserialization of staked node hashmap (same approach as agave)
      • Use VoteAccount struct instead of generic Account for vote accounts
    • add EpochSchedule.getSlotsInEpoch
  • cmd:
    • calculate the leader schedule in the validator command and pass it to shred collector.
    • add leader-schedule command to print the leader schedule in the same format as solana cli.
    • add --leader-schedule option to validator command to allow passing a known leader schedule, instead of calculating it.
    • refactor to reduce code duplication (otherwise this PR would significantly increase code duplication)
      • AppBase: application-wide state that needs to be initialized for every command.
      • loadSnapshot function that is reused in the validator and leader-schedule commands to load the snapshot
      • LoadedSnapshot: all the state that is produced by loadSnapshot
  • core:
    • leader_schedule.zig new file
      • leaderSchedule: calculate the leader schedule from the minimum required inputs.
      • leaderScheduleFromBank: conveniently determine the leader schedule from a bank.
      • SlotLeaderProvider: Abstraction to represent any approach of providing a slot leader.
      • SingleEpochLeaderSchedule: Basic slot leader provider that can only handle the epoch it was initialized with.
    • pubkey: replace explicit panic with error.
  • gossip: Change GossipService to use ServiceManager instead of its own thread handling.
  • rand: new sig component
    • ChaChaRng: generate the same stream as rust's rand_chacha crate.
      • wraps ChaCha (barebones chacha stream generator) with BlockRng (rng state manager)
    • WeightedRandomSampler: randomly select the same items as the rand crate's WeightedIndex.
  • shred-collector: Use fleshed out leader schedule type to require correct shred signatures.
    • new dependency on SlotLeaderProvider, passed down to shred verifier
    • shred verifier: require slot leader and fail verification if missing (instead of optional)
    • shred: implement getSignedData to verify signatures
      • required some new functions to be implemented to get the merkle proof
      • refactored some code out of GenericShred since it needs to be able to be used on serialized shreds
  • trace: added Logger.logf so you can pass the log level as a runtime parameter.
  • utils
    • PointerClosure: Wrapper for a function pointer and some state. The pointer takes two inputs, where one is a mutable pointer stored in the struct that is implicitly passed at call time, and the other parameter is passed explicitly by the caller of the closure. This may be called multiple times. The immediate use for this is to facilitate interface abstractions such as LeaderScheduleProvider.
    • service manager:
      • RunConfig.ReturnHandler: more flexible configuration
      • ServiceManager: allow specifying a default RunConfig on init, so you don't need to specify it for each spawn.

Run

calculating leader schedule from snapshot

Currently by default, sig uses "test_data" with snapshots that are not valid for any actual cluster, rather than downloading a snapshot. To run this code using a snapshot, you'll likely need to customize sig to use a different snapshot directory. This will tell sig to download and load a fresh snapshot from the cluster. It will then use this snapshot to calculate the leader schedule.

You can create a new snapshot directory like this:

mkdir "$SNAPSHOT_DIR"
cp test_data/genesis.bin "$SNAPSHOT_DIR"

To download the snapshot and print the leader schedule derived from the snapshot (outputs the same format as solana leader-schedule):

sig leader-schedule --snapshot-dir "$SNAPSHOT_DIR" $ENTRYPOINTS

To download the snapshot and run the validator using the leader schedule derived from the snapshot:

sig validator --snapshot-dir "$SNAPSHOT_DIR" $ENTRYPOINTS

use known leader schedule

If you want to provide a leader schedule to sig, rather than calculate it, you can pass it in using the cli option --leader-schedule. If you don't want to download a snapshot, you should also pass the start slot in the CLI.

You can create a leader schedule file using either of these commands:

solana leader-schedule > leader-schedule.txt
sig leader-schedule --snapshot-dir "$SNAPSHOT_DIR" $ENTRYPOINTS > leader-schedule.txt

Pass the file to sig like this:

sig validator --leader-schedule leader-schedule.txt --test-repair-for-slot $START_SLOT $ENTRYPOINTS

Or pipe it directly. Specify -- to indicate stdin.

solana leader-schedule | sig validator --leader-schedule -- --test-repair-for-slot $START_SLOT $ENTRYPOINTS

@dnut dnut changed the title feat: derive leader schedule from snapshot, use it to verify shreds feat(core, rand, shred-collector, utils): derive leader schedule from snapshot, use it to verify shreds Jun 10, 2024
@dnut dnut changed the title feat(core, rand, shred-collector, utils): derive leader schedule from snapshot, use it to verify shreds feat(accountsdb, core, rand, shred-collector, utils): derive leader schedule from snapshot, use it to verify shreds Jun 10, 2024
@dnut dnut changed the title feat(accountsdb, core, rand, shred-collector, utils): derive leader schedule from snapshot, use it to verify shreds feat: derive leader schedule from snapshot, use it to verify shreds Jun 10, 2024
dnut added 4 commits June 11, 2024 22:47
bugs:
- infinite loop in iterator due to not incrementing
- adding in dataIndex instead of subtracting
- invalid proof should be index != 0
@dnut dnut changed the title feat: derive leader schedule from snapshot, use it to verify shreds feat: link shred collector with snapshot. get root slot, calculate leader schedule, and verify shred signatures Jun 13, 2024
@dnut dnut changed the title feat: link shred collector with snapshot. get root slot, calculate leader schedule, and verify shred signatures feat: link shred collector with snapshot. root slot, leader schedule, and verify shreds Jun 13, 2024
@dnut dnut changed the title feat: link shred collector with snapshot. root slot, leader schedule, and verify shreds feat: shred collector <-> snapshot: root slot, leader schedule, verify shreds Jun 13, 2024
@dnut dnut changed the title feat: shred collector <-> snapshot: root slot, leader schedule, verify shreds feat: link shred collector and snapshot - root slot, leader schedule, verify shreds Jun 13, 2024
@dnut dnut changed the title feat: link shred collector and snapshot - root slot, leader schedule, verify shreds feat: link shred collector with snapshot - root slot, leader schedule, verify shreds Jun 13, 2024
@dnut dnut marked this pull request as ready for review June 13, 2024 13:38
@0xNineteen 0xNineteen changed the title feat: link shred collector with snapshot - root slot, leader schedule, verify shreds feat(repair): link shred collector with snapshot - root slot, leader schedule, verify shreds Jun 24, 2024
Copy link
Contributor

@0xNineteen 0xNineteen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm - just a few things

src/accountsdb/genesis_config.zig Show resolved Hide resolved
src/accountsdb/snapshots.zig Outdated Show resolved Hide resolved
src/cmd/cmd.zig Show resolved Hide resolved
src/cmd/cmd.zig Outdated Show resolved Hide resolved
src/cmd/config.zig Outdated Show resolved Hide resolved
src/cmd/cmd.zig Show resolved Hide resolved
src/cmd/cmd.zig Show resolved Hide resolved
@dnut dnut requested a review from 0xNineteen June 28, 2024 00:57
0xNineteen
0xNineteen previously approved these changes Jun 28, 2024
Copy link
Contributor

@0xNineteen 0xNineteen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - awesome pr 🔥

@dnut dnut merged commit 6ce1445 into main Jul 1, 2024
5 checks passed
@InKryption InKryption deleted the dnut/repair3 branch July 1, 2024 20:34
@dnut dnut added this to the Networking milestone Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants