Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wen restart aggregate last voted fork slots #33892

Conversation

wen-coding
Copy link
Contributor

Problem

Aggregate RestartLastVotedForkSlots in wen_restart and repair if necessary.

Summary of Changes

  • Aggregate RestartLastVotedForkSlots
  • Repair all blocks with more than 42% of stakes
  • If 80% of the validators reported RestartLastVotedForkSlots and blocks with more than 42% of the stakes have been repaired, proceed to next step
  • Write progress into protobuf file for further debugging

The Gossip changes are being reviewed separately in #33613.

@codecov
Copy link

codecov bot commented Oct 27, 2023

Codecov Report

Attention: Patch coverage is 94.21222% with 54 lines in your changes are missing coverage. Please review.

Project coverage is 81.8%. Comparing base (7399178) to head (59fd5ff).
Report is 10 commits behind head on master.

❗ Current head 59fd5ff differs from pull request most recent head 40c0fb6. Consider uploading reports for the commit 40c0fb6 to get more accurate results

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #33892    +/-   ##
========================================
  Coverage    81.7%    81.8%            
========================================
  Files         836      835     -1     
  Lines      224834   225690   +856     
========================================
+ Hits       183847   184687   +840     
- Misses      40987    41003    +16     

Copy link
Contributor

@AshwinSekar AshwinSekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took an initial pass, will review the aggregate logic next.

I think if you wanted to you could split this PR into the repair changes that consume slots_to_repair_for_wen_restart and then the part that updates it.

core/src/repair/repair_service.rs Outdated Show resolved Hide resolved
core/src/repair/repair_service.rs Show resolved Hide resolved
core/src/repair/repair_service.rs Outdated Show resolved Hide resolved
core/src/shred_fetch_stage.rs Outdated Show resolved Hide resolved
wen-restart/src/wen_restart.rs Outdated Show resolved Hide resolved
wen-restart/src/wen_restart.rs Outdated Show resolved Hide resolved
- Avoid waiting after first shred when repair is in wen_restart
@AshwinSekar AshwinSekar requested a review from jbiseda November 15, 2023 20:36
Copy link
Contributor

@jbiseda jbiseda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a first pass and have some suggestions.

@@ -630,7 +646,7 @@ impl RepairService {
* timestamp().saturating_sub(slot_meta.first_shred_timestamp)
/ 1_000;
if ticks_since_first_insert
< reference_tick.saturating_add(DEFER_REPAIR_THRESHOLD_TICKS)
< reference_tick.saturating_add(defer_repair_threshold_ticks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we bypass the reference_tick comparison completely if add_delay_after_first_shred == false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -613,7 +622,14 @@ impl RepairService {
slot: Slot,
slot_meta: &SlotMeta,
max_repairs: usize,
add_delay_after_first_shred: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe change name to throttle_requests_by_shred_tick?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -668,6 +684,7 @@ impl RepairService {
slot,
&slot_meta,
max_repairs - repairs.len(),
true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you add the param name in a comment like /*add_delay_after_first_shred*/ true for readability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -213,6 +213,8 @@ pub struct RepairInfo {
pub repair_validators: Option<HashSet<Pubkey>>,
// Validators which should be given priority when serving
pub repair_whitelist: Arc<RwLock<HashSet<Pubkey>>>,
// A given list of slots to repair when in wen_restart
pub slots_to_repair_for_wen_restart: Option<Arc<RwLock<Vec<Slot>>>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: s/slots_to_repair_for_wen_restart/wen_restart_repair_slots/ for brevity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

core/src/tvu.rs Outdated
@@ -205,6 +206,7 @@ impl Tvu {
repair_whitelist: tvu_config.repair_whitelist,
cluster_info: cluster_info.clone(),
cluster_slots: cluster_slots.clone(),
slots_to_repair_for_wen_restart: slots_to_repair_for_wen_restart.clone(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to clone() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't now, removed.

let mut slots_stake_map = HashMap::new();
for slot in last_voted_fork_slots {
if slot <= &root_slot {
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is last_voted_fork_slots always sorted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's sorted in Gossip code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to add an assert here that it's sorted to make the assumptions explicit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that if it's not sorted, assertion doesn't help you because you exited early already. I'm letting it loop on all slots, a loop on vec should be really fast.

Ok(())
}

fn aggregate_restart_last_voted_fork_slots(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think exit from Validator:new should be piped down to this level so the loop can be interrupted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

RestartState::Init => RestartState::LastVotedForkSlots,
RestartState::LastVotedForkSlots => RestartState::HeaviestFork,
RestartState::HeaviestFork => RestartState::Done,
_ => panic!("Unexpected state {:?}", new_progress.state()),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/_/RestartState::Done so compilation will break if a new enum value is added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

wen-restart/src/wen_restart.rs Outdated Show resolved Hide resolved
pub fn new(from: Pubkey, now: u64, last_voted_hash: Hash, shred_version: u16) -> Self {
Self {
// This number is MAX_CRDS_OBJECT_SIZE - empty serialized RestartLastVotedForkSlots.
const MAX_SPACE: usize = 824;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may be able to define this in terms of the other constants and use const_assert_eq!() to make this less fragile. There are some similar patterns in legacy.rs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll take a look.

wen-restart/src/wen_restart.rs Outdated Show resolved Hide resolved
wen-restart/src/wen_restart.rs Outdated Show resolved Hide resolved
{
*wen_restart_repair_slots.write().unwrap() = filtered_slots;
}
write_wen_restart_records(wen_restart_path, progress)?;
Copy link
Contributor

@carllin carllin Feb 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this write handled by the caller in wait_for_wen_restart()?

increment_and_write_wen_restart_records(wen_restart_path, &mut progress)?;

For less complicated code and more modular testing, I think:

  1. We leave all the record keeping into pub fn wait_for_wen_restart(), so wen_restart_path never needs to be passed into these state transition functions. This also means for testing we don't have to continually test for all errors that records are written to wen_restart_path (we only need one test for that), we just have to test that the inner state transition function returns the expected state.
  2. Each of the state transition functions called here
    RestartState::Init => send_restart_last_voted_fork_slots(
    last_vote.clone(),
    blockstore.clone(),
    cluster_info.clone(),
    &mut progress,
    )?,
    RestartState::LastVotedForkSlots => aggregate_restart_last_voted_fork_slots(
    wen_restart_path,
    wait_for_supermajority_threshold_percent,
    cluster_info.clone(),
    bank_forks.clone(),
    wen_restart_repair_slots.clone().unwrap(),
    exit.clone(),
    &mut progress,
    )?,
    RestartState::HeaviestFork => {
    warn!("Not implemented yet, make it empty to complete test")
    }
    RestartState::FinishedSnapshot => return Err("Not implemented!".into()),
    RestartState::GeneratingSnapshot => return Err("Not implemented!".into()),
    RestartState::WaitingForSupermajority => return Err("Not implemented!".into()),
    RestartState::Done => return Ok(()),
    }
    should return a custom error type. For instance aggregate_restart_last_voted_fork_slots() should have an enum AggregateForksError(). This achieves a few things:
  • For debugging, we know where each of the errors originated
  • For testing, we can just test that every error returns the expected state. Then from 1. above, we don't have to worry about testing that that it's written to wen_restart_path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally put all book keeping in wait_for_wen_restart, but then realized that AggregateLastVotedForkSlots might span several hours when we wait for validators to join the restart, then I decided we want to write received LastVotedForkSlots inside this loop so we can speed up internal state population in case the validator restarts.

What's your take on this? We can of course rely on others sending us the Gossip messages as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason we do want to keep record of the Gossip messages we received is that we can maybe debug later why the restart didn't work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding one error type per internal function, I added one WenRestartError for all functions in the crate, so we can do ?; everywhere. How does this look?

Copy link
Contributor

@carllin carllin Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally put all book keeping in wait_for_wen_restart, but then realized that AggregateLastVotedForkSlots might span several hours when we wait for validators to join the restart, then I decided we want to write received LastVotedForkSlots inside this loop so we can speed up internal state population in case the validator restarts.

You're totally right, let's keep the record writing inside the function then, and test those records are properly written

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding one error type per internal function, I added one WenRestartError for all functions in the crate, so we can do ?; everywhere. How does this look?

Think it's better to still organize per state transition for testing, see comment below

Comment on lines 43 to 62
#[error("Protobuf Decode error: {0}")]
DecodeError(#[from] prost::DecodeError),
#[error("Protobuf Encode error: {0}")]
EncodeError(#[from] prost::EncodeError),
#[error("Exiting")]
Exiting,
#[error("Invalid Last Vote Type: {0:?}")]
InvalidLastVoteType(VoteTransaction),
#[error("File IO error: {0}")]
IoError(#[from] std::io::Error),
#[error("Missing last voted fork slots")]
MissingLastVotedForkSlots,
#[error("Hash parse error: {0}")]
ParseHashError(#[from] solana_sdk::hash::ParseHashError),
#[error("Pubkey parse error: {0}")]
PubkeyParseError(#[from] solana_sdk::pubkey::ParsePubkeyError),
#[error("Gossip push RestartLastVotedForkSlotsError: {0}")]
RestartLastVotedForkSlotsError(#[from] RestartLastVotedForkSlotsError),
#[error("Unexpected state: {0:?}")]
UnexpectedState(wen_restart_proto::State),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to organize these by state transition:

pub enum SendLastVotedForkError {

}

pub enum LastVotedForkSlotsError {

}
  1. Then wait_for_wen_restart() still returns Box<dyn std::error::Error>
  2. For testing for each state transition, we make sure all of their particular errors write properly to the record and we can restart

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, when you say "write properly to the record", do you mean writing to the logs or to the protobuf as well? Since some of the errors might be transient, I didn't plan to write the errors to the protobuf.

Copy link
Contributor

@carllin carllin Feb 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't mean write the error out. I mean the progress data structure you write is what you expect after the error happens.

For every single error, for every single state transition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did the change but it actually felt wrong IMHO, several reasons:

  1. complicating public API for the sake of testing
    pub enum xxxError is public interface, if you think about how it's used, all people see from outside the module is wait_for_wen_restart, so it feels natural to have WenRestartError returned, it feels unnatural to have several Error enum returned, leaking our implementation details (which they shouldn't care about).
  2. harder to write tests
    I could of course test each function individually, but I do have to test they work in wait_for_wen_restart. If I only need to test is_err() then it's all fine, but if I want to test for specific error (e.g. file permission wrong), I had to do a downcast on dyn error, I had to constantly look back to error definitions to see what type of error I'm now expecting
  3. common functions harder to implement and change
    The fact that we do need to write the protobuf means I need to call write_wen_restart() from several places, but they will return different xxxError types, so I need to create yet another WriteWenRestartError and convert to other Error types, this is code duplication and unnecessary conversion. If we move part of the function to another internal method, we need to create yet another Error struct?

All in all, this feels wrong, maybe I'm misunderstanding what you described.

Copy link
Contributor

@carllin carllin Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub enum xxxError is public interface, if you think about how it's used, all people see from outside the module is wait_for_wen_restart, so it feels natural to have WenRestartError returned, it feels unnatural to have several Error enum returned, leaking our implementation details (which they shouldn't care about).

Here we're not writing a standalone library so I think a boxed error is fine. The purpose of this error is to make attribution of errors/debugging easier, which categorizing the errors does. We can now trace exactly which step in the restart went wrong. I think we could also introduce a top level WenRestartError that wraps the errors like so:

enum WenRestartError {
   SendLastVotedForkError(SendLastVotedForkError),
   AggregateLastVotedForkSlotsError(AggregateLastVotedForkSlotsError),
  ...
}

I could of course test each function individually, but I do have to test they work in wait_for_wen_restart. If I only need to test is_err() then it's all fine, but if I want to test for specific error (e.g. file permission wrong), I had to do a downcast on dyn error, I had to constantly look back to error definitions to see what type of error I'm now expecting

Hmm I don't think you have to test every error case in wait_for_wen_restart(). I think if the state transition functions are a-> b -> c, then:

  1. You just have to test that each transition a, b, and c individually will run successfully to "completion" across all cases. "Completion" in this case means the ending output passes the verification for the initial input of the next state transition. Concretely this means testing in the happy case the state is correct, and that on all declared errors, they can recover and then run to completion.
  2. From 1, inductively, you can then conclude that a -> b -> c will run correctly.
  3. All you need to test then in wait_for_wen_restart() is that the order of calls a -> b -> c is correct and the arguments are being passed correctly. Concretely this means just testing one happy path.

The fact that we do need to write the protobuf means I need to call write_wen_restart() from several places, but they will return different xxxError types, so I need to create yet another WriteWenRestartError and convert to other Error types, this is code duplication and unnecessary conversion. If we move part of the function to another internal method, we need to create yet another Error struct?

I think a unique error for write_wen_restart is fine, it is just a suberror of each state transition. This way we know which step the write failed in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it more, I think I agree that failures should be tested on individual functions rather than wait_on_wen_restart(), so I changed accordingly.

The only thing I'm hesitating is that I still don't see how splitting WenRestartError into individual errors helps us. If the proto file is suddenly not writable, we need to solve that problem no matter what stage WenRestart is in, so I don't think replicating file IO errors everywhere really helps the user, it just complicates the public API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about this, we guarantee no errors on the state transitions 😄 . I.e. we guarantee the state going in and the state coming out of the records are always valid.

  1. All of the current errors are verification errors on the initial state:
    return Err(WenRestartError::MissingLastVotedForkSlots);
    }
    } else {
    return Err(WenRestartError::InvalidLastVoteType(last_vote));
    }
    and
    https://github.com/solana-labs/solana/pull/33892/files#diff-c32a5c15bdd23820489e6f8af8b305aee344228d0cb6789e554a988d57d54bd7R70-R71. These validations can be done in the beginning on entry into the state machine in wait_for_wen_restart, as a validate_state_1(), validate_state_2(), etc. These can then be unit tested. Ideally I think we would change WenRestartProgress to be an enum with a separate state for each step in this state machine, i.e.
enum WenRestartProgress {
       Init,
       SendForkSlots(SendForkSlots),
       AggregateForkslots(AggregateForkSlots),
}

and then at the end of each state transition we just spit out the next iteration of the enum

  1. Then the only acceptable error is logging errors which we can then enforce is the only acceptable error.
  2. Then if we test that write_wen_restart_records() either writes a whole valid record OR doesn't write a record/i.e. errors, then we know the state must always be valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can move the initialization errors into initialize(), but I can't guarantee there are no other errors after initialize(). For example, push_restart_last_voted_fork_slots() could return error. And we might want to throw out error in the future while running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right on this 😃

If we don't want a separate enum per state transition, then at minimum I would like to be able to print out the backtrace of where that error occurred. Maybe this: https://stackoverflow.com/questions/75060346/how-can-i-get-file-and-line-number-where-an-error-was-returned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to anyhow instead of thiserror for backtrace, also moved initialization errors into initialize().

}
None => {
// repair and restart option does not work without last voted slot.
if let VoteTransaction::Vote(ref vote) = last_vote {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this error for validators who have a default tower at startup and haven't voted yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default tower has an empty slots list, so it will not fail on line 91, instead it will fail on line 88.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that what we want or do we want validators with a default tower to be able to perform this restart as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, if you have no last vote information, not sure you can contribute anything to the restart. I guess those guys can copy the generated snapshot like today. Or we can try to use information from the ledger if tower is not available?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine, we should have a clear error message telling those guys they need to wait then.

I think it would be a better user experience if you verified the tower beforehand and printed a clear error message if the tower is not valid/missing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I added some error log there, see if this helps?

@wen-coding wen-coding force-pushed the wen_restart_aggregate_last_voted_fork_slots branch from 10d4afa to e3d0194 Compare February 29, 2024 04:18
@wen-coding wen-coding requested a review from carllin March 1, 2024 17:30
Comment on lines 321 to 325
Cannot find last voted slot in the tower storage, it either means that this node has never \
voted or the tower storage is corrupted. Unfotunately, since WenRestart is a consensus protocol \
depending on each participant to send their last voted fork slots, your validator cannot participate.\
Please wait in the discord channel for the result of WenRestart, then generate a snapshot and use \
--wait-for-supermajority to restart the validator.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfotunately->Unfortunately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Cannot find last voted slot in the tower storage, it either means that this node has never \
voted or the tower storage is corrupted. Unfotunately, since WenRestart is a consensus protocol \
depending on each participant to send their last voted fork slots, your validator cannot participate.\
Please wait in the discord channel for the result of WenRestart, then generate a snapshot and use \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please check discord for the conclusion of the WenRestart protocol, then generate a snapshot and use...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

)
.err()
.unwrap()
.to_string()
Copy link
Contributor

@carllin carllin Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this .err().unwrap().to_string() pattern from all the tests? We can just do an

assert_eq!(initialize(), Err()) or assert!(matches!(result, Err())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately you can't directly use assert_eq or even assert_match on anyhow::Error:
https://users.rust-lang.org/t/issues-in-asserting-result/61198/5

I have to use a unwrap_error().downcast().unwrap(), would you prefer that instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow so ugly 🤮

yeah, I think let's downcast and match on the actual error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@wen-coding wen-coding merged commit bfe44d9 into solana-labs:master Mar 2, 2024
46 checks passed
@wen-coding wen-coding deleted the wen_restart_aggregate_last_voted_fork_slots branch March 2, 2024 02:52
@wen-coding wen-coding self-assigned this Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants