-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add wen_restart module #33344
Add wen_restart module #33344
Conversation
- Implement reading LastVotedForkSlots from blockstore. - Add proto file to record the intermediate results. - Also link wen_restart into validator. - Move recreation of tower outside replay_stage so we can get last_vote.
Codecov Report
@@ Coverage Diff @@
## master #33344 +/- ##
=======================================
Coverage 81.7% 81.7%
=======================================
Files 805 807 +2
Lines 218162 218256 +94
=======================================
+ Hits 178411 178493 +82
- Misses 39751 39763 +12 |
wen-restart/src/wen_restart.rs
Outdated
let last_vote_slot = last_vote.last_voted_slot().unwrap(); | ||
let mut last_vote_fork = vec![last_vote_slot]; | ||
let mut slot = last_vote_slot; | ||
for _ in 0..MAX_SLOTS_ON_VOTED_FORKS { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to go this deep, can we stop at block_commitment_cache.highest_supermajority_root()
?
if i'm behind the smr my last voted fork info doesn't contribute anything to wen restart anyway right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is also an option, I think the current logic is trying to be more straightforward so more fault tolerant. What if there is a bug in code so highest_supermajority_root is wrong?
So when we started the initial design we tried to see what the minimum required info is, we had to send last_vote, and need to send the slots on last voted fork so validators with different progress can sync up again. Other than that, we know an outage is probably less than 9 hours before we intervene, and we avoid relying on all other information. Because software can have bugs, you can't predict what is going wrong in an outage.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know what the latest oom time is, can we actually survive 9 hours without making a root?
either way it seems alright to play it safe and send all these slots.
was selfishly hoping we could reduce the size so that we could include hashes for the inevitable duplicate block restart 😛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in past outages validators lasted maybe 3 to 4 hours without making root, we decided to restart before 9 hours.
Yeah whether we can include hash had been debated, and the first version it looks like hash isn't going to make it. We could always add another set of consensus on bankhash in the future though.
@CriesofCarrots mind giving the protobufs stuff a once over for consistency and build reliability? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some stylistic suggestions and nits from me. The proto build script itself looks good.
wen-restart/protos/wen_restart.proto
Outdated
DONE = 6; | ||
} | ||
|
||
message InitRecord { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not totally clear to me what an InitRecord is. Might be more clear if you spell out Init? I'm not sure if Init is a noun or an adjective (or a verb, I suppose).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to MyLastVotedForkSlots.
Co-authored-by: Tyera <teulberg@gmail.com>
Co-authored-by: Tyera <teulberg@gmail.com>
Co-authored-by: Tyera <teulberg@gmail.com>
Co-authored-by: Tyera <teulberg@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r+ final nits. (I only reviewed the proto build, usage, and related)
Co-authored-by: Tyera <teulberg@gmail.com>
Co-authored-by: Tyera <teulberg@gmail.com>
already larger than MAX_SLOTS_ON_VOTED_FORKS.
* Add wen_restart module: - Implement reading LastVotedForkSlots from blockstore. - Add proto file to record the intermediate results. - Also link wen_restart into validator. - Move recreation of tower outside replay_stage so we can get last_vote. * Update lock file. * Fix linter errors. * Fix depencies order. * Update wen_restart explanation and small fixes. * Generate tower outside tvu. * Update validator/src/cli.rs Co-authored-by: Tyera <teulberg@gmail.com> * Update wen-restart/protos/wen_restart.proto Co-authored-by: Tyera <teulberg@gmail.com> * Update wen-restart/build.rs Co-authored-by: Tyera <teulberg@gmail.com> * Update wen-restart/src/wen_restart.rs Co-authored-by: Tyera <teulberg@gmail.com> * Rename proto directory. * Rename InitRecord to MyLastVotedForkSlots, add imports. * Update wen-restart/Cargo.toml Co-authored-by: Tyera <teulberg@gmail.com> * Update wen-restart/src/wen_restart.rs Co-authored-by: Tyera <teulberg@gmail.com> * Move prost-build dependency to project toml. * No need to continue if the distance between slot and last_vote is already larger than MAX_SLOTS_ON_VOTED_FORKS. * Use 16k slots instead of 81k slots, a few more wording changes. * Use AncestorIterator which does the same thing. * Update Cargo.lock * Update Cargo.lock --------- Co-authored-by: Tyera <teulberg@gmail.com>
std::io::Result, | ||
}; | ||
|
||
fn main() -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe within this main function you'll want to add
println!("cargo:rerun-if-changed=proto/wen_restart.proto");
This causes the generated files to get regenerated if the protos change.
More info:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for checking, sending another PR to fix it.
// Copied and adapted from | ||
// https://github.com/Kimundi/rustc-version-rs/blob/1d692a965f4e48a8cb72e82cda953107c0d22f47/README.md#example | ||
// Licensed under Apache-2.0 + MIT | ||
match version_meta().unwrap().channel { | ||
Channel::Stable => { | ||
println!("cargo:rustc-cfg=RUSTC_WITHOUT_SPECIALIZATION"); | ||
} | ||
Channel::Beta => { | ||
println!("cargo:rustc-cfg=RUSTC_WITHOUT_SPECIALIZATION"); | ||
} | ||
Channel::Nightly => { | ||
println!("cargo:rustc-cfg=RUSTC_WITH_SPECIALIZATION"); | ||
} | ||
Channel::Dev => { | ||
println!("cargo:rustc-cfg=RUSTC_WITH_SPECIALIZATION"); | ||
// See https://github.com/solana-labs/solana/issues/11055 | ||
// We may be running the custom `rust-bpf-builder` toolchain, | ||
// which currently needs `#![feature(proc_macro_hygiene)]` to | ||
// be applied. | ||
println!("cargo:rustc-cfg=RUSTC_NEEDS_PROC_MACRO_HYGIENE"); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the wen-restart
module, I'm not seeing anything that requires specialization. Is this block needed? (Usually I see this code to enable the AbiExample
testing macros.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not, I'll send you another PR to remove it.
@@ -1381,6 +1381,35 @@ pub fn app<'a>(version: &'a str, default_args: &'a DefaultArgs) -> App<'a, 'a> { | |||
.possible_values(BlockProductionMethod::cli_names()) | |||
.help(BlockProductionMethod::cli_message()) | |||
) | |||
.arg( | |||
Arg::with_name("wen_restart") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a quick look at this because of #33633. I believe (but could be mistaken) that this feature is not yet production ready. If that is correct, i think we should add this
.hidden(hidden_unless_forced())
with the idea being to keep the in-development features hidden from solana-validator --help
until they're ready
Problem
Add wen_restart module
Summary of Changes