Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set-identity causes undesirable behavior when used while validator is waiting for supermajority. #35152

Closed
AshwinSekar opened this issue Feb 8, 2024 · 8 comments · Fixed by #35173

Comments

@AshwinSekar
Copy link
Contributor

AshwinSekar commented Feb 8, 2024

Problem

During cluster restart if validator has entered the WaitingForSupermajority state, calls to set-identity will appear to succeed however later upon ReplayStage::push_vote, the validator will panic.

Upon validator initialization we create the ProcessBlockstore closure using the identity provided at startup. This closure creates the tower when invoked:

let mut process_blockstore = ProcessBlockStore::new(
&id,
vote_account,

Afterwards the admin rpc service is initialized. From here onwards the set-identity command will be functional:

*admin_rpc_service_post_init.write().unwrap() = Some(AdminRpcRequestMetadataPostInit {

We then enter the wait for supermajority loop, at which point validators might execute the set-identity command:

solana/core/src/validator.rs

Lines 1086 to 1093 in 09e0300

let waited_for_supermajority = wait_for_supermajority(
config,
Some(&mut process_blockstore),
&bank_forks,
&cluster_info,
rpc_override_health_check,
&start_progress,
)

Validators execute the set-identity command, and the running process is updated with the new identity.
However when ProcessBlockstore is executed this will populate the loaded tower with the initial identity.

When voting, we encounter an issue as we have a tower with the previous identity attempting to be saved with the new identity:

let saved_tower = SavedTower::new(tower, identity_keypair).unwrap_or_else(|err| {
error!("Unable to create saved tower: {:?}", err);
std::process::exit(1);

Normally such a situation is avoided as replay will replace the old tower with the new tower:

if my_pubkey != cluster_info.id() {
identity_keypair = cluster_info.keypair().clone();
let my_old_pubkey = my_pubkey;
my_pubkey = identity_keypair.pubkey();
// Load the new identity's tower
tower = Tower::restore(tower_storage.as_ref(), &my_pubkey)

However since the set-identity was performed before ReplayStage was initialized, this logic was not performed.

Proposed Solutions

Copy behavior in master by initializing the admin rpc service until after wait for supermajority has completed. This disallows the set-identity to execute until the tower has been initialized.

However this still creates a small gap of time from when the rpc service is initialized to the first iteration of ReplayStage is executed in which the identity could be changed from under us. A more complete solution would be to fail set-identity until we are sure ReplayStage is running.

@t-nelson
Copy link
Contributor

t-nelson commented Feb 8, 2024

proposal seems reasonable for a stopgap. i think we'd ideally want to support set-identity during WFSM for the sake of convenience

@AshwinSekar
Copy link
Contributor Author

agreed, we can have WFSM reinitialize the tower if set-identity is detected, or make ReplayStage smart enough to reinitialize the tower before the first iteration.

However that would clash with #33865. @ryleung-solana is it feasible to initialize the admin rpc service before the tpu?

@t-nelson
Copy link
Contributor

t-nelson commented Feb 8, 2024

theoretically it should be feasible to start an admin interface immediately. the way we have it designed today turns it into a dependency nightmare (both package-wise and initialization-wise). it'd be more flexible if we redesigned it around channels rather than actually holding Arcs

@AshwinSekar
Copy link
Contributor Author

that would definitely save us some hassle. having to poll around when cluster_info changes is less than ideal in this spot.

In terms of a minimum changeset to backport, i'm thinking we:

  • initialize admin service after wait for supermajority
  • use the initial identity from Validator::new in ReplayStage for the first identity change comparison, to update the tower if necessary. happens right before the first vote is sent out.

@ryleung-solana
Copy link
Contributor

agreed, we can have WFSM reinitialize the tower if set-identity is detected, or make ReplayStage smart enough to reinitialize the tower before the first iteration.

However that would clash with #33865. @ryleung-solana is it feasible to initialize the admin rpc service before the tpu?

I mean, we could initialize it once before the TPU is initialize, then reinitialize it afterwards...I guess the only thing this means is that set-identity calls before the second initialization will not get propagated to the quic layer. Then again, right now, any calls to set-identity will not get propagated to the quic layer before the initialization anywway...

@diman-io
Copy link
Contributor

However this still creates a small gap of time from when the rpc service is initialized to the first iteration of ReplayStage is executed in which the identity could be changed from under us.

Actually, this is a significant problem in the everyday use of the validator. Until the replay_stage starts, changing the identity leads to a crash. The main issue here is that the startup script should check whether the replay_stage has begun before setting a new identity.

If any of the validators are looking for a solution right now, they can use this patch. I'm using it, and it saves from crashing (and actually saved me during the last restart, but I had to lose several hundred credits), and then you just need to change the identity twice more (return the fake one and set the real one).

         generate_time.stop();
         replay_timing.generate_vote_us += generate_time.as_us();
         if let Some(vote_tx) = vote_tx {
+            if tower.node_pubkey != identity_keypair.pubkey() {
+                error!(
+                    "Most likely, the identity was changed from {} to {} before the voting started.",
+                    tower.node_pubkey,
+                    identity_keypair.pubkey()
+                );
+                return;
+            }
             tower.refresh_last_vote_tx_blockhash(vote_tx.message.recent_blockhash);
 
             let saved_tower = SavedTower::new(tower, identity_keypair).unwrap_or_else(|err| {

A more complete solution would be to fail set-identity until we are sure ReplayStage is running.

To be honest, I thought about saving the desired identity in a new cluster_info variable and then triggering the change from the replay_stage, because, as I understand, there is still a very small (in time) window when changing the identity can lead to such an error. I don't remember why I didn't implement it, either because of time or I didn't like the architecture of the solution.

@AshwinSekar
Copy link
Contributor Author

@diman-io ack will think about this some more

out of curiosity could you describe your use case for calling set-identity during startup? I was under the impression it was mainly used for hot swap setups on already running validators

@diman-io
Copy link
Contributor

out of curiosity could you describe your use case for calling set-identity during startup? I was under the impression it was mainly used for hot swap setups on already running validators

I don't store keys with providers. Unfortunately, I've sometimes received "new" machines that had recoverable partitions from previous users. Also, this eliminates, for example, the risk associated with the unknown fate of a disk if it fails. Overall, I sleep better this way :)
Plus, it allows viewing primary and backup machines as truly equivalent (for instance, using completely identical configurations, etc.), with the actual assignment of roles (primary/backup) happening exclusively through external tools.

Not all developers are aware that you can set up identity in this way:

ssh node /path/to/solana-validator -l /path/to/ledger set-identity < ~/keys/identity.json

here, ~/keys/identity.json is on a local machine

In reality, of course, it's more complicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants