-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
set-identity
causes undesirable behavior when used while validator is waiting for supermajority.
#35152
Comments
proposal seems reasonable for a stopgap. i think we'd ideally want to support |
agreed, we can have WFSM reinitialize the tower if However that would clash with #33865. @ryleung-solana is it feasible to initialize the admin rpc service before the tpu? |
theoretically it should be feasible to start an admin interface immediately. the way we have it designed today turns it into a dependency nightmare (both package-wise and initialization-wise). it'd be more flexible if we redesigned it around channels rather than actually holding |
that would definitely save us some hassle. having to poll around when In terms of a minimum changeset to backport, i'm thinking we:
|
I mean, we could initialize it once before the TPU is initialize, then reinitialize it afterwards...I guess the only thing this means is that |
Actually, this is a significant problem in the everyday use of the validator. Until the replay_stage starts, changing the identity leads to a crash. The main issue here is that the startup script should check whether the replay_stage has begun before setting a new identity. If any of the validators are looking for a solution right now, they can use this patch. I'm using it, and it saves from crashing (and actually saved me during the last restart, but I had to lose several hundred credits), and then you just need to change the identity twice more (return the fake one and set the real one).
To be honest, I thought about saving the desired identity in a new cluster_info variable and then triggering the change from the replay_stage, because, as I understand, there is still a very small (in time) window when changing the identity can lead to such an error. I don't remember why I didn't implement it, either because of time or I didn't like the architecture of the solution. |
@diman-io ack will think about this some more out of curiosity could you describe your use case for calling set-identity during startup? I was under the impression it was mainly used for hot swap setups on already running validators |
I don't store keys with providers. Unfortunately, I've sometimes received "new" machines that had recoverable partitions from previous users. Also, this eliminates, for example, the risk associated with the unknown fate of a disk if it fails. Overall, I sleep better this way :) Not all developers are aware that you can set up identity in this way: ssh node /path/to/solana-validator -l /path/to/ledger set-identity < ~/keys/identity.json here, In reality, of course, it's more complicated. |
Problem
During cluster restart if validator has entered the
WaitingForSupermajority
state, calls toset-identity
will appear to succeed however later uponReplayStage::push_vote
, the validator will panic.Upon validator initialization we create the
ProcessBlockstore
closure using the identity provided at startup. This closure creates the tower when invoked:solana/core/src/validator.rs
Lines 835 to 837 in eeb0cf1
Afterwards the admin rpc service is initialized. From here onwards the
set-identity
command will be functional:solana/core/src/validator.rs
Line 1079 in 09e0300
We then enter the wait for supermajority loop, at which point validators might execute the
set-identity
command:solana/core/src/validator.rs
Lines 1086 to 1093 in 09e0300
Validators execute the
set-identity
command, and the running process is updated with the new identity.However when
ProcessBlockstore
is executed this will populate the loaded tower with the initial identity.When voting, we encounter an issue as we have a tower with the previous identity attempting to be saved with the new identity:
solana/core/src/replay_stage.rs
Lines 2525 to 2527 in 09e0300
Normally such a situation is avoided as replay will replace the old tower with the new tower:
solana/core/src/replay_stage.rs
Lines 944 to 950 in 09e0300
However since the
set-identity
was performed beforeReplayStage
was initialized, this logic was not performed.Proposed Solutions
Copy behavior in master by initializing the admin rpc service until after wait for supermajority has completed. This disallows the
set-identity
to execute until the tower has been initialized.core/src/validator.rs
from https://github.com/solana-labs/solana/pull/33865/files# to v1.17However this still creates a small gap of time from when the rpc service is initialized to the first iteration of
ReplayStage
is executed in which the identity could be changed from under us. A more complete solution would be to failset-identity
until we are sureReplayStage
is running.The text was updated successfully, but these errors were encountered: