-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[EFM] Recoverable Random Beacon State Machine #6771
Changes from all commits
100a92b
719b6a6
01bbdcf
1d3c6a4
6c34886
8aac526
064b651
6bcaf38
c13c7f6
489c871
10aab93
fb28249
fb161f7
650b8b8
2386623
bfe95b0
79aac04
1119c8c
c8dfca6
c9182c5
577712a
7e728aa
1fa11c6
a92d8b7
f0be4ba
62d399d
bea9c1a
89005a8
0b9f732
5a98f76
b639742
be18c57
d0df4fb
0f98cf7
6ea64d6
6f7017a
ea0f412
b9bc92e
8f15c29
43d6a63
9302497
550fd3f
7053c88
0687e30
8226642
66f67d4
56b1a63
c0c6b7e
00f952a
21ff3ef
12a633a
aa377c2
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
The `storage.RecoverableRandomBeaconStateMachine` formalizes the life-cycle of the Random Beacon keys for each epoch. On the happy path, each consensus participant for the next epoch takes part in a DKG to obtain a threshold key to participate in Flow's Random Beacon. After successfully finishing the DKG protocol, the node obtains a random beacon private key, which is stored in the database along with DKG current state `flow.DKGStateCompleted`. If for any reason the DKG fails, then the private key will be `nil` and DKG current state is set to `flow.DKGStateFailure`. | ||
In case of failing Epoch switchover, the network goes into Epoch Fallback Mode [EFM]. The governance committee can recover the network via a special EpochRecover transaction. In this case, the set of threshold keys is specified by the governance committee. | ||
The current implementation focuses on the scenario, where the governance committee re-uses the threshold key set from the last successful epoch transition. While injecting other threshold keys into the nodes is conceptually possible and supported, the utilities for this recovery path are not yet implemented. | ||
|
||
[diagram](https://drive.google.com/file/d/1UnJLlTIs8IDOIHZhNUhXakeP_5Re10S4/view?usp=sharing) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -110,7 +110,7 @@ func (e *ReactorEngine) Ready() <-chan struct{} { | |
if phase == flow.EpochPhaseSetup { | ||
e.startDKGForEpoch(currentCounter, first) | ||
} else if phase == flow.EpochPhaseCommitted { | ||
// If we start up in EpochCommitted phase, ensure the DKG end state is set correctly. | ||
// If we start up in EpochCommitted phase, ensure the DKG current state is set correctly. | ||
e.handleEpochCommittedPhaseStarted(currentCounter, first) | ||
} | ||
}) | ||
|
@@ -155,23 +155,23 @@ func (e *ReactorEngine) startDKGForEpoch(currentEpochCounter uint64, first *flow | |
Logger() | ||
|
||
// if we have started the dkg for this epoch already, exit | ||
started, err := e.dkgState.GetDKGStarted(nextEpochCounter) | ||
started, err := e.dkgState.IsDKGStarted(nextEpochCounter) | ||
if err != nil { | ||
// unexpected storage-level error | ||
// TODO use irrecoverable context | ||
log.Fatal().Err(err).Msg("could not check whether DKG is started") | ||
log.Fatal().Err(err).Msg("could not check whether DKG is dkgState") | ||
} | ||
if started { | ||
log.Warn().Msg("DKG started before, skipping starting the DKG for this epoch") | ||
return | ||
} | ||
|
||
// flag that we are starting the dkg for this epoch | ||
err = e.dkgState.SetDKGStarted(nextEpochCounter) | ||
err = e.dkgState.SetDKGState(nextEpochCounter, flow.DKGStateStarted) | ||
if err != nil { | ||
// unexpected storage-level error | ||
// TODO use irrecoverable context | ||
log.Fatal().Err(err).Msg("could not set dkg started") | ||
log.Fatal().Err(err).Msg("could not transition DKG state machine into state DKGStateStarted") | ||
} | ||
|
||
curDKGInfo, err := e.getDKGInfo(firstID) | ||
|
@@ -246,14 +246,17 @@ func (e *ReactorEngine) startDKGForEpoch(currentEpochCounter uint64, first *flow | |
|
||
// handleEpochCommittedPhaseStarted is invoked upon the transition to the EpochCommitted | ||
// phase, when the canonical beacon key vector is incorporated into the protocol state. | ||
// Alternatively we invoke this function preemptively on startup if we are in the | ||
// EpochCommitted Phase, in case the `EpochCommittedPhaseStarted` event was missed | ||
// due to a crash. | ||
// | ||
durkmurder marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// This function checks that the local DKG completed and that our locally computed | ||
// key share is consistent with the canonical key vector. When this function returns, | ||
// an end state for the just-completed DKG is guaranteed to be stored (if not, the | ||
// the current state for the just-completed DKG is guaranteed to be stored (if not, the | ||
// program will crash). Since this function is invoked synchronously before the end | ||
// of the current epoch, this guarantees that when we reach the end of the current epoch | ||
// we will either have a usable beacon key (successful DKG) or a DKG failure end state | ||
// stored, so we can safely fall back to using our staking key. | ||
// we will either have a usable beacon key committed (state [flow.RandomBeaconKeyCommitted]) | ||
// or we persist [flow.DKGStateFailure], so we can safely fall back to using our staking key. | ||
// | ||
// CAUTION: This function is not safe for concurrent use. This is not enforced within | ||
// the ReactorEngine - instead we rely on the protocol event emission being single-threaded | ||
|
@@ -267,13 +270,22 @@ func (e *ReactorEngine) handleEpochCommittedPhaseStarted(currentEpochCounter uin | |
Uint64("next_epoch", nextEpochCounter). // the epoch the just-finished DKG was preparing for | ||
Logger() | ||
|
||
// Check whether we have already set the end state for this DKG. | ||
// Check whether we have already set the current state for this DKG. | ||
// This can happen if the DKG failed locally, if we failed to generate | ||
// a local private beacon key, or if we crashed while performing this | ||
// check previously. | ||
endState, err := e.dkgState.GetDKGEndState(nextEpochCounter) | ||
if err == nil { | ||
log.Warn().Msgf("checking beacon key consistency: exiting because dkg end state was already set: %s", endState.String()) | ||
currentState, err := e.dkgState.GetDKGState(nextEpochCounter) | ||
if err != nil { | ||
if errors.Is(err, storage.ErrNotFound) { | ||
log.Warn().Msg("failed to get dkg state, assuming this node has skipped epoch setup phase") | ||
} else { | ||
log.Fatal().Err(err).Msg("failed to get dkg state") | ||
} | ||
|
||
return | ||
} | ||
if currentState != flow.DKGStateCompleted { | ||
log.Warn().Msgf("checking beacon key consistency: exiting because dkg didn't reach completed state: %s", currentState.String()) | ||
return | ||
} | ||
|
||
|
@@ -289,13 +301,13 @@ func (e *ReactorEngine) handleEpochCommittedPhaseStarted(currentEpochCounter uin | |
return | ||
} | ||
|
||
myBeaconPrivKey, err := e.dkgState.RetrieveMyBeaconPrivateKey(nextEpochCounter) | ||
myBeaconPrivKey, err := e.dkgState.UnsafeRetrieveMyBeaconPrivateKey(nextEpochCounter) | ||
if errors.Is(err, storage.ErrNotFound) { | ||
log.Warn().Msg("checking beacon key consistency: no key found") | ||
err := e.dkgState.SetDKGEndState(nextEpochCounter, flow.DKGEndStateNoKey) | ||
err := e.dkgState.SetDKGState(nextEpochCounter, flow.DKGStateFailure) | ||
if err != nil { | ||
// TODO use irrecoverable context | ||
log.Fatal().Err(err).Msg("failed to set dkg end state") | ||
log.Fatal().Err(err).Msg("failed to set dkg state") | ||
} | ||
return | ||
} else if err != nil { | ||
|
@@ -312,25 +324,25 @@ func (e *ReactorEngine) handleEpochCommittedPhaseStarted(currentEpochCounter uin | |
} | ||
localPubKey := myBeaconPrivKey.PublicKey() | ||
|
||
// we computed a local beacon key but it is inconsistent with our canonical | ||
// we computed a local beacon key, but it is inconsistent with our canonical | ||
// public key - therefore it is unsafe for use | ||
if !nextDKGPubKey.Equals(localPubKey) { | ||
log.Warn(). | ||
Str("computed_beacon_pub_key", localPubKey.String()). | ||
Str("canonical_beacon_pub_key", nextDKGPubKey.String()). | ||
Msg("checking beacon key consistency: locally computed beacon public key does not match beacon public key for next epoch") | ||
err := e.dkgState.SetDKGEndState(nextEpochCounter, flow.DKGEndStateInconsistentKey) | ||
err := e.dkgState.SetDKGState(nextEpochCounter, flow.DKGStateFailure) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Intuitively it seems to be a too relaxed behavior. For failure states we allow self-transitions, for everything else where we deviate from happy path and get an unexpected error I would be inclined to return an error so we can figure out what was wrong. For your particular scenario I am not very worried, it means that operator has to try again. |
||
if err != nil { | ||
// TODO use irrecoverable context | ||
log.Fatal().Err(err).Msg("failed to set dkg end state") | ||
log.Fatal().Err(err).Msg("failed to set dkg current state") | ||
} | ||
return | ||
} | ||
|
||
err = e.dkgState.SetDKGEndState(nextEpochCounter, flow.DKGEndStateSuccess) | ||
err = e.dkgState.SetDKGState(nextEpochCounter, flow.RandomBeaconKeyCommitted) | ||
if err != nil { | ||
// TODO use irrecoverable context | ||
e.log.Fatal().Err(err).Msg("failed to set dkg end state") | ||
e.log.Fatal().Err(err).Msg("failed to set dkg current state") | ||
} | ||
log.Info().Msgf("successfully ended DKG, my beacon pub key for epoch %d is %s", nextEpochCounter, localPubKey) | ||
} | ||
|
@@ -423,16 +435,17 @@ func (e *ReactorEngine) end(nextEpochCounter uint64) func() error { | |
if crypto.IsDKGFailureError(err) { | ||
// Failing to complete the DKG protocol is a rare but expected scenario, which we must handle. | ||
// By convention, if we are leaving the happy path, we want to persist the _first_ failure symptom | ||
// in the `dkgState`. If the write yields a `storage.ErrAlreadyExists`, we know the overall protocol | ||
// has already abandoned the happy path, because on the happy path the ReactorEngine is the only writer. | ||
// Then this function just stops and returns without error. | ||
// in the `dkgState`. If the write yields a [storage.InvalidDKGStateTransitionError], it means that the state machine | ||
// is in the terminal state([flow.RandomBeaconKeyCommitted]) as all other transitions(even to [flow.DKGStateFailure] -> [flow.DKGStateFailure]) | ||
// are allowed. If the protocol is in terminal state, and we have a failure symptom, then it means that recovery has happened | ||
// before ending the DKG. In this case, we want to ignore the error and return without error. | ||
e.log.Warn().Err(err).Msgf("node %s with index %d failed DKG locally", e.me.NodeID(), e.controller.GetIndex()) | ||
err := e.dkgState.SetDKGEndState(nextEpochCounter, flow.DKGEndStateDKGFailure) | ||
err := e.dkgState.SetDKGState(nextEpochCounter, flow.DKGStateFailure) | ||
if err != nil { | ||
if errors.Is(err, storage.ErrAlreadyExists) { | ||
return nil // DKGEndState already being set is expected in case of epoch recovery | ||
if storage.IsInvalidDKGStateTransitionError(err) { | ||
return nil | ||
} | ||
return fmt.Errorf("failed to set dkg end state following dkg end error: %w", err) | ||
return fmt.Errorf("failed to set dkg current state following dkg end error: %w", err) | ||
} | ||
return nil // local DKG protocol has failed (the expected scenario) | ||
} else if err != nil { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that is easy to do, I'd prefer if we can check that the random beacon key matches the information in the Epoch Commit event ... just to be safe against human configuration errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will take care of this in follow up PR.