-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Khalil/6959 Recover Epoch transaction args script #5576
Khalil/6959 Recover Epoch transaction args script #5576
Conversation
- update signatures - update usages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First round of comments
cmd/util/cmd/epochs/cmd/recover.go
Outdated
_, clusters, err := common.ConstructClusterAssignment(log, partnerNodes, internalNodes, flagCollectionClusters) | ||
if err != nil { | ||
log.Fatal().Err(err).Msg("unable to generate cluster assignment") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is generating the cluster assignment based on the partner and internal node config files from the spork bootstrapping process. These files should accurately represent the identity table state at the beginning of the spork, but may not be accurate at the time a RecoveryEpoch
needs to be generated.
In order to guarantee we generate valid clusters (and don't re-insert an old node which has unstaked or omit a new node which has joined after the spork) we will need to use the identity table from the snapshot we retrieve from the network as the data source for constructing cluster assignments.
We will still need to retrieve the internal nodes info from disk (step ReadInternalNodeInfos
) in order to get the private keys necessary to produce the QCs. However I don't think it is useful to retrieve the partner node info from disk, because this will be completely replaced by the snapshot. I think we should:
- Exit with an error if we observe a node in
InternalNodeInfos
that doesn't exist in the snapshot - Populate
internalNodes
based on what we read from disk - Populate
partnerNodes
based onsnapshot.Identities - internalNodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Exit with an error if we observe a node in
InternalNodeInfos
that doesn't exist in the snapshot- Populate
internalNodes
based on what we read from disk- Populate
partnerNodes
based onsnapshot.Identities - internalNodes
Very much possible that we are suggesting the same thing, but I am not sure. I generally agree with Jordan's comment, but I would frame it slightly differently:
- The
snapshot
is the core object for this logic here:- It specifies the set of nodes
$S$ which nodes are currently allowed to participate in the network. Note that ejected nodes are not part of$S$ . So you cannot useInitialIdentities
⚠️ as you do here! Instead, you should use theIdentities
method from the snapshot to retrieve the set of participating nodes as of the snapshot's block. - If you use a selector that drops ejected nodes and nodes with zero weight, you receive the set
$S$ of all participating nodes as of this block.
- It specifies the set of nodes
- The nodes in the recovery epoch should be a subset of
$S$ -
So we select the subset of all collectors
$s_c$ from$S$ . This are the collectors that can participate in the recovery epoch and we can partition them across the different clusters (methodConstructClusterAssignment
) -
For a cluster to work, we need a root QC, for the cluster to start its local HotStuff. Hence, each cluster must contain more than 2/3 of collectors (by weight), whose private staking key we know.
-
Therefore, we split
$s_c$ into two subsets-
$s^{(i)}_c$ are the nodes, where we know the private staking key: -
$s^{(p)}_c$ are the nodes, where we do not know the private staking key
We can proceed as follows:
- We load the old bootstrapping data for our internal nodes info from disk (step
ReadInternalNodeInfos
). Then we iterate over all nodes$n \in s_c$ . If collector node$n$ is listed in the old bootstrapping data, we have a private staking key for it and it goes into$s^{(i)}_c$ . Otherwise, the collector is added to$s^{(p)}_c$ (no private key). - we can then call (similar to line 124)
common.ConstructClusterAssignment(
log
,$s^{(p)}_c$ ,$s^{(i)}_c$ ,flagCollectionClusters
)
-
-
To the best of my understanding, this is the most general algorithm and works for the broadest possible range of situations. In particular:
-
If is possible for us to add or remove internal collectors after the original spork.
In comparison, removing one of our collectors would yield an error in Jordan's suggestion:
Exit with an error if we observe a node in InternalNodeInfos that doesn't exist in the snapshot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlexHentschel thanks for the in depth algorithm clarity
cmd/util/cmd/epochs/cmd/recover.go
Outdated
generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagStartView, "start-view", 0, "start view of the recovery epoch") | ||
generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagStakingEndView, "staking-end-view", 0, "end view of the staking phase of the recovery epoch") | ||
generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagEndView, "end-view", 0, "end view of the recovery epoch") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easier from an ops perspective if we used the same representation for these values as we do in bootstrapping, in particular using the lengths as the input rather than the end view. I suggest we use the same flag names as the rootblock
comand:
--epoch-length
--epoch-staking-phase-length
Then we can compute endView
and stakingEndView
as we do in the bootstrapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd/util/cmd/epochs/cmd/recover.go
Outdated
" in the JSON file: Role, Address, NodeID, NetworkPubKey, StakingPubKey)") | ||
generateRecoverEpochTxArgsCmd.Flags().StringVar(&flagPartnerWeights, "partner-weights", "", "path to a JSON file containing "+ | ||
"a map from partner node's NodeID to their stake") | ||
generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagStartView, "start-view", 0, "start view of the recovery epoch") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Start view sensitivity
The startView
parameter is very sensitive, because in order for a EpochRecover
service event to be valid, it must be exactly one view after the last EpochExtension
. Otherwise we would have conflicting/missing definitions for who is the leader for a view.
Suggestions:
- Retrieve the "extended final view" from the snapshot (
FinalView
from the latest extension of the current epoch) - Validate that
--start-view
is equal toextendedFinalView+1
Alternatively we can revisit the strictness of the requirement, but would need to carefully think through the safety implications.
Race condition
In addition, there is a race condition here to be aware of. The design specifies that we will add a new extension periodically. If, in between retrieving the snapshot while executing this command and submitting the corresponding admin transaction, we add a new epoch extension, the RecoverEpoch
event would be discarded as invalid.
This isn't the end of the world because we can just do it again, but it might be worth having the tooling help avoid this situation, for example by reporting the number of view remaining before another extension would be added. To be honest I'm kind of conflicted: on the one hand it would be useful but on the other hand, this is already an infrequent, heavily involved process where I'd expect humans to be double-checking these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is related to the comment above it and the same commit applies 884fdbf , as far as the race condition it's unlikely to happen and we would only be able to log a useful message if we were able to get the current view from the network somehow then we can could log how many views left until the start-view .
cmd/util/cmd/epochs/cmd/recover.go
Outdated
|
||
dkgPubKeys := make([]cadence.Value, 0) | ||
nodeIds := make([]cadence.Value, 0) | ||
ids.Map(func(skeleton flow.IdentitySkeleton) flow.IdentitySkeleton { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest just using a for loop here. Map
is used to translate a list through a function into a new list, but here we're discarding the output list, which defeats the purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.Fatal().Err(err).Msg("failed to get random source cadence string") | ||
} | ||
|
||
dkgPubKeys := make([]cadence.Value, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first element here will need to be the group public key (see https://github.com/onflow/flow-go/blob/master/model/convert/service_event.go#L360-L362).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd/util/cmd/common/clusters.go
Outdated
identifierLists[i%len(identifierLists)] = append(identifierLists[i%len(identifierLists)], node.NodeID) | ||
constraint[i%nClusters] -= 2 | ||
constraint[i%numCollectionClusters] -= 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per construction the length of identifierLists
is numCollectionClusters
:
flow-go/cmd/util/cmd/common/clusters.go
Line 52 in 7d09f47
identifierLists := make([]flow.IdentifierList, numCollectionClusters) |
above, we have used this relation already:
flow-go/cmd/util/cmd/common/clusters.go
Lines 56 to 60 in 7d09f47
// first, round-robin internal nodes into each cluster | |
for i, node := range internals { | |
identifierLists[i%numCollectionClusters] = append(identifierLists[i%numCollectionClusters], node.NodeID) | |
constraint[i%numCollectionClusters] += 1 | |
} |
algorithmically, we are doing exactly the same thing here: distribute collector nodes amongst the clusters. So the code should have the same structure. Otherwise, people reading this will get confused.
identifierLists[i%len(identifierLists)] = append(identifierLists[i%len(identifierLists)], node.NodeID) | |
constraint[i%nClusters] -= 2 | |
constraint[i%numCollectionClusters] -= 2 | |
clusterIndex := i % numCollectionClusters | |
identifierLists[clusterIndex] = append(identifierLists[clusterIndex], node.NodeID) | |
constraint[clusterIndex] -= 2 |
cmd/util/cmd/common/node_info.go
Outdated
networkPubKey := cmd.ValidateNetworkPubKey(partner.NetworkPubKey) | ||
stakingPubKey := cmd.ValidateStakingPubKey(partner.StakingPubKey) | ||
weight, valid := cmd.ValidateWeight(weights[partner.NodeID]) | ||
if !valid { | ||
log.Error().Msgf("weights: %v", weights) | ||
log.Fatal().Msgf("partner node id %x has no weight", nodeID) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are mixing panics and boolean return for indicating invalid node information. Ideally, it would all be the same: sentinel errors.
If it is a huge refactoring, feel free to skip. But if we can with moderate amount of work, it would be nice to have consistent code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd/util/cmd/common/snapshot.go
Outdated
// GetSnapshotAtEpochAndPhase will get the latest finalized protocol snapshot and check the current epoch and epoch phase. | ||
// If we are past the target epoch and epoch phase we exit the retry mechanism immediately. | ||
// If not check the snapshot at the specified interval until we reach the target epoch and phase. | ||
func GetSnapshotAtEpochAndPhase(ctx context.Context, log zerolog.Logger, startupEpoch uint64, startupEpochPhase flow.EpochPhase, retryInterval time.Duration, getSnapshot GetProtocolSnapshot) (protocol.Snapshot, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to document the meaning of some of the inputs:
startupEpoch
and startupEpochPhase
and getSnapshot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5576 +/- ##
==========================================
+ Coverage 55.65% 56.55% +0.89%
==========================================
Files 1036 629 -407
Lines 101131 60238 -40893
==========================================
- Hits 56287 34065 -22222
+ Misses 40518 23675 -16843
+ Partials 4326 2498 -1828
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
- add filter for valid protocol participant
- emit fatal log if identity is present in internal node info from disc but missing from snap shot identities list
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
- infer start view, staking phase end view, and epoch end view from curr epoch final view
Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>
…b.com:onflow/flow-go into khalil/6959-efm-recvery-epoch-data-generation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few final small suggestions, but otherwise looks good!
s.DKGPhaseLen = 50 | ||
s.EpochLen = 250 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note: When we get to making this integration test case, we'll likely be able to set these more aggressively (shorter). We can discuss later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great work. largely cosmetic suggestions.
Also wanted to say thanks for cleanup and documentation improvement also to the adjacent code. Continuous efforts on a smaller scale really add up to a big difference over time for clarity and low tech debt in our code base. Thanks, really appreciate your contributions. 💚
cmd/util/cmd/epochs/cmd/recover.go
Outdated
generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagNumViewsInEpoch, "epoch-length", 4000, "length of each epoch measured in views") | ||
generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagNumViewsInStakingAuction, "epoch-staking-phase-length", 100, "length of the epoch staking phase measured in views") | ||
generateRecoverEpochTxArgsCmd.Flags().Uint64Var(&flagEpochCounter, "epoch-counter", 0, "the epoch counter used to generate the root cluster block") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my understanding, I would think these default values are drastically to short. would also like to get @jordanschalm's thoughts ...
My thoughts:
-
Default values should provide a sensible setting for mainnet.
-
We are talking about an epoch here, with the limitation that it is mainly for recovery, but aren't we planning to run a regular DKG in this recovery epoch? I think a day would be reasonable
- For reference, we decided on 3000 views for each DKG phase (3 in total) for the mainnet consensus timing, plus
EpochCommitSafetyThreshold
(which should also be minimally 1000 blocks), plus staking and Epoch commit phase that also belong to an epoch on the happy path.
With the new consensus timing [👉 reference], one day would correspond to to 108,000 views for mainnet
- For reference, we decided on 3000 views for each DKG phase (3 in total) for the mainnet consensus timing, plus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice the CLI will require all of these to be explicitly specified in flags when run, because of the MarkFlagRequired
blocks below. I think it is preferable to force operators to provide specific values and fail rather than using a default value (current behaviour).
To prevent confusion, we could set the "default" value parameter when defining the flag to 0.
Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org> Co-authored-by: Jordan Schalm <jordan@dapperlabs.com>
Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>
Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>
Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>
Co-authored-by: Alexander Hentschel <alex.hentschel@flowfoundation.org>
…b.com:onflow/flow-go into khalil/6959-efm-recvery-epoch-data-generation
…b.com:onflow/flow-go into khalil/6959-efm-recvery-epoch-data-generation
log.Fatal().Err(cdcErr).Msg("failed to get dkg group key cadence string") | ||
} | ||
dkgPubKeys = append(dkgPubKeys, dkgGroupKeyCdc) | ||
for _, id := range currentEpochIdentities { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to comment on a closed PR. I noticed some sanity checks we should add here while reviewing onflow/flow-core-contracts#420.
We should check that currentEpochDKG.Size() == len(currentEpochIdentities.Filter(filter.HasRole(flow.RoleConsensus)))
We already check that there is a DKG key for every consensus node, but not that there is a consensus node for every DKG key.
Added a reminder to the design doc for this.
This PR adds a new cmd
recover-epoch-tx-args
to theepochs
CLI command that generates the transaction args for theRecoverEpoch
tx. This command reuses logic from the sporking utility tool to generate the following transaction args.startView
: start view of the recovery epochstakingStartView
: start view of the staking phase of the recovery epochendView
: end view of the recovery epochdkgPubKeys
: list of dkg key shares for each consensus node from the dkg of the last successful epochnodeIds
: initial identity list of the last successful epochclusters
: node clustersclusterQcs
: cluster qcs generated from the clustersNote: Significant changes are in this file cmd/util/cmd/epochs/cmd/recover.go, other changes are due to relocating reusable funcs