Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Safe BP failover solution #265

Open
arhag opened this issue May 17, 2022 · 0 comments
Open

Safe BP failover solution #265

arhag opened this issue May 17, 2022 · 0 comments
Labels
enhancement New feature or request

Comments

@arhag
Copy link
Member

arhag commented May 17, 2022

Summary

Active block producers need to run multiple nodeos instances synchronized to the network with the ability to quickly switch on block signing for their block producer (BP) account when they intend that nodeos instance to be their active BP signer. This need comes from the expectation to not miss their time slot for producing a block. But it is critical that they also never have more than one active BP signer for their BP account at a time, as that opens up the possibility of double-confirming conflicting blocks and leading to a finality violation.

Current options for BPs are to either:

  • manually take care to only enable block production on at most one node at a time among their nodeos instances that all have access to their BP's block signing private key(s);
  • or, use different keys on their different nodeos instances and when they want to switch which nodeos instance is their active BP signer, they change their registered block signing key through an on-chain action and wait until the BP schedule changes (takes around 6 minutes on EOS) to reflect those changes.

The latter option is safer but has its downsides.

The feature enhancement proposes two solutions (either or both can be taken) to improve the process for switching between the nodeos instances as the active block signer for a BP.

The first solution is focused on addressing some of the limitations of the latter current option mentioned above. It allows BPs to register multiple block signing key candidates (separate from the WTMSIG block signing authority feature of EOSIO) with one action, and allows selecting which one of those is the active block signing authority with another action.

The second solution is to implement an off-chain consensus between replicas of the BP block signing nodes which ensures the cluster of replicas will not double confirm conflicting blocks and which automatically handles BP failover (according to some policy of preferred nodes to make active if possible).

Solution 1: Decouple block signing key registration from selecting the active key

The latter current option that was mentioned earlier, which is to use the on-chain consensus to change the active BP signer key, is the safest approach to switching which BP block signing node is active. Safety is considered in the sense that it does not contribute towards two conflicting blocks both being considered final (aka irreversible). However, that option does have some downsides.

One disadvantage is that it takes considerable time (around 6 minutes on EOS) for the producer schedule change to take effect. Solution 1 does not actually address this disadvantage. However, this time should be reduced when the new EOSIO consensus algorithm enabling faster finality is completed. The time it takes would be a problem if the prior key was not active at the time the request to change it was processed, since it would mean that the BP would not be able to sign for the duration of that transition (e.g a 6 minute duration means a couple of BP rounds missed). Fortunately, the old key still works until the producer schedule change is finished. So, there is no need for the BP to miss any blocks if they follow the correct process. It does however mean that the BP must wait for several minutes (at least until fast finality is here) before they can shut down the old node, e.g. to change configuration options.

Another disadvantage is that each time a producer changes keys, it forces a on-chain producer schedule change. That creates additional burden for any light clients trying to sync up quickly using sparse block header validation. Again, solution 1 does not actually address this disadvantage either.

There is another disadvantage that solution 1 does address however. With the current system contract, if BPs wish to automate the active node change process as part of a larger automated BP failover solution, they are forced to provide their scripts with a private key that can call the eosio::regproducer (or eosio::regproducer2) action. This also enables the key to execute other operations like change their URL or location, or add new keys or removing existing keys. Those are operations which the BP may not want to risk getting into the hands of an attacker if the script was compromised.

So solution 1 introduces the idea of registering multiple keys for a single BP and then separately selecting the active key among the set.

Note that this is separate from the WTMSIG block signing authority which allows for multiple keys to be represented in a single authority. It could also be possible to allow registering multiple WTMSIG block signing authorities with a single BP and then select the active authority among the set. However, there is some discussion, particularly sparked from the cryptographic requirements of the new fast finality consensus algorithm, to rethink whether WTMSIG block signing authorities is even a desirable feature to keep going forward. The main advantage of WTMSIG block signing authorities is to allow the multiple block signing nodeos instances to each have their own hardware signing key. But it appears there is no desire by the BPs to use hardware keys for block signing or block finality confirmation (and this also gets further complicated with fast finality if the new consensus algorithm requires Schnorr or BLS signatures for block finality confirmation). A secondary benefit would be to have accountability in case of key compromise, but BPs would likely take the cautious route and change all the keys anyway in the event of such a compromise.

Because the operation of selecting the active key among the set of already registered block signing keys is provided by an action that is separate from action of changing this set, it becomes possible to use the EOSIO permission system to tie that active key selection action to a low privilege key that can safely be kept on a server with the BP failover script. If that key is compromised, the attacker can only switch between the set of already approved BP signing keys (which should presumably have a nodeos instance associated with each that are usually ready to produce).

These changes can be fully implemented in the system contract and there are no changes to the core protocol or even to nodeos required for solution 1. Though faster finality (which does involve a change to the core protocol) would make it nicer to use solution 1 since it would reduce the time to switch.

If the decision is to proceed with solution 1 (whether or not solution 2 is also considered), then another issue can be opened up in https://github.com/eosnetworkfoundation/mandel-contracts to just capture the system contract changes of solution 1.

Solution 2: Consensus-based automatic BP failover solution

Solution 2 takes an alternative approach of using a consensus algorithm to automatically coordinate between the block signing nodeos instances (aka replicas) of a given BP to safely carry out actions like intentionally switching from one node to another as the active producer or automatically handling failover. Specifically, by "safely" it is meant that the nodeos instances work to ensure they never double confirm two conflicting blocks even though they have access to the on-chain registered key(s) that could technically allow any one of them to do so.

However, this consensus algorithm is an off-chain consensus algorithm among the BP replicas only which is independent of the on-chain consensus algorithm among the separate BPs. While there is a connection between the off-chain consensus algorithm and the on-chain one (since the goal of the off-chain one is to not violate the finalization rules of the on-chain consensus algorithm), there is a lot of flexibility allowed in choosing the off-chain consensus algorithm.

In particular, there is no reason that the off-chain consensus algorithm has to be Byzantine fault tolerant (BFT) because the replicas are all owned and operated by the same organization (the BP). It is sufficient for the off-chain consensus algorithm to simply be crash fault tolerant (CFT). By tolerating a CFT consensus algorithm, some advantages can be gained such as reducing the latency to reaching a consensus decision and furthermore a simple majority of functioning replicas is sufficient to make progress (unlike the two-thirds majority of BFT). A good candidate for this off-chain consensus algorithm to consider is Raft. For the remainder of the description of solution 2, a variant of the Raft consensus algorithm will be assumed, though other options are also acceptable.

There is some minimal state related to the on-chain consensus algorithm that needs to be tracked and durably stored to be used in the algorithm to protect block finalizers (those contributing towards finalization of a block, which for the sake of this discussion we can assume are equivalent to the block producers) from double confirmation of conflicting blocks while still not being too restrictive to compromise liveness.

While the specifics depend on the particular consensus algorithm, for the sake of this discussion I will look towards what is likely to be the consensus algorithm of the new fast finality consensus algorithm of EOSIO and assume that this minimal state can be reduced down to a single view number. View numbers would be involved in the BFT on-chain consensus algorithm to allow the correct block finalizers to always make progress (liveness) while not risking a violation of finality (safety). There are certain set of rules that correct block finalizers must not violate (this may be referred to as slashing conditions) and in which one of the rules in the set must have been violated if a faulty block finalizer contributed towards a finality violation. These slashing conditions are typically very limited in number and relatively simple for the correct block finalizers to avoid violating. Again, with an eye towards the future consensus algorithm of EOSIO, it is likely safe to assume for the purposes of this discussion that the rule that should not be violated is that a block finalizer should not sign a confirmation more than once for a distinct combination of view number and phase.

The minimal state does not need to track the the view number and phase combinations that the block finalizer has signed confirmations for. It is allowed to me more coarse-grained than that. The trade-off may be against liveness however. But a temporary disruption of a particular block finalizer to participate in block finalization in rare scenarios may be accepted as long as it does not compromise the ability for the network as a whole to make progress with finality advancement eventually (without too long of a delay of course).

So instead, the minimal state can be a single view number which defines an open upper bound on the view numbers for which the replica that committed that view number into the state machine is allowed to sign confirmation messages for. In other words, using Raft terminology for clarity, if a particular replica is the leader of a term, and the committed view number in the state was committed as of the current term, then that replica is allowed to sign confirmation messages as part of the on-chain consensus mechanism for view numbers strictly less than the committed view number but they also must meet the on-chain consensus algorithm requirements tracked separately (by the finality module) that further restrict which view numbers (and phases) the replica can sign confirmation messages for. However, the idea is that the state used for those further restrictions as part of the finality module can be ephemeral and kept in the process memory, while the committed view number in the Raft state machine (part of the BP failover module) is durably stored before being considered committed by the replica. If the committed view number in state was not committed as of the current term, then the leader replica of the current term must first commit an update to that view number (it must be monotonically increasing) before they are allowed to use it as an upper bound on the view numbers that they are allowed to sign confirmation messages for.

Then by committing a view number in the Raft state machine that is a little ahead of the current view number periodically (though frequently enough to try to never let the current view number in the on-chain consensus algorithm catch up that committed view number), the added latency of Raft consensus does not need to necessarily add to the overall latency of the finalization process (assuming the leader remains stable). It does mean that there is added artificial latency when the leader is changed (either by choice or automatically due to timeouts in the failover logic) since it may take some time for the on-chain consensus to reach a view number greater than the view number committed by the prior Raft leader. That is part of the trade-off in trying to reduce latency in the normal case. I think some sensible choices can be made in how view numbers are designed in the on-chain consensus algorithm to allow fast advancement in those view numbers with gaps as necessary to improve liveness and reduce latency, but with some restrictions to how quickly it advances to protect against liveness attacks my malicious block finalizers and so that the off-chain consensus discussed in this solution can reliably predict how much time they have before a later view number is reached or passed. And with those sensible choices in the on-chain consensus algorithm coupled with appropriate matching choices in the off-chain consensus algorithm in how ahead the committed view number in the Raft state machine can get compared to the current on-chain consensus view number, I believe it should be feasible to achieve a Raft-based automatic BP failover solution which is safe*, does not add to the latency of block finalization (except in very rare unhealthy network conditions), and which allows the leader BP replica to change in less than 5 seconds (regardless of how long it takes for a block to become final).

*Note: Safety still assumes the operators are not doing reckless things. For example, if they want to add or remove BP replicas from the cluster, they would need to do that as operations within the Raft state machine and wait until the replica set changes take effect (which shouldn't take long). They obviously should not just start a nodeos instance with access to the BP block signing key in a mode where it acts as an independent node (or as part of a branch new Raft cluster). In addition, if they delete the files capturing the durable Raft state machine data on half or more of the existing replicas, then all bets are off since this would essentially degrade back to the current situation of manually ensuring all nodes that may have access to the block signing key are shut down (or least have block production disabled) before then bringing up a new Raft cluster with access to the block signing key. In that (hopefully) very rare situation, it would probably be wise to instead just change to a new block signing key on-chain.

On top of the basic foundation for a BP failover solution described above, one can then add all sorts of policies that help manage the failover process. All of these could simply be augmentations to the Raft state machine. For example, a prioritized list of Raft replicas can be committed to the state machine. This list would determine which nodes are preferred to act as leaders. Instead of selecting Raft leaders based on purely random time outs, the off-chain consensus algorithm could be augmented to prefer certain replicas as leaders (using the prioritized list in the state) while still respecting timeouts as a means to determine whether the current leader is active before moving on to the next one in the list and to periodically check whether a preferred replica that was skipped over previously due to not responding in time may now be available and ready to take over as a leader. In addition, the state machine can be augmented to manually mark some of the replicas in the list as temporarily unavailable (e.g. for maintenance) so that it doesn't even waste time trying to determine if it is still alive.

Finally the discussion above was regarding block finalization (since that is what creates a safety issue that a consensus-based failover solution greatly helps with). But it would preferred to have the same off-chain consensus mechanism also apply to block proposals (aka block producing). Even if finality violations are a non-issue, it would of course be highly desirable to not have a BP produce multiple blocks in its time slot which can lead to confusion, wasted bandwidth and computation, and reduced performance of the network. So it would make sense to have only the Raft leader replica produce a block during that BP's time slot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

1 participant