IF: Document correct process for switching to a backup producer #93

ericpassmore · 2023-08-18T16:24:42Z

Depends on AntelopeIO/reference-contracts#24 for decision on how finalizers register their keys.

Need to document a new process for switching to backup.

arhag · 2023-08-18T17:56:55Z

We should clarify how BP fail-over for block proposals can be separate than the process for block finalization.

The pause/resume endpoints in the producer plugin can still remain relevant for a quick process of switching over block production to a backup node. With the HotStuff transition, since produced blocks (and the BP signature included with them) would no longer convey an attestation with regards to the finality algorithm, there is less of a risk to the network if the BP signs two conflicting blocks. So BPs could still use an off-chain method leveraging pause/resume to quickly switch block production over to a backup node. However, with the HotStuff changes, we can and should also reduce latency in a producer schedule change. So it would not be burdensome to handle that on-chain as well.

In the case of a BP switching over block finalization from one of their nodes to another, we want to strongly encourage the BPs to use separate finalizer keys for each machine and to handle the switchover using an on-chain action that changes their active finalizer key. Due to the speed of IF, this process should be fast (seconds).

matthewdarwin · 2023-08-18T18:54:15Z

Not sure what exactly the plan is here, but keep in mind in some BP operations the person managing the infrastructure is different than the one who can execute an on-chain transaction. Ability to fail-over nodes without touching the chain is highly desirable feature to keep.

arhag · 2023-08-21T21:30:44Z

@matthewdarwin:

Great feedback! Would using linkauths and custom permissions to have a dedicated key to sign transactions to switch the finalizer key on-chain (and is only able to do that on-chain action) be an acceptable compromise?

The problem with not doing it on-chain is that it isn't technically safe unless another consensus algorithm is used to ensure safety among the replicas of each BP (which would add additional latency as well). We are trying to be more rigorous with consensus safety going forward with the Instant Finality switchover. The high time-to-finality of the current algorithm has reluctantly pushed us to accept the current "unsafe" approach that producers use since the probability of enough BPs messing up at the same time to cause an actual finality violation is low. But with Instant Finality, this latency penalty shouldn't exist. So I am hoping to encourage BPs to adopt best practices.

Obviously, BPs are ultimately the ones who get to decide how they manage their keys and operations. In the future, perhaps economic disincentives (e.g. automatically slashing bonds for double signing which the new consensus algorithm enables) if adopted by the BPs could change the playing field enough to convince each BP it is best for their net economic outcome to adopt those best practices (e.g. balancing risk of missed income due to loss of availability versus risk of lost money due to slashing). But it would be ideal if we could remove as many of the obstacles BPs currently face with adopting the best practice so that BPs can comfortably adopt them for post-IF operation as soon as possible.

matthewdarwin · 2023-08-21T21:47:26Z

Could we discuss at the next node operator round table @bhazzard @heifner

arhag · 2023-08-25T03:24:09Z

Other reading material relevant to this discussion is provided in this old issue: eosnetworkfoundation/mandel#265

Note that Solution 1 described in that issue is essentially the current path we are considering for the system contract accompanying Leap 5.0 and the launch of Instant Finality (issue for that work is tracked in AntelopeIO/reference-contracts#24).

That old issue also describes a "Solution 2" which provides an alternative "off-chain" mechanism to prevent BP backup nodes from double-signing that still remains safe (the high-level idea is still good, but the details of the design need updates to reflect the specific constraints imposed by the on-chain consensus algorithm of HotStuff that was selected for Instant Finality). It requires using a completely different consensus algorithm within an internal network compromised of just the BP nodes, so it is not under consideration for the Leap 5.0 timeframe. The additional development complexity does however provide some benefits that may be of interest to the BPs:

it is safe way of preventing double signing without forcing the BPs to be dependent on the overall liveness of the blockchain itself;
it does not require expending on-chain resources (CPU/NET) to change the active key/machine;
and, it should have slightly lower latencies for switching the active key/machine than the on-chain method.

I personally believe that with the very low time-to-finality provided by IF, the latencies involved with the on-chain method are already low enough to not cause any significant risk of the BP nodes being unavailable to contribute to block finalization for any noticeable amount of time. The other risks, limitations, or costs with the on-chain solution also appear to me to be negligible given the safety win it enables for the entire EOS network compared to risks currently imposed on the network due to the typical (theoretically unsafe) methods used to handle failover between BP machines now.

However, if there is still significant concern with the long-term use of an on-chain BP failover method, perhaps that should influence prioritization of the development of "Solution 2" to be delivered some time in the future as way to eventually improve upon the limitations of the recommended on-chain BP failover method but without giving up safety.

bhazzard · 2023-09-07T17:12:03Z

Labelled as pending discussion after relevant decisions are made as part of AntelopeIO/reference-contracts#24.

arhag · 2024-04-15T19:23:52Z

Documentation should also capture how BPs should be connected to other BPs (including potentially standby nodes) and how they should configure their vote-threads to ensure that all the BP nodes that may participate in consensus send/receive vote messages so that finality can still advance.

arhag · 2024-08-21T21:08:47Z

See:
https://github.com/eosnetworkfoundation/docs/blob/main/native/60_advanced-topics/20_introduction-finalizers-voting.md
https://github.com/eosnetworkfoundation/docs/blob/main/native/60_advanced-topics/21_managing-finalizer-keys.md
https://github.com/AntelopeIO/spring/wiki/Rotate-Finalizer-Keys

Between these three documents, we cover the intended documentation of this issue.

Improvements and reorganization can be covered in future issues.

ericpassmore added the documentation Improvements or additions to documentation label Aug 18, 2023

enf-ci-bot added the triage label Aug 18, 2023

arhag mentioned this issue Aug 18, 2023

IF: Ensure pause and resume work with Instant Finality AntelopeIO/leap#1529

Closed

arhag mentioned this issue Aug 18, 2023

Instant Finality eosnetworkfoundation/product#39

Open

bhazzard mentioned this issue Sep 7, 2023

IF Prototype Stage 1A Implementation in Leap AntelopeIO/leap#1508

Closed

bhazzard added discussion and removed triage labels Sep 7, 2023

arhag transferred this issue from AntelopeIO/leap Apr 29, 2024

arhag closed this as completed Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IF: Document correct process for switching to a backup producer #93

IF: Document correct process for switching to a backup producer #93

ericpassmore commented Aug 18, 2023 •

edited by bhazzard

Loading

arhag commented Aug 18, 2023 •

edited

Loading

matthewdarwin commented Aug 18, 2023

arhag commented Aug 21, 2023 •

edited

Loading

matthewdarwin commented Aug 21, 2023

arhag commented Aug 25, 2023 •

edited

Loading

bhazzard commented Sep 7, 2023 •

edited

Loading

arhag commented Apr 15, 2024

arhag commented Aug 21, 2024

IF: Document correct process for switching to a backup producer #93

IF: Document correct process for switching to a backup producer #93

Comments

ericpassmore commented Aug 18, 2023 • edited by bhazzard Loading

arhag commented Aug 18, 2023 • edited Loading

matthewdarwin commented Aug 18, 2023

arhag commented Aug 21, 2023 • edited Loading

matthewdarwin commented Aug 21, 2023

arhag commented Aug 25, 2023 • edited Loading

bhazzard commented Sep 7, 2023 • edited Loading

arhag commented Apr 15, 2024

arhag commented Aug 21, 2024

ericpassmore commented Aug 18, 2023 •

edited by bhazzard

Loading

arhag commented Aug 18, 2023 •

edited

Loading

arhag commented Aug 21, 2023 •

edited

Loading

arhag commented Aug 25, 2023 •

edited

Loading

bhazzard commented Sep 7, 2023 •

edited

Loading