Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nodes may get stuck due to parent availability issue with transition block #2732

Closed
mkalinin opened this issue Nov 22, 2021 · 11 comments
Closed

Comments

@mkalinin
Copy link
Collaborator

mkalinin commented Nov 22, 2021

Problem

Suppose, the merge fork has happened and beacon chain network produces blocks with empty payloads waiting for TTD to hit the reality. Malicious proposer builds a block with non-empty payload with a random sequence set to its parent_hash field, i.e. producing a payload atop of a block that is unavailable. According to current spec EL must turn into SYNCING upon receiving such a payload which would turn all nodes in the network into syncing mode and prevent them from attesting to and proposing new blocks.

This edge case was attempted to be solved in the Interop spec (see no 5. in Engine API spec here) and has been brought up in discord by @g11tech (thanks a lot!) as Kintsugi spec seems to miss the handling of this case.

Potential Solution

Interop spec proposes for EL to turn into SYNCING only after parent block is pulled from the network and is proved to be a PoS block. If the parent yet not pulled or it appeared to be a PoW block, EL should keep silence and try to pull missing blocks from the wire and execute them. If the parent block is indeed unavailable then EL would try to resolve the dependency forever and never respond to CL, and CL wouldn't treat the beacon block containing this payload as fully validated, thus, would orphan this block and move on. Additionally, it requires EL to properly handle the case when it's forced to sync with unavailable chain -- this assumed to be resolved already as it may happen on the Mainnet (by receiving NewBlock with unavailable parent) -- but there could be implications in the new context of CL/EL communications.

Note, when a node syncs from scratch and EL starts syncing before hearing from CL (regular sync in the PoW network) it will respond with SYNCING to any executePayload call. Suppose CL sends executePayload with unavailable parent block before EL starts its sync process, EL following the Interop spec would wait until it pulls and executes the parent block and all its ancestors -- this would keep CL in limbo for a few hours in case of the Mainnet. Being in this state CL can't attest to or propose new blocks.

A solution that seems working but not always:

  • Add isMergeBlock: bool to executePayload to clearly distinguish transition block from the others (EL could hijack forkchoiceUpdated -- if there were no forkchoiceUpdated calls before and executePayload call then this must be a transition block).
  • Add UNKNOWN_PARENT response status to executePayload. EL returns UNKNOWN_PARENT when isMergeBlock: True, the parent is unknown and EL isn't already SYNCING. Additionally, EL initiates the sync process in attempt to sync up to (and including) the parent block
  • CL handles UNKNOWN_PARENT as SYNCING during the optimistic sync, and as if there was a missing slot in the case when no sync process is happening -- it allows CL to attest to the previous block and propose yet another block on top of the previous one
  • In case when the parent is truly unavailable honest nodes will orphan the block
  • In case when the parent PoW block exists but a node didn't receive it in time due to synchrony issues the things get worse. CL will have to pull the parent beacon block and try to re-import it once again with potentially the same UNKNOWN_PARENT result if the sync process on EL side hasn't resolved the dependency yet.
@djrtwo
Copy link
Contributor

djrtwo commented Nov 22, 2021

which would turn all nodes in the network into syncing mode and prevent them from attesting to and proposing new blocks.

I don't see why this is true? A CL node would not incorporate this block into their non-optimistic block-tree and would build upon an avaialble TTD block as long as this isn't resolved

@mkalinin
Copy link
Collaborator Author

I don't see why this is true?

This is true if SYNCING status in the response turns CL from the online into the syncing mode where it can't propose and attest understanding that it's yet not caught up with the head of observable chain. If this switch doesn't happen then we're safe in this edge case. Indeed, if EL responds with SYNCING in this case then the next proposer may still build yet another transition block with its own terminal PoW block and validators will be able to attest to it if in this case the parent is available.

SYNCING status in the response signals that EL is missing (at least) parent and post-Merge this must never happen with online node as block trees of CL and EL are tightly coupled and no data availability issue may occur between the layers. CL adding a beacon block into its optimistic tree upon processing, and then moving it to the fully verified state once it hears from EL that the payload is VALID sounds fine. But doing the same when EL signalling that some data is missing does't feel right. Deciding on what CL should do when it receives SYNCING post-Merge is out of the scope of this issue and probably worth discussion.

IMO, the right behaviour of a node in this edge case is similar to what it would be if the payload execution took forever. Missing data is a different kind of thing and we might want avoid mixing it into this case. I am not saying that this is entirely wrong though -- definitely worth discussion

@djrtwo
Copy link
Contributor

djrtwo commented Nov 22, 2021

Right, SYNCING from EL on chains with an unknown PoW source at the transition cannot block CL from making decisions. Doing so would allow for trivially stopping the transition process.

IMO, the right behaviour of a node in this edge case is similar to what it would be if the payload execution took forever.

For simplicity sake (not changing APIs or anything), I think that we should not go into a place where EL can hang forever (just attempting to fill in the unavailable PoW parent) nor should we specify new return values. CL can discern between SYNCING on chains where some PoS ancestor is validated and chains where they are rooted in an unknown PoW parent. Thus they have enough information to act accordingly (not halt block and attestation production).

I think the correct thing is to just note that SYNCING return value on transition chains MUST NOT halt block and attestation production.

We can discuss other "halt" conditions in optimistic sync discussions/specs elsewhere.

@g11tech
Copy link
Contributor

g11tech commented Nov 23, 2021

CL can discern between SYNCING on chains where some PoS ancestor is validated and chains where they are rooted in an unknown PoW parent. Thus they have enough information to act accordingly (not halt block and attestation production).

This distinction by CL at the transition time should resolve and get the right/available/popular pow in the chain by allowing to use the local PoW to build the new merge block. Once the new terminal PoW is out there in network, validators should be able to vote on it.

@mkalinin
Copy link
Collaborator Author

I think the correct thing is to just note that SYNCING return value on transition chains MUST NOT halt block and attestation production.

I tend to agree. We need a proper place for this statement. I guess it should be in the optimistic sync document.

We can discuss other "halt" conditions in optimistic sync discussions/specs elsewhere

I've created a separate issue #2735

@djrtwo
Copy link
Contributor

djrtwo commented Nov 23, 2021

I think we can elevate that as a note into the CL specs

Chains with transition blocks with unavailable PoW parents MUST be queued until the source PoW chain becomes available. Block and attestation production MUST NOT be halted due to anything in that queue

@ajsutton
Copy link
Contributor

What's the limit on that though? For example if I'm following a chain and get a transition block that I can't yet verify, it makes sense to not have that block me. But if I can optimistically follow the chain past that transition block for another 100 epochs then it seems wrong to produce a block that would create a fork.

@djrtwo
Copy link
Contributor

djrtwo commented Nov 23, 2021

An easy limit is if an unavailable chain (wrt PoW source) finalizes. This clearly looks like a networking/EL failure and should probably be bubbled up to the user

@ajsutton
Copy link
Contributor

An easy limit is if an unavailable chain (wrt PoW source) finalizes. This clearly looks like a networking/EL failure and should probably be bubbled up to the user

But we'd expect that to happen if we're initially syncing the chain so it wouldn't be an error we'd report to the user. Not performing duties if our finalized checkpoint is only optimistically synced probably does make sense - though I'd be tempted to base it on the justified checkpoint instead since attesting with an invalid justified checkpoint can be very problematic and it's still a very strong signal that a lot of validators think that's the real chain and the execution block should turn up eventually.

@g11tech
Copy link
Contributor

g11tech commented Nov 24, 2021

+1 for justified as it will signal early on that User needs to followup/escalate for the social consensus to kickin, hopefully before chain finalizes.

@mkalinin
Copy link
Collaborator Author

Addressed in #2770:

The current slot (as per the system clock) is at least SAFE_SLOTS_TO_IMPORT_OPTIMISTICALLY ahead of the slot of the block being imported.

A node will not optimistically import a merge transition block until it's safe enough to do so. This prevents system from getting stuck due to all nodes being kicked out by a transition block atop of unavailable terminal block. If a terminal block is available but haven't been disseminated in time it will eventually be disseminated and picked up by one of the next proposers to build another transition block atop of it, and this block will likely be accepted by the network.

If a block is indeed unavailable an honest chain built during SAFE_SLOTS_TO_IMPORT_OPTIMISTICALLY will be long enough to outperform a malicious chain built atop of unavailable terminal block. Most likely an honest chain segment will have justified and finalized checkpoints produced during this period of time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants