Nodes may get stuck due to parent availability issue with transition block #2732

mkalinin · 2021-11-22T14:28:12Z

Problem

Suppose, the merge fork has happened and beacon chain network produces blocks with empty payloads waiting for TTD to hit the reality. Malicious proposer builds a block with non-empty payload with a random sequence set to its parent_hash field, i.e. producing a payload atop of a block that is unavailable. According to current spec EL must turn into SYNCING upon receiving such a payload which would turn all nodes in the network into syncing mode and prevent them from attesting to and proposing new blocks.

This edge case was attempted to be solved in the Interop spec (see no 5. in Engine API spec here) and has been brought up in discord by @g11tech (thanks a lot!) as Kintsugi spec seems to miss the handling of this case.

Potential Solution

Interop spec proposes for EL to turn into SYNCING only after parent block is pulled from the network and is proved to be a PoS block. If the parent yet not pulled or it appeared to be a PoW block, EL should keep silence and try to pull missing blocks from the wire and execute them. If the parent block is indeed unavailable then EL would try to resolve the dependency forever and never respond to CL, and CL wouldn't treat the beacon block containing this payload as fully validated, thus, would orphan this block and move on. Additionally, it requires EL to properly handle the case when it's forced to sync with unavailable chain -- this assumed to be resolved already as it may happen on the Mainnet (by receiving NewBlock with unavailable parent) -- but there could be implications in the new context of CL/EL communications.

Note, when a node syncs from scratch and EL starts syncing before hearing from CL (regular sync in the PoW network) it will respond with SYNCING to any executePayload call. Suppose CL sends executePayload with unavailable parent block before EL starts its sync process, EL following the Interop spec would wait until it pulls and executes the parent block and all its ancestors -- this would keep CL in limbo for a few hours in case of the Mainnet. Being in this state CL can't attest to or propose new blocks.

A solution that seems working but not always:

Add isMergeBlock: bool to executePayload to clearly distinguish transition block from the others (EL could hijack forkchoiceUpdated -- if there were no forkchoiceUpdated calls before and executePayload call then this must be a transition block).
Add UNKNOWN_PARENT response status to executePayload. EL returns UNKNOWN_PARENT when isMergeBlock: True, the parent is unknown and EL isn't already SYNCING. Additionally, EL initiates the sync process in attempt to sync up to (and including) the parent block
CL handles UNKNOWN_PARENT as SYNCING during the optimistic sync, and as if there was a missing slot in the case when no sync process is happening -- it allows CL to attest to the previous block and propose yet another block on top of the previous one
In case when the parent is truly unavailable honest nodes will orphan the block
In case when the parent PoW block exists but a node didn't receive it in time due to synchrony issues the things get worse. CL will have to pull the parent beacon block and try to re-import it once again with potentially the same UNKNOWN_PARENT result if the sync process on EL side hasn't resolved the dependency yet.

The text was updated successfully, but these errors were encountered:

djrtwo · 2021-11-22T15:06:22Z

which would turn all nodes in the network into syncing mode and prevent them from attesting to and proposing new blocks.

I don't see why this is true? A CL node would not incorporate this block into their non-optimistic block-tree and would build upon an avaialble TTD block as long as this isn't resolved

mkalinin · 2021-11-22T17:04:19Z

I don't see why this is true?

This is true if SYNCING status in the response turns CL from the online into the syncing mode where it can't propose and attest understanding that it's yet not caught up with the head of observable chain. If this switch doesn't happen then we're safe in this edge case. Indeed, if EL responds with SYNCING in this case then the next proposer may still build yet another transition block with its own terminal PoW block and validators will be able to attest to it if in this case the parent is available.

SYNCING status in the response signals that EL is missing (at least) parent and post-Merge this must never happen with online node as block trees of CL and EL are tightly coupled and no data availability issue may occur between the layers. CL adding a beacon block into its optimistic tree upon processing, and then moving it to the fully verified state once it hears from EL that the payload is VALID sounds fine. But doing the same when EL signalling that some data is missing does't feel right. Deciding on what CL should do when it receives SYNCING post-Merge is out of the scope of this issue and probably worth discussion.

IMO, the right behaviour of a node in this edge case is similar to what it would be if the payload execution took forever. Missing data is a different kind of thing and we might want avoid mixing it into this case. I am not saying that this is entirely wrong though -- definitely worth discussion

djrtwo · 2021-11-22T20:51:37Z

Right, SYNCING from EL on chains with an unknown PoW source at the transition cannot block CL from making decisions. Doing so would allow for trivially stopping the transition process.

IMO, the right behaviour of a node in this edge case is similar to what it would be if the payload execution took forever.

For simplicity sake (not changing APIs or anything), I think that we should not go into a place where EL can hang forever (just attempting to fill in the unavailable PoW parent) nor should we specify new return values. CL can discern between SYNCING on chains where some PoS ancestor is validated and chains where they are rooted in an unknown PoW parent. Thus they have enough information to act accordingly (not halt block and attestation production).

I think the correct thing is to just note that SYNCING return value on transition chains MUST NOT halt block and attestation production.

We can discuss other "halt" conditions in optimistic sync discussions/specs elsewhere.

g11tech · 2021-11-23T04:11:22Z

CL can discern between SYNCING on chains where some PoS ancestor is validated and chains where they are rooted in an unknown PoW parent. Thus they have enough information to act accordingly (not halt block and attestation production).

This distinction by CL at the transition time should resolve and get the right/available/popular pow in the chain by allowing to use the local PoW to build the new merge block. Once the new terminal PoW is out there in network, validators should be able to vote on it.

mkalinin · 2021-11-23T06:35:02Z

I think the correct thing is to just note that SYNCING return value on transition chains MUST NOT halt block and attestation production.

I tend to agree. We need a proper place for this statement. I guess it should be in the optimistic sync document.

We can discuss other "halt" conditions in optimistic sync discussions/specs elsewhere

I've created a separate issue #2735

djrtwo · 2021-11-23T15:29:27Z

I think we can elevate that as a note into the CL specs

Chains with transition blocks with unavailable PoW parents MUST be queued until the source PoW chain becomes available. Block and attestation production MUST NOT be halted due to anything in that queue

ajsutton · 2021-11-23T22:33:54Z

What's the limit on that though? For example if I'm following a chain and get a transition block that I can't yet verify, it makes sense to not have that block me. But if I can optimistically follow the chain past that transition block for another 100 epochs then it seems wrong to produce a block that would create a fork.

djrtwo · 2021-11-23T23:10:45Z

An easy limit is if an unavailable chain (wrt PoW source) finalizes. This clearly looks like a networking/EL failure and should probably be bubbled up to the user

ajsutton · 2021-11-23T23:42:51Z

An easy limit is if an unavailable chain (wrt PoW source) finalizes. This clearly looks like a networking/EL failure and should probably be bubbled up to the user

But we'd expect that to happen if we're initially syncing the chain so it wouldn't be an error we'd report to the user. Not performing duties if our finalized checkpoint is only optimistically synced probably does make sense - though I'd be tempted to base it on the justified checkpoint instead since attesting with an invalid justified checkpoint can be very problematic and it's still a very strong signal that a lot of validators think that's the real chain and the execution block should turn up eventually.

g11tech · 2021-11-24T07:58:04Z

+1 for justified as it will signal early on that User needs to followup/escalate for the social consensus to kickin, hopefully before chain finalizes.

mkalinin · 2021-12-22T18:22:15Z

Addressed in #2770:

The current slot (as per the system clock) is at least SAFE_SLOTS_TO_IMPORT_OPTIMISTICALLY ahead of the slot of the block being imported.

A node will not optimistically import a merge transition block until it's safe enough to do so. This prevents system from getting stuck due to all nodes being kicked out by a transition block atop of unavailable terminal block. If a terminal block is available but haven't been disseminated in time it will eventually be disseminated and picked up by one of the next proposers to build another transition block atop of it, and this block will likely be accepted by the network.

If a block is indeed unavailable an honest chain built during SAFE_SLOTS_TO_IMPORT_OPTIMISTICALLY will be long enough to outperform a malicious chain built atop of unavailable terminal block. Most likely an honest chain segment will have justified and finalized checkpoints produced during this period of time.

g11tech mentioned this issue Nov 22, 2021

Kintsugi 🍵 (the merge) devnets 2022 tracker ChainSafe/lodestar#3452

Closed

27 tasks

mkalinin mentioned this issue Nov 23, 2021

CL reaction to SYNCING response status post-Merge #2735

Closed

ajsutton mentioned this issue Nov 25, 2021

[Merge] Handle conflicting optimistic head Consensys/teku#4650

Closed

mkalinin mentioned this issue Dec 1, 2021

Consensus-layer Call 77 ethereum/pm#429

Closed

mkalinin closed this as completed Dec 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nodes may get stuck due to parent availability issue with transition block #2732

Nodes may get stuck due to parent availability issue with transition block #2732

mkalinin commented Nov 22, 2021 •

edited

Loading

djrtwo commented Nov 22, 2021

mkalinin commented Nov 22, 2021

djrtwo commented Nov 22, 2021

g11tech commented Nov 23, 2021

mkalinin commented Nov 23, 2021

djrtwo commented Nov 23, 2021

ajsutton commented Nov 23, 2021

djrtwo commented Nov 23, 2021

ajsutton commented Nov 23, 2021

g11tech commented Nov 24, 2021

mkalinin commented Dec 22, 2021

Nodes may get stuck due to parent availability issue with transition block #2732

Nodes may get stuck due to parent availability issue with transition block #2732

Comments

mkalinin commented Nov 22, 2021 • edited Loading

Problem

Potential Solution

djrtwo commented Nov 22, 2021

mkalinin commented Nov 22, 2021

djrtwo commented Nov 22, 2021

g11tech commented Nov 23, 2021

mkalinin commented Nov 23, 2021

djrtwo commented Nov 23, 2021

ajsutton commented Nov 23, 2021

djrtwo commented Nov 23, 2021

ajsutton commented Nov 23, 2021

g11tech commented Nov 24, 2021

mkalinin commented Dec 22, 2021

mkalinin commented Nov 22, 2021 •

edited

Loading