Clarify `get_pow_block` block-not-found case #2676

hwwhww · 2021-10-18T06:08:26Z

Issue

Discussed in #2636
It is possible that the requested block hash is not found via get_pow_block.

How did I fix it

Make get_pow_block returns Optional[PowBlock]
If it returns None, the operation is invalid.
Add some notes about the edge case.

specs/merge/fork-choice.md

Co-authored-by: Hsiao-Wei Wang <hsiaowei.eth@gmail.com>

mkalinin

We might want to add more clarity around transition block in the context of optimistic sync. But this could be done in a separate PR

specs/merge/validator.md

specs/merge/fork-choice.md

mkalinin · 2021-10-20T06:03:52Z

specs/merge/fork-choice.md

@@ -110,8 +117,12 @@ def on_block(store: Store, signed_block: SignedBeaconBlock) -> None:

    # [New in Merge]
    if is_merge_block(pre_state, block.body):
+        # Note: unavailable PoW block(s) may later become available.
+        # Nodes should queue such beacon blocks for later processing.


This case is a bit more tricky and we should think what would be a better way of specifying it. For the case when CL and EL are in sync with the network, current specification works. But for the case when they are syncing we might want to add more clarity to the spec.

Queueing these blocks could raise a deadlock in an optimistic sync process. Suppose, the optimistic sync is happening and EL uses state sync to catch up with the head and the network passed the transition point a while ago, so the state for transition block is no longer available due to state trie pruning. By queueing beacon blocks at this point, CL will stuck and won't inform EL about the most recent head that is available in the network.

The right behaviour during the sync stage would be as follows:

process transition block optimistically without terminal block verification

once fully synced, get back to the transition block and check if it's been finalized or not

verify terminal block if the transition block hasn't been finalized

do re-org if the terminal block is invalid

We also may add this clarity in a separate PR.

I consider the optimistic sync process as independent from the consensus specs.

"queuing blocks for later processing" in optimistic beacon chain sync means that you will eventually go back to fully validate them later. The architecture of CL/EL split client is independent of core consensus specs and I want to find the right balance here to not taint the specs with implementation details.

That said, I could see how this is confusing. But I do think that more notes about optimistic processing can/should be included in the Engine API as that is where this implemntation specific architecture design lives

I agree with the intention in general. But would also like to keep the Engine API spec lean. Probably, it should be described in an optimistic sync specification or an architecture design doc of a split client.

What confuses me is that with this change the spec considers the case when PowBlock is unavailable after state_transition have passed, which means that process_execution_payload have succeeded with unavailable payload. This could only happen in the optimistic sync scenario. So, the spec already supports an optimistic sync implicitly. I would prefer not supporting it at all, i.e. keeping everything as it is in the spec, or add an explicit notion of it. IMO, something in the middle is a source of confusion for implementers.

So we can keep the function as assuming availability here implicitly just blocking until it can be found. To avoid it all together.

Also committing to specifying optimistic sync wrt engineAPI somewhere

I consider the optimistic sync process as independent from the consensus specs.

A thought on that. Optimistic nature of the sync process is caused by the CL/EL separation which is deeper than CL/EL client software split hence could be reflected in the specs. Imagine the client that is represented by a single software unit or the world where there is only one physical blockchain, i.e. consensus and execution blockchains have been merged. In both cases optimistic sync would be still needed to be able to pull the head of the chain and then pull the EL state.

The analog is [fast | snap] sync today. It optimistically downloads the state using the information about the head from the wire, at the same time it pulls the consensus state which is currently a chain of headers originated from the genesis. And for the sync procedure to get converged it is required that headers constitute a valid PoW chain and the state has been pulled. We can replace headers with beacon state and blocks required for it verification in this analog and get the sync procedure that we are working on at the moment. Arguably we already have optimistic sync today that is heavily used by the network.

There is a couple of changes appear when we move from the analog to our case. First, we can't initiate the EL state pulling process until the consensus state is downloaded and verified, and it gives us an advantage because pulling the consensus state is cheaper and gives us more guarantee than in the PoW network. So, the optimistic sync in our case is rather applied to the consensus state than to the EL state. Second, we can do a sync only in an optimistic way as an EL block sync without CL isn't reliable anymore.

Recap, IMO it's worth considering an optimistic nature of the sync in the specs. As this nature is unlikely a subject to change until we have statelessness. Even in a stateless world it depends on the kind of the statelessness (weak vs strong) that is maintained and the type of the node (proposer/producer vs user).

I see your point and agree with it in general. But we have a specific case here. The logic we're changing in this proposal and are arguing about is Merge specific; it doesn't and shouldn't apply to a Ethereum Networks in general as some of them are started from scratch as a PoS networks and this is where your argument make a lot of sense to me.

But the Merge is a specific upgrade on the Mainnet and some time after and during the Merge Mainnet will rely on optimistic sync. I think it worth writing the spec around get_pow_block more clear and close to the Mainnet rather than to the general network as clarity is always good for implementers.

Your point on lean specs made me think about the spec clean up after the Merge, i.e. removing transition logic from spec as a part of clean-up fork.

Off the topic, I think it's difficult to generate test vectors for syncing scenarios from consensus-specs. What do you guys think about adding "client teams must implement the following integration testing cases: ..." to merge readiness for the known edge cases?

I think that this tests should be a part of Hive test scenarios and we definitely want some test plan to exist. I would link a document with test plan to the merge readiness. what do you think?

On a side topic, what do you think about moving the code under if is_merge_block(pre_state, block.body): condition in a separate function? This code becomes more and more sophisticated and enclosing it in a function should aid unit test writing where we can cover with tests this particular function with e.g. mocked get_pow_block.

This method could probably be called validate_merge_block, for instance:

def validate_merge_block(block: BeaconBlock) -> None: # Check the parent PoW block of execution payload is a valid terminal PoW block. # Note: unavailable PoW block(s) may later become available. # Nodes should queue such beacon blocks for later processing. pow_block = get_pow_block(block.body.execution_payload.parent_hash) assert pow_block is not None pow_parent = get_pow_block(pow_block.parent_hash) assert pow_parent is not None assert is_valid_terminal_pow_block(pow_block, pow_parent) # If `TERMINAL_BLOCK_HASH` is used as an override, the activation epoch must be reached. if TERMINAL_BLOCK_HASH != Hash32(): assert compute_epoch_at_slot(block.slot) >= TERMINAL_BLOCK_HASH_ACTIVATION_EPOCH if is_merge_block(pre_state, block.body): validate_merge_block(block)

Back to the main discussion, validate_merge_block is basically what CL must do after its fully synced. CL will have to scan its block tree (basically, every fork) for the merge block(s) and call to validate_merge_block over this block. There are three possible cases:

Such a block isn't available -- either it's behind the WS checkpoint the node has started with or it's not yet happened; already no need OR nothing yet to validate

Such a block exist and it's valid -- great the transition went smooth on this chain

Such a block exist and it's invalid -- invalidate the chain containing this block and do re-org and other according actions; if such a chain has been finalized then signal critical failure to the user and stop processing the chain

Validating this block doesn't require the BeaconState object which is good and can be done with the merge beacon block only. But is_merge_block function would probably need to be replaced to remove the dependency on the state.

And we could probably state something like that: "Note: unavailable PoW block(s) may later become available and a client software MAY delay a call to validate_merge_block until the PoW block(s) become available" and remove "Nodes should queue such beacon blocks for later processing". As the latter will likely cause a deadlock in an optimistic sync.

I like the validate_merge_block! It makes the warning more clear. :)
I implemented it in 6016347

Co-authored-by: Mikhail Kalinin <noblesse.knight@gmail.com>

mkalinin

Looks good to me! 👍

djrtwo

lgtm

Clarify get_pow_block block-not-found case

cd5cf60

hwwhww requested a review from mkalinin October 18, 2021 06:08

djrtwo reviewed Oct 18, 2021

View reviewed changes

specs/merge/fork-choice.md Outdated Show resolved Hide resolved

hwwhww added the Bellatrix CL+EL Merge label Oct 19, 2021

djrtwo reviewed Oct 19, 2021

View reviewed changes

specs/merge/fork-choice.md Outdated Show resolved Hide resolved

Apply suggestions from code review

e787da3

Co-authored-by: Hsiao-Wei Wang <hsiaowei.eth@gmail.com>

mkalinin reviewed Oct 20, 2021

View reviewed changes

hwwhww and others added 5 commits October 26, 2021 21:37

Merge branch 'dev' into get_pow_block-not-found

5b9652b

Fix typo

71d3159

Co-authored-by: Mikhail Kalinin <noblesse.knight@gmail.com>

Merge branch 'dev' into get_pow_block-not-found

79bf348

PR feedback from @mkalinin

6016347

Minor comment fix

8ecbffa

mkalinin approved these changes Oct 28, 2021

View reviewed changes

hwwhww requested a review from djrtwo October 29, 2021 04:11

djrtwo approved these changes Oct 29, 2021

View reviewed changes

djrtwo merged commit 6d1be4a into dev Oct 29, 2021

djrtwo deleted the get_pow_block-not-found branch October 29, 2021 19:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify `get_pow_block` block-not-found case #2676

Clarify `get_pow_block` block-not-found case #2676

hwwhww commented Oct 18, 2021

mkalinin left a comment

mkalinin Oct 20, 2021

djrtwo Oct 20, 2021

mkalinin Oct 21, 2021

djrtwo Oct 21, 2021

mkalinin Oct 22, 2021

mkalinin Oct 26, 2021

hwwhww Oct 26, 2021

mkalinin Oct 27, 2021

mkalinin Oct 27, 2021

hwwhww Oct 28, 2021

mkalinin left a comment

djrtwo left a comment

Clarify get_pow_block block-not-found case #2676

Clarify get_pow_block block-not-found case #2676

Conversation

hwwhww commented Oct 18, 2021

Issue

How did I fix it

mkalinin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkalinin left a comment

Choose a reason for hiding this comment

djrtwo left a comment

Choose a reason for hiding this comment

Clarify `get_pow_block` block-not-found case #2676

Clarify `get_pow_block` block-not-found case #2676