Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimistic post-merge sync #2691

Closed
paulhauner opened this issue Oct 8, 2021 · 9 comments
Closed

Optimistic post-merge sync #2691

paulhauner opened this issue Oct 8, 2021 · 9 comments
Labels
RFC Request for comment

Comments

@paulhauner
Copy link
Member

paulhauner commented Oct 8, 2021

Description

This is a tracking and discussion issue for implementing "optimistic" Beacon Chain (BC) sync once the merge-fork-epoch has passed. It aims to collate the lessons learned and information shared in the following two Lighthouse PRs:

This is a work-in-progress effort to maintain my notes in an organised fashion.

Terminology

  • Merge Fork: I use this to refer to the point at which the BC passes the MERGE_FORK_EPOCH. PoW Ethereum can (theoretically) exist indefinitely beyond this point.
  • Terminal Block (TB) Inclusion: I use this to refer to the point in the BC where get_terminal_pow_block
    first returns Some(pow_block) and it is included by reference as the parent of an ExecutionPayload in the BC.
    This must happen either at or after the Merge Fork. PoW Ethereum ends here.
  • Execution Layer (EL) Clients: existing "eth1" clients modified to work with the merge. Think EthereumJS, Nethermind, Besu, Erigon, Geth, etc.
  • Consensus Layer (CL) Clients: existing "eth2" clients modified to work with the merge. Think Lodestar, Nimbus, Teku, Lighthouse, Prysm.

Optimistic Sync

After TB inclusion on the BC, if we follow the specs exactly then we are simply unable to import beacon blocks without a connection to an EL client that is synced to our head (or later).

Whilst this is nice from a specification point of view, it's not great in practice. EL clients have developed very advanced ways of syncing the Ethereum state across long block-spans. Being spoon-fed block-by-block from a CL client is a major step backwards.

In order for EL clients to be able to use their fancy sync mechanisms, the CL clients need to zoom ahead and obtain all the valid beacon blocks they can and send the execution payloads to the EL clients. Ideally, the CL clients zoom to the head of the BC and are able to start sharing the latest, tip-of-the-chain execution_payloads with the EL. This gives the EL a nice big, juicy chain segment to sync.

Since the CL needs to reach the head of the BC before the EL can sync to an equivalent head, the CL must import beacon blocks without verifying the execution payloads. This is, technically, a violation of the BC specification. Some might call it "unsafe", but we call it "optimistic".

In summary, optimistic sync is where a CL syncs the BC without verifying all the execution payload values with an EL.

From Optimism to Realism

Syncing a CL client without verifying the execution payload values at all is simply unsafe (at least as far as I'm concerned). So, once we mange to get our EL synced, we should go back and verify all of the execution payloads we imported along the way.

Thankfully, this is not as tedious as it sounds. If one execution payload is valid, then all the ancestors must be valid. So, as long as we've ensured that the execution payloads we've imported all form a chain, if all the chain-heads (chain-tips) are valid, then all of our beacon blocks become fully verified and we're no longer an optimistic client (a realistic client?).

But what if one of those execution payloads is invalid? Well, we just need to invalidate that block and its descendants. That sounds easy, but there are two scenarios to consider:

  1. A finalized execution payload is invalid.
  2. A non-finalized execution payload is invalid.

In the case of (1), we're in serious trouble. As I understand it, there aren't any CL clients prepared to handle a reversion in the finalized chain (Lighthouse wont). So, in this case I think we simply need to shutdown, log critical errors and request the user to re-sync on a trusted internet connection.

In the case of (2), this is going to be much simpler. All the CL clients are prepared for re-orgs in the non-finalized chain. What they would do is go and remove the invalid block (and descendants) from their fork-choice tree and then run the fork-choice algorithm to find a new head that does not include any invalid execution payloads.

Dealing with Uncertainty

There are various different things a CL client needs to do with the blocks in their database:

  • Build new blocks atop them
  • Attest to them
  • Reference them in a sync committee
  • Serve them to API consumers
  • Serve them to P2P peers

When it comes to blocks with a valid payload, it's clear that we're free to do any of those tasks. However, when it comes to invalid blocks, I'd say it's clear that we shouldn't do any of those things.

But what about when we have blocks with an unknown execution payload status? I.e., the blocks we imported optimistically and haven't yet had verified? At this point, I think I'm also of the opinion that we should do any of those things either. Notably, it would be impossible to produce a block atop a block with an "unknown" status, since our EL can't build a new block atop one it doesn't know!

So, if we know that our head has an unknown status we can't build atop it. But should we try to fork around it and build atop the best verified head (our "safe head")? I'm not convinced the correct behaviour here, but I think that we should not try to build around it, since we would be forking the chain when we know that we have an incomplete picture. I really need to think deeply about this and if it will cause liveness failures.

Additional Resources

  • A doc by Danny Ryan on this topic: https://notes.ethereum.org/@djrtwo/BJxKBaqNF
    • This includes the addition of a most_recent_correct_ancestor to engine_processPayload which would make it very easy for us to find all the invalid ancestors of a block in our fork-choice tree.
@ralexstokes
Copy link
Contributor

Terminal Block (TB) Inclusion: I use this to refer to the point in the BC where get_terminal_pow_block first returns Some(pow_block) and it is included as an ExecutionPayload in the BC. This must happen either at or after the Merge Fork. PoW Ethereum ends here.

Do you mean here that the parent_hash of the ExecutionPayload in the BC is a reference to the pow_block (wrapped in Some)? The way this reads to me is that the pow_block is copied as a (duplicate) ExecutionPayload into the first post-merge beacon block. Unless I'm confused on this, I'd suggest updating this description so it is clearer:

Terminal Block (TB) Inclusion: I use this to refer to the point in the BC where `get_terminal_pow_block`
 first returns Some(pow_block) and it is included by reference as the parent of an ExecutionPayload in the BC. 
This must happen either at or after the Merge Fork. PoW Ethereum ends here.

@djrtwo
Copy link

djrtwo commented Oct 8, 2021

I concur that item's 1 through 5 should not be performed on optimistic BC heads. CERTAINLY items 1 through 3 should not be performed on an unsafe/optimistic head. These are simply dangerous for attesters.

An attestation for an incorrect chain could result in the attester stuck on such a chain (in the even that two chains had 2/3 and conflicting ffg info) and building on incorrect beacon blocks is (a) currently just a bad behavior for the network and (b) when we have an execution proof of custody (which we expect sooner rather than later), it could result in slashing in some cases.

As for APIs, I don't think it makes sense to serve an optimistic head. The user would not be able to then go look at the EL contents of a such a head and thus would have a broken view of what maybe is the head. If EL isn't resolved for some stretch, that is essentially the aggregate EL+CL client still "syncing" that segment, and thus it is natural to treat it as such (even though one half of layers is resolved).

As for P2P, it's a bit less straight forward. I don't think you should serve optimistic beacon blocks in blocks by range or status. Your sync status and local head is still behind the optimistic head in a sense.

For gossip, though, it's a bit less clear. In many SYNCING situations, EL might be near the head so you want to still get new CL blocks so you can quickly resolve segments when EL finishes SYNCING. If you look at the Merge p2p beacon_block validations, you can do all of the execution_payload validations without issue. It seems like the worst case is that all CL clients (not just SYNCING ones) could be tricked to gossip a block that has a good signature and structure but bad EL execution. The non-SYNCING nodes would quickly drop the block because it fails full EL validations and the SYNCING nodes would also drop the block when eventually sync'd.

@vbuterin
Copy link

vbuterin commented Oct 8, 2021

CERTAINLY items 1 through 3 should not be performed on an unsafe/optimistic head.

This seems risky. If attesters do not attest to unsafe heads, then how would an unsafe head ever become safe?

(I'm sure in some situations it would, because attestations later show up, but not all)

@djrtwo
Copy link

djrtwo commented Oct 8, 2021

This only happens if your EL is syncing.

*THIS IS NOT safe/unsafe wrt making decisions about beacon blocks and chance of re-org. This is unsafe because the CL has been validated but not the EL.

We conflated "unsafe" in two different Merge convos and conventions. "Optimistic" CL is probably a better term here

@sauliusgrigaitis
Copy link

We experimented with a similar concept in Grandine for a different purpose - unlimited parallelization of blocks signatures verification. Kinda similar situation as the latest chunk of the chain had semi-verified (everything checked except signatures) blocks too. In our case fork choice built the chain with blocks skipping signature verification in order to advance the state enough so it made possible to spin a high amount (at least hundreds) of block signatures verification tasks.

After we implemented it the whole thing looked so terrible and unsafe that we dropped this idea. Quadratic complexity of already complex optimized fork choice. Looking forward to your solution as it will solve the unlimited block signatures verification parallelization too.

@paulhauner
Copy link
Member Author

paulhauner commented Oct 11, 2021

Do you mean here that the parent_hash of the ExecutionPayload in the BC is a reference to the pow_block (wrapped in Some)?

Good point @ralexstokes, thanks. I've added your suggestion :)

For gossip, though, it's a bit less clear. In many SYNCING situations, EL might be near the head so you want to still get new CL blocks so you can quickly resolve segments when EL finishes SYNCING

Indeed, gossip is a good point. I also tend to think that we should continue to gossip blocks on an optimistic head.

Looking forward to your solution as it will solve the unlimited block signatures verification parallelization too.

It's important to note that my scheme fails hard (i.e. client shutdown, delete the database) if an invalid block is finalized. The primary reason I would be comfortable implementing such a scheme is because we verify signatures along the way. In order to get a failure in this optimistic execution-payload scheme you need to get 2/3rds (of a random distribution) of active validators signing across invalid blocks.

If we were to delay signature verification across a unlimited number of blocks, some batches would contain blocks that finalize blocks earlier in the batch (mainnet usually finalizes every 64 blocks). Since there is no signature verification, it would be trivial for anyone to construct a chain of blocks that looks like it finalizes.

So, to do unlimited parallelization of block signatures, you need a client design that makes it possible to revert finality. That is not something I plan to implement here unfortunately.

@sauliusgrigaitis
Copy link

If we were to delay signature verification across a unlimited number of blocks, some batches would contain blocks that finalize blocks earlier in the batch (mainnet usually finalizes every 64 blocks). Since there is no signature verification, it would be trivial for anyone to construct a chain of blocks that looks like it finalizes.

This can be solved optimistically by doing a quick check of proposer signature. So that's not too big problem, especially if reverting is implemented.

So, to do unlimited parallelization of block signatures, you need a client design that makes it possible to revert finality. That is not something I plan to implement here unfortunately.

Grandine doesn't have persistence and finalization coupling. It can run in memory for very long and we dump the state only to avoid full resync after a restart. However, as I mentioned before, the implementation we did back then felt too hacky.

Anyway, as signatures are checked, then the only problem is to not get into a situation where 2/3rds finalizes invalid payload. This means that unsafe head should be an isolated optimization and should not be exponsed elsewhere, otherwise we may learn how users use it in creative ways that make 2/3rds finalizing invalid payload.

@paulhauner
Copy link
Member Author

I've done some more thinking on this and my latest collection of information lives here:

https://hackmd.io/Ic7VpkY3SkKGgYLg2p9pMg

@paulhauner
Copy link
Member Author

I'll close this since we've already implemented optimistic sync (and done the merge 🎉)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC Request for comment
Projects
None yet
Development

No branches or pull requests

5 participants