op-supervisor-integration #49

protolambda · 2024-07-18T19:33:25Z

Description

Proposes how we extend the op-supervisor with an additional DB, to handle reorgs with and RPC query handling.

ajsutton

So to run through my favourite L1 reorg scenario...

op-node would be calling NextDeriveTask and they'd pull in the original set of blocks with the invalid dependencies. When op-node calls OnDerived with the block missing a dependency. It then calls NextDeriveTask and will get unknown until the other chain has derived a local-safe block at the required block number. Then it will get a new task to build a new depositOnly block on top of the latestCrossSafe. It builds that block and calls OnDerived, eventually the cross-safe head advances to include that deposit only block.

Then L1 reorgs so op-supervisor will roll back its super safe db and the cross safe head to prior to the deposit-only block. op-node calls NextDeriveTask and it's told to build a new not depositOnly block on the rolled back cross-safe head, so it derives the original block with the dependencies. This time the other chain eventually derives a block that meets that dependency and the cross safe advances including the full block. So that checks out well.

I'm going to have to think more about having multiple op-nodes. It might work but definitely introduces a lot of complexity to reason about.

ajsutton · 2024-07-19T00:01:22Z

protocol/op-supervisor-integration.md

+We extend the op-supervisor with an append-only database, similar to that of the event-database,
+to track the interdependent safety-level increments across the superchain.
+This data can then be exchanged with the op-node to promote the safety of its respective L2 blocks.


TODO(Adrian): Think through how we ensure these database are both in sync in the case of crashes and inconsistent writes.

protocol/op-supervisor-integration.md

ajsutton · 2024-07-19T02:02:05Z

Generally I like this approach a lot and I think it solves a lot of the problems we've had. I think we just need to be clear and think through and be really clear about what happens if op-node is ahead or behind op-supervisor, because with multiple op-nodes using the same supervisor, only one of them will be the source of new log data for supervisor. So the others will often be either ahead or behind.

tynes · 2024-07-22T16:45:36Z

Two thoughts:

Can this safely handle changes in the dependency set? ie we add a new chain
Can this handle chains with different blocktimes?

protocol/op-supervisor-integration.md

Co-authored-by: Adrian Sutton <adrian@symphonious.net>

protolambda · 2024-07-22T23:19:48Z

@tynes

Can this safely handle changes in the dependency set? ie we add a new chain

Yes, the search over L1 blocks can discover any number of L2 chain entries per L1 block. This is not a fixed number.
We should be able to return error codes on the derive-task and related RPC endpoints, if a chainID is used at a time when it shouldn't be part of the dependency set.

Can this handle chains with different blocktimes?

Yes, since the primary key-space are the L1 block that the L2 blocks are derived from. The L2 blocks are still ordered, but there may be gaps, and the L2 timestamp / block-number can be binary searched to find the closest relevant entry.

ajsutton · 2024-07-24T04:05:58Z

protocol/op-supervisor-integration.md

+To query whether an L2 block is `cross-safe`, binary search for the L2 block, to determine when (w.r.t. L1 block)
+it became cross-safe. If this L1 block exists and is canonical, then the L2 block is cross-safe.


Just thinking through the two different requirements we have here:

We don't write an L2 block to the super-safe-db until it is cross safe and we write the L2 block that is cross safe for all chains in that L1 at the same time

We progress local-safe derivation on each chain L1 block by L1 block so the chains move forward and eventually all chains have local-safe blocks at the same timestamp so we can determine if all dependencies are met or if we need to replace with a deposit-only block.

Since chains may submit batches at very different intervals, it's possible that when supervisor is up to L1 block 10, chain A can derive the next block, but chain B can't yet. So supervisor has to move forward along L1, potentially up to a full sequencer window, trying to find the next block from chain B. In the mean time, chain A might derive a lot of L2 blocks.

I think this might work if the entries for each L1 always describe the current cross-safe L2 block for the chain at that L1. So chain A would record the same L2 block for all of those L1 blocks as it waits for the next chain B block. supervisor would still index all those new blocks from chain A because it's indexing up to unsafe head.

I'm assuming that would be the case even if there were no dependencies from chain A blocks because we want to move the L2 blocks ahead in sync?

But I don't think this behaviour matches what's currently in this doc. The supervisor would need to tell all chains to start processing the next L1 block and just track the latest block they've reported back for each block. It can record a new L1 block set of entries in the db once it has a response from each chain for that L1 block, even if they didn't derive a new head. So it has to track the L1 block its up to rather than always returning the L1 block of the current safe head, and it would return the L2 block the chain last reported it was up to.

Eventually as the L1 moves forward, op-supervisor winds up having a block from every chain that builds on top of that chain's current cross-safe. It can then check if dependencies are met and advance the safe head or start assigning tasks to replace local-safe blocks with deposit-only blocks. It would need to reset the L1 block its asking nodes to derive from back to when the cross-safe head last advanced. Except at that point the local-safe for chain A could still be ahead of cross-safe (ie the batches for those blocks are in L1 blocks prior to when the cross-safe advanced even those they're after the cross-safe block for chain A). Those batches are probably invalid now but there may have been others that are now valid and were ignored because the local-safe head had already advanced. The fact that different channels can overlap makes it even more complex to know which L1 block we need to go back to.

Just realised - with steady batch derivation we don't need to go back and look for possible other batches. If the first one turns out to be invalid it is replaced with a deposit-only block and any others for that same block remain invalid. Will have to think it through again but that should let use rewind the L1 to the block the cross safe last updated and try and go forward from there again.

I think this might work if the entries for each L1 always describe the current cross-safe L2 block for the chain at that L1. So chain A would record the same L2 block for all of those L1 blocks as it waits for the next chain B block. supervisor would still index all those new blocks from chain A because it's indexing up to unsafe head.

I'm assuming that would be the case even if there were no dependencies from chain A blocks because we want to move the L2 blocks ahead in sync?

Right, we don't want a "weakest link batch submission" problem, and that's why I pivoted from using a timestamp (as in original notion doc draft) to a L1 block as primary key.

If a L2 chain does not make progress on safety, it just repeats the latest cross-safe block in the DB, for those L1 blocks that it did not make progress for.
That way we have a notion, at any point in time in L1, where the L2s were at in terms of derived blocks.

Upon reorgs of L1, it'll be easy to roll back the DB by primary key, towards a L1 block that is still there, and get a coherent older understanding of where L2s were at. And then continue from there, to follow the new L1 chain.

Also note that we are storing both the "local" safe and "cross" safe L2-pointers per L1 block, this makes it easy to promote a "local" safe once the dependencies are checked.

I'm assuming that would be the case even if there were no dependencies from chain A blocks because we want to move the L2 blocks ahead in sync?

Yes, no weakest-link, the DB should always progress. If the L1 did not produce new L2 blocks, then just repeat the last known L2 block (for cross-safe, and also local-safe, labels).

If derivation is not being processed, i.e. if we are unable to determine what L2 block is being derived from L1 at all, then we are forced to stop the DB progression however. This is unavoidable, because inactive op-nodes cannot be attributed to L1, so any action would be premature.

But I don't think this behaviour matches what's currently in this doc. The supervisor would need to tell all chains to start processing the next L1 block and just track the latest block they've reported back for each block. It can record a new L1 block set of entries in the db once it has a response from each chain for that L1 block, even if they didn't derive a new head. So it has to track the L1 block its up to rather than always returning the L1 block of the current safe head, and it would return the L2 block the chain last reported it was up to.

The behavior you are describing here matches my understanding, and what I tried to express in the doc.

Eventually as the L1 moves forward, op-supervisor winds up having a block from every chain that builds on top of that chain's current cross-safe. It can then check if dependencies are met and advance the safe head or start assigning tasks to replace local-safe blocks with deposit-only blocks. It would need to reset the L1 block its asking nodes to derive from back to when the cross-safe head last advanced. Except at that point the local-safe for chain A could still be ahead of cross-safe (ie the batches for those blocks are in L1 blocks prior to when the cross-safe advanced even those they're after the cross-safe block for chain A). Those batches are probably invalid now but there may have been others that are now valid and were ignored because the local-safe head had already advanced. The fact that different channels can overlap makes it even more complex to know which L1 block we need to go back to.

The steady-batch-derivation solves the problem you outlined here. Once we invalidate a "local safe" block that was not promoted to cross-safe yet, while staying on the same L1 chain, it means we do not have to reconsider prior batches anymore. Once invalid, always invalid, when anchored to the same L1 chain. "steady", no retries, no if-this-then-that backtracking in L1 history.

Just realised - with steady batch derivation we don't need to go back and look for possible other batches. If the first one turns out to be invalid it is replaced with a deposit-only block and any others for that same block remain invalid. Will have to think it through again but that should let use rewind the L1 to the block the cross safe last updated and try and go forward from there again.

Exactly. We can always drop the "local" safe, revert to the last "cross safe", while progressing on the same L1 chain without re-processing older L1 blocks. This does mean the "local safe" of the L2 is reorged, but the event-logs-DB can be updated, by rolling it back to cross-safe, and syncing the new post-interop-reorg L2 blocks.

protolambda · 2024-11-25T15:00:20Z

This is stale; the op-supervisor has a safety-index and work is in progress to handle reorgs for the next devnet.

protolambda added 3 commits July 18, 2024 13:31

op-supervisor-integration [work in progress]

7e93c56

more robust increments

e11e695

drop old comment

a93370d

protolambda changed the title ~~op-supervisor-integration [work in progress]~~ op-supervisor-integration Jul 19, 2024

protolambda marked this pull request as ready for review July 19, 2024 00:55

protolambda requested a review from ajsutton July 19, 2024 00:56

ajsutton reviewed Jul 19, 2024

View reviewed changes

tynes reviewed Jul 22, 2024

View reviewed changes

protocol/op-supervisor-integration.md Show resolved Hide resolved

Update protocol/op-supervisor-integration.md

6fb3a7b

Co-authored-by: Adrian Sutton <adrian@symphonious.net>

ajsutton mentioned this pull request Jul 23, 2024

Interop: op-supervisor: Handle large number of blocks between logs ethereum-optimism/optimism#11091

Closed

ajsutton reviewed Jul 24, 2024

View reviewed changes

protolambda self-assigned this Jul 24, 2024

protolambda mentioned this pull request Jul 31, 2024

Interop Devnet 1 Architecture ethereum-optimism/optimism#11298

Closed

protolambda closed this Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

op-supervisor-integration #49

op-supervisor-integration #49

protolambda commented Jul 18, 2024 •

edited

Loading

ajsutton left a comment

ajsutton Jul 19, 2024

ajsutton commented Jul 19, 2024

tynes commented Jul 22, 2024

protolambda commented Jul 22, 2024

ajsutton Jul 24, 2024

ajsutton Jul 24, 2024

protolambda Jul 24, 2024

protolambda commented Nov 25, 2024

		To query whether an L2 block is `cross-safe`, binary search for the L2 block, to determine when (w.r.t. L1 block)
		it became cross-safe. If this L1 block exists and is canonical, then the L2 block is cross-safe.

op-supervisor-integration #49

op-supervisor-integration #49

Conversation

protolambda commented Jul 18, 2024 • edited Loading

ajsutton left a comment

Choose a reason for hiding this comment

ajsutton Jul 19, 2024

Choose a reason for hiding this comment

ajsutton commented Jul 19, 2024

tynes commented Jul 22, 2024

protolambda commented Jul 22, 2024

ajsutton Jul 24, 2024

Choose a reason for hiding this comment

ajsutton Jul 24, 2024

Choose a reason for hiding this comment

protolambda Jul 24, 2024

Choose a reason for hiding this comment

protolambda commented Nov 25, 2024

protolambda commented Jul 18, 2024 •

edited

Loading