Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

op-supervisor-integration #49

Closed
wants to merge 4 commits into from
Closed

Conversation

protolambda
Copy link
Contributor

@protolambda protolambda commented Jul 18, 2024

Description

Proposes how we extend the op-supervisor with an additional DB, to handle reorgs with and RPC query handling.

@protolambda protolambda changed the title op-supervisor-integration [work in progress] op-supervisor-integration Jul 19, 2024
@protolambda protolambda marked this pull request as ready for review July 19, 2024 00:55
@protolambda protolambda requested a review from ajsutton July 19, 2024 00:56
Copy link
Contributor

@ajsutton ajsutton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to run through my favourite L1 reorg scenario...

op-node would be calling NextDeriveTask and they'd pull in the original set of blocks with the invalid dependencies. When op-node calls OnDerived with the block missing a dependency. It then calls NextDeriveTask and will get unknown until the other chain has derived a local-safe block at the required block number. Then it will get a new task to build a new depositOnly block on top of the latestCrossSafe. It builds that block and calls OnDerived, eventually the cross-safe head advances to include that deposit only block.

Then L1 reorgs so op-supervisor will roll back its super safe db and the cross safe head to prior to the deposit-only block. op-node calls NextDeriveTask and it's told to build a new not depositOnly block on the rolled back cross-safe head, so it derives the original block with the dependencies. This time the other chain eventually derives a block that meets that dependency and the cross safe advances including the full block. So that checks out well.

I'm going to have to think more about having multiple op-nodes. It might work but definitely introduces a lot of complexity to reason about.

Comment on lines +14 to +16
We extend the op-supervisor with an append-only database, similar to that of the event-database,
to track the interdependent safety-level increments across the superchain.
This data can then be exchanged with the op-node to promote the safety of its respective L2 blocks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO(Adrian): Think through how we ensure these database are both in sync in the case of crashes and inconsistent writes.

protocol/op-supervisor-integration.md Outdated Show resolved Hide resolved
protocol/op-supervisor-integration.md Show resolved Hide resolved
protocol/op-supervisor-integration.md Show resolved Hide resolved
protocol/op-supervisor-integration.md Show resolved Hide resolved
protocol/op-supervisor-integration.md Outdated Show resolved Hide resolved
protocol/op-supervisor-integration.md Show resolved Hide resolved
protocol/op-supervisor-integration.md Show resolved Hide resolved
protocol/op-supervisor-integration.md Show resolved Hide resolved
protocol/op-supervisor-integration.md Show resolved Hide resolved
@ajsutton
Copy link
Contributor

Generally I like this approach a lot and I think it solves a lot of the problems we've had. I think we just need to be clear and think through and be really clear about what happens if op-node is ahead or behind op-supervisor, because with multiple op-nodes using the same supervisor, only one of them will be the source of new log data for supervisor. So the others will often be either ahead or behind.

@tynes
Copy link
Contributor

tynes commented Jul 22, 2024

Two thoughts:

  • Can this safely handle changes in the dependency set? ie we add a new chain
  • Can this handle chains with different blocktimes?

Co-authored-by: Adrian Sutton <adrian@symphonious.net>
@protolambda
Copy link
Contributor Author

@tynes

Can this safely handle changes in the dependency set? ie we add a new chain

Yes, the search over L1 blocks can discover any number of L2 chain entries per L1 block. This is not a fixed number.
We should be able to return error codes on the derive-task and related RPC endpoints, if a chainID is used at a time when it shouldn't be part of the dependency set.

Can this handle chains with different blocktimes?

Yes, since the primary key-space are the L1 block that the L2 blocks are derived from. The L2 blocks are still ordered, but there may be gaps, and the L2 timestamp / block-number can be binary searched to find the closest relevant entry.

Comment on lines +105 to +106
To query whether an L2 block is `cross-safe`, binary search for the L2 block, to determine when (w.r.t. L1 block)
it became cross-safe. If this L1 block exists and is canonical, then the L2 block is cross-safe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking through the two different requirements we have here:

  1. We don't write an L2 block to the super-safe-db until it is cross safe and we write the L2 block that is cross safe for all chains in that L1 at the same time
  2. We progress local-safe derivation on each chain L1 block by L1 block so the chains move forward and eventually all chains have local-safe blocks at the same timestamp so we can determine if all dependencies are met or if we need to replace with a deposit-only block.

Since chains may submit batches at very different intervals, it's possible that when supervisor is up to L1 block 10, chain A can derive the next block, but chain B can't yet. So supervisor has to move forward along L1, potentially up to a full sequencer window, trying to find the next block from chain B. In the mean time, chain A might derive a lot of L2 blocks.

I think this might work if the entries for each L1 always describe the current cross-safe L2 block for the chain at that L1. So chain A would record the same L2 block for all of those L1 blocks as it waits for the next chain B block. supervisor would still index all those new blocks from chain A because it's indexing up to unsafe head.

I'm assuming that would be the case even if there were no dependencies from chain A blocks because we want to move the L2 blocks ahead in sync?

But I don't think this behaviour matches what's currently in this doc. The supervisor would need to tell all chains to start processing the next L1 block and just track the latest block they've reported back for each block. It can record a new L1 block set of entries in the db once it has a response from each chain for that L1 block, even if they didn't derive a new head. So it has to track the L1 block its up to rather than always returning the L1 block of the current safe head, and it would return the L2 block the chain last reported it was up to.

Eventually as the L1 moves forward, op-supervisor winds up having a block from every chain that builds on top of that chain's current cross-safe. It can then check if dependencies are met and advance the safe head or start assigning tasks to replace local-safe blocks with deposit-only blocks. It would need to reset the L1 block its asking nodes to derive from back to when the cross-safe head last advanced. Except at that point the local-safe for chain A could still be ahead of cross-safe (ie the batches for those blocks are in L1 blocks prior to when the cross-safe advanced even those they're after the cross-safe block for chain A). Those batches are probably invalid now but there may have been others that are now valid and were ignored because the local-safe head had already advanced. The fact that different channels can overlap makes it even more complex to know which L1 block we need to go back to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just realised - with steady batch derivation we don't need to go back and look for possible other batches. If the first one turns out to be invalid it is replaced with a deposit-only block and any others for that same block remain invalid. Will have to think it through again but that should let use rewind the L1 to the block the cross safe last updated and try and go forward from there again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might work if the entries for each L1 always describe the current cross-safe L2 block for the chain at that L1. So chain A would record the same L2 block for all of those L1 blocks as it waits for the next chain B block. supervisor would still index all those new blocks from chain A because it's indexing up to unsafe head.

I'm assuming that would be the case even if there were no dependencies from chain A blocks because we want to move the L2 blocks ahead in sync?

Right, we don't want a "weakest link batch submission" problem, and that's why I pivoted from using a timestamp (as in original notion doc draft) to a L1 block as primary key.

If a L2 chain does not make progress on safety, it just repeats the latest cross-safe block in the DB, for those L1 blocks that it did not make progress for.
That way we have a notion, at any point in time in L1, where the L2s were at in terms of derived blocks.

Upon reorgs of L1, it'll be easy to roll back the DB by primary key, towards a L1 block that is still there, and get a coherent older understanding of where L2s were at. And then continue from there, to follow the new L1 chain.

Also note that we are storing both the "local" safe and "cross" safe L2-pointers per L1 block, this makes it easy to promote a "local" safe once the dependencies are checked.

I'm assuming that would be the case even if there were no dependencies from chain A blocks because we want to move the L2 blocks ahead in sync?

Yes, no weakest-link, the DB should always progress. If the L1 did not produce new L2 blocks, then just repeat the last known L2 block (for cross-safe, and also local-safe, labels).

If derivation is not being processed, i.e. if we are unable to determine what L2 block is being derived from L1 at all, then we are forced to stop the DB progression however. This is unavoidable, because inactive op-nodes cannot be attributed to L1, so any action would be premature.

But I don't think this behaviour matches what's currently in this doc. The supervisor would need to tell all chains to start processing the next L1 block and just track the latest block they've reported back for each block. It can record a new L1 block set of entries in the db once it has a response from each chain for that L1 block, even if they didn't derive a new head. So it has to track the L1 block its up to rather than always returning the L1 block of the current safe head, and it would return the L2 block the chain last reported it was up to.

The behavior you are describing here matches my understanding, and what I tried to express in the doc.

Eventually as the L1 moves forward, op-supervisor winds up having a block from every chain that builds on top of that chain's current cross-safe. It can then check if dependencies are met and advance the safe head or start assigning tasks to replace local-safe blocks with deposit-only blocks. It would need to reset the L1 block its asking nodes to derive from back to when the cross-safe head last advanced. Except at that point the local-safe for chain A could still be ahead of cross-safe (ie the batches for those blocks are in L1 blocks prior to when the cross-safe advanced even those they're after the cross-safe block for chain A). Those batches are probably invalid now but there may have been others that are now valid and were ignored because the local-safe head had already advanced. The fact that different channels can overlap makes it even more complex to know which L1 block we need to go back to.

The steady-batch-derivation solves the problem you outlined here. Once we invalidate a "local safe" block that was not promoted to cross-safe yet, while staying on the same L1 chain, it means we do not have to reconsider prior batches anymore. Once invalid, always invalid, when anchored to the same L1 chain. "steady", no retries, no if-this-then-that backtracking in L1 history.

Just realised - with steady batch derivation we don't need to go back and look for possible other batches. If the first one turns out to be invalid it is replaced with a deposit-only block and any others for that same block remain invalid. Will have to think it through again but that should let use rewind the L1 to the block the cross safe last updated and try and go forward from there again.

Exactly. We can always drop the "local" safe, revert to the last "cross safe", while progressing on the same L1 chain without re-processing older L1 blocks. This does mean the "local safe" of the L2 is reorged, but the event-logs-DB can be updated, by rolling it back to cross-safe, and syncing the new post-interop-reorg L2 blocks.

@protolambda
Copy link
Contributor Author

This is stale; the op-supervisor has a safety-index and work is in progress to handle reorgs for the next devnet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants