-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
op-supervisor-integration #49
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,307 @@ | ||
# `op-supervisor` integration | ||
|
||
# Purpose | ||
|
||
The previously described [`op-supervisor` backend design](./superchain-backend.md) | ||
serves the resolution of cross-L2 dependencies, i.e. the "cross" part of safety, | ||
but not the resolution of superchain-wide L1-inclusion and case of reorg, i.e. the "safe" part. | ||
|
||
This document describes how to complete the supervisor-functionality, | ||
such that it can be integrated with the op-node and op-geth functionality. | ||
|
||
# Summary | ||
|
||
We extend the op-supervisor with an append-only database, similar to that of the event-database, | ||
to track the interdependent safety-level increments across the superchain. | ||
This data can then be exchanged with the op-node to promote the safety of its respective L2 blocks. | ||
|
||
# Problem Statement + Context | ||
|
||
# op-supervisor super-safe-db | ||
|
||
The main challenge with the op-supervisor, and op-node, are reorgs. | ||
|
||
With interop, due to cross-chain dependencies, reorgs may be triggered on a previously locally derived series of L2 blocks (single chain). | ||
This may happen due to a missing event-log on another chain, which may have been independently reorged (either due to L1 change, or due to another L2 interop reorg). | ||
|
||
Flip-flops in reorgs, non-deterministic ordering of op-node interactions with the supervisor, and replay of events / L2 blocks from different batches on L1, all make single-chain-perspective solutions infeasible. The safety changes of L2 chains need to be tracked atomically across the superchain, to consistently apply resolution of invalid dependencies. | ||
|
||
Taking a global perspective on safety-increments, each increment anchored to the view of the L1 chain, is identical to the way the interop Fault-Proof resolution is proposed to work. | ||
|
||
Offchain however, we need to book-keep the different increments, to replicate the view of each one of them, in case of a reorg back to this. Likewise, an FP game is initiated with this same data, artificially constraining the view of the L1 chain and the L2 chain derived from it thus far. | ||
|
||
The “safe db” functionality of the op-node maintains this exact data, except only for a single chain, in an unoptimized manner, as optional add-on. | ||
|
||
With interop, we propose to introduce a “super safe db”, maintained by the op-supervisor, to book-keep this information as part of interop verification. With this, the supervisor will be able to handle reorgs and provide accurate cross-L2 safety views to op-nodes. | ||
|
||
# Alternatives Considered | ||
|
||
*List out a short summary of each possible solution that was considered. | ||
Comparing the effort of each solution* | ||
|
||
## Only using op-supervisor event-logs and in-memory heads data | ||
|
||
We started without the `super-safe-db`, with an attempt to resolve safety and reorgs against the event-logs data. | ||
|
||
However, the event-logs data does not capture: | ||
- Where L2 blocks were derived from on L1: determining which L2 block needs to be reorged out upon invalid executing messages | ||
requires an understanding of L1-ordering. | ||
- When we choose to go with a deposits-only block, it may invalidate other L2 data at the same timestamp. | ||
- Relations between L2 cross-safe increments, to set up a FP pre-state view. | ||
|
||
# Proposed Solution | ||
|
||
The solution consists of 3 parts: | ||
- super-safe-db: data in supervisor to resolve safety. | ||
- op-supervisor RPC: serve queries to op-node / op-geth. | ||
- integration: op-node / op-geth access op-supervisor RPC to maintain a view of safety. | ||
|
||
## Super-safe-db | ||
|
||
### Primary key | ||
|
||
With L1 block number as primary key: | ||
|
||
- Key-space may leave gaps in L2 blockhashes | ||
- Not superchain timestamp aligned | ||
- Binary search to find what L1 blocks we need to determine a specific L2 timestamp | ||
- Binary search to determine L2 safety | ||
- Easy to roll back on L1 reorg | ||
|
||
With superchain timestamp as primary key: | ||
|
||
- All L2 blockhashes can be found. | ||
- We store more duplicate derived-from information; presumably 1 block-number per L2 block, and the L1 blockhash of at least the highest L1 block number. | ||
- Binary-search to find what L2 data was derived from up to a specific L1 block. | ||
- More difficult to roll back on L1 reorg. | ||
- Easy to capture interop timestamp invariant, thus easy to maintain horizon properties. | ||
- Weakest-link safety problem: cross-safe becomes global, increments all at once for everyone. | ||
(This forms a delayed safety risk when 1 chain is not batch-submitting). | ||
- Consistent, roll-back on invalidated executing messages always straight-forward. | ||
- While the same as the FP bisection step, the L1-view is now inconsistent, | ||
or completely delayed/unavailable, between different L2 chains. | ||
|
||
With each L1-block or timestamp increment, we need to capture the addition | ||
of any L2 block that is failing executing-message checks. | ||
|
||
Note that additions are dynamic-size (based on chain interop set), and so every lookup will always be some kind of search. | ||
|
||
### DB updates by op-supervisor | ||
|
||
This DB can then be used to query when each L2 block became cross-safe. | ||
|
||
This same DB can be used by the challenger, to replace the existing safe-db. A must when starting interop-FP work. | ||
|
||
And this same DB can be used to improve the `FindSyncStart` performance, since the super-safe-db will have the exact data to determine where to start syncing from, when given a L1 block and chainID. | ||
|
||
#### finalized (always cross) | ||
|
||
To query the latest finalized L2 block, | ||
simply binary search for the last L2 block fully derived from the finalized L1 block. | ||
This can simplify existing op-node functionality, and make finality more reliable. | ||
|
||
#### cross-safe | ||
|
||
To query whether an L2 block is `cross-safe`, binary search for the L2 block, to determine when (w.r.t. L1 block) | ||
it became cross-safe. If this L1 block exists and is canonical, then the L2 block is cross-safe. | ||
Comment on lines
+105
to
+106
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just thinking through the two different requirements we have here:
Since chains may submit batches at very different intervals, it's possible that when supervisor is up to L1 block 10, chain A can derive the next block, but chain B can't yet. So supervisor has to move forward along L1, potentially up to a full sequencer window, trying to find the next block from chain B. In the mean time, chain A might derive a lot of L2 blocks. I think this might work if the entries for each L1 always describe the current cross-safe L2 block for the chain at that L1. So chain A would record the same L2 block for all of those L1 blocks as it waits for the next chain B block. supervisor would still index all those new blocks from chain A because it's indexing up to unsafe head. I'm assuming that would be the case even if there were no dependencies from chain A blocks because we want to move the L2 blocks ahead in sync? But I don't think this behaviour matches what's currently in this doc. The supervisor would need to tell all chains to start processing the next L1 block and just track the latest block they've reported back for each block. It can record a new L1 block set of entries in the db once it has a response from each chain for that L1 block, even if they didn't derive a new head. So it has to track the L1 block its up to rather than always returning the L1 block of the current safe head, and it would return the L2 block the chain last reported it was up to. Eventually as the L1 moves forward, op-supervisor winds up having a block from every chain that builds on top of that chain's current cross-safe. It can then check if dependencies are met and advance the safe head or start assigning tasks to replace local-safe blocks with deposit-only blocks. It would need to reset the L1 block its asking nodes to derive from back to when the cross-safe head last advanced. Except at that point the local-safe for chain A could still be ahead of cross-safe (ie the batches for those blocks are in L1 blocks prior to when the cross-safe advanced even those they're after the cross-safe block for chain A). Those batches are probably invalid now but there may have been others that are now valid and were ignored because the local-safe head had already advanced. The fact that different channels can overlap makes it even more complex to know which L1 block we need to go back to. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just realised - with steady batch derivation we don't need to go back and look for possible other batches. If the first one turns out to be invalid it is replaced with a deposit-only block and any others for that same block remain invalid. Will have to think it through again but that should let use rewind the L1 to the block the cross safe last updated and try and go forward from there again. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Right, we don't want a "weakest link batch submission" problem, and that's why I pivoted from using a timestamp (as in original notion doc draft) to a L1 block as primary key. If a L2 chain does not make progress on safety, it just repeats the latest cross-safe block in the DB, for those L1 blocks that it did not make progress for. Upon reorgs of L1, it'll be easy to roll back the DB by primary key, towards a L1 block that is still there, and get a coherent older understanding of where L2s were at. And then continue from there, to follow the new L1 chain. Also note that we are storing both the "local" safe and "cross" safe L2-pointers per L1 block, this makes it easy to promote a "local" safe once the dependencies are checked.
Yes, no weakest-link, the DB should always progress. If the L1 did not produce new L2 blocks, then just repeat the last known L2 block (for cross-safe, and also local-safe, labels). If derivation is not being processed, i.e. if we are unable to determine what L2 block is being derived from L1 at all, then we are forced to stop the DB progression however. This is unavoidable, because inactive op-nodes cannot be attributed to L1, so any action would be premature.
The behavior you are describing here matches my understanding, and what I tried to express in the doc.
The steady-batch-derivation solves the problem you outlined here. Once we invalidate a "local safe" block that was not promoted to cross-safe yet, while staying on the same L1 chain, it means we do not have to reconsider prior batches anymore. Once invalid, always invalid, when anchored to the same L1 chain. "steady", no retries, no if-this-then-that backtracking in L1 history.
Exactly. We can always drop the "local" safe, revert to the last "cross safe", while progressing on the same L1 chain without re-processing older L1 blocks. This does mean the "local safe" of the L2 is reorged, but the event-logs-DB can be updated, by rolling it back to cross-safe, and syncing the new post-interop-reorg L2 blocks. |
||
|
||
#### local-safe | ||
|
||
The op-nodes determine this for themselves, and exchange it with the op-supervisor. The op-supervisor includes the "local safe" increment in the entry, but it may not be promoted to "cross safe" until all dependencies are met. | ||
|
||
#### cross-unsafe | ||
|
||
Optimistically the supervisor may fetch the event-logs of unsafe blocks. Blocks with transitively safe dependencies may be marked as cross-unsafe. | ||
|
||
The cross-unsafe data is not persisted across restarts: this can be re-computed from the event-log DB quickly, | ||
and recomputing makes it much less complicated to stay consistent with the database. | ||
|
||
#### unsafe | ||
|
||
The sync-status exchange triggers the op-supervisor to fetch the event-logs of new unsafe blocks. | ||
|
||
If the new unsafe-block does not build on the previous unsafe block (parent-hash check), then the event-log DB needs to be rewound to the last blockhash checkpoint still consistent with the post-reorg L2 chain. | ||
This may not revert deeper than cross-safe, as the L1 chain is the source of truth for reversals that deep. | ||
The cross-safe will has to rewind first, and then when unsafe-blocks still “don’t fit” on the remaining unsafe data, the unsafe data reorg is triggered again, and can be rewound further. | ||
|
||
### Super-safe-DB format | ||
|
||
The DB is append-only, with truncation on rollback: | ||
this makes it easy to recover on corruption, and fast and efficient. | ||
|
||
The entries are fixed-size: | ||
|
||
```go | ||
<version> // 1 byte identifier of entry | ||
<derived-up-and-including-to L1 block number> // 8 bytes, primary key | ||
<derived-up-and-including-to L1 truncated blockhash> // 20 bytes | ||
ajsutton marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
<chain idx> // 4 bytes, same indexing as executing event-log entry | ||
|
||
<l2 local-safe number> // 8 bytes. local-safe will be promoted to cross-safe once all dependencies transitively verify | ||
<l2 local-safe truncated blockhash> // 20 bytes. May refer to an invalid non-canonical L2 block, as dependencies are unverified. | ||
|
||
<l2 cross-safe number> // 8 bytes. Each L2 has its own. Also implied to be finalized, if derived-from L1 is finalized. | ||
<l2 cross-safe truncated blockhash> // 20 bytes. | ||
|
||
<flags> // 1 byte, to indicate if cross-safe was a deposit-only block | ||
// total: 1+8+20+4+8+20+8+20+1 = 90 bytes | ||
``` | ||
|
||
For each chain in the dependency set, for each L1 block, | ||
we append one of these entries to the super-safe-db. | ||
If an L2 chain has not traversed gotten to the next L1 block yet, the DB writing work halts until it does. | ||
protolambda marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
In a future iteration we may add other `<version>` data, | ||
and fit it into `90` bytes (using multiple entries if necessary) to register data. | ||
|
||
TODO: We modify the `event` DB to checkpoint every L2 block, | ||
rather than only forcefully placing such entry every 256 entries. | ||
This will help verify any gaps between L2 blocks, | ||
if the `cross-safe` or `local-safe` jumps by more than 1 block at a time. | ||
protolambda marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## op-supervisor RPC | ||
|
||
### `CheckMessage` | ||
|
||
```go | ||
type SafetyLevel string | ||
|
||
const ( | ||
Finalized SafetyLevel = "finalized" | ||
Safe SafetyLevel = "safe" | ||
CrossUnsafe SafetyLevel = "cross-unsafe" | ||
Unsafe SafetyLevel = "unsafe" | ||
Unknown SafetyLevel = "unknown" | ||
Conflicts SafetyLevel = "conflicts" | ||
) | ||
|
||
type RPC interface { | ||
CheckMessage(identifier types.Identifier, | ||
payloadHash common.Hash, | ||
) (types.SafetyLevel, error) | ||
} | ||
``` | ||
|
||
This method is used to verify initiating messages exist, e.g. used by the tx-pool and block-building, to verify user-input before inclusion into the chain. | ||
|
||
API implementation: | ||
|
||
1. Find the event log-db entry at the given identifier | ||
2. Reconstruct a `TruncatedHash` from the `identifier` and `payloadHash` | ||
3. Check if the log-db entry matches the reconstructed `TruncatedHash` | ||
- If it matches, return the safety-level of the entry | ||
- The safety is determined by the first (and thus strongest safety level) type of head of the chain that is past or equal the entry. | ||
- If it does not match, return the `conflicts` safety level. | ||
- If the entry is not yet known, return an `unknown` safety level. | ||
|
||
|
||
### `GetCrossUnsafe` | ||
|
||
```go | ||
type RPC interface { | ||
GetCrossUnsafe(chainID types.ChainID, maxNumber uint64) (crossUnsafe eth.BlockID, err error) | ||
} | ||
``` | ||
|
||
API implementation: | ||
1. Compare the `maxNumber` to the locally known `unsafe` tip of the event-DB: | ||
- If newer: | ||
1. Initiate fetching of new event-log data. | ||
2. Return the latest known `cross-unsafe`. | ||
- If equal or older: | ||
1. Return the latest known `cross-unsafe` block before `maxNumber`. | ||
This is the last block checkpoint `<= maxNumber` | ||
|
||
By clipping using `maxNumber` the op-node is able to verify the returned `cross-unsafe`, and not accept it if it's non-canonical, and optionally query for older cross-unsafe blocks to try to bump to. | ||
|
||
### `NextDeriveTask`, `OnDerived`, `CrossSafe` | ||
|
||
```go | ||
type RPC interface { | ||
|
||
// Returns where the next L2 block has to be derived from. | ||
NextDeriveTask(chainID types.ChainID) (parentL2 eth.BlockID, depositOnly bool, l1 eth.BlockID) error | ||
ajsutton marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
// Signals what has been derived, considered to be local-safe in op-node. | ||
OnDerived(chainID types.ChainID, derived eth.BlockID, parentL2 eth.BlockID, depositOnly bool, l1 eth.BlockID) error | ||
|
||
// CrossSafe returns the cross-safe block, and where in L1 traversal it was recognized as cross-safe. | ||
CrossSafe(chainID types.ChainID, l1Number uint64) (crossSafe eth.BlockID, derivedFrom eth.BlockID, error) | ||
} | ||
``` | ||
`NextDeriveTask`: Given the chainID, check if we have a local-safe block that builds on top of the latest cross-unsafe. | ||
- If we do not have such block, then create the task to derive one: | ||
- `{parentL2: latestCrossSafe, depositOnly: false, l1: l1OfLatestCrossSafe}` | ||
protolambda marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- If we do, then determine if the local-safe block has been verified: | ||
1. Check if all dependencies are: | ||
- known | ||
- cross-safe if previous height | ||
- local-safe if same height | ||
- have valid intra-block dependencies if same height | ||
protolambda marked this conversation as resolved.
Show resolved
Hide resolved
|
||
2. If unknowns, then return error to defer the task till later. | ||
3. If invalid dependencies, then create a task to replace the block: | ||
- `{parentL2: latestCrossSafe, depositOnly: true, l1: l1OfLatestCrossSafe}` | ||
|
||
`OnDerived`: the op-node signals the result of a previous `NextDeriveTask`: | ||
- If `parentL2`, `depositOnly`, `l1` match the current task: | ||
1. Accept it, and merge it into the `super-safe-db` head entry. | ||
2. The head-entry should be flushed to the DB, and tasks should be updated, once every L2 chain has been seen. | ||
- TODO: multiple L2 blocks from the same L1 block. We need some way to report exhaustion of a L1 block. | ||
- TODO: there may not be any L2 blocks derived from the L1 block. Can repeat parentL2 as "derived" in that case maybe? | ||
- If it does not match the task, then heck if it matches a past entry. | ||
- If it does, silently accept the result. There may be multiple op-nodes running the same tasks. | ||
protolambda marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- If it does not, then return an error. The op-node will need to adjust to the latest task. | ||
|
||
Note that the op-node can ignore reporting task-results, | ||
if the `CrossSafe()` is canonical to its view, and ahead of where the op-node is. | ||
|
||
`CrossSafe`: returns the current cross-safe block, at the time of `l1Number` L1 block number. | ||
The op-node should check if the return L1-block is canonical before accepting the result. | ||
|
||
|
||
### `TryFinalize` | ||
|
||
```go | ||
type RPC interface { | ||
TryFinalize(chainID types.ChainID, finalizedL1 eth.BlockID) (finalizedL2 eth.BlockID, err error) | ||
} | ||
``` | ||
|
||
- `TryFinalize`: this simply returns the latest L2 block of the specified `chainID` to finalize, which must be fully derived from the specified `finalizedL1` block. | ||
|
||
## Integration | ||
|
||
### op-node safety | ||
|
||
- Track `cross-unsafe` as new safety-level in the `EngineController`, similar to pseudo-levels like `backup-unsafe-reorg`, and `pending-safe`. | ||
- Modify the `engine` payload-processing step to not immediately mark a derived-from-L1 block as `safe`, but rather emit an event that indicates the `unsafe` block was derived and ready to promote. Don’t emit the pending-safe update immediately either. | ||
- Add new Interop event-deriver. | ||
- Monitor the derive tasks of the super-visor with `NextDeriveTask(...)` | ||
- Perform the tasks, report results with `OnDerived(...)` | ||
- Poll `CrossSafe(...)` for cross-safe changes. | ||
- Replace finalizer deriver, to simply take the L1 finality signal, forward it to the supervisor, | ||
and get back a signal what to finalize on the L2 accordingly. | ||
If the L1 finalization signal is too new, that’s ok, eventually we’ll have recent data, | ||
or we can pass an older L1 block as input to the supervisor, to get an older L2 block to finalize. | ||
|
||
TODO: | ||
- We could add a mode to `op-node` where it ignores the derivation tasks, | ||
and just takes `CrossSafe` to find a recent sync-target, and then trigger engine-sync with that. | ||
- We could expose a public `CrossSafe` endpoint, without the `NextDeriveTask`, `OnDerived`, | ||
for low-resource users / replicas to have something trusted to maintain a safety view with. | ||
|
||
### op-geth safety | ||
|
||
During block-building: | ||
|
||
- After adding a tx to an in-progress block building job, inspect all the emitted new event-logs by said tx: | ||
- For each executing message, query the op-supervisor `CheckMessage`, to verify if the message is sufficiently safe to include. | ||
protolambda marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Undo the tx if the supervisor returns `conflicts`, and apply back-pressure to the tx-origin. | ||
- Undo the tx if the supervisor returns `unknown`, and apply back-pressure to the inclusion of interop txs as a whole. The supervisor needs to catch up, but we can continue to build regular blocks. | ||
- Accept the tx if the supervisor indicates it is `unsafe`/ `safe` / `finalized`; depending on the sequencer setting. If the setting does not allow `unsafe`, then re-try the tx at a later point, when the safety may have been bumped up. | ||
- For txs in the tx-pool: periodically check the safety, and prune the pool accordingly. This is out of scope for devnet 1. | ||
|
||
# Risks & Uncertainties | ||
|
||
*An overview of what could go wrong. Also any open questions that need more work to resolve.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO(Adrian): Think through how we ensure these database are both in sync in the case of crashes and inconsistent writes.