Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

batcher: keep blocks, channels and frames in strict order & simplify reorg handling #12390

Merged
merged 71 commits into from
Nov 18, 2024

Conversation

geoknee
Copy link
Contributor

@geoknee geoknee commented Oct 9, 2024

Closes #12123

Design doc https://www.notion.so/oplabs/op-batcher-re-architecture-114f153ee162803d943ff4628ab6578f?pvs=4

Changes (at the reviewer's request I could split this PR up):

  1. Use the op-service/queue type in a couple more places in the batcher
  2. Introduce cursors and reduce the amount of queue operations so that we can be sure things stay in order
  3. Simplify reorg handling (it was causing the channel queue to be spliced up which is contrary to the aims of (2))

From the issue:

We should check

  • normal operation always enqueues blocks inside in order {This was already the case, but now we only ever enqueue and dequeue, never prepend blocks).
  • keeps channels in the channel queue in order (We now only ever enqueue, dequeue or delete from the back of the queue in a way which preserves the order -- by rewinding the block cursor)
  • sends transactions in order (This was already true, the txmgr has a queue which we add txs to in order)
  • also frames within blob transactions in order (This was already true, frames are appended to the transaction in order)
  • reorg case (We were already clearing out the batcher's state and starting again when detecting this. Continuing to do that, only now we don't bother finishing any tx submission work before doing so)
  • re-queueing case when a channel fails to submit in time -> need to preserver order across channels (also future pending) (When this happens, we now clear out the channel queue and rewind the block cursor to point at the first block of the timed out channel).
  • re-queueing case when da-type changes (This was already OK, but now simplified to make it clear we only ever operate with a single channel, the current channel). Again we rewind the block cursor so that the blocks can be reprocessed).

geoknee added 30 commits October 8, 2024 14:04
this is strange, I don't think we should expect channels with frames but no blocks...
and simplify channel.TxConfirmed API
Instead of optimizing for a clean shutdown (which we couldn't guarantee anyway), this change optimizes for code simplicity.

This change also helps us restrict the amount of code which mutates the channelQueue (removePendingChannel was doing removal of channels at arbitrary positions in the queue).

The downside is that we may end up needlessly resubmitting some data after the reset.

Reorgs are rare, so it makes sense to optimize for correctness rather than DA costs here.
@geoknee geoknee added the A-op-batcher Area: op-batcher label Nov 12, 2024
Copy link
Member

@sebastianst sebastianst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one confusion in handleChannelTimeout that we need to sort out. Otherwise lgtm, just minor comments and improvement proposals.

op-batcher/batcher/channel_manager.go Outdated Show resolved Hide resolved
op-batcher/batcher/channel_manager.go Outdated Show resolved Hide resolved
op-batcher/batcher/channel_manager.go Outdated Show resolved Hide resolved
op-batcher/batcher/channel_manager_test.go Outdated Show resolved Hide resolved
op-batcher/readme.md Show resolved Hide resolved
op-batcher/readme.md Outdated Show resolved Hide resolved
@geoknee geoknee enabled auto-merge November 18, 2024 09:44
@geoknee geoknee added this pull request to the merge queue Nov 18, 2024
Merged via the queue into develop with commit c91fe2f Nov 18, 2024
49 checks passed
@geoknee geoknee deleted the gk/batcher-cursors-rebased branch November 18, 2024 09:52


### Reorgs
When an L2 unsafe reorg is detected, the batch submitter will reset its state, and wait for any in flight transactions to be ingested by the verifier nodes before starting work again.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is an L2 unsafe reorg wrt:
image

Even safe L2 blocks (submitted on L1) could be reorged out by an L1 reorg right? Did you mean L1 unsafe (or safe) reorg here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes safe L2 blocks can be reorged by an L1 reorg.

What I mean here by an L2 reorg, is when the unsafe blocks being pulled from the sequencer don't descend from the previous L2 blocks in the batcher's memory. Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that makes sense! Could we rephrase to just say "ingested by the sequencer's op-node" here? Or are there cases the batcher is really waiting for nodeS?

And just to be sure, this kind of reorg can only happen when an L1 reorg happens right..?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can make that edit.

An unsafe reorg can happen when e.g. the batcher is down for a long time such that the sequencing window is breached https://docs.optimism.io/stack/rollup/derivation-pipeline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reading the code more here, this same logic would also be applied for safe reorgs right? So think this should say both unsafe and safe reorgs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good point. Could you send a PR for these readme changes?

Also, the logic for handling these reorgs is being refactored here #13060

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I'll wait for that PR to land and then refactor my PR on top of it. Also I'll include the readme changes in my PR to force you to review it :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-op-batcher Area: op-batcher
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Holocene-D: op-batcher: Check existing code paths for ordering violations
4 participants