batcher: keep blocks, channels and frames in strict order & simplify reorg handling #12390

geoknee · 2024-10-09T14:24:34Z

Closes #12123

Design doc https://www.notion.so/oplabs/op-batcher-re-architecture-114f153ee162803d943ff4628ab6578f?pvs=4

Changes (at the reviewer's request I could split this PR up):

Use the op-service/queue type in a couple more places in the batcher
Introduce cursors and reduce the amount of queue operations so that we can be sure things stay in order
Simplify reorg handling (it was causing the channel queue to be spliced up which is contrary to the aims of (2))

From the issue:

We should check

normal operation always enqueues blocks inside in order {This was already the case, but now we only ever enqueue and dequeue, never prepend blocks).
keeps channels in the channel queue in order (We now only ever enqueue, dequeue or delete from the back of the queue in a way which preserves the order -- by rewinding the block cursor)
sends transactions in order (This was already true, the txmgr has a queue which we add txs to in order)
also frames within blob transactions in order (This was already true, frames are appended to the transaction in order)
reorg case (We were already clearing out the batcher's state and starting again when detecting this. Continuing to do that, only now we don't bother finishing any tx submission work before doing so)
re-queueing case when a channel fails to submit in time -> need to preserver order across channels (also future pending) (When this happens, we now clear out the channel queue and rewind the block cursor to point at the first block of the timed out channel).
re-queueing case when da-type changes (This was already OK, but now simplified to make it clear we only ever operate with a single channel, the current channel). Again we rewind the block cursor so that the blocks can be reprocessed).

https://www.notion.so/oplabs/op-batcher-re-architecture-114f153ee162803d943ff4628ab6578f

this is strange, I don't think we should expect channels with frames but no blocks...

and simplify channel.TxConfirmed API

Instead of optimizing for a clean shutdown (which we couldn't guarantee anyway), this change optimizes for code simplicity. This change also helps us restrict the amount of code which mutates the channelQueue (removePendingChannel was doing removal of channels at arbitrary positions in the queue). The downside is that we may end up needlessly resubmitting some data after the reset. Reorgs are rare, so it makes sense to optimize for correctness rather than DA costs here.

using new TestMetrics struct

op-batcher/batcher/channel.go

sebastianst

There's one confusion in handleChannelTimeout that we need to sort out. Otherwise lgtm, just minor comments and improvement proposals.

op-batcher/batcher/channel_manager.go

op-batcher/batcher/channel_manager_test.go

op-batcher/readme.md

…by requeue or timeout

We were trimming older channels and keeping new ones. We need to trim newer channels and keep old ones. Fixes associated test (see previous commit).

…-rebased

samlaf · 2024-11-26T13:36:14Z

op-batcher/readme.md

+
+
+### Reorgs
+When an L2 unsafe reorg is detected, the batch submitter will reset its state, and wait for any in flight transactions to be ingested by the verifier nodes before starting work again.


What is an L2 unsafe reorg wrt:

Even safe L2 blocks (submitted on L1) could be reorged out by an L1 reorg right? Did you mean L1 unsafe (or safe) reorg here?

Yes safe L2 blocks can be reorged by an L1 reorg.

What I mean here by an L2 reorg, is when the unsafe blocks being pulled from the sequencer don't descend from the previous L2 blocks in the batcher's memory. Does that make sense?

Yep that makes sense! Could we rephrase to just say "ingested by the sequencer's op-node" here? Or are there cases the batcher is really waiting for nodeS?

And just to be sure, this kind of reorg can only happen when an L1 reorg happens right..?

Yes we can make that edit.

An unsafe reorg can happen when e.g. the batcher is down for a long time such that the sequencing window is breached https://docs.optimism.io/stack/rollup/derivation-pipeline.

On reading the code more here, this same logic would also be applied for safe reorgs right? So think this should say both unsafe and safe reorgs?

Yes good point. Could you send a PR for these readme changes?

Also, the logic for handling these reorgs is being refactored here #13060

Sounds good, I'll wait for that PR to land and then refactor my PR on top of it. Also I'll include the readme changes in my PR to force you to review it :D

geoknee added 30 commits October 8, 2024 14:04

use a queue.Queue for channelBuilder.frames

75928f8

remove pop and push terminology

40d5deb

proliferate queue.Queue type

9309689

simplify requeue method

762057b

undo changes to submodule

74e53f6

sketch out new arch

7a020d6

https://www.notion.so/oplabs/op-batcher-re-architecture-114f153ee162803d943ff4628ab6578f

add TODO

06b52bd

add channelManager.pruneSafeBlocks method and integrate into main loop

c949c2a

fix frameCursor semantics

d0e98f9

fixup tests

431e681

avoid Rewind() in tests

fdd675f

only rewind cursor in rewind (never move it forward)

0f4cc5c

fix assertions

ea88715

prune channels whose blocks are now safe

5ed15db

handle case when rewinding a channel with no blocks

2ea2dc8

this is strange, I don't think we should expect channels with frames but no blocks...

add clarification

a33d34e

implement channelManager.pendinBlocks() method

aafd290

fix pruning logic

22351ae

simplify pruneChannels

6717664

simplify pruneSafeBlocks

c463d53

add unit tests for pruneSafeBlocks

c8121b3

fix pruneSafeBlocks to avoid underflow

737a229

improve test

71901a2

add unit tests for pruneChannels

cd6d19c

introduce handleChannelTimeout

867335c

and simplify channel.TxConfirmed API

factor out channelManager.rewindToBlockWithHash

358c6a8

change test expectation

78d1f30

do more pruning in test

26f0040

Add readme and architecture diagram

a0b0e37

geoknee added 10 commits November 12, 2024 14:07

update panic message

da89ebc

extend test coverage and fix bug

71065c6

rename test blocks

fe7eb67

simplify HasPendingFrame() method

81aa115

simplify implementation of RewindFrameCursor

0ba74df

activate dormant test

23da037

ensure pending_blocks_bytes_current metric is tracked properly

4cd021a

cover metrics behaviour in test

7a45ecd

using new TestMetrics struct

extend test coverage to channelManager.handleChannelTimeout

f642cda

add comment to TxFailed

753ad8e

geoknee added the A-op-batcher Area: op-batcher label Nov 12, 2024

sebastianst reviewed Nov 13, 2024

View reviewed changes

op-batcher/batcher/channel.go Show resolved Hide resolved

sebastianst mentioned this pull request Nov 14, 2024

CursorQueue abstraction #12926

Open

sebastianst reviewed Nov 14, 2024

View reviewed changes

geoknee added 7 commits November 14, 2024 11:37

rename test fn

42727dc

point to e2e tests in readme.

869b745

readme: performance -> throughput

9ceff07

improve channel_manager_test to assert old channels are not affected …

2d025e8

…by requeue or timeout

fix handleChannelTimeout behaviour

74ed8b8

We were trimming older channels and keeping new ones. We need to trim newer channels and keep old ones. Fixes associated test (see previous commit).

tighten up requirements for invalidating a channel

7b17b1c

replace requeue with handleChannelInvalidated

62bd303

sebastianst approved these changes Nov 18, 2024

View reviewed changes

Merge remote-tracking branch 'origin/develop' into gk/batcher-cursors…

941669e

…-rebased

geoknee enabled auto-merge November 18, 2024 09:44

geoknee mentioned this pull request Nov 18, 2024

txmgr: use sendAsync in Queue #12947

Closed

geoknee added this pull request to the merge queue Nov 18, 2024

Merged via the queue into develop with commit c91fe2f Nov 18, 2024
49 checks passed

geoknee deleted the gk/batcher-cursors-rebased branch November 18, 2024 09:52

samlaf reviewed Nov 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

batcher: keep blocks, channels and frames in strict order & simplify reorg handling #12390

batcher: keep blocks, channels and frames in strict order & simplify reorg handling #12390

geoknee commented Oct 9, 2024 •

edited

Loading

sebastianst left a comment

samlaf Nov 26, 2024

geoknee Nov 26, 2024

samlaf Nov 26, 2024

geoknee Nov 26, 2024

samlaf Dec 5, 2024

geoknee Dec 5, 2024

samlaf Dec 5, 2024



		### Reorgs
		When an L2 unsafe reorg is detected, the batch submitter will reset its state, and wait for any in flight transactions to be ingested by the verifier nodes before starting work again.

batcher: keep blocks, channels and frames in strict order & simplify reorg handling #12390

batcher: keep blocks, channels and frames in strict order & simplify reorg handling #12390

Conversation

geoknee commented Oct 9, 2024 • edited Loading

sebastianst left a comment

Choose a reason for hiding this comment

samlaf Nov 26, 2024

Choose a reason for hiding this comment

geoknee Nov 26, 2024

Choose a reason for hiding this comment

samlaf Nov 26, 2024

Choose a reason for hiding this comment

geoknee Nov 26, 2024

Choose a reason for hiding this comment

samlaf Dec 5, 2024

Choose a reason for hiding this comment

geoknee Dec 5, 2024

Choose a reason for hiding this comment

samlaf Dec 5, 2024

Choose a reason for hiding this comment

geoknee commented Oct 9, 2024 •

edited

Loading