Improve range sync with PeerDAS #6258

dapplion · 2024-08-14T16:57:23Z

Roadmap

Add individual by_range sync requests #6497 - To retry each by_range request individually we need to be able to trigger them individually
Track request IDs in RangeBlockComponentsRequest #6998 - Make RangeBlockComponentsRequest aware of the request IDs to safely trigger retries. We don't want the result of a prior by_range request to affect the state of a future retry. Lookup sync uses this mechanism.
Attribute fault to the correct peer if a RpcBlock is invalid. Now we penalize the block peer for bad columns
Improve the coupling logic considering the case where only the block or column peer may be malicious
Implement retries of individual columns by range requests

Description

Currently range sync and backfill sync fetch blocks and blobs from the network with this sequence:

Out of the pool of peers in the chain (peers that agree on some fork-choice state) select ONE peer
Immediately issue block_by_range and blobs_by_range to the SAME ONE peer
If any of those requests error, fail BOTH requests and retry with another peer

This strategy is not optimal but good enough for now. However with PeerDAS, the worst-case number of requests per batch increases from 2 (blocks + blobs) to 2 + DATA_COLUMN_SIDECAR_SUBNET_COUNT / CUSTODY_REQUIREMENT = 2+32 (if not connected to any larger node.

If we extend the current paradigm, a single failure on a columns_by_range request will trigger a retry of all 34 requests. Not optimal 😅

A solution is to make the "components_by_range" request able to retry each individual request. This is what block lookup requests do, where each component (block, blobs, custody) has its own state and retry count.

Your feedback requested here 👇

Going that direction would add a bunch of lines to sync, first of all I want to check if you agree with the problem, and that it's actually necessary to make each request retry-able.

The text was updated successfully, but these errors were encountered:

realbigsean · 2024-08-14T20:57:57Z

Going that direction would add a bunch of lines to sync, first of all I want to check if you agree with the problem, and that it's actually necessary to make each request retry-able.

This does seem necessary to me, also because requesting all custody columns from the same peer means you have to find a peer that exactly matches your custody right? so retries would happen frequently

jimmygchen · 2024-08-15T01:10:11Z

+1 to the above.

One of the bugs we had earlier on das was that we were spamming blocks requests before we have enough custody subnet peers - so we never make data column requests but we made a bunch of block requests before getting rate limited - and we had to add a workaround to make sure we have peers across all custody subnets before we start requesting both (#6004).

The proposed change here would allow us to start requesting blocks and columns without having to wait for peers to be available across all custody subnets (for supernodes that would mean requests would be delayed until it has peers across all 128 subnets!).

dapplion · 2024-10-10T17:05:08Z

Noting another issue with backfill sync:

Backfill sync sources blocks and blobs / data columns from peers with the _by_range RPCs. We bundle those results in an RpcBlock and send it to the processor. Since now the block and column peers may be different we need to attribute fault to the right peer.

Currently process_backfill_blocks checks KZG validity before checking the hash chain. If the block peer sends and invalid block we may hit a KZG or availability error instead of a block hash error. We should check the integrity of the block before checking the columns validity.

Change the function to

fn process_backfill_blocks() {
    check_hash_chain(downloaded_blocks)?;
    check_availability(downloaded_blocks)?;
    import_historical_block_batch(downloaded_blocks)?;
}

Where import_historical_block_batch may no longer need to check that hash chain as it's done ahead of time.

Part of - #6258 To address PeerDAS sync issues we need to make individual by_range requests within a batch retriable. We should adopt the same pattern for lookup sync where each request (block/blobs/columns) is tracked individually within a "meta" request that group them all and handles retry logic. - Building on #6398 second step is to add individual request accumulators for `blocks_by_range`, `blobs_by_range`, and `data_columns_by_range`. This will allow each request to progress independently and be retried separately. Most of the logic is just piping, excuse the large diff. This PR does not change the logic of how requests are handled or retried. This will be done in a future PR changing the logic of `RangeBlockComponentsRequest`. ### Before - Sync manager receives block with `SyncRequestId::RangeBlockAndBlobs` - Insert block into `SyncNetworkContext::range_block_components_requests` - (If received stream terminators of all requests) - Return `Vec<RpcBlock>`, and insert into `range_sync` ### Now - Sync manager receives block with `SyncRequestId::RangeBlockAndBlobs` - Insert block into `SyncNetworkContext:: blocks_by_range_requests` - (If received stream terminator of this request) - Return `Vec<SignedBlock>`, and insert into `SyncNetworkContext::components_by_range_requests ` - (If received a result for all requests) - Return `Vec<RpcBlock>`, and insert into `range_sync`

dapplion added the das Data Availability Sampling label Aug 14, 2024

dapplion mentioned this issue Aug 16, 2024

Wasted data in range_blocks_and_blobs_requests #6270

Open

jimmygchen mentioned this issue Aug 19, 2024

Fix PeerDAS stuck range sync when multiple columns are batched in a single request #6276

Merged

jimmygchen changed the title ~~Range sync with PeerDAS~~ Improve range sync with PeerDAS Aug 21, 2024

This was referenced Aug 21, 2024

PeerDAS - Tracking Issue #4983

Closed

PeerDAS implementation #5683

Merged

dapplion mentioned this issue Sep 16, 2024

Generalize sync ActiveRequests #6398

Merged

dapplion mentioned this issue Oct 6, 2024

Match block and blobs after validating execution #6469

Open

dapplion mentioned this issue Oct 16, 2024

Add individual by_range sync requests #6497

Merged

jimmygchen mentioned this issue Jan 31, 2025

Range sync stuck due to insufficient peers on data column subnets when triggered #6895

Open

jimmygchen added the major-task A significant amount of work or conceptual task. label Feb 4, 2025

jimmygchen assigned dapplion Feb 4, 2025

dapplion mentioned this issue Feb 6, 2025

Simplify into_responses_with_blobs logic #6927

Open

dapplion mentioned this issue Feb 13, 2025

Track request IDs in RangeBlockComponentsRequest #6998

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve range sync with PeerDAS #6258

Improve range sync with PeerDAS #6258

dapplion commented Aug 14, 2024 •

edited

Loading

realbigsean commented Aug 14, 2024 •

edited

Loading

jimmygchen commented Aug 15, 2024

dapplion commented Oct 10, 2024 •

edited

Loading

Improve range sync with PeerDAS #6258

Improve range sync with PeerDAS #6258

Comments

dapplion commented Aug 14, 2024 • edited Loading

Roadmap

Description

Your feedback requested here 👇

realbigsean commented Aug 14, 2024 • edited Loading

jimmygchen commented Aug 15, 2024

dapplion commented Oct 10, 2024 • edited Loading

dapplion commented Aug 14, 2024 •

edited

Loading

realbigsean commented Aug 14, 2024 •

edited

Loading

dapplion commented Oct 10, 2024 •

edited

Loading