-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve range sync with PeerDAS #6258
Comments
This does seem necessary to me, also because requesting all custody columns from the same peer means you have to find a peer that exactly matches your custody right? so retries would happen frequently |
+1 to the above. One of the bugs we had earlier on The proposed change here would allow us to start requesting blocks and columns without having to wait for peers to be available across all custody subnets (for supernodes that would mean requests would be delayed until it has peers across all 128 subnets!). |
Noting another issue with backfill sync: Backfill sync sources blocks and blobs / data columns from peers with the _by_range RPCs. We bundle those results in an RpcBlock and send it to the processor. Since now the block and column peers may be different we need to attribute fault to the right peer. Currently Change the function to fn process_backfill_blocks() {
check_hash_chain(downloaded_blocks)?;
check_availability(downloaded_blocks)?;
import_historical_block_batch(downloaded_blocks)?;
} Where |
Part of - #6258 To address PeerDAS sync issues we need to make individual by_range requests within a batch retriable. We should adopt the same pattern for lookup sync where each request (block/blobs/columns) is tracked individually within a "meta" request that group them all and handles retry logic. - Building on #6398 second step is to add individual request accumulators for `blocks_by_range`, `blobs_by_range`, and `data_columns_by_range`. This will allow each request to progress independently and be retried separately. Most of the logic is just piping, excuse the large diff. This PR does not change the logic of how requests are handled or retried. This will be done in a future PR changing the logic of `RangeBlockComponentsRequest`. ### Before - Sync manager receives block with `SyncRequestId::RangeBlockAndBlobs` - Insert block into `SyncNetworkContext::range_block_components_requests` - (If received stream terminators of all requests) - Return `Vec<RpcBlock>`, and insert into `range_sync` ### Now - Sync manager receives block with `SyncRequestId::RangeBlockAndBlobs` - Insert block into `SyncNetworkContext:: blocks_by_range_requests` - (If received stream terminator of this request) - Return `Vec<SignedBlock>`, and insert into `SyncNetworkContext::components_by_range_requests ` - (If received a result for all requests) - Return `Vec<RpcBlock>`, and insert into `range_sync`
Roadmap
RangeBlockComponentsRequest
aware of the request IDs to safely trigger retries. We don't want the result of a prior by_range request to affect the state of a future retry. Lookup sync uses this mechanism.Description
Currently range sync and backfill sync fetch blocks and blobs from the network with this sequence:
block_by_range
andblobs_by_range
to the SAME ONE peerThis strategy is not optimal but good enough for now. However with PeerDAS, the worst-case number of requests per batch increases from 2 (blocks + blobs) to 2 +
DATA_COLUMN_SIDECAR_SUBNET_COUNT / CUSTODY_REQUIREMENT = 2+32
(if not connected to any larger node.If we extend the current paradigm, a single failure on a columns_by_range request will trigger a retry of all 34 requests. Not optimal 😅
A solution is to make the "components_by_range" request able to retry each individual request. This is what block lookup requests do, where each component (block, blobs, custody) has its own state and retry count.
Your feedback requested here 👇
Going that direction would add a bunch of lines to sync, first of all I want to check if you agree with the problem, and that it's actually necessary to make each request retry-able.
The text was updated successfully, but these errors were encountered: