-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
state sync: improvements from mocknet testing #12507
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #12507 +/- ##
=======================================
Coverage 70.12% 70.12%
=======================================
Files 841 841
Lines 169839 169901 +62
Branches 169839 169901 +62
=======================================
+ Hits 119099 119149 +50
- Misses 45597 45614 +17
+ Partials 5143 5138 -5
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
} | ||
res | ||
} | ||
.instrument(tracing::debug_span!("StateSyncDownloader::ensure_shard_part_downloaded")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change name of debug span
This PR contains several fixes which improve the speed and robustness of state sync: - **All part requests to peers are now made before all cloud attempts**. Previously we focused on obtaining specific parts one by one, which could cause a thread to block for a long time until a particular part was uploaded to cloud storage. It takes tens of minutes after the epoch ends for dumper nodes to write all state parts to cloud storage, whereas peer hosts are ready to serve all requests as soon as the epoch ends. - **Part request order is randomized at each syncing node**, preventing spikes in demand to specific hosts. - **Removes an unnecessary check for state headers when serving state parts**. In some cases this was preventing peer hosts which do not track all shards from responding successfully to part requests. Before these changes, it took up to 75 minutes for nodes to download parts for the largest shard (38.8 GiB in 1324 parts). After these changes: * Nodes consistently finish downloading parts in under 15 min, * State requests to peer hosts have a failure rate below 1%, * and 100% of parts are successfully obtained from peer hosts within three requests. <img width="1374" alt="Screenshot 2024-11-24 at 7 17 39 AM" src="https://github.com/user-attachments/assets/90537548-514b-49b6-87aa-e08b21a24f86"> ----- Additional minor improvements: - Adds a separate config parameter to specify how long to wait after a failed cloud download. This allows nodes to avoid spamming requests to cloud storage before parts have been uploaded. - Adds metrics recording the number of cache hits and misses when serving state part requests. - Distinguishes different types of errors collected in the `near_state_sync_download_result` metric. Closes issues #12497, #12498, #12499 Once merged this PR should be cherry-picked into the 2.4 release. cc @staffik
This PR contains several fixes which improve the speed and robustness of state sync:
Before these changes, it took up to 75 minutes for nodes to download parts for the largest shard (38.8 GiB in 1324 parts). After these changes:
Additional minor improvements:
near_state_sync_download_result
metric.Closes issues #12497, #12498, #12499
Once merged this PR should be cherry-picked into the 2.4 release. cc @staffik