state sync: improvements from mocknet testing #12507

saketh-are · 2024-11-24T17:07:17Z

This PR contains several fixes which improve the speed and robustness of state sync:

All part requests to peers are now made before all cloud attempts. Previously we focused on obtaining specific parts one by one, which could cause a thread to block for a long time until a particular part was uploaded to cloud storage. It takes tens of minutes after the epoch ends for dumper nodes to write all state parts to cloud storage, whereas peer hosts are ready to serve all requests as soon as the epoch ends.
Part request order is randomized at each syncing node, preventing spikes in demand to specific hosts.
Removes an unnecessary check for state headers when serving state parts. In some cases this was preventing peer hosts which do not track all shards from responding successfully to part requests.

Before these changes, it took up to 75 minutes for nodes to download parts for the largest shard (38.8 GiB in 1324 parts). After these changes:

Nodes consistently finish downloading parts in under 15 min,
State requests to peer hosts have a failure rate below 1%,
and 100% of parts are successfully obtained from peer hosts within three requests.

Additional minor improvements:

Adds a separate config parameter to specify how long to wait after a failed cloud download. This allows nodes to avoid spamming requests to cloud storage before parts have been uploaded.
Adds metrics recording the number of cache hits and misses when serving state part requests.
Distinguishes different types of errors collected in the near_state_sync_download_result metric.

Closes issues #12497, #12498, #12499

Once merged this PR should be cherry-picked into the 2.4 release. cc @staffik

codecov · 2024-11-24T17:21:22Z

Codecov Report

Attention: Patch coverage is 86.17886% with 17 lines in your changes missing coverage. Please review.

Project coverage is 70.12%. Comparing base (fc9aaa3) to head (0dd7a1c).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
chain/chain/src/metrics.rs	50.00%	7 Missing ⚠️
chain/client/src/sync/state/network.rs	0.00%	4 Missing ⚠️
chain/client/src/sync/state/external.rs	84.21%	2 Missing and 1 partial ⚠️
chain/client/src/sync/state/downloader.rs	91.66%	1 Missing and 1 partial ⚠️
chain/chain/src/chain.rs	50.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master   #12507   +/-   ##
=======================================
  Coverage   70.12%   70.12%           
=======================================
  Files         841      841           
  Lines      169839   169901   +62     
  Branches   169839   169901   +62     
=======================================
+ Hits       119099   119149   +50     
- Misses      45597    45614   +17     
+ Partials     5143     5138    -5

Flag	Coverage Δ
backward-compatibility	`0.16% <0.00%> (-0.01%)`	⬇️
db-migration	`0.16% <0.00%> (-0.01%)`	⬇️
genesis-check	`1.29% <4.95%> (+<0.01%)`	⬆️
linux	`69.36% <80.48%> (+0.01%)`	⬆️
linux-nightly	`69.72% <86.17%> (+<0.01%)`	⬆️
pytests	`1.59% <4.95%> (+<0.01%)`	⬆️
sanity-checks	`1.40% <4.95%> (+<0.01%)`	⬆️
unittests	`69.94% <86.17%> (+<0.01%)`	⬆️
upgradability	`0.20% <0.00%> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

VanBarbascu

Looks good!

VanBarbascu · 2024-11-27T11:03:38Z

chain/client/src/sync/state/downloader.rs

            }
+            res
        }
        .instrument(tracing::debug_span!("StateSyncDownloader::ensure_shard_part_downloaded"))


Change name of debug span

@staffik

This PR contains several fixes which improve the speed and robustness of state sync: - **All part requests to peers are now made before all cloud attempts**. Previously we focused on obtaining specific parts one by one, which could cause a thread to block for a long time until a particular part was uploaded to cloud storage. It takes tens of minutes after the epoch ends for dumper nodes to write all state parts to cloud storage, whereas peer hosts are ready to serve all requests as soon as the epoch ends. - **Part request order is randomized at each syncing node**, preventing spikes in demand to specific hosts. - **Removes an unnecessary check for state headers when serving state parts**. In some cases this was preventing peer hosts which do not track all shards from responding successfully to part requests. Before these changes, it took up to 75 minutes for nodes to download parts for the largest shard (38.8 GiB in 1324 parts). After these changes: * Nodes consistently finish downloading parts in under 15 min, * State requests to peer hosts have a failure rate below 1%, * and 100% of parts are successfully obtained from peer hosts within three requests. <img width="1374" alt="Screenshot 2024-11-24 at 7 17 39 AM" src="https://github.com/user-attachments/assets/90537548-514b-49b6-87aa-e08b21a24f86"> ----- Additional minor improvements: - Adds a separate config parameter to specify how long to wait after a failed cloud download. This allows nodes to avoid spamming requests to cloud storage before parts have been uploaded. - Adds metrics recording the number of cache hits and misses when serving state part requests. - Distinguishes different types of errors collected in the `near_state_sync_download_result` metric. Closes issues #12497, #12498, #12499 Once merged this PR should be cherry-picked into the 2.4 release. cc @staffik

state sync: improvements from mocknet testing

1dd1435

saketh-are added the pick-2.4.0 label Nov 24, 2024

saketh-are requested review from robin-near and VanBarbascu November 24, 2024 17:40

saketh-are marked this pull request as ready for review November 24, 2024 17:41

saketh-are requested a review from a team as a code owner November 24, 2024 17:41

VanBarbascu approved these changes Nov 27, 2024

View reviewed changes

saketh-are added 2 commits December 1, 2024 15:53

Merge remote-tracking branch 'origin/master' into saketh-state-sync

7afd5c1

fix debug span name

f2d4496

saketh-are enabled auto-merge December 1, 2024 20:54

saketh-are added this pull request to the merge queue Dec 2, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 2, 2024

saketh-are added this pull request to the merge queue Dec 4, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 4, 2024

saketh-are added this pull request to the merge queue Dec 4, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Dec 4, 2024

saketh-are added 2 commits December 4, 2024 09:12

set reasonable external_backoff in tests

1dd5e08

Merge branch 'master' into saketh-state-sync

0dd7a1c

saketh-are enabled auto-merge December 4, 2024 14:13

saketh-are added this pull request to the merge queue Dec 4, 2024

Merged via the queue into near:master with commit 71e0447 Dec 4, 2024
27 checks passed

saketh-are deleted the saketh-state-sync branch December 4, 2024 14:59

staffik mentioned this pull request Dec 6, 2024

Issue tracker near/near-one-project-tracking#101

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

state sync: improvements from mocknet testing #12507

state sync: improvements from mocknet testing #12507

saketh-are commented Nov 24, 2024 •

edited

Loading

codecov bot commented Nov 24, 2024 •

edited

Loading

VanBarbascu left a comment

VanBarbascu Nov 27, 2024

state sync: improvements from mocknet testing #12507

state sync: improvements from mocknet testing #12507

Conversation

saketh-are commented Nov 24, 2024 • edited Loading

codecov bot commented Nov 24, 2024 • edited Loading

Codecov Report

VanBarbascu left a comment

Choose a reason for hiding this comment

VanBarbascu Nov 27, 2024

Choose a reason for hiding this comment

saketh-are commented Nov 24, 2024 •

edited

Loading

codecov bot commented Nov 24, 2024 •

edited

Loading