Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observe mainnet IO traffic during release window for nearcore version 1.29 #7635

Closed
Tracked by #7634
jakmeier opened this issue Sep 19, 2022 · 1 comment
Closed
Tracked by #7634
Assignees
Labels
A-storage Area: storage and databases

Comments

@jakmeier
Copy link
Contributor

jakmeier commented Sep 19, 2022

During a release, all nodes are restarted to run on the new binary. This clears the shard cache for trie nodes.

Make sure block processing time stays reasonable during that time window. Most importantly, we expect the prefetcher hit-rate on shard 3 to be high for the first few hours and only go down once the shard hit rate approaches >95%.

@jakmeier
Copy link
Contributor Author

Release is not happening yet but we already simulated the same behavior using canary nodes in mainnet with version 1.29.0-rc.4 (the version currently on testnet).

My takeaways so far:

Looking good as expected

  • High prefetcher hit rate in the first 24 hours after a restart. Around 12% of all chunk cache misses are prefetched, 86% hit in the shard cache.
  • Prefetcher hit rate drops in the following 24 hours. About 91% hit in shard cache, 9% are prefetched.
  • Trend continues for the next 24h, here I see 93% in shard cache, 7% prefetched.
  • Running with a small shard cache, we only hit 62% in shard cache and prefetch 34%. These numbers are very stable right from the start.
  • No prefetch failures.
  • No prefetches requested on shards 0,1,2 if only sweatcoin prefetcher is enabled.
  • Memory requirements of prefetcher are very low. (Capped at 200MB but never exceeds ~40MB)

Not quite as expected:

  • We see main-thread retries counter going up on all shard, even when no prefetch requests are sent. This I can only explain with forks, that causes two chunks on the same shard being processed simultaneously. The second thread accessing the same trie node will then wait for the other to fetch the data first. The numbers in the "prefetch pending" counter confirm this theory. But it shows a bit of an inefficiency in this case, as we read from DB twice even though the second thread could read from the shard cache. (It's probably okay as we are still reading from RocksDB cache.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-storage Area: storage and databases
Projects
None yet
Development

No branches or pull requests

1 participant