-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determine impact of missing back-off #2622
Comments
20 validators without parachains are running for ~20 hours producing ~9000 unfinalized blocks. Memory usage growth is roughly linear and homogenous among all the validators (I'm not averaging the graphs because we have some outliers that would affect the average significantly). Taking one of the validators as a sample, we have an increase of allocated memory from 650 Mb to 1,27 Gb, roughly twice, in 20 hours. Metrics used: |
Of out curiosity~ in the |
After this test, Versi validators were reset to the master branch. The finality lag was ~9000 blocks; it started to catch up quickly but stalled again at ~1000 unfinalized blocks, and the lag started to increase. Also, while the vast majority of validators were reporting the same finalized block height, some of them stalled with lower heights and didn't progress (see graph). We decided to reset the network for now, but that behavior should be further investigated later. |
We deployed a modified version of latest release on a Kusama validator we use to burn-in changes. The image we used is built by CI in #2649 . Surprisingly the node didn't OOM as expect, instead it stopped allocating memory and went out of sync due to I/O wait times after starting to hammer rocksDB and use 100% of disk bandwidth for reading. Quoting @BulatSaif findings from a different ticket: But 100% cpu usage was caused by iowait. Possible scenario: application run out of memory, which caused removing all free memory which was used as disk cache for in rocksdb, so basically node lost connection to rocksdb since all request will timeout. Interestingly, that application did not allocate more memory, if it did it would be just restarted. CPU usage looked like this, most likely due to a greatly reduced page cache (around 600MB) |
First warnings appear: https://grafana.teleport.parity.io/goto/fOxsnBDSR?orgId=1
Runtime errorsLogs: https://grafana.teleport.parity.io/goto/O5174fvSg?orgId=1
Block pinning fills the cacheLogs: https://grafana.teleport.parity.io/goto/7Csv4BvIg?orgId=1
We only pin one block per session, so should be limited to 6 pinned blocks:
Trie cache warningsLogs: https://grafana.teleport.parity.io/goto/_uTP4fvIR?orgId=1 03:52:16.141 WARN tokio-runtime-worker trie-cache: Timeout while trying to acquire a write lock for the shared trie cache Libp2p kademlia errorsLogs: https://grafana.teleport.parity.io/goto/OT2zSBvSR?orgId=1 19:40:14.627 WARN tokio-runtime-worker libp2p_kad::behaviour: Locating closest peers for replication failed: Err(Timeout { key: Key(b"\x9a\x9b\x99G>\x13\xea\x16\x07\xd0\x0e9\x91\x91\xce\xa4G\xe6&6\x866R\xec\xda\0\xbf\xb8\xa2\xa2;%"), success: [], quorum: 20 }) subsystem task usagehttps://grafana.teleport.parity.io/goto/DT1MIfDIg?orgId=1 Things look normal until 19:00, when there is a huge gap of missing metric data points. This is the exact point where we reach maximum mem usage for the node: https://grafana.teleport.parity.io/goto/uQ0IIBDSg?orgId=1 This is explained by the latency introduced by DB reads not being serviced from cache. IO waitNext step:
|
In #2510, block authoring backoff has been disabled for production networks. This is the tracking issue for tests where we try to determine how this change affects validator memory usage at different network scales and scenarios.
Validators are deployed from #2574 image with the following changes to the usual logic:
The text was updated successfully, but these errors were encountered: