-
Notifications
You must be signed in to change notification settings - Fork 968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reports of bursts of high disk R/Ws on v0.12.4
#3216
Comments
The logs below are taken at moment of 85% CPU Load and 97% I/O Feb 27 19:37:46 qupra-7900x celestia[290693]: 2024-02-27T19:37:46.979Z INFO header/store store/store.go:375 new head {"height": 879415, "hash": "E31E753744D9EC1F8EDE8421E39B9D4CF309509E465E3ADEDD3D9054335A6F1C"} |
I think the above logs, don't tell much, more verbose logs or metrics would be required. We are posting our metrics with otel-collector, see below our node-id. If there are any requests in terms of logs/screenshots/data please tell us what you need. I'll post below some screenshots from our node-exporter. The server is solely dedicated to Celestia Bridge no other services running on it. Specs: 7900X (12-core) and a fast NVME drive. |
First time we detected this issue was on 05 Jan, we were using CELESTIA-NODE v0.12.1. Our findings back then (note: we're using NVMEs): In our case, not the open files are the issue. It seems to us that some peer/peers are the reason. Checked the logs from that time, and we found the following errors that are relevant:
A 4 month graph shows a clear improvement of the resource usage on our bridge since we blocked incoming p2p traffic (Red line = ufw deny) |
We had a spike in our IO too especially the read part . Below are the bridge node logs during the high IO reads. We are using 2 Nvme drives on Raid0 . Seeig repeated entries like below in the logs
TCP connections in the server over that period |
This is unlikely to be a peering issue but a compaction run by indexing in badger. There is not much we can do about it at this point, as it's a fundamental flaw in the current storage mechanism. The solution for the existing and active implementation phase would need a few months to land. Meanwhile, If that's really an issue killing disks, we can explore pruning for the inverted index. |
Progress on storage upgrades here: #2971 |
In our cases the problem occurs by affecting only the CPU load. A restart solves always the problem. Tested with very high performance bare-metal (Intel + NVMe Gen4) and VMware clusters with 8x32Gbit FC NVMe SAN storage in private data centers Note: Last consideration, the current graphs refer to the testnet, as the mainnet bridge node was recently shut down because we were not selected again with the delegation (despite the details of our infrastructure with 6/8 dedicated enterprise nodes) Here chain detail and connections graph (without changes) We are available to provide further details... |
I can confirm the problem on both chains, but as I indicated I was forced to shut down the mainnet bridge node and I no longer have the history. Regarding the problem in testnet I can say that a reboot of the process fixes the problem, every time. They could be unrelated or connected issues, and a possibly study on the testnet structure and then verify it on the mainnet where the problem is most intense |
this issue is not closed currently, actually our server has a issue by this. server cpu usage is not normal |
Closing because Shwap addressed such issues. Thanks! |
Seeing reports from BN node runners of high disk R/Ws on
v0.12.4
in bursts of ~12 hoursThe text was updated successfully, but these errors were encountered: