Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reports of bursts of high disk R/Ws on v0.12.4 #3216

Closed
renaynay opened this issue Feb 27, 2024 · 12 comments
Closed

Reports of bursts of high disk R/Ws on v0.12.4 #3216

renaynay opened this issue Feb 27, 2024 · 12 comments
Labels
area:storage bug Something isn't working

Comments

@renaynay
Copy link
Member

Seeing reports from BN node runners of high disk R/Ws on v0.12.4 in bursts of ~12 hours

@renaynay renaynay added bug Something isn't working area:storage labels Feb 27, 2024
@Suleymann85
Copy link

The logs below are taken at moment of 85% CPU Load and 97% I/O

Feb 27 19:37:46 qupra-7900x celestia[290693]: 2024-02-27T19:37:46.979Z INFO header/store store/store.go:375 new head {"height": 879415, "hash": "E31E753744D9EC1F8EDE8421E39B9D4CF309509E465E3ADEDD3D9054335A6F1C"}
Feb 27 19:37:58 qupra-7900x celestia[290693]: 2024-02-27T19:37:58.753Z INFO header/store store/store.go:375 new head {"height": 879416, "hash": "25B3D05E50BA5C1F96DDFF5DD7EE4BF7E21442088BF0BD729C0EE60BEC9CB136"}
Feb 27 19:38:09 qupra-7900x celestia[290693]: 2024-02-27T19:38:09.826Z INFO header/store store/store.go:375 new head {"height": 879417, "hash": "0F87FFC4A2E61BF0C6E81315412079B6BDC660FB094A643C19F214A9D623E6D3"}
Feb 27 19:38:21 qupra-7900x celestia[290693]: 2024-02-27T19:38:21.835Z INFO header/store store/store.go:375 new head {"height": 879418, "hash": "9B5687A2A1C94D1CF69EA46F29B44D72997B23806ECD05F484046683C328C7CC"}
Feb 27 19:38:33 qupra-7900x celestia[290693]: 2024-02-27T19:38:33.046Z INFO bitswap-server server/server.go:547 Bitswap Client ReceiveError: Application error 0x0 (remote)
Feb 27 19:38:33 qupra-7900x celestia[290693]: 2024-02-27T19:38:33.739Z INFO header/store store/store.go:375 new head {"height": 879419, "hash": "528D22D54B2B8D16BEA06BDFB5BD638AEB024CDC1DA796EE66374853C35E991A"}
Feb 27 19:38:45 qupra-7900x celestia[290693]: 2024-02-27T19:38:45.602Z WARN badger v4@v4.0.0-20231125230536-2b9e13346f75/db.go:460 Block cache might be too small. Metrics: hit: 121107130 miss: 4241640680 keys-added: 285626328 keys-updated: 483835 keys-evicted: 285626311 cost-added: 2291943524773320 cost-evicted: 2291943262078825 sets-dropped: 31297 sets-rejected: 3810601313 gets-dropped: 111778176 gets-kept: 4250854976 gets-total: 4362747810 hit-ratio: 0.03
Feb 27 19:38:45 qupra-7900x celestia[290693]: 2024-02-27T19:38:45.602Z WARN badger v4@v4.0.0-20231125230536-2b9e13346f75/db.go:461 Cache life expectancy (in seconds):
Feb 27 19:38:45 qupra-7900x celestia[290693]: -- Histogram:
Feb 27 19:38:45 qupra-7900x celestia[290693]: Min value: 0
Feb 27 19:38:45 qupra-7900x celestia[290693]: Max value: 150
Feb 27 19:38:45 qupra-7900x celestia[290693]: Count: 285624116
Feb 27 19:38:45 qupra-7900x celestia[290693]: 50p: 2.00
Feb 27 19:38:45 qupra-7900x celestia[290693]: 75p: 2.00
Feb 27 19:38:45 qupra-7900x celestia[290693]: 90p: 2.00
Feb 27 19:38:45 qupra-7900x celestia[290693]: [0, 2) 285196410 99.85% 99.85%
Feb 27 19:38:45 qupra-7900x celestia[290693]: [2, 4) 279923 0.10% 99.95%
Feb 27 19:38:45 qupra-7900x celestia[290693]: [4, 8) 105346 0.04% 99.99%
Feb 27 19:38:45 qupra-7900x celestia[290693]: [8, 16) 32620 0.01% 100.00%
Feb 27 19:38:45 qupra-7900x celestia[290693]: [16, 32) 8313 0.00% 100.00%
Feb 27 19:38:45 qupra-7900x celestia[290693]: [32, 64) 1380 0.00% 100.00%
Feb 27 19:38:45 qupra-7900x celestia[290693]: [64, 128) 118 0.00% 100.00%
Feb 27 19:38:45 qupra-7900x celestia[290693]: [128, 256) 6 0.00% 100.00%
Feb 27 19:38:45 qupra-7900x celestia[290693]: --
Feb 27 19:38:46 qupra-7900x celestia[290693]: 2024-02-27T19:38:46.064Z INFO header/store store/store.go:375 new head {"height": 879420, "hash": "407EB57A22E6260C4A3F0D2AA1E2AE0642C744B8500DDFCFEE9CD5403612C89D"}
Feb 27 19:38:52 qupra-7900x celestia[290693]: 2024-02-27T19:38:52.055Z INFO bitswap-server server/server.go:547 Bitswap Client ReceiveError: Application error 0x0 (remote)
Feb 27 19:38:57 qupra-7900x celestia[290693]: 2024-02-27T19:38:57.919Z INFO header/store store/store.go:375 new head {"height": 879421, "hash": "6400B3E90304777540288AB6362912756AB0F9ED06D160EA25A117CBCEFCBC64"}
Feb 27 19:39:09 qupra-7900x celestia[290693]: 2024-02-27T19:39:09.922Z INFO header/store store/store.go:375 new head {"height": 879422, "hash": "16CB14378210F6249F3F09E3B5C50E8AC5E3DBD3866C53D39EB70B48CB9349BA"}
Feb 27 19:39:21 qupra-7900x celestia[290693]: 2024-02-27T19:39:21.733Z INFO header/store store/store.go:375 new head {"height": 879423, "hash": "B666D118EDF9214952D86C31F1C2ECF82554B7629801D580C09E430DAB8D91E0"}
Feb 27 19:39:31 qupra-7900x celestia[290693]: 2024-02-27T19:39:31.197Z INFO bitswap-server server/server.go:547 Bitswap Client ReceiveError: received a stateless reset with token b9caffa2f9721c3b7362f27cc154618a
Feb 27 19:39:31 qupra-7900x celestia[290693]: 2024-02-27T19:39:31.693Z INFO bitswap-server server/server.go:547 Bitswap Client ReceiveError: stream reset
Feb 27 19:39:33 qupra-7900x celestia[290693]: 2024-02-27T19:39:33.743Z INFO header/store store/store.go:375 new head {"height": 879424, "hash": "0B17EB891D6B08E6D8CEC1282C254A766011CA157D2BDC78475EF1CB63675987"}
Feb 27 19:39:39 qupra-7900x celestia[290693]: 2024-02-27T19:39:39.642Z INFO bitswap-server server/server.go:547 Bitswap Client ReceiveError: Application error 0x0 (remote)
Feb 27 19:39:45 qupra-7900x celestia[290693]: 2024-02-27T19:39:45.602Z WARN badger v4@v4.0.0-20231125230536-2b9e13346f75/db.go:460 Block cache might be too small. Metrics: hit: 121163085 miss: 4244080664 keys-added: 285820179 keys-updated: 483976 keys-evicted: 285820162 cost-added: 2293655197376295 cost-evicted: 2293654954690637 sets-dropped: 31297 sets-rejected: 3812847305 gets-dropped: 111778560 gets-kept: 4253350464 gets-total: 4365243749 hit-ratio: 0.03
Feb 27 19:39:45 qupra-7900x celestia[290693]: 2024-02-27T19:39:45.602Z WARN badger v4@v4.0.0-20231125230536-2b9e13346f75/db.go:461 Cache life expectancy (in seconds):
Feb 27 19:39:45 qupra-7900x celestia[290693]: -- Histogram:
Feb 27 19:39:45 qupra-7900x celestia[290693]: Min value: 0
Feb 27 19:39:45 qupra-7900x celestia[290693]: Max value: 150
Feb 27 19:39:45 qupra-7900x celestia[290693]: Count: 285817967
Feb 27 19:39:45 qupra-7900x celestia[290693]: 50p: 2.00
Feb 27 19:39:45 qupra-7900x celestia[290693]: 75p: 2.00
Feb 27 19:39:45 qupra-7900x celestia[290693]: 90p: 2.00
Feb 27 19:39:45 qupra-7900x celestia[290693]: [0, 2) 285390261 99.85% 99.85%
Feb 27 19:39:45 qupra-7900x celestia[290693]: [2, 4) 279923 0.10% 99.95%
Feb 27 19:39:45 qupra-7900x celestia[290693]: [4, 8) 105346 0.04% 99.99%
Feb 27 19:39:45 qupra-7900x celestia[290693]: [8, 16) 32620 0.01% 100.00%
Feb 27 19:39:45 qupra-7900x celestia[290693]: [16, 32) 8313 0.00% 100.00%
Feb 27 19:39:45 qupra-7900x celestia[290693]: [32, 64) 1380 0.00% 100.00%
Feb 27 19:39:45 qupra-7900x celestia[290693]: [64, 128) 118 0.00% 100.00%
Feb 27 19:39:45 qupra-7900x celestia[290693]: [128, 256) 6 0.00% 100.00%
Feb 27 19:39:45 qupra-7900x celestia[290693]: --

@ilhanu
Copy link

ilhanu commented Feb 28, 2024

I think the above logs, don't tell much, more verbose logs or metrics would be required. We are posting our metrics with otel-collector, see below our node-id. If there are any requests in terms of logs/screenshots/data please tell us what you need.

I'll post below some screenshots from our node-exporter. The server is solely dedicated to Celestia Bridge no other services running on it.

Specs: 7900X (12-core) and a fast NVME drive.
node-id: 12D3KooWRH2pkayUDqmCAXUpQSjX1pmmftPPNEaedEdUSEBRfLJH
OS: 22.04
kernel: 6.5.0-15-generic

Range: Last 2 Days
image

Range: Last 2 Days
image

@rares-e2s
Copy link

rares-e2s commented Feb 28, 2024

First time we detected this issue was on 05 Jan, we were using CELESTIA-NODE v0.12.1.

Our findings back then (note: we're using NVMEs):

In our case, not the open files are the issue.
Yes, it has ~7k open files, but the CPU and this load increase are insane.
It started all at 9:30 UTC (11:30 in our Grafana which uses GMT+2)
the CPU increased from an average of 1 CPU Load, to a 12 CPU Load. -> 12x increase in CPU LOAD
DISK, it was at 542 IOPS READ and 44 IOPS WRITE ->>>> NOW it has 17k (yes thousands) IOPS READ and 4.6K IOPS write.
Since 9:30 UTC we also noticed an increase in active TCP ESTABLISHED connections.
I correlated this to active P2P connections
Then I blocked in firewall all incoming trafic towards the P2P port, and I restarted the node. The load on CPU/DISK/Network are back to "low" values (also visible on the graphs: baseline is much lower than before after this step)

image
image
image
image
image

It seems to us that some peer/peers are the reason.

Checked the logs from that time, and we found the following errors that are relevant:

2024-01-05T11:28:59+02:00	2024-01-05T11:28:59.488+0200	ERROR	header/p2p	p2p/server.go:157	server: writing header to stream	{"err": "i/o deadline reached"}
2024-01-05T11:29:00+02:00	2024-01-05T11:29:00.302+0200	INFO	bitswap-server	server/server.go:547	Bitswap Client ReceiveError: stream reset
2024-01-05T11:29:00+02:00	2024-01-05T11:29:00.613+0200	ERROR	header/p2p	p2p/server.go:157	server: writing header to stream	{"err": "stream reset"} 

A 4 month graph shows a clear improvement of the resource usage on our bridge since we blocked incoming p2p traffic (Red line = ufw deny)

image

@Staking7pc
Copy link

We had a spike in our IO too especially the read part . Below are the bridge node logs during the high IO reads. We are using 2 Nvme drives on Raid0 .

IO graphs
image
image

Seeig repeated entries like below in the logs

Feb 28 11:09:41 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:09:41.019+0100        INFO        header/store      
  store/store.go:375        new head        {"height": 883855, "hash": "762641C9AB7F543B835063483575DFE627FFC874B9D
8E92555765776702E2676"}
Feb 28 11:09:53 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:09:53.224+0100        INFO        header/store      
  store/store.go:375        new head        {"height": 883856, "hash": "BF8BBEF57BC796672B48BF4B09790ABC3D346C3D857
FAB00584DAC66712655FE"}
Feb 28 11:10:02 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:02.274+0100        INFO        bitswap-server    
    server/server.go:547        Bitswap Client ReceiveError: Application error 0x0 (remote)
Feb 28 11:10:07 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:07.044+0100        INFO        header/store      
  store/store.go:375        new head        {"height": 883857, "hash": "2154691F4B77A2CBE1C44BBD90A3B83A554B4C38646
E0EA08A7521EC27A2EE07"}
Feb 28 11:10:07 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:07.452+0100        INFO        bitswap-server    
    server/server.go:547        Bitswap Client ReceiveError: Application error 0x0 (remote)
Feb 28 11:10:18 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:18.538+0100        INFO        bitswap-server    
    server/server.go:547        Bitswap Client ReceiveError: stream reset
Feb 28 11:10:18 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:18.798+0100        INFO        header/store      
  store/store.go:375        new head        {"height": 883858, "hash": "7D70166FBD419C3A99192C44E6914657896802E1457
7C05CB34D8B756EEEE812"}
Feb 28 11:10:22 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:22.874+0100        INFO        bitswap-server    
    server/server.go:547        Bitswap Client ReceiveError: stream reset
Feb 28 11:10:30 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:30.805+0100        INFO        header/store      
  store/store.go:375        new head        {"height": 883859, "hash": "3AC06314404321D2784A65E711714F3C8126EDD6D17
C1C51DAD756472FF0451D"}
Feb 28 11:10:33 BsSui-mainRPC celestia[2497452]: 2024-02-28T11:10:33.626+0100        WARN        badger        v4@v
4.0.0-20231125230536-2b9e13346f75/db.go:460        Block cache might be too small. Metrics: hit: 187054252 miss: 51
61187805 keys-added: 544306772 keys-updated: 1442029 keys-evicted: 544306743 cost-added: 5447645463552796 cost-evic
ted: 5447645199982723 sets-dropped: 1408 sets-rejected: 4483552722 gets-dropped: 25108032 gets-kept: 5321904256 get
s-total: 5348242073 hit-ratio: 0.03

TCP connections in the server over that period

image

@Aderks
Copy link

Aderks commented Mar 26, 2024

On v0.13.0 we're also seeing similar logs and resource usage as the above. Mainly very high read usage.

image

Blocking p2p port 2121 didn't seem to help our bridge node. Logs resemble what has already been posted.

@Wondertan
Copy link
Member

This is unlikely to be a peering issue but a compaction run by indexing in badger. There is not much we can do about it at this point, as it's a fundamental flaw in the current storage mechanism. The solution for the existing and active implementation phase would need a few months to land.

Meanwhile, If that's really an issue killing disks, we can explore pruning for the inverted index.
cc @renaynay and @walldiss

@MSevey
Copy link
Member

MSevey commented May 2, 2024

Progress on storage upgrades here: #2971

@activenodes
Copy link

In our cases the problem occurs by affecting only the CPU load.

immagine

A restart solves always the problem.

Tested with very high performance bare-metal (Intel + NVMe Gen4) and VMware clusters with 8x32Gbit FC NVMe SAN storage in private data centers

immagine

Note:
The memory has been reduced from 128GB to 32GB (since it is not used)

immagine

Current:
immagine

Last consideration, the current graphs refer to the testnet, as the mainnet bridge node was recently shut down because we were not selected again with the delegation (despite the details of our infrastructure with 6/8 dedicated enterprise nodes)

Here chain detail and connections graph (without changes)

immagine

We are available to provide further details...

@mogoll92
Copy link

mogoll92 commented May 7, 2024

We don't have such load on mocha bridge, yes there are some spikes but thats okay, but with mainnet bridge the situation is terrible.
Screenshot 2024-05-07 at 12 16 19
Screenshot 2024-05-07 at 12 16 01
Screenshot 2024-05-07 at 12 15 46
Screenshot 2024-05-07 at 12 15 27

Logs do not say a lot, just noticed that this Bitswap error much more frequently than in testnet.

INFO	bitswap-server	server/server.go:547	Bitswap Client ReceiveError: Application error 0x0 (remote)

@activenodes
Copy link

I can confirm the problem on both chains, but as I indicated I was forced to shut down the mainnet bridge node and I no longer have the history.

Regarding the problem in testnet I can say that a reboot of the process fixes the problem, every time.

They could be unrelated or connected issues, and a possibly study on the testnet structure and then verify it on the mainnet where the problem is most intense

@Jeongseup
Copy link

this issue is not closed currently, actually our server has a issue by this. server cpu usage is not normal

@cristaloleg
Copy link
Contributor

Closing because Shwap addressed such issues. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:storage bug Something isn't working
Projects
None yet
Development

No branches or pull requests