-
Notifications
You must be signed in to change notification settings - Fork 1.6k
RPC node (kusama) stuck following Error importing block. #4064
Comments
There were no logs between |
No, exactly. I think that something is causing a hickup here. |
At 09:08:06 Unknown error: Client(UnknownBlock("State already discarded for BlockId::Hash(0x68c9231a95d26c65887e540f51373ede8e2d1ffcb59cc412e86edf992e9dae7d)")) This seems to me a lead. |
No that should not be a problem. I mean 10 minutes no output sounds like the entire machine was halted? |
I think we might have found something that might have caused this. We have started this RPC in a "pruning" mode since we got struck by a bug earlier, related to a mysterious bad-block-bug (that forced us to use a snapshot database). We have changed that now back and see alot better behaviour. We will provide more details to you soon since I guess this can be of future relevance anyway. The machine is all fine even though the service seems blocked. |
Just want to confirm that the problem still exist when running in archive mode. |
@Maharacha what problem? Without any logs we can not do anything. |
Yeah, sorry for that. The thing is that we don'r really have any useful logs. It's like the RPC service gets overloaded as soon as we start allowing connections to it. Already at ~30 connections we get this behavior. We have tried to configured the proxy a lot. We have tried with both |
To give some more information: The node seems to sync fine when looking at Telemetry. The block time is good. |
In the normal log we can see these message some times:
I have attached a log in debug mode as well. We are only using one RPC-node at a time (no load balancing). We can see about ~200 connections to it. |
Hey @bkchr - we are experiencing some serious issues with performance on the RPC node which almost comes to a halt at time and we have been in contact with some others experiencing performance issues although perhaps not as serious as ours. We have been trying to reach out for help for some time to get to the bottom of these issues. We end up in a situation that the RPC service chokes almost to a death as early as around 100 connections, sometimes even at 20. Sometimes at 200+. We have collected alot of intel and are going to run the service on bare metal to see if this has any difference as from running in containers. But given our setup, we have reason to suspect this is much related to the RPC service itself.... We are really in need of getting in contact with dev/ops on this matter. How can we find assistance here to look deeper with us? @Maharacha |
This is how it looks in the logs: Oct 26 23:17:40 juju-ce0f07-4 polkadot[970]: 2021-10-26 23:17:40 💤 Idle (32 peers), best: #9828969 (0x7b8f…71f8), finalized #9828966 (0x98be…276f), ⬇ 311.4kiB/s ⬆ 679.7kiB/s |
We have now tested the RPC service on totally different setups. One on AWS and one on OVH. The machines have been less powerful than our own, since it is clearly not a performance issue on hardware level. But they have been at 8 core and 32G, which should be more than enough. They have been clean Ubuntu machines (tried both focal and xenial) with ext4 filesystem. No zfs or lxd involved at all. We get the same behavior. The node is in sync all the time (one block every 6s), looking at the log as well as on telemetry. However, using polkadot.js Network->Explorer, it can sometimes take 1 minute to receive new blocks. |
We have pursued this issue to the furthest possible degree now. We're providing here a summary/report on the issue. Steps taken:We first suspected there was a misconfiguration or native bottleneck in our own infrastructure. But from extensive analytics, in multiple different,independent setups we concluded that the performance bottleneck is with the RPC code itself. The current code can simply put not efficiently use available machine resources. After trying hard to mitigate the situation of the badly performant system - we (Dwellir) consulted with community experts and Onfinality (also runs RPC services). We managed from that meeting validate our suspicions that a performant solution can only be achieved with brute force adding of system resources to mitigate the core performance issues. Generally, RPC availability under load is achieved by balancing relatively few clients across multiple RPC instances. E.g. a brute force solution to scalability. This solution is expensive as each instance needs to synchronize its own database or make use of complex infrastructure to mitigate. Today, this is the only way to provide a scaling RPC service. Next steps:Dwellir is fully committed to provide a performant RPC service to the community. We have invested in dedicated hardware resources to operate the service going forward. We are in the progress of tuning and tailor make a deployment in our private data centers. We are refactoring our automation framework to fit the brute-force method of running the RPC nodes and finding smart ways to do this as cost efficient as possible. This is non trivial - but we have evaluated a few ways to do this and homed in on a solution we can start testing at scale. We have also brought in a new team member to get familiar with the codebase to start do performance analytics. We hope to discover if it is feasible to improve the performance of the RPC service to make it more efficient. Improving the RPC code will benefit the whole community and other public RPC providers. We hope to collaborate closely with the developer community to achieve this - if possible, We (Dwellir) will post an updated Kusama Treasury proposal where we take into account the current brute force solution needed to run a RPC service at scale. If we discover it is possible to improve the RPC code to scale better, we will apply for funding to drive that development full speed. We think this issue can now be closed as the underlying problem has been identified. AttributionMany thanks to OnFinality, Parity and Paradox at Paranodes.io for taking the time to assist and help us understand better the nature of this issue. Feel free to close this issue. |
For reference: the problem was a too low cache size and is automatically solved by this PR |
We are running the polkadot binary in RPC mode providing the service publicly on: kusama-rpc.dwellir.com
We experienced a problem today with the service getting "stuck" and not responding. We are trying to get to the bottom as of why.
The polkadot RPC service is behind a ssl-terminating haproxy and works as normal until today.
The event starts at 09:09 where we get the first alarm from our monitoring stack. "Service unavailable". We expect about 1 minute delay, so the event likely occurs at about 09:08. (Attaching logs from the polkadot service from that time.) The haproxy logs are silent and normal.
09:20 - Start investigation.
Can't find any obvious particular reason to why.
At 09:40 - Restart of polkadot service.
It takes some significant time as the polkadot service seems to recover. It takes about 10 minutes before the services is OK again.
At 10:00 - The service is OK.
wss://kusama-rpc.dwellir.com
The polkadot service is started as:
polkadot --name "🛡 DWELLIR KSM RPC 🛡" --chain kusama --prometheus-external --pruning=1000 --unsafe-pruning --ws-external --ws-port 9966 --ws-max-connections 400 --rpc-external --rpc-cors all --rpc-methods Safe
Monitoring screenshot below.
The text was updated successfully, but these errors were encountered: