-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Incident: Kusama - Mangata Parachain halt for ~30 mins #6758
Comments
Which Polkadot branch are you using? |
At that time there have been quite a few disputes happening. But looking at metrics, they seem to have been handled well. Load looks normal, ToFs are very good as well. So while this might be related, it is not obvious how. |
In the logs, worrying but before the incident time frame:
Also interesting, there are a couple of:
so we seem to be having three forks, but even more interesting there is no Seems like the only thing that could go wrong here without an immediate log (maybe higher level) is produce candidate. There are PoV size messages much later, but way less than We have up to 5 forks:
Oh no, it gets way worse later:
Interestingly, those excessive forks are after it has been recovered already. Also those |
How was CPU load on that collator during that time frame? |
If they are using AURA and it isn't the turn of their collator (slot wise) there is nothing more being printed. So, this is completely fine.
These are no forks.
These are block |
Sorry, I do have not such info. But I think this is not about a collator machine, since all the collators stopped producing blocks. I am going to try asking to some others collators if they have such metrics. We have those if helps:
Thanks for investigation @eskimor , @bkchr . 🙏 |
Here is a description of the same incident on Shiden side. |
Collators are running 0.9.31? I think I have an idea what was going on. This release does not yet contain: #6440 - this issue is even worse on collators than on validators, because the accumulated rate limit will always be Upgrading nodes to a more recent release will very likely fix the issue. What I think what happened is:
Most of the time dispute rate was around 0.15 which means each active leaf update is around 8.1 seconds, right below the 10s timeout. The worst case of 0.2 was the peak: It recovered afterwards. I believe this is the reason we did not see any subsystem unresponsive errors.
TL;DR: With #6440 in place, this should no longer happen: Please upgrade your nodes. I will also prioritize this one, because there is no reason those subsystems should be active on collator nodes at all. |
Ahh that is a nice explanation @eskimor! I also had seen in the logs that the delay between import a relay chain block and triggering a collation also increased over time, which should also perfectly match your explanation! |
We had a small incident and we can not guess the root cause of it.
The incident happened between 2023-02-20 23:25:01 - 2023-02-20 23:41:10 CET.
We had a chain halt during that time, and we do not understand why.
Attached are my collator logs if help!
logs_incident_23_25_to_23_41.txt
Network: Kusama
Parachain: Mangata - 2110
The text was updated successfully, but these errors were encountered: