-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Disputes Causing Unresponsivenss #6412
Comments
when I restarted my node first time I had this log (it wasn't responde): |
logs from our validators First I had thousands of these
and shortly after:
another validator:
I only posted these two because the rest seem to be identical |
Thanks for the logs, folks. The working hypothesis is that the dispute-coordinator got overloaded, and is becoming unresponsive. We have a few corroborations of this theory, now, from a few different places. If anyone has more logs which contain the phrase "appears unresponsive", please post them here. And if by any chance there are folks running with It is interesting to note in both @tugytur's examples, that the dispute coordinator was considered unresponsive and then afterwards issued a log "Fetch for approval votes cancelled" followed by "New dispute initiated for candidate". This might indicate the code branch where the dispute coordinator is stalling. |
This comment was marked as outdated.
This comment was marked as outdated.
Thanks @CertHum-Jim - hiding to reduce noise. Please do not post logs unless your node specifically has gone down or has initiated a dispute, to avoid noise, but if in doubt please post and we will hide if not useful. |
I have posted logs below. I had 2 of 5 active set validators seize up and fail to continue working or exit out with a failure. Log set below are for the 2 nodes that went offline. |
@senseless Thanks. The nodes here didn't go down, but did seize up and failed to sync at a reasonable rate. Presumably related to the "Flooding" logs. |
Posting from one node,
I invoked the stop manually when the error was detected. I should have logs from two other nodes, shall I post as well? |
Yes, please. |
When they seized they failed to output any additional logs in the log file until I restarted the node. |
Were those the absolute last logs in the log file? If so, something must have been cut off... |
@paradox-tt, to confirm, the node outputting those logs did not actually shut down but just hung for 10 minutes until stopped manually? |
Yes, the last line was the last line in the log file. |
To reduce on noise I used This node hung to the point that it did not produce a block or heartbeat in the session and was slashed. I was able to restart and re-declare the intent for the next session. Timing:
|
That is correct. |
One more, very similar to the above:
I'm staying up for another 10 minutes then heading to sleep (its 3:40 AM). Don't hesitate to post any follow-up questions, I'll glance and action before retiring for the night. |
|
2022-12-08 23:40:56 New dispute initiated for candidate. candidate_hash=0x5799d2aff22f22b837e6fa474724ba95587de8f65ae84c3afc5d348f59eb6e7c session=26550 traceID=11644153029854107226582139140959769> |
and the service won't start now |
@StakeHulk What logs do you get when starting the service? |
now it has started successfully seeing this log now |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
@StakeHulk Thanks for posting. this appears ~unrelated to the issue directly but is worth filing in another issue. I will hide those comments in this thread. |
I had a crashed node, kusama paritydb around 01 am with It started with a lot of disputes, then runtimerequest not supported:
Then function almost normal until a real crash:
|
I noticed the first dispute logs appear at 00:25, at 00:40 the node completely hanged. Uploading the log dump from that time period, hope it helps. |
Logs from our validator:
|
Logs for our slashed validator:
|
We had two nodes in the active set, both of them crashed and one of them was also subsequently slashed as a result of this event. The slashed one
The other one
Also, both nodes show quite a lot of lines exactly every hour saying - and this keeps repeating itself for around one minute
|
Caught a small portion of logs before service shutdwon before the server was rebooted.
|
Log from block #15675064 to #15675448. |
Problems started around 00:29 (GMT +1) after block #1567503. |
Here are my 5 cents, experienced much the same chaos as the rest. Good luck with resolving this guys! 🤞🤞 |
panic after the incident, times are UTC To be noted: node did not crash, it was not chilled and was not slashed (I think the last 2 were hand in hand but just for clarity), it resumed afterwards and still running fine at the moment without a restart
|
Hey team, Is there any reason some validators went offline while others didn't? Were these disputes handled evenly by each validator or is it that some handle these disputes while other didn't? |
Not all validators are parachain validators, I guess that's the reason. Only those will actually participate in disputes, the others will just include votes in blocks. |
I suspected that not all validators participate in disputes, thanks for confirming. |
For the warning "Missing block weight for new head", it looks like it is a symptom not a cause, on one of our validators:
|
Does https://github.com/paritytech/polkadot/releases/tag/v0.9.35 address this issue? |
No, this is a runtime only release, the fix for this issue will come in the client. |
Who typically uses the runtime only release? For validators, we should wait for a node client release? When is that intended? |
This runtime will soon be proposed for upgrade. The client fix is currently being developed, the PR doing so will link to this issue. |
#6440 is a potential fix for the issue. Initial testing looking good (Versi), but we need to bump up the dispute rate more aggressively to be sure. |
Fix confirmed. |
I'm unable to determine if what follows directly relates to this GitHub issue, however the errors are similar so I thought it may be beneficial to post here. I'm getting the 'dropping stream because buffer is full' error repeatedly/constantly on both of my validators immediately after upgrading from v0.9.33 to v0.9.36. "dropping (Stream a5718300/101) because buffer is full" Additionally, the polkadot service is in an "activating (auto-restart)" state so this leads me to believe both validators are effectively down. Dec 05 09:58:53 host15 polkadot[79378]: 2022-12-05 09:58:53 dropping (Stream ea849616/1089) because buffer is full |
I was able to get around this issue. While stopping and restarting the polkadot service had no effect on the issue on either node, rebooting the server itself appears to have resolved the issue on both of my nodes which is quiet interesting. It's as if something is left in memory after the polkadot service is stopped. |
Why is your date stamp on these messages December 5th? A |
Thanks for the call out, I don't know how I failed to notice that. I recall the time was incrementing and matching the current time each time I would run 'sudo journalctl -u polkadot.service -b --no-pager' so I believe my log output was current/real-time. Perhaps the date was incorrect on the server itself and the restart fixed that. I will look into this more, thank you. |
This issue has been mentioned on Polkadot Forum. There might be relevant details there: https://forum.polkadot.network/t/polkadot-release-analysis-v0-9-36/1529/1 |
Many had logs similar to this with some parachain subsystems failing, but the node doesn't crash.
The text was updated successfully, but these errors were encountered: