-
Notifications
You must be signed in to change notification settings - Fork 744
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
All nodes in major sync
, all nodes being stuck
#1353
Comments
What node software are you using? Any custom modifications? And what are the CLI commands to start it? |
|
Thanks for the report and even initial investigation. My knowledge about the ChainSync is a bit rudimentary, but from my mental model the node should not stall even if it goes into major sync momentarily? |
Using your test I was able to reproduce this, the node is switching to major sync in between but still makes some progress. Latency of 2 seconds. |
@fixxxedpoint I haven't had any time to look into this yet but I'll do so over the weekend. |
I am not sure what do you mean by
Our current implementation of |
Ah yes, sorry, this was a misinterpretation of the logs on my part, node is indeed stuck.
Did you have time to try this, did it bring any change? |
I tried it today with our sync disabled (link) and with 500ms latency. It failed in similar fashion. I'll ensure today that it doesn't interact in any other way with substrate's sync. |
* cumulus: 4e952282914719fafd2df450993ccc2ce9395415 polkadot: 975e780 substrate: 89fcb3e * fix refs * sync changes from paritytech/polkadot#3828 * sync changes from paritytech/polkadot#4387 * sync changes from paritytech/polkadot#3940 * sync with changes from paritytech/polkadot#4493 * sync with changes from paritytech/polkadot#4958 * sync with changes from paritytech/polkadot#3889 * sync with changes from paritytech/polkadot#5033 * sync with changes from paritytech/polkadot#5065 * compilation fixes * fixed prometheus endpoint startup (it now requires to be spawned within tokio context)
Sometimes all of our nodes switch into
major sync
mode and get stuck at some block. After this, none of them create new blocks (due to being inmajor sync
), finalization is also blocked, etc. Easiest way to trigger this behavior is to introduce some network latency between nodes. We use AURA, so introducing network latency might force our nodes to create forks. If after some time aReorg
happens (after late block finalization), it can force all of our nodes to incorrectly assume that they are inmajor sync
/catching up. We tried to investigate it further and found some possible cause for this behavior. We think that it is caused by how ChainSync'sstatus
method computes its return value after aReorg
. When enough forks are introduced and aReorg
happens it doesn't trigger any change in ChainSync'sPeerSync
collection (peers
field in ChainSync). Further, other component of substrate, after thatReorg
, changes the head of the chain. After this thestatus
method ofChainSync
see the new value of the new head (smaller than previously seen), but still uses old values forPeerSync
. This way the gap betweenmedian_seen
andbest_block
instatus
becomes bigger thanMAJOR_SYNC_BLOCKS
and forces a node intomajor sync
. Dirty quick fix would be something like this: fix. Another easy and dirty fix would be to drop connections (it should reinitializepeers
field for each node...?). This issue might be related with this one (just after a quick glance, so I might be totally wrong) #1157.The text was updated successfully, but these errors were encountered: