All nodes in `major sync`, all nodes being stuck #1353

fixxxedpoint · 2023-09-01T13:04:01Z

Sometimes all of our nodes switch into major sync mode and get stuck at some block. After this, none of them create new blocks (due to being in major sync), finalization is also blocked, etc. Easiest way to trigger this behavior is to introduce some network latency between nodes. We use AURA, so introducing network latency might force our nodes to create forks. If after some time a Reorg happens (after late block finalization), it can force all of our nodes to incorrectly assume that they are in major sync/catching up. We tried to investigate it further and found some possible cause for this behavior. We think that it is caused by how ChainSync's status method computes its return value after a Reorg . When enough forks are introduced and a Reorg happens it doesn't trigger any change in ChainSync's PeerSync collection (peers field in ChainSync). Further, other component of substrate, after that Reorg, changes the head of the chain. After this the status method of ChainSync see the new value of the new head (smaller than previously seen), but still uses old values for PeerSync. This way the gap between median_seen and best_block in status becomes bigger than MAJOR_SYNC_BLOCKS and forces a node into major sync. Dirty quick fix would be something like this: fix. Another easy and dirty fix would be to drop connections (it should reinitialize peers field for each node...?). This issue might be related with this one (just after a quick glance, so I might be totally wrong) #1157.

The text was updated successfully, but these errors were encountered:

ggwpez · 2023-09-01T13:08:13Z

What node software are you using? Any custom modifications? And what are the CLI commands to start it?
cc @paritytech/sdk-node

fixxxedpoint · 2023-09-01T13:14:12Z

aleph-node. You can try to reproduce it using one of our tests. Link to test code. README describing how to use it: link. Setting latency to something larger than 1s should trigger it.

skunert · 2023-09-01T15:59:52Z

Thanks for the report and even initial investigation. My knowledge about the ChainSync is a bit rudimentary, but from my mental model the node should not stall even if it goes into major sync momentarily?
Adding this to the networking project. cc @altonen

skunert · 2023-09-01T16:30:14Z

Using your test I was able to reproduce this, the node is switching to major sync in between but still makes some progress. Latency of 2 seconds.
Logs:
026b7cd234e4.log
30d386a82c51.log
061e5b351f2e.log
68f55be5277b.log
97a5b262a020.log

altonen · 2023-09-01T16:37:29Z

@fixxxedpoint
I've understood you're working on your own syncing implementation and if you've made any changes so far, have you ruled out that this issue not caused by those changes (if any) and that it's is an issue in Substrate?

I haven't had any time to look into this yet but I'll do so over the weekend.

fixxxedpoint · 2023-09-07T15:25:27Z

the node is switching to major sync in between but still makes some progress.

I am not sure what do you mean by still makes some progress. All logs you provided just ends in a moment where everything is stuck (not in Sync mode, but still in Importing/Preparing).

I've understood you're working on your own syncing implementation and if you've made any changes so far, have you ruled out that this issue not caused by those changes (if any) and that it's is an issue in Substrate?

Our current implementation of sync runs simultaneously/complementary to substrate's sync. I will turn off our sync service and try if I am still able to trigger it.

skunert · 2023-09-12T08:23:07Z

I am not sure what do you mean by still makes some progress. All logs you provided just ends in a moment where everything is stuck (not in Sync mode, but still in Importing/Preparing).

Ah yes, sorry, this was a misinterpretation of the logs on my part, node is indeed stuck.

Our current implementation of sync runs simultaneously/complementary to substrate's sync. I will turn off our sync service and try if I am still able to trigger it.

Did you have time to try this, did it bring any change?

fixxxedpoint · 2023-09-22T12:20:47Z

Did you have time to try this, did it bring any change?

I tried it today with our sync disabled (link) and with 500ms latency. It failed in similar fashion. I'll ensure today that it doesn't interact in any other way with substrate's sync.

* cumulus: 4e952282914719fafd2df450993ccc2ce9395415 polkadot: 975e780 substrate: 89fcb3e * fix refs * sync changes from paritytech/polkadot#3828 * sync changes from paritytech/polkadot#4387 * sync changes from paritytech/polkadot#3940 * sync with changes from paritytech/polkadot#4493 * sync with changes from paritytech/polkadot#4958 * sync with changes from paritytech/polkadot#3889 * sync with changes from paritytech/polkadot#5033 * sync with changes from paritytech/polkadot#5065 * compilation fixes * fixed prometheus endpoint startup (it now requires to be spawned within tokio context)

github-actions bot added the I10-unconfirmed Issue might be valid, but it's not yet known. label Sep 1, 2023

ggwpez added this to SDK Node Sep 1, 2023

github-project-automation bot moved this to backlog in SDK Node Sep 1, 2023

skunert added this to Networking Sep 1, 2023

altonen moved this to In Progress 🛠 in Networking Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All nodes in `major sync`, all nodes being stuck #1353

All nodes in `major sync`, all nodes being stuck #1353

fixxxedpoint commented Sep 1, 2023

ggwpez commented Sep 1, 2023

fixxxedpoint commented Sep 1, 2023 •

edited

Loading

skunert commented Sep 1, 2023

skunert commented Sep 1, 2023

altonen commented Sep 1, 2023 •

edited

Loading

fixxxedpoint commented Sep 7, 2023 •

edited

Loading

skunert commented Sep 12, 2023

fixxxedpoint commented Sep 22, 2023

All nodes in major sync, all nodes being stuck #1353

All nodes in major sync, all nodes being stuck #1353

Comments

fixxxedpoint commented Sep 1, 2023

ggwpez commented Sep 1, 2023

fixxxedpoint commented Sep 1, 2023 • edited Loading

skunert commented Sep 1, 2023

skunert commented Sep 1, 2023

altonen commented Sep 1, 2023 • edited Loading

fixxxedpoint commented Sep 7, 2023 • edited Loading

skunert commented Sep 12, 2023

fixxxedpoint commented Sep 22, 2023

All nodes in `major sync`, all nodes being stuck #1353

All nodes in `major sync`, all nodes being stuck #1353

fixxxedpoint commented Sep 1, 2023 •

edited

Loading

altonen commented Sep 1, 2023 •

edited

Loading

fixxxedpoint commented Sep 7, 2023 •

edited

Loading