Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All nodes in major sync, all nodes being stuck #1353

Open
fixxxedpoint opened this issue Sep 1, 2023 · 8 comments
Open

All nodes in major sync, all nodes being stuck #1353

fixxxedpoint opened this issue Sep 1, 2023 · 8 comments
Labels
I10-unconfirmed Issue might be valid, but it's not yet known.

Comments

@fixxxedpoint
Copy link

Sometimes all of our nodes switch into major sync mode and get stuck at some block. After this, none of them create new blocks (due to being in major sync), finalization is also blocked, etc. Easiest way to trigger this behavior is to introduce some network latency between nodes. We use AURA, so introducing network latency might force our nodes to create forks. If after some time a Reorg happens (after late block finalization), it can force all of our nodes to incorrectly assume that they are in major sync/catching up. We tried to investigate it further and found some possible cause for this behavior. We think that it is caused by how ChainSync's status method computes its return value after a Reorg . When enough forks are introduced and a Reorg happens it doesn't trigger any change in ChainSync's PeerSync collection (peers field in ChainSync). Further, other component of substrate, after that Reorg, changes the head of the chain. After this the status method of ChainSync see the new value of the new head (smaller than previously seen), but still uses old values for PeerSync. This way the gap between median_seen and best_block in status becomes bigger than MAJOR_SYNC_BLOCKS and forces a node into major sync. Dirty quick fix would be something like this: fix. Another easy and dirty fix would be to drop connections (it should reinitialize peers field for each node...?). This issue might be related with this one (just after a quick glance, so I might be totally wrong) #1157.

@github-actions github-actions bot added the I10-unconfirmed Issue might be valid, but it's not yet known. label Sep 1, 2023
@ggwpez ggwpez added this to SDK Node Sep 1, 2023
@github-project-automation github-project-automation bot moved this to backlog in SDK Node Sep 1, 2023
@ggwpez
Copy link
Member

ggwpez commented Sep 1, 2023

What node software are you using? Any custom modifications? And what are the CLI commands to start it?
cc @paritytech/sdk-node

@fixxxedpoint
Copy link
Author

fixxxedpoint commented Sep 1, 2023

aleph-node. You can try to reproduce it using one of our tests. Link to test code. README describing how to use it: link. Setting latency to something larger than 1s should trigger it.

@skunert
Copy link
Contributor

skunert commented Sep 1, 2023

Thanks for the report and even initial investigation. My knowledge about the ChainSync is a bit rudimentary, but from my mental model the node should not stall even if it goes into major sync momentarily?
Adding this to the networking project. cc @altonen

@skunert
Copy link
Contributor

skunert commented Sep 1, 2023

Using your test I was able to reproduce this, the node is switching to major sync in between but still makes some progress. Latency of 2 seconds.
Logs:
026b7cd234e4.log
30d386a82c51.log
061e5b351f2e.log
68f55be5277b.log
97a5b262a020.log

@altonen altonen moved this to In Progress 🛠 in Networking Sep 1, 2023
@altonen
Copy link
Contributor

altonen commented Sep 1, 2023

@fixxxedpoint
I've understood you're working on your own syncing implementation and if you've made any changes so far, have you ruled out that this issue not caused by those changes (if any) and that it's is an issue in Substrate?

I haven't had any time to look into this yet but I'll do so over the weekend.

@fixxxedpoint
Copy link
Author

fixxxedpoint commented Sep 7, 2023

the node is switching to major sync in between but still makes some progress.

I am not sure what do you mean by still makes some progress. All logs you provided just ends in a moment where everything is stuck (not in Sync mode, but still in Importing/Preparing).

I've understood you're working on your own syncing implementation and if you've made any changes so far, have you ruled out that this issue not caused by those changes (if any) and that it's is an issue in Substrate?

Our current implementation of sync runs simultaneously/complementary to substrate's sync. I will turn off our sync service and try if I am still able to trigger it.

@skunert
Copy link
Contributor

skunert commented Sep 12, 2023

I am not sure what do you mean by still makes some progress. All logs you provided just ends in a moment where everything is stuck (not in Sync mode, but still in Importing/Preparing).

Ah yes, sorry, this was a misinterpretation of the logs on my part, node is indeed stuck.

Our current implementation of sync runs simultaneously/complementary to substrate's sync. I will turn off our sync service and try if I am still able to trigger it.

Did you have time to try this, did it bring any change?

@fixxxedpoint
Copy link
Author

Did you have time to try this, did it bring any change?

I tried it today with our sync disabled (link) and with 500ms latency. It failed in similar fashion. I'll ensure today that it doesn't interact in any other way with substrate's sync.

bkchr pushed a commit that referenced this issue Apr 10, 2024
* cumulus: 4e952282914719fafd2df450993ccc2ce9395415
polkadot: 975e780
substrate: 89fcb3e

* fix refs

* sync changes from paritytech/polkadot#3828

* sync changes from paritytech/polkadot#4387

* sync changes from paritytech/polkadot#3940

* sync with changes from paritytech/polkadot#4493

* sync with changes from paritytech/polkadot#4958

* sync with changes from paritytech/polkadot#3889

* sync with changes from paritytech/polkadot#5033

* sync with changes from paritytech/polkadot#5065

* compilation fixes

* fixed prometheus endpoint startup (it now requires to be spawned within tokio context)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I10-unconfirmed Issue might be valid, but it's not yet known.
Projects
Status: In Progress 🛠
Status: backlog
Development

No branches or pull requests

4 participants