Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Free more full node slots on the network #519

Open
dmitry-markin opened this issue Feb 20, 2023 · 2 comments
Open

Free more full node slots on the network #519

dmitry-markin opened this issue Feb 20, 2023 · 2 comments
Assignees

Comments

@dmitry-markin
Copy link
Contributor

dmitry-markin commented Feb 20, 2023

Problem statement

Due to a bug #526 (fixed by paritytech/substrate#13396 and later reverted), currently nodes do not detect being kicked off by a remote peer during warp sync. Because of this, the node remains connected on other protocols, including block request protocol, and continues syncing even so the remote node has all the full node slots occupied. Once the warp sync is over, the local node sends block announcement, instantly learns that the set 0 (block announcements) notification stream was closed by the remote node, and finally discovers that it was rejected by the remote node. This leads to the peer count dropping after the warp sync, like described in #528. The disconnect on the local side happens with a delay (only after sending out a block announcement), so our node still thinks that it's connected to nodes that actually rejected it, after connecting to them, so the peer count reported is higher than should be and this allows continuing communication on non-default protocols (non-zero peerset).

After merging the fix for #526 (paritytech/substrate#13396), it turned out that the local node can't actually sync, because it now disconnects from the remote once it's kicked off, and there is not enough peers to sync from. So the fix paritytech/substrate#13396 was reverted in paritytech/substrate#13409. As investigated by @altonen, our local node is kicked of because the remote has all the full node slots occupied.

In order for sync to work after merging paritytech/substrate#13396, there should be more full node slots available on the network (see previous "everyone is full" crisis paritytech/substrate#12434).

Solution proposed

One way of increasing the available full node slot count on the network is to reduce outbound connections from the nodes, which don't really need them. High number of connections is needed to speed up the initial sync, but when the node is just doing the keep-up sync, fewer connections can be used. So, it's proposed to reduce the number of outgoing connections once the initial sync is finished to free slots up for other full nodes.

@drskalman
Copy link
Contributor

After upgrading to 0.9.38. one of our validator constantly (every hour or so would get stuck in some block and stops syncing even though it reports having 40 peers). I'm not sure if it is related to this issue, when it gets stuck it reports stable 40 peers, rejecting all new peers based on too many full nodes reason. we were not able to downgrade so the only way out was to increase the number peers to 50. Since we haven't experienced sync problem anymore.

One observation I had was that as long as we had this type of mass disconnect event in the log, then the node was going on and as you see in the log there is no such disconnect event when the node is stuck:

Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.965 DEBUG tokio-runtime-worker sync: 12D3KooWFDFsbLsjuSqxniefb8h4cjJ1ZkpMKAQMFwg4STHUn7Y2 disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.966 DEBUG tokio-runtime-worker sync: 12D3KooW9xWbHxRQnEAAXN8Eq5Wfijknx65gdGEgV8ZnG42aX6CX disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.966 DEBUG tokio-runtime-worker sync: 12D3KooWRMHpoE6J67eRrLvdyy2E9R9y82QA7pxWn1Sr2RHd5fj8 disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.966 DEBUG tokio-runtime-worker sync: 12D3KooWNpUSZB8wwnTDrVVKS7qNy5mD1fg1yoUkewxGpnvJaC4i disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.966 DEBUG tokio-runtime-worker sync: 12D3KooWR5DsTqptjCygWoWUXzm12Got2zXdaPdXowgATGhYhTSs disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.966 DEBUG tokio-runtime-worker sync: 12D3KooW9wUx4QG5enP2Qpamb8qjk2AB2LPdGzSB18M7SZcMNZMr disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.966 DEBUG tokio-runtime-worker sync: 12D3KooWB2JSrJXjgVDBt4C6Pfxq9ymRfxsf6DgQLwTFEYLF1gbe disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.966 DEBUG tokio-runtime-worker sync: 12D3KooWJFWd5d42pFovcVnXpfz414Gy87W6L72GvDJXZ3TVczYK disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.967 DEBUG tokio-runtime-worker sync: 12D3KooWQhxiQWhFGYzxMTR9nRL3RkYY1kvf8bh7KQwQUm5A5mH7 disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.967 DEBUG tokio-runtime-worker sync: 12D3KooWFCW2mVro6b7oZoQo1DRpeKRfSA1RzgF35PjnPNsYjPae disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.967 DEBUG tokio-runtime-worker sync: 12D3KooWFzo9oNaEN5nhScYHWUSoibnXRmEsuXqV3EPwTnPtgcNV disconnected

One explanation, would be that upgraded nodes were

dropping us for an unknown reason/bug but stale nodes would keep the connection forever which eventually result in peers being all saturated by stale nodes. So I'm not sure if it might be actually a consequence of #528.

node-got-stuck-log.txt.gz

@dmitry-markin
Copy link
Contributor Author

@drskalman thanks for the logs! I must admit I also had an issue with node reporting 40 peers during gap sync ("Block history" phase of warp sync) but making no progress (currently downloading block number was stuck) on the latest master. The issue went away after I restarted the node.

@altonen altonen transferred this issue from paritytech/substrate Aug 24, 2023
bkchr pushed a commit that referenced this issue Apr 10, 2024
* Update some docs

* Add derived account origin

* Add tests for derived origin

* Do a little bit of cleanup

* Change Origin type to use AccountIds instead of Public keys

* Update (most) tests to use new Origin types

* Remove redundant test

* Update `runtime-common` tests to use new Origin types

* Remove unused import

* Fix documentation around origin verification

* Update config types to use AccountIds in runtime

* Update Origin type used in message relay

* Use correct type when verifying message origin

* Make CallOrigin docs more consistent

* Use AccountIds instead of Public keys in Runtime types

* Introduce trait for converting AccountIds

* Bring back standalone function for deriving account IDs

* Remove AccountIdConverter configuration trait

* Remove old bridge_account_id derivation function

* Handle target ID decoding errors more gracefully

* Update message-lane to use new AccountId derivation

* Update merged code to use new Origin types

* Use explicit conversion between H256 and AccountIds

* Make relayer fund account a config option in `message-lane` pallet

* Add note about deriving the same account on different chains

* Fix test weight

* Use AccountId instead of Public key when signing Calls

* Semi-hardcode relayer fund address into Message Lane pallet
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Backlog 🗒
Development

No branches or pull requests

3 participants