Free more full node slots on the network #519

dmitry-markin · 2023-02-20T15:21:02Z

Problem statement

Due to a bug #526 (fixed by paritytech/substrate#13396 and later reverted), currently nodes do not detect being kicked off by a remote peer during warp sync. Because of this, the node remains connected on other protocols, including block request protocol, and continues syncing even so the remote node has all the full node slots occupied. Once the warp sync is over, the local node sends block announcement, instantly learns that the set 0 (block announcements) notification stream was closed by the remote node, and finally discovers that it was rejected by the remote node. This leads to the peer count dropping after the warp sync, like described in #528. The disconnect on the local side happens with a delay (only after sending out a block announcement), so our node still thinks that it's connected to nodes that actually rejected it, after connecting to them, so the peer count reported is higher than should be and this allows continuing communication on non-default protocols (non-zero peerset).

After merging the fix for #526 (paritytech/substrate#13396), it turned out that the local node can't actually sync, because it now disconnects from the remote once it's kicked off, and there is not enough peers to sync from. So the fix paritytech/substrate#13396 was reverted in paritytech/substrate#13409. As investigated by @altonen, our local node is kicked of because the remote has all the full node slots occupied.

In order for sync to work after merging paritytech/substrate#13396, there should be more full node slots available on the network (see previous "everyone is full" crisis paritytech/substrate#12434).

Solution proposed

One way of increasing the available full node slot count on the network is to reduce outbound connections from the nodes, which don't really need them. High number of connections is needed to speed up the initial sync, but when the node is just doing the keep-up sync, fewer connections can be used. So, it's proposed to reduce the number of outgoing connections once the initial sync is finished to free slots up for other full nodes.

drskalman · 2023-02-23T16:30:51Z

After upgrading to 0.9.38. one of our validator constantly (every hour or so would get stuck in some block and stops syncing even though it reports having 40 peers). I'm not sure if it is related to this issue, when it gets stuck it reports stable 40 peers, rejecting all new peers based on too many full nodes reason. we were not able to downgrade so the only way out was to increase the number peers to 50. Since we haven't experienced sync problem anymore.

One observation I had was that as long as we had this type of mass disconnect event in the log, then the node was going on and as you see in the log there is no such disconnect event when the node is stuck:

Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.965 DEBUG tokio-runtime-worker sync: 12D3KooWFDFsbLsjuSqxniefb8h4cjJ1ZkpMKAQMFwg4STHUn7Y2 disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.966 DEBUG tokio-runtime-worker sync: 12D3KooW9xWbHxRQnEAAXN8Eq5Wfijknx65gdGEgV8ZnG42aX6CX disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.966 DEBUG tokio-runtime-worker sync: 12D3KooWRMHpoE6J67eRrLvdyy2E9R9y82QA7pxWn1Sr2RHd5fj8 disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.966 DEBUG tokio-runtime-worker sync: 12D3KooWNpUSZB8wwnTDrVVKS7qNy5mD1fg1yoUkewxGpnvJaC4i disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.966 DEBUG tokio-runtime-worker sync: 12D3KooWR5DsTqptjCygWoWUXzm12Got2zXdaPdXowgATGhYhTSs disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.966 DEBUG tokio-runtime-worker sync: 12D3KooW9wUx4QG5enP2Qpamb8qjk2AB2LPdGzSB18M7SZcMNZMr disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.966 DEBUG tokio-runtime-worker sync: 12D3KooWB2JSrJXjgVDBt4C6Pfxq9ymRfxsf6DgQLwTFEYLF1gbe disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.966 DEBUG tokio-runtime-worker sync: 12D3KooWJFWd5d42pFovcVnXpfz414Gy87W6L72GvDJXZ3TVczYK disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.967 DEBUG tokio-runtime-worker sync: 12D3KooWQhxiQWhFGYzxMTR9nRL3RkYY1kvf8bh7KQwQUm5A5mH7 disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.967 DEBUG tokio-runtime-worker sync: 12D3KooWFCW2mVro6b7oZoQo1DRpeKRfSA1RzgF35PjnPNsYjPae disconnected
Feb 21 19:54:54 kusama-6-dionysus polkadot[2657165]: 2023-02-21 19:54:54.967 DEBUG tokio-runtime-worker sync: 12D3KooWFzo9oNaEN5nhScYHWUSoibnXRmEsuXqV3EPwTnPtgcNV disconnected

One explanation, would be that upgraded nodes were

dropping us for an unknown reason/bug but stale nodes would keep the connection forever which eventually result in peers being all saturated by stale nodes. So I'm not sure if it might be actually a consequence of #528.

node-got-stuck-log.txt.gz

dmitry-markin · 2023-02-23T16:47:34Z

@drskalman thanks for the logs! I must admit I also had an issue with node reporting 40 peers during gap sync ("Block history" phase of warp sync) but making no progress (currently downloading block number was stuck) on the latest master. The issue went away after I restarted the node.

* Update some docs * Add derived account origin * Add tests for derived origin * Do a little bit of cleanup * Change Origin type to use AccountIds instead of Public keys * Update (most) tests to use new Origin types * Remove redundant test * Update `runtime-common` tests to use new Origin types * Remove unused import * Fix documentation around origin verification * Update config types to use AccountIds in runtime * Update Origin type used in message relay * Use correct type when verifying message origin * Make CallOrigin docs more consistent * Use AccountIds instead of Public keys in Runtime types * Introduce trait for converting AccountIds * Bring back standalone function for deriving account IDs * Remove AccountIdConverter configuration trait * Remove old bridge_account_id derivation function * Handle target ID decoding errors more gracefully * Update message-lane to use new AccountId derivation * Update merged code to use new Origin types * Use explicit conversion between H256 and AccountIds * Make relayer fund account a config option in `message-lane` pallet * Add note about deriving the same account on different chains * Fix test weight * Use AccountId instead of Public key when signing Calls * Semi-hardcode relayer fund address into Message Lane pallet

dmitry-markin added I8-footprint labels Feb 20, 2023

dmitry-markin self-assigned this Feb 20, 2023

altonen transferred this issue from paritytech/substrate Aug 24, 2023

the-right-joyce removed I8-footprint labels Aug 25, 2023

This was referenced Jun 5, 2024

Update polkadot-sdk from v1.7.0 to v1.11.0 moondance-labs/tanssi#573

Closed

Update polkadot-sdk from v1.10.0 to v1.11.0 moondance-labs/tanssi#577

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Free more full node slots on the network #519

Free more full node slots on the network #519

dmitry-markin commented Feb 20, 2023 •

edited

Loading

drskalman commented Feb 23, 2023

dmitry-markin commented Feb 23, 2023

Free more full node slots on the network #519

Free more full node slots on the network #519

Comments

dmitry-markin commented Feb 20, 2023 • edited Loading

Problem statement

Solution proposed

drskalman commented Feb 23, 2023

dmitry-markin commented Feb 23, 2023

dmitry-markin commented Feb 20, 2023 •

edited

Loading