Kusama: Beefy gossip validator is processing messages too slowly #3390

bkchr · 2024-02-19T13:47:25Z

Validators on Kusama report the following:

Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]: 2024-02-19 13:42:50 The number of unprocessed messages in channel `mpsc_beefy_gossip_validator` exceeded 10000.
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]: The channel was created at:
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    0: sc_consensus_beefy::start_beefy_gadget::{{closure}}
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    1: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    2: tokio::runtime::task::raw::poll
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    3: std::sys_common::backtrace::__rust_begin_short_backtrace
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    4: core::ops::function::FnOnce::call_once{{vtable.shim}}
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    5: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:              at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/alloc/src/boxed.rs:2007:9
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    6: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:              at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/alloc/src/boxed.rs:2007:9
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    7: std::sys::unix::thread::Thread::new::thread_start
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:              at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys/unix/thread.rs:108:17
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    8: <unknown>
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    9: <unknown>
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:  Last message was sent from:
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    0: sc_utils::mpsc::TracingUnboundedSender<T>::unbounded_send
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    1: <sc_consensus_beefy::communication::gossip::GossipValidator<B> as sc_network_gossip::validator::Validator<B>>::validate
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    2: <sc_network_gossip::bridge::GossipEngine<B> as core::future::future::Future>::poll
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    3: sc_consensus_beefy::wait_for_runtime_pallet::{{closure}}::{{closure}}::{{closure}}
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    4: sc_consensus_beefy::start_beefy_gadget::{{closure}}
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    5: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    6: tokio::runtime::task::raw::poll
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    7: std::sys_common::backtrace::__rust_begin_short_backtrace
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    8: core::ops::function::FnOnce::call_once{{vtable.shim}}
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    9: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:              at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/alloc/src/boxed.rs:2007:9
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:   10: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:              at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/alloc/src/boxed.rs:2007:9
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:   11: std::sys::unix::thread::Thread::new::thread_start
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:              at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys/unix/thread.rs:108:17
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:   12: <unknown>
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:   13: <unknown>
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:

The text was updated successfully, but these errors were encountered:

acatangiu · 2024-02-19T14:54:34Z

I am guessing this is a non-issue, 1000 validators gossiping, multiple messages getting enqueued without being processed during node restart. Maybe we should increase the limit for this warning on Kusama to 100k instead of 10k.

Besides seeing this shortly after restart, are you experiencing any other issues?

Also, is your node finalizing BEEFY blocks? do you see logs like INFO tokio-runtime-worker beefy: 🥩 Round #21946827 concluded?
Or you can also check RPC against your node: RPC::beefy::getFinalizedHead().

paulormart · 2024-02-19T15:02:13Z

From my side, beside seeing similar message after a restart, all seems to be good, it's finalizing BEEFY blocks:

polkadot[208721]: 2024-02-19 14:42:47 🥩 Round #21946707 concluded, finality_proof: V1(SignedCommitment { ...

burdges · 2024-02-19T15:15:03Z

Why? If its CPU usage, we do not check the BLS signature when gossiping, right?

acatangiu · 2024-02-19T16:19:04Z

Why? If its CPU usage, we do not check the BLS signature when gossiping, right?

happens only during node restart while gossiped messages are piling up during restart (BEEFY voter task starts late in the process while network gossip subsystem starts early).

Once node restart process is complete and BEEFY voter/worker task gets crunching, it seems it consumes the pending gossip messages and carries on nicely.

From my side, beside seeing similar message after a restart, all seems to be good, it's finalizing BEEFY blocks

For completeness of sanity checks, please also check node RAM usage over time. The bad scenario to invalidate here is that the gossip messages keep piling up faster than being consumed - in which case the RAM usage would reflect this very visibly over a, say, 24h window.

kureus · 2024-02-19T16:49:54Z

As Paulo is, seeing finalised BEEFY blocks

Feb 19 16:45:55 ovh-adv2-fireproof polkadot[437166]: 2024-02-19 16:45:55 🥩 Round #21947934 concluded, finality_proof: V1(SignedCommitment { commitment: Commitment { payload: ...

dbpatty · 2024-02-19T17:00:28Z

I had the same issue with the v1.7.1 and not only at the restart of the node

infrachris · 2024-02-19T17:06:25Z

On both KSM nodes I updated, I received the error "The number of unprocessed messages in channel mpsc_beefy_gossip_validator exceeded 10000." upon startup of the node after installing v1.7.1.

stakeworld · 2024-02-19T18:09:20Z

For completeness of sanity checks, please also check node RAM usage over time. The bad scenario to invalidate here is that the gossip messages keep piling up faster than being consumed - in which case the RAM usage would reflect this very visibly over a, say, 24h window.

Ram & cpu seem ok but network usage has doubled since the activation of beefy

burdges · 2024-02-19T22:27:49Z

happens only during node restart while gossiped messages are piling up during restart (BEEFY voter task starts late in the process while network gossip subsystem starts early).

GRANDPA was designed so you only need to pay attention to the current round. In principle BEEFY should continue this, so maybe this requires some look ahead that realizes older messsages could now be discarded. Alternatively BEEFY rounds could simply be run less frequently, so then even though this happens it winds up irrelevant.

network usage has doubled since the activation of beefy

Do we know if BEEFY caused this? I'd think network usage doubling sounds more like async backing, but not really sure.

stakeworld · 2024-02-21T17:30:14Z

network usage has doubled since the activation of beefy

Do we know if BEEFY caused this? I'd think network usage doubling sounds more like async backing, but not really sure.

You are right, a lot of things happening recently (runtime upgrade, new version, beefy), so it could also be the async backing kicking in.

…on conditions (#3435) As part of BEEFY worker/voter initialization the task waits for certain chain and backend conditions to be fulfilled: - BEEFY consensus enabled on-chain & GRANDPA best finalized higher than on-chain BEEFY genesis block, - backend has synced headers for BEEFY mandatory blocks between best BEEFY and best GRANDPA. During this waiting time, any messages gossiped on the BEEFY topic for current chain get enqueued in the gossip engine, leading to RAM bloating and output warning/error messages when the wait time is non-negligible (like during a clean sync). This PR adds logic to pump the gossip engine while waiting for other things to make sure gossiped messages get consumed (practically discarded until worker is fully initialized). Also raises the warning threshold for enqueued messages from 10k to 100k. This is in line with the other gossip protocols on the node. Fixes #3390 --------- Signed-off-by: Adrian Catangiu <adrian@parity.io>

acatangiu · 2024-02-22T12:27:34Z

Fixed by #3435 and will be released with node version 1.8.0.

…on conditions (#3435) As part of BEEFY worker/voter initialization the task waits for certain chain and backend conditions to be fulfilled: - BEEFY consensus enabled on-chain & GRANDPA best finalized higher than on-chain BEEFY genesis block, - backend has synced headers for BEEFY mandatory blocks between best BEEFY and best GRANDPA. During this waiting time, any messages gossiped on the BEEFY topic for current chain get enqueued in the gossip engine, leading to RAM bloating and output warning/error messages when the wait time is non-negligible (like during a clean sync). This PR adds logic to pump the gossip engine while waiting for other things to make sure gossiped messages get consumed (practically discarded until worker is fully initialized). Also raises the warning threshold for enqueued messages from 10k to 100k. This is in line with the other gossip protocols on the node. Fixes #3390 --------- Signed-off-by: Adrian Catangiu <adrian@parity.io>

…on conditions (paritytech#3435) As part of BEEFY worker/voter initialization the task waits for certain chain and backend conditions to be fulfilled: - BEEFY consensus enabled on-chain & GRANDPA best finalized higher than on-chain BEEFY genesis block, - backend has synced headers for BEEFY mandatory blocks between best BEEFY and best GRANDPA. During this waiting time, any messages gossiped on the BEEFY topic for current chain get enqueued in the gossip engine, leading to RAM bloating and output warning/error messages when the wait time is non-negligible (like during a clean sync). This PR adds logic to pump the gossip engine while waiting for other things to make sure gossiped messages get consumed (practically discarded until worker is fully initialized). Also raises the warning threshold for enqueued messages from 10k to 100k. This is in line with the other gossip protocols on the node. Fixes paritytech#3390 --------- Signed-off-by: Adrian Catangiu <adrian@parity.io>

bkchr added I2-bug The node fails to follow expected behavior. I3-annoyance The node behaves within expectations, however this “expected behaviour” itself is at issue. labels Feb 19, 2024

acatangiu self-assigned this Feb 19, 2024

acatangiu mentioned this issue Feb 21, 2024

sc-consensus-beefy: pump gossip engine while waiting for initialization conditions #3435

Merged

acatangiu closed this as completed in #3435 Feb 22, 2024

dmitry-markin mentioned this issue Apr 2, 2024

Beefy mpsc channels clogging after warp-sync #3945

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kusama: Beefy gossip validator is processing messages too slowly #3390

Kusama: Beefy gossip validator is processing messages too slowly #3390

bkchr commented Feb 19, 2024

acatangiu commented Feb 19, 2024 •

edited

Loading

paulormart commented Feb 19, 2024 •

edited

Loading

burdges commented Feb 19, 2024

acatangiu commented Feb 19, 2024 •

edited

Loading

kureus commented Feb 19, 2024

dbpatty commented Feb 19, 2024

infrachris commented Feb 19, 2024

stakeworld commented Feb 19, 2024

burdges commented Feb 19, 2024 •

edited

Loading

stakeworld commented Feb 21, 2024

acatangiu commented Feb 22, 2024

Kusama: Beefy gossip validator is processing messages too slowly #3390

Kusama: Beefy gossip validator is processing messages too slowly #3390

Comments

bkchr commented Feb 19, 2024

acatangiu commented Feb 19, 2024 • edited Loading

paulormart commented Feb 19, 2024 • edited Loading

burdges commented Feb 19, 2024

acatangiu commented Feb 19, 2024 • edited Loading

kureus commented Feb 19, 2024

dbpatty commented Feb 19, 2024

infrachris commented Feb 19, 2024

stakeworld commented Feb 19, 2024

burdges commented Feb 19, 2024 • edited Loading

stakeworld commented Feb 21, 2024

acatangiu commented Feb 22, 2024

acatangiu commented Feb 19, 2024 •

edited

Loading

paulormart commented Feb 19, 2024 •

edited

Loading

acatangiu commented Feb 19, 2024 •

edited

Loading

burdges commented Feb 19, 2024 •

edited

Loading