Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kusama: Beefy gossip validator is processing messages too slowly #3390

Closed
bkchr opened this issue Feb 19, 2024 · 11 comments · Fixed by #3435
Closed

Kusama: Beefy gossip validator is processing messages too slowly #3390

bkchr opened this issue Feb 19, 2024 · 11 comments · Fixed by #3435
Assignees
Labels
I2-bug The node fails to follow expected behavior. I3-annoyance The node behaves within expectations, however this “expected behaviour” itself is at issue.

Comments

@bkchr
Copy link
Member

bkchr commented Feb 19, 2024

Validators on Kusama report the following:

Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]: 2024-02-19 13:42:50 The number of unprocessed messages in channel `mpsc_beefy_gossip_validator` exceeded 10000.
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]: The channel was created at:
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    0: sc_consensus_beefy::start_beefy_gadget::{{closure}}
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    1: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    2: tokio::runtime::task::raw::poll
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    3: std::sys_common::backtrace::__rust_begin_short_backtrace
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    4: core::ops::function::FnOnce::call_once{{vtable.shim}}
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    5: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:              at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/alloc/src/boxed.rs:2007:9
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    6: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:              at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/alloc/src/boxed.rs:2007:9
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    7: std::sys::unix::thread::Thread::new::thread_start
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:              at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys/unix/thread.rs:108:17
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    8: <unknown>
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    9: <unknown>
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:  Last message was sent from:
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    0: sc_utils::mpsc::TracingUnboundedSender<T>::unbounded_send
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    1: <sc_consensus_beefy::communication::gossip::GossipValidator<B> as sc_network_gossip::validator::Validator<B>>::validate
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    2: <sc_network_gossip::bridge::GossipEngine<B> as core::future::future::Future>::poll
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    3: sc_consensus_beefy::wait_for_runtime_pallet::{{closure}}::{{closure}}::{{closure}}
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    4: sc_consensus_beefy::start_beefy_gadget::{{closure}}
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    5: <futures_util::future::future::Map<Fut,F> as core::future::future::Future>::poll
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    6: tokio::runtime::task::raw::poll
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    7: std::sys_common::backtrace::__rust_begin_short_backtrace
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    8: core::ops::function::FnOnce::call_once{{vtable.shim}}
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:    9: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:              at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/alloc/src/boxed.rs:2007:9
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:   10: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:              at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/alloc/src/boxed.rs:2007:9
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:   11: std::sys::unix::thread::Thread::new::thread_start
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:              at ./rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/sys/unix/thread.rs:108:17
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:   12: <unknown>
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:   13: <unknown>
Feb 19 13:42:50 ovh-adv2-fireproof polkadot[437166]:
@bkchr bkchr added I2-bug The node fails to follow expected behavior. I3-annoyance The node behaves within expectations, however this “expected behaviour” itself is at issue. labels Feb 19, 2024
@acatangiu
Copy link
Contributor

acatangiu commented Feb 19, 2024

I am guessing this is a non-issue, 1000 validators gossiping, multiple messages getting enqueued without being processed during node restart. Maybe we should increase the limit for this warning on Kusama to 100k instead of 10k.

Besides seeing this shortly after restart, are you experiencing any other issues?

Also, is your node finalizing BEEFY blocks? do you see logs like INFO tokio-runtime-worker beefy: 🥩 Round #21946827 concluded?
Or you can also check RPC against your node: RPC::beefy::getFinalizedHead().

@paulormart
Copy link
Contributor

paulormart commented Feb 19, 2024

From my side, beside seeing similar message after a restart, all seems to be good, it's finalizing BEEFY blocks:

polkadot[208721]: 2024-02-19 14:42:47 🥩 Round #21946707 concluded, finality_proof: V1(SignedCommitment { ...

@burdges
Copy link

burdges commented Feb 19, 2024

Why? If its CPU usage, we do not check the BLS signature when gossiping, right?

@acatangiu
Copy link
Contributor

acatangiu commented Feb 19, 2024

Why? If its CPU usage, we do not check the BLS signature when gossiping, right?

happens only during node restart while gossiped messages are piling up during restart (BEEFY voter task starts late in the process while network gossip subsystem starts early).

Once node restart process is complete and BEEFY voter/worker task gets crunching, it seems it consumes the pending gossip messages and carries on nicely.

From my side, beside seeing similar message after a restart, all seems to be good, it's finalizing BEEFY blocks

For completeness of sanity checks, please also check node RAM usage over time. The bad scenario to invalidate here is that the gossip messages keep piling up faster than being consumed - in which case the RAM usage would reflect this very visibly over a, say, 24h window.

@acatangiu acatangiu self-assigned this Feb 19, 2024
@kureus
Copy link

kureus commented Feb 19, 2024

As Paulo is, seeing finalised BEEFY blocks

Feb 19 16:45:55 ovh-adv2-fireproof polkadot[437166]: 2024-02-19 16:45:55 🥩 Round #21947934 concluded, finality_proof: V1(SignedCommitment { commitment: Commitment { payload: ...

@dbpatty
Copy link

dbpatty commented Feb 19, 2024

I had the same issue with the v1.7.1 and not only at the restart of the node

@infrachris
Copy link

On both KSM nodes I updated, I received the error "The number of unprocessed messages in channel mpsc_beefy_gossip_validator exceeded 10000." upon startup of the node after installing v1.7.1.

@stakeworld
Copy link
Contributor

For completeness of sanity checks, please also check node RAM usage over time. The bad scenario to invalidate here is that the gossip messages keep piling up faster than being consumed - in which case the RAM usage would reflect this very visibly over a, say, 24h window.

Ram & cpu seem ok but network usage has doubled since the activation of beefy

@burdges
Copy link

burdges commented Feb 19, 2024

happens only during node restart while gossiped messages are piling up during restart (BEEFY voter task starts late in the process while network gossip subsystem starts early).

GRANDPA was designed so you only need to pay attention to the current round. In principle BEEFY should continue this, so maybe this requires some look ahead that realizes older messsages could now be discarded. Alternatively BEEFY rounds could simply be run less frequently, so then even though this happens it winds up irrelevant.

network usage has doubled since the activation of beefy

Do we know if BEEFY caused this? I'd think network usage doubling sounds more like async backing, but not really sure.

@stakeworld
Copy link
Contributor

network usage has doubled since the activation of beefy

Do we know if BEEFY caused this? I'd think network usage doubling sounds more like async backing, but not really sure.

You are right, a lot of things happening recently (runtime upgrade, new version, beefy), so it could also be the async backing kicking in.

github-merge-queue bot pushed a commit that referenced this issue Feb 22, 2024
…on conditions (#3435)

As part of BEEFY worker/voter initialization the task waits for certain
chain and backend conditions to be fulfilled:
- BEEFY consensus enabled on-chain & GRANDPA best finalized higher than
on-chain BEEFY genesis block,
- backend has synced headers for BEEFY mandatory blocks between best
BEEFY and best GRANDPA.

During this waiting time, any messages gossiped on the BEEFY topic for
current chain get enqueued in the gossip engine, leading to RAM bloating
and output warning/error messages when the wait time is non-negligible
(like during a clean sync).

This PR adds logic to pump the gossip engine while waiting for other
things to make sure gossiped messages get consumed (practically
discarded until worker is fully initialized).

Also raises the warning threshold for enqueued messages from 10k to
100k. This is in line with the other gossip protocols on the node.

Fixes #3390

---------

Signed-off-by: Adrian Catangiu <adrian@parity.io>
@acatangiu
Copy link
Contributor

Fixed by #3435 and will be released with node version 1.8.0.

EgorPopelyaev pushed a commit that referenced this issue Feb 22, 2024
…on conditions (#3435)

As part of BEEFY worker/voter initialization the task waits for certain
chain and backend conditions to be fulfilled:
- BEEFY consensus enabled on-chain & GRANDPA best finalized higher than
on-chain BEEFY genesis block,
- backend has synced headers for BEEFY mandatory blocks between best
BEEFY and best GRANDPA.

During this waiting time, any messages gossiped on the BEEFY topic for
current chain get enqueued in the gossip engine, leading to RAM bloating
and output warning/error messages when the wait time is non-negligible
(like during a clean sync).

This PR adds logic to pump the gossip engine while waiting for other
things to make sure gossiped messages get consumed (practically
discarded until worker is fully initialized).

Also raises the warning threshold for enqueued messages from 10k to
100k. This is in line with the other gossip protocols on the node.

Fixes #3390

---------

Signed-off-by: Adrian Catangiu <adrian@parity.io>
bgallois pushed a commit to duniter/duniter-polkadot-sdk that referenced this issue Mar 25, 2024
…on conditions (paritytech#3435)

As part of BEEFY worker/voter initialization the task waits for certain
chain and backend conditions to be fulfilled:
- BEEFY consensus enabled on-chain & GRANDPA best finalized higher than
on-chain BEEFY genesis block,
- backend has synced headers for BEEFY mandatory blocks between best
BEEFY and best GRANDPA.

During this waiting time, any messages gossiped on the BEEFY topic for
current chain get enqueued in the gossip engine, leading to RAM bloating
and output warning/error messages when the wait time is non-negligible
(like during a clean sync).

This PR adds logic to pump the gossip engine while waiting for other
things to make sure gossiped messages get consumed (practically
discarded until worker is fully initialized).

Also raises the warning threshold for enqueued messages from 10k to
100k. This is in line with the other gossip protocols on the node.

Fixes paritytech#3390

---------

Signed-off-by: Adrian Catangiu <adrian@parity.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I2-bug The node fails to follow expected behavior. I3-annoyance The node behaves within expectations, however this “expected behaviour” itself is at issue.
Projects
None yet
8 participants