bitfield-distribution: subsystem queue seems to get full #5657

sandreim · 2024-09-10T08:16:17Z

On Kusama we can observe that the channel gets full every once in a while, leading to a brief stall of the network bridge. This has been happening as we increased the number of valdiators which increased the number of bitifeld messages in the network.

The bitfield gossip is bursty as all nodes setup a timer of 1.5s when they import the block. When timer expires they all send out their bitfield to other validators.

We need to investigate this further and see if it is a potential problem when we scale up to 1k validators. We might want to optimize this a bit or maybe just having a larger subsystem channel size to absorb these bursts is enough.

alexggh · 2024-09-20T07:57:50Z

I did some investigations on bitfield-distribution being clogged sometimes and all data points that the throughput of the system on average is more than sufficient, this is backed by multiple data sources like:

Subsystem benchmarks shows that processing all bitfields for 500 validators takes around 50ms of cpu time on the reference hardware.
Looking at CPU usage on kusama nodes it does not go over 4%.

This leads me to think this rare occasions of the susbystem being clogged are just of bursts of messages that happen because of the fact that all validators decide to send their bitfield all of the same time.

Doing some math on the total number of message it looks like this:

We have 500 validators, so there are 500 unique bitfield messages.
Each unique messages can be received by a node 6 times(2 times because of X and Y neighbour and 4 times because any messages is also gossiped randomly to 4 peers)
Hence a node can receive 3000 (500 * 6) bitfield messages per relay chain block.
Now the clogging seems to be correlated with relay-chain forks I see cases on kusama where we have 3 or 4-way forks, in that case you have a node receiving up to 3000 * 4 = 12_000 bitfield messages all coming around the same time. The messages get processed really fast, because we don't see them gathering and Time of flight for all messages seem to be almost always bellow 100ms, with most of them bellow 100 micro-seconds.
Bitfield distribution uses the default message_capacity=2048, so I think that's why when we have this bursts of messages caused by relay-chain forks the queue gets full, important to note here is that this happens very rarely like ~4 times a day.

Clogging on the subsystem queue, even briefly, is really bad because because it blocks the sender and in this case the sender is network-bridge-rx which is dispatching communication for all the other subsystems, so we want to avoid it entirely or minimize it, for this we have 2 low hanging fruits that we should do:

Increase the message_capacity, I propose setting it to 8192, the only downside here is that we slightly increase the memory footprint when the queue would be at max capacity, our messages have around 1k, so we would go from a theoretical max of 2MiB for this subsystem queue to 8MiB, I think that's a trade-off perfectly acceptable because production nodes are suppose to be running with at least 32GiB MiB, so this is really negligible.
Make the subsystem run on a blocking task, this would have two benefits. First, it should make the subsystem quicker to react because it gets its own thread rather than share the task pool with everyone else. Secondly, the subsystem does some signature checking here:

polkadot-sdk/polkadot/node/network/bitfield-distribution/src/lib.rs

Line 677 in 86bb5cb

let signed_availability = match bitfield.try_into_checked(&signing_context, &validator) {

, which is a CPU intensive task and running on the blocking pool is the recommended behaviour to reduce the impact on the other tasks in the tokio-pool.

Proposed fix: #5787

…5787) ## Description Details and rationale explained here: #5657 (comment) Fixes: #5657 --------- Signed-off-by: Alexandru Gheorghe <alexandru.gheorghe@parity.io>

sandreim added the T0-node This PR/Issue is related to the topic “node”. label Sep 10, 2024

alexggh self-assigned this Sep 19, 2024

alexggh mentioned this issue Sep 20, 2024

bitfield_distribution: Move on blocking pool and use custom capacity #5787

Merged

alexggh closed this as completed in #5787 Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bitfield-distribution: subsystem queue seems to get full #5657

bitfield-distribution: subsystem queue seems to get full #5657

sandreim commented Sep 10, 2024

alexggh commented Sep 20, 2024 •

edited

Loading

bitfield-distribution: subsystem queue seems to get full #5657

bitfield-distribution: subsystem queue seems to get full #5657

Comments

sandreim commented Sep 10, 2024

alexggh commented Sep 20, 2024 • edited Loading

alexggh commented Sep 20, 2024 •

edited

Loading