Brainstorm: Expose keep-alive information to user #1701

mxinden · 2020-08-13T08:51:11Z

Background

For paritytech/polkadot#1532 I am investigating why connections initiated through Kademlia are kept alive for more than the expected 10 second idle timeout. As far as I can tell there is no way to learn why connections are kept alive without intrusive libp2p source code changes. Ideally I would like to expose a Prometheus metric in Substrate.

substrate_sub_libp2p_connections_total{keep-alive-protocol="kademlia"} 10
substrate_sub_libp2p_connections_total{keep-alive-protocol="legacy"} 11

Hacky solution

To help debugging I extended KeepAlive with a protocol id indicating the protocol that would like to keep the connection alive.

diff --git a/swarm/src/protocols_handler.rs b/swarm/src/protocols_handler.rs
index 9721e9db..1fcc3fac 100644
--- a/swarm/src/protocols_handler.rs
+++ b/swarm/src/protocols_handler.rs
@@ -498,12 +498,12 @@ where T: ProtocolsHandler
 }
 
 /// How long the connection should be kept alive.
 pub enum KeepAlive {
     /// If nothing new happens, the connection should be closed at the given `Instant`.
-    Until(Instant),
+    Until(Instant, Cow<'static, [u8]>),
     /// Keep the connection alive.
-    Yes,
+    Yes(Cow<'static, [u8]>),
     /// Close the connection as soon as possible.
     No,
 }

ProtocolHandlers that delegate to other ProtocolHandlers aggregate the KeepAlive and pass the protocol id of the highest KeepAlive upwards.

Within node_handler.rs I could then log the id of the protocol that keeps the connection alive.

This lead to #1698 which triggered investigation for #1700.

Way forward

First of all: Do people feel the need to surface keep-alive information to the user? Or is the on-demand debugging through log lines good enough?

If we do want to expose that information we need to find a consistent way to do so. One suggestion from my side would be to bubble up a ~ KeepAlive event that records the id of the protocol keeping the protocol alive from the NodeHandler for each connection on a regular interval .

The text was updated successfully, but these errors were encountered:

romanb · 2020-08-13T09:03:28Z

Relates to #1478.

mxinden added the discussion label Aug 13, 2020

tomaka mentioned this issue Aug 14, 2020

Networking metrics to add to Prometheus paritytech/polkadot-sdk#560

Open

13 tasks

mxinden mentioned this issue Aug 20, 2020

RFC: Expose KeepAlive information to NetworkBehaviour #1717

Closed

romanb closed this as completed Feb 23, 2021

libp2p locked and limited conversation to collaborators Feb 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Brainstorm: Expose keep-alive information to user #1701

Brainstorm: Expose keep-alive information to user #1701

mxinden commented Aug 13, 2020

romanb commented Aug 13, 2020

This issue was moved to a discussion.

This issue was moved to a discussion.

Brainstorm: Expose keep-alive information to user #1701

Brainstorm: Expose keep-alive information to user #1701

Comments

mxinden commented Aug 13, 2020

Background

Hacky solution

Way forward

romanb commented Aug 13, 2020

This issue was moved to a discussion.