Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

litep2p: Introduce metrics to reflect libp2p metrics #4681

Open
lexnv opened this issue Jun 3, 2024 · 0 comments
Open

litep2p: Introduce metrics to reflect libp2p metrics #4681

lexnv opened this issue Jun 3, 2024 · 0 comments

Comments

@lexnv
Copy link
Contributor

lexnv commented Jun 3, 2024

There are metrics currently used for checking the availability of our node, metrics that can trigger alarms for the oncall engineers.

One such example is incoming_connections_total, which does not have a correspondent to litep2p:

SwarmEvent::IncomingConnection { local_addr, send_back_addr } => {
trace!(target: "sub-libp2p", "Libp2p => IncomingConnection({},{}))",
local_addr, send_back_addr);
if let Some(metrics) = self.metrics.as_ref() {
metrics.incoming_connections_total.inc();
}
},

End goals:

  • Introduce metrics to reflect libp2p metrics (where possible)
  • Make the naming of the metrics agnostic of the backend (such metrics can be seamlessly reused for dashboards / alarms / monitoring)

cc @paritytech/networking @paritytech/sdk-node

@lexnv lexnv added this to Networking Jun 3, 2024
github-merge-queue bot pushed a commit that referenced this issue Jul 3, 2024
This PR exposes the `RandomKademliaStarted` event from the litep2p
network backend, and then increments the appropriate metrics.

This is part of: #4681.
However, it is more of an effort to debug low peer count 

### Testing Done
- Started a node and fetched queries:
`substrate_sub_libp2p_kademlia_random_queries_total` produces results
for litep2p backend

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
TomaszWaszczyk pushed a commit to TomaszWaszczyk/polkadot-sdk that referenced this issue Jul 7, 2024
This PR exposes the `RandomKademliaStarted` event from the litep2p
network backend, and then increments the appropriate metrics.

This is part of: paritytech#4681.
However, it is more of an effort to debug low peer count 

### Testing Done
- Started a node and fetched queries:
`substrate_sub_libp2p_kademlia_random_queries_total` produces results
for litep2p backend

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
github-merge-queue bot pushed a commit that referenced this issue Jul 22, 2024
This PR improves the metrics reported by litep2p on request-response
errors.

Discovered while investigating:
- #4985


We are experiencing many requests that are `Refused` by litep2p in
comparison with libp2p.
The metric roughly approximates the sum of other reasons from libp2p.
This PR aims to provide more insights.

```
{__name__="substrate_sub_libp2p_requests_out_failure_total", chain="ksmcc3", instance="localhost:9615", job="substrate_node", protocol="/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/sync/2", reason="Remote has closed the substream before answering, thereby signaling that it considers the request as valid, but refused to answer it."}

    Last *: 3365
    Min: 3363
    Max: 3365
    Mean: 3365
    
    
{__name__="substrate_sub_libp2p_requests_out_failure_total", chain="ksmcc3", instance="localhost:9615", job="substrate_node", protocol="/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/beefy/justifications/1", reason="Remote has closed the substream before answering, thereby signaling that it considers the request as valid, but refused to answer it."}

    Last *: 3461
    Min: 3461
    Max: 3461
    Mean: 3461
```

Part of:
- #4681

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
drskalman pushed a commit to w3f/polkadot-sdk that referenced this issue Jul 23, 2024
…ble litep2p metrics (paritytech#4977)

This PR extends the metrics exposed by the peerstore with the total
number of banned peers.

The new metric is exposed under
`substrate_sub_libp2p_peerset_num_banned_peers`.

To easily extend metrics in the future, the `fn num_known_peers` is
removed in favor of `fn status`.

While at it, enable the metrics for litep2p:
- total number of peers from peerstore (needed to debug memory
consumption)
- total number of banned peers from peerstore (needed to debug
reputation bans and disconnects)

Have added a couple of tests to validate that the number of banned peers
is exposed properly.

Part of: paritytech#4681


### Testing Done
Using [subp2p-explorer](https://github.com/lexnv/subp2p-explorer) have
submitted random data on tx protocol.
The peer gets banned, the num of banned peers is incremented then the
peer is disconnected.

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this issue Aug 2, 2024
This PR exposes the `RandomKademliaStarted` event from the litep2p
network backend, and then increments the appropriate metrics.

This is part of: paritytech#4681.
However, it is more of an effort to debug low peer count 

### Testing Done
- Started a node and fetched queries:
`substrate_sub_libp2p_kademlia_random_queries_total` produces results
for litep2p backend

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this issue Aug 2, 2024
…h#5077)

This PR improves the metrics reported by litep2p on request-response
errors.

Discovered while investigating:
- paritytech#4985


We are experiencing many requests that are `Refused` by litep2p in
comparison with libp2p.
The metric roughly approximates the sum of other reasons from libp2p.
This PR aims to provide more insights.

```
{__name__="substrate_sub_libp2p_requests_out_failure_total", chain="ksmcc3", instance="localhost:9615", job="substrate_node", protocol="/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/sync/2", reason="Remote has closed the substream before answering, thereby signaling that it considers the request as valid, but refused to answer it."}

    Last *: 3365
    Min: 3363
    Max: 3365
    Mean: 3365
    
    
{__name__="substrate_sub_libp2p_requests_out_failure_total", chain="ksmcc3", instance="localhost:9615", job="substrate_node", protocol="/b0a8d493285c2df73290dfb7e61f870f17b41801197a149ca93654499ea3dafe/beefy/justifications/1", reason="Remote has closed the substream before answering, thereby signaling that it considers the request as valid, but refused to answer it."}

    Last *: 3461
    Min: 3461
    Max: 3461
    Mean: 3461
```

Part of:
- paritytech#4681

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
TarekkMA pushed a commit to moonbeam-foundation/polkadot-sdk that referenced this issue Aug 2, 2024
…ble litep2p metrics (paritytech#4977)

This PR extends the metrics exposed by the peerstore with the total
number of banned peers.

The new metric is exposed under
`substrate_sub_libp2p_peerset_num_banned_peers`.

To easily extend metrics in the future, the `fn num_known_peers` is
removed in favor of `fn status`.

While at it, enable the metrics for litep2p:
- total number of peers from peerstore (needed to debug memory
consumption)
- total number of banned peers from peerstore (needed to debug
reputation bans and disconnects)

Have added a couple of tests to validate that the number of banned peers
is exposed properly.

Part of: paritytech#4681


### Testing Done
Using [subp2p-explorer](https://github.com/lexnv/subp2p-explorer) have
submitted random data on tx protocol.
The peer gets banned, the num of banned peers is incremented then the
peer is disconnected.

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Co-authored-by: Dmitry Markin <dmitry@markin.tech>
github-merge-queue bot pushed a commit that referenced this issue Sep 10, 2024
This release introduces several new features, improvements, and fixes to
the litep2p library. Key updates include enhanced error handling,
configurable connection limits, and a new API for managing public
addresses.

For a detailed set of changes, see [litep2p
changelog](https://github.com/paritytech/litep2p/blob/master/CHANGELOG.md#070---2024-09-05).

This PR makes use of:
- connection limits to optimize network throughput
- better errors that are propagated to substrate metrics 
- public addresses API to report healthy addresses to the Identify
protocol

### Warp sync time improvement

Measuring warp sync time is a bit inaccurate since the network is not
deterministic and we might end up using faster peers (peers with more
resources to handle our requests). However, I did not see warp sync
times of 16 minutes, instead, they are roughly stabilized between 8 and
10 minutes.

For measuring warp-sync time, I've used
[sub-trige-logs](https://github.com/lexnv/sub-triage-logs/?tab=readme-ov-file#warp-time)

### Litep2p

Phase | Time
 -|-
Warp  | 426.999999919s
State | 99.000000555s
Total | 526.000000474s

### Libp2p

Phase | Time
 -|-
Warp  | 731.999999837s
State | 71.000000882s
Total | 803.000000719s

Closes: #4986


### Low peer count

After exposing the `litep2p::public_addresses` interface, we can report
to litep2p confirmed external addresses. This should mitigate or at
least improve: #4925.
Will keep the issue around to confirm this.


### Improved metrics

We are one step closer to exposing similar metrics as libp2p:
#4681.

cc @paritytech/networking 

### Next Steps
- [x] Use public address interface to confirm addresses to identify
protocol

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
mordamax pushed a commit to paritytech-stg/polkadot-sdk that referenced this issue Sep 11, 2024
This release introduces several new features, improvements, and fixes to
the litep2p library. Key updates include enhanced error handling,
configurable connection limits, and a new API for managing public
addresses.

For a detailed set of changes, see [litep2p
changelog](https://github.com/paritytech/litep2p/blob/master/CHANGELOG.md#070---2024-09-05).

This PR makes use of:
- connection limits to optimize network throughput
- better errors that are propagated to substrate metrics 
- public addresses API to report healthy addresses to the Identify
protocol

### Warp sync time improvement

Measuring warp sync time is a bit inaccurate since the network is not
deterministic and we might end up using faster peers (peers with more
resources to handle our requests). However, I did not see warp sync
times of 16 minutes, instead, they are roughly stabilized between 8 and
10 minutes.

For measuring warp-sync time, I've used
[sub-trige-logs](https://github.com/lexnv/sub-triage-logs/?tab=readme-ov-file#warp-time)

### Litep2p

Phase | Time
 -|-
Warp  | 426.999999919s
State | 99.000000555s
Total | 526.000000474s

### Libp2p

Phase | Time
 -|-
Warp  | 731.999999837s
State | 71.000000882s
Total | 803.000000719s

Closes: paritytech#4986


### Low peer count

After exposing the `litep2p::public_addresses` interface, we can report
to litep2p confirmed external addresses. This should mitigate or at
least improve: paritytech#4925.
Will keep the issue around to confirm this.


### Improved metrics

We are one step closer to exposing similar metrics as libp2p:
paritytech#4681.

cc @paritytech/networking 

### Next Steps
- [x] Use public address interface to confirm addresses to identify
protocol

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
lexnv added a commit that referenced this issue Nov 15, 2024
This release introduces several new features, improvements, and fixes to
the litep2p library. Key updates include enhanced error handling,
configurable connection limits, and a new API for managing public
addresses.

For a detailed set of changes, see [litep2p
changelog](https://github.com/paritytech/litep2p/blob/master/CHANGELOG.md#070---2024-09-05).

This PR makes use of:
- connection limits to optimize network throughput
- better errors that are propagated to substrate metrics
- public addresses API to report healthy addresses to the Identify
protocol

Measuring warp sync time is a bit inaccurate since the network is not
deterministic and we might end up using faster peers (peers with more
resources to handle our requests). However, I did not see warp sync
times of 16 minutes, instead, they are roughly stabilized between 8 and
10 minutes.

For measuring warp-sync time, I've used
[sub-trige-logs](https://github.com/lexnv/sub-triage-logs/?tab=readme-ov-file#warp-time)

Phase | Time
 -|-
Warp  | 426.999999919s
State | 99.000000555s
Total | 526.000000474s

Phase | Time
 -|-
Warp  | 731.999999837s
State | 71.000000882s
Total | 803.000000719s

Closes: #4986

After exposing the `litep2p::public_addresses` interface, we can report
to litep2p confirmed external addresses. This should mitigate or at
least improve: #4925.
Will keep the issue around to confirm this.

We are one step closer to exposing similar metrics as libp2p:
#4681.

cc @paritytech/networking

- [x] Use public address interface to confirm addresses to identify
protocol

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

1 participant