Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: Expose litep2p metrics in an agnostic manner #294

Open
wants to merge 32 commits into
base: master
Choose a base branch
from

Conversation

lexnv
Copy link
Collaborator

@lexnv lexnv commented Nov 29, 2024

This PR optionally exposes litep2p metrics in an agnostic manner.

API Design

Litep2p supports at the moment the following primitives to register and operate on metrics:

/// A registry for metrics.
pub trait MetricsRegistryT: Send + Sync {
    /// Register a new counter.
    fn register_counter(&self, name: String, help: String) -> Result<MetricCounter, Error>;

    /// Register a new gauge.
    fn register_gauge(&self, name: String, help: String) -> Result<MetricGauge, Error>;
}


/// Represents a metric that can only go up.
pub trait MetricCounterT: Send + Sync {
    /// Increment the counter by `value`.
    fn inc(&self, value: u64);
}

/// Represents a metric that can arbitrarily go up and down.
pub trait MetricGaugeT: Send + Sync {
    /// Set the gauge to `value`.
    fn set(&self, value: u64);

    /// Increment the gauge.
    fn inc(&self);
...
}

Around these primitives, substrate can expose prometheus metrics, and if need be update seamlessly to other metric crates (like prometheus-client).

Metrics Exposed

The exposed metrics inform the user about the state of litep2p components (like kademlia number of store elements, identify negotiating substreams etc) and help developers detect abnormal behavior (like memory leaks / unbounded growth / stalls in some protocols).

Transport Manager

  • number of incoming connections
  • number of outgoing connections
  • number of opening errors
  • number of managed peers
  • number of pending connections

Transport Layer (TCP + Websocket)

  • number of pending inbound connections
  • number of pending open connections
  • number of raw unnegotiated connections
  • number of pending substreams

Kademlia

  • number of total peers
  • number of engine running queries
  • number of executor queries
  • number of memory store records
  • number of memory store local providers
  • number of memory store providers
  • number of memory store provider refreshes

Identify / Ping

  • number of total peers
  • number of pending inbound substreams
  • number of pending outbound substreams

Request Response Protocol

These metrics are exposed for every req-resp protocol.

  • number of connected peers
  • number of pending dials
  • number of pending inbound substreams
  • number of pending inbound requests
  • number of pending outbound
  • number of outbound cancels
  • number of pending outbound responses

Notification Protocol

These metrics are exposed for every notification protocol.

  • number of connected peers
  • number of outbound initiated handshakes
  • number of outbound substreams
  • number of validation
  • number of ready substream handshakes
  • number of timers

Dashboards

Screenshot 2024-12-03 at 15 02 27 Screenshot 2024-12-03 at 15 02 51 Screenshot 2024-12-03 at 15 02 43 Screenshot 2024-12-03 at 15 03 09

Review notes: Metric traits are define in src/metrics.rs, they should give enough background for the metric registration / update that is happening in the rest of the code

lexnv added 12 commits November 29, 2024 13:18
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Ideally this should be Into<String>, but that way the
we cannot be object safe

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv added the enhancement New feature or request label Nov 29, 2024
@lexnv lexnv self-assigned this Nov 29, 2024
lexnv added 10 commits November 29, 2024 17:43
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
lexnv added 2 commits December 3, 2024 10:17
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
lexnv added a commit that referenced this pull request Dec 3, 2024
Similar to #296, there is a
possibility of leaking memory in the following edge-case:
- T0: Connection is established and outbound substream is initiated with
peer
  - This maps the substream ID to the request bytes information
- T1: Connection is closed before the service has a chance to report
`TransportEvent::SubstreamOpened` or
`TransportEvent::SubstreamOpenFailure`

In this case, if we connect and immediately disconnect with a request in
flight, we are effectively leaking the request bytes.


Detected by:
- #294


### Dashboard

- We are leaking ~111 requests over 3 days timespan:

<img width="1484" alt="Screenshot 2024-12-03 at 10 41 01"
src="https://github.com/user-attachments/assets/f6701017-4add-4aa1-aee1-e1f8d33d54f3">


cc @paritytech/networking

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
lexnv added 6 commits December 3, 2024 13:58
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
@lexnv lexnv changed the title wip: Metrics metrics: Expose litep2p metrics in an agnostic manner Dec 3, 2024
lexnv added a commit that referenced this pull request Dec 3, 2024
This PR fixes a subtle memory leak that can happen in the following
edge-case situation:
- connection is established and substream outbound is initiated with
remote peer
- the substream ID is tracked until the substream either completes
successfully or fails
- the connection is closed soon after, leading to no substream events
ever being generated

For this edge-cases, we need to remove the tracking of the substream ID
when the connection is reported as closed.

This has been detected after running a node for more than 2 days with
the following generic metrics PR:
- #294

Closes: #295

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <alexandru.vasile@parity.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant