Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC Collator: Introduce Useful Metrics #5409

Closed
skunert opened this issue Aug 19, 2024 · 0 comments · Fixed by #5572
Closed

RPC Collator: Introduce Useful Metrics #5409

skunert opened this issue Aug 19, 2024 · 0 comments · Fixed by #5572
Assignees

Comments

@skunert
Copy link
Contributor

skunert commented Aug 19, 2024

When starting a collator node there is an option to connect to external relay chain nodes via --relay-chain-rpc-urls.

The spawned RelayChainRpcInterface performs the RPC calls to the relay chain through RelayChainRpcClient. It is used internally by the collator as well as in the minimal-nodes subsystems.

In the past we had instances where a huge amount of requests in combination with longer answer times led to complications in the subsystems. To allow better monitoring, we need some metrics on the RelayChainRpcClient.

We should expose metrics for the average timing information as well as a aggregates of the counts.

A nice starting point is to spin up a network using this zombienet test: https://github.com/paritytech/polkadot-sdk/blob/master/cumulus/zombienet/tests/0006-rpc_collator_builds_blocks.toml

From there you can discover the prometheus metrics that are exposed.

github-merge-queue bot pushed a commit that referenced this issue Sep 19, 2024
)

# Description

When we start a node with connections to external RPC servers (as a
minimal node), we lack metrics around how many individual calls we're
doing to the remote RPC servers and their duration. This PR adds metrics
that measure durations of each RPC call made by the minimal nodes, and
implicitly how many calls there are.

Closes #5409 
Closes #5689

## Integration

Node operators should be able to track minimal node metrics and decide
appropriate actions according to how the metrics are interpreted/felt.
The added metrics can be observed by curl'ing the prometheus metrics
endpoint for the ~relaychain~ parachain (it was changed based on the
review). The metrics are represented by
~`polkadot_parachain_relay_chain_rpc_interface`~
`relay_chain_rpc_interface` namespace (I realized lining up
`parachain_relay_chain` in the same metric might be confusing :).
Excerpt from the curl:

```
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.001"} 15
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.004"} 23
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.016"} 23
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.064"} 23
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.256"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="1.024"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="4.096"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="16.384"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="65.536"} 24
relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="+Inf"} 24
relay_chain_rpc_interface_sum{method="chain_getBlockHash",chain="rococo_local_testnet"} 0.11719075
relay_chain_rpc_interface_count{method="chain_getBlockHash",chain="rococo_local_testnet"} 24
```

## Review Notes

The way we measure durations/hits is based on `HistogramVec` struct
which allows us to collect timings for each RPC client method called
from the minimal node., It can be extended to measure the RPCs against
other dimensions too (status codes, response sizes, etc). The timing
measuring is done at the level of the `relay-chain-rpc-interface`, in
the `RelayChainRpcClient` struct's method 'request_tracing'. A single
entry point for all RPC requests done through the
relay-chain-rpc-interface. The requests durations will fall under
exponential buckets described by start `0.001`, factor `4` and count
`9`.

---------

Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: done
Development

Successfully merging a pull request may close this issue.

2 participants