-
Notifications
You must be signed in to change notification settings - Fork 690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple collators started on same machine conflict on prometheus exporter default port 9616 #5689
Comments
Hi @iulianbarbu, I think you can also pass the |
Hi @pepoviola ! Thanks for jumping in. Would be great if you can reference this issue on the associated issue/PR on zombienet. I can also take a look/help with it once my bandwidth allows it, if not fixed by then. It is not urgent to be fixed in zombienet either, I've opened this ticket to clarify what's the best way to go around this default 9616 for prometheus exporter of the relay chain rpc interface (when managed through zombienet initially, and to a bigger extent if it applies to other usecases). Separately, I think I just understood the codepaths related to configuring the prometheus params for the relay chain part. Somehow the relay chain args make their way up to What should remain on this ticket is the the error logging, and opening an issue on zombienet side. Do you agree? |
Seems you already did it. Just the error logging then. |
) # Description When we start a node with connections to external RPC servers (as a minimal node), we lack metrics around how many individual calls we're doing to the remote RPC servers and their duration. This PR adds metrics that measure durations of each RPC call made by the minimal nodes, and implicitly how many calls there are. Closes #5409 Closes #5689 ## Integration Node operators should be able to track minimal node metrics and decide appropriate actions according to how the metrics are interpreted/felt. The added metrics can be observed by curl'ing the prometheus metrics endpoint for the ~relaychain~ parachain (it was changed based on the review). The metrics are represented by ~`polkadot_parachain_relay_chain_rpc_interface`~ `relay_chain_rpc_interface` namespace (I realized lining up `parachain_relay_chain` in the same metric might be confusing :). Excerpt from the curl: ``` relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.001"} 15 relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.004"} 23 relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.016"} 23 relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.064"} 23 relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="0.256"} 24 relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="1.024"} 24 relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="4.096"} 24 relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="16.384"} 24 relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="65.536"} 24 relay_chain_rpc_interface_bucket{method="chain_getBlockHash",chain="rococo_local_testnet",le="+Inf"} 24 relay_chain_rpc_interface_sum{method="chain_getBlockHash",chain="rococo_local_testnet"} 0.11719075 relay_chain_rpc_interface_count{method="chain_getBlockHash",chain="rococo_local_testnet"} 24 ``` ## Review Notes The way we measure durations/hits is based on `HistogramVec` struct which allows us to collect timings for each RPC client method called from the minimal node., It can be extended to measure the RPCs against other dimensions too (status codes, response sizes, etc). The timing measuring is done at the level of the `relay-chain-rpc-interface`, in the `RelayChainRpcClient` struct's method 'request_tracing'. A single entry point for all RPC requests done through the relay-chain-rpc-interface. The requests durations will fall under exponential buckets described by start `0.001`, factor `4` and count `9`. --------- Signed-off-by: Iulian Barbu <iulian.barbu@parity.io>
Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
Description
I started zombienet natively with
cumulus/zombienet/tests/0006_rpc-collator-builds-blocks.toml
. The network topology has two collators with minimal relay chain nodes which try to setup prometheus by creating a thin hyper server that responds to/metrics
requests to localhost:9616. The hyper server is separate from the similar hyper server for the parachain part of the same collator. Both collators will race towards binding to 9616, resulting in a failure for one of them, which is not necessarily obvious when looking at the logs of the collator which can not bind. The relaychain minimal node's prometheus exporter is somewhat hidden and can be found in the logs if successful when setting it up, or through open ports exploration, but either way, not very well documented and straightfoward to refer to.To consider for a possible fix (maybe others too)
1. Expose the relay chain prometheus exporter ports through a CLI arg to be able to set it to specific unused ports. This is relevant for:(We have support to pass aa. testing with multiple collators on the same machine.
b. node operators that want to set the relaychain prometheus exporter port to something else than 9616.
c. node operators/devs who don't know the relaychain prometheus exporter exists and find it by reading the code (or by reading the logs if its setup succeeds). If this is documented a bit more clearly in public docs this point can be ignored.
--prometheus-port
flag to the relaychain part of the args received by a polkadot-parachain node already, which makes this point irrelevant).2. We could also bind to a random unused port, don't act on 1 and let devs/operators rely on code reading and logs. It isn't very DX friendly. I would avoid this considering the first point would fix this the good way and it isn't very complex either.(First point is not relevant and this one doesn't matter anymore).Steps to reproduce
This will start only the relay chain minimal node's prometheus exporter of
dave
collator, whileeve
exporter fails with no way of telling why (and more problematic if not knowing that it should be started, given it can be found out only if reading the code* for the relay chain rpc interface setup).* I am not completely sure if the relaychain rpc interface's prometheus exporter startup is documented clearly outside of code though.(there is a way to pass relaychain args including the prometheus port when starting a polkadot-parachain node)The text was updated successfully, but these errors were encountered: