Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An error has occurred while serving routing_http_client metrics #9891

Open
Tracked by #10000
lidel opened this issue May 23, 2023 · 5 comments
Open
Tracked by #10000

An error has occurred while serving routing_http_client metrics #9891

lidel opened this issue May 23, 2023 · 5 comments
Labels
exp/intermediate Prior experience is likely helpful kind/bug A bug in existing code (including security flaws) P2 Medium: Good to have, but can wait until someone steps up topic/IPNI Issues related to InterPlanetary Network Indexer topic/routing Topic routing

Comments

@lidel
Copy link
Member

lidel commented May 23, 2023

Version

0.20.0

Config

"Routing": {
		"Methods": null,
		"Routers": null,
		"Type": "auto"
	},

Description

http://127.0.0.1:5001/debug/metrics/prometheus ends up in broken state after a multiple days of uptime:

An error has occurred while serving metrics:

14 error(s) occurred:
* collected metric "routing_http_client_length" { label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:17760 sample_sum:67.00000000000001 bucket:<cumulative_count:17741 upper_bound:1 > bucket:<cumulative_count:17742 upper_bound:2 > bucket:<cumulative_count:17753 upper_bound:5 > bucket:<cumulative_count:17760 upper_bound:10 > bucket:<cumulative_count:17760 upper_bound:11 > bucket:<cumulative_count:17760 upper_bound:12 > bucket:<cumulative_count:17760 upper_bound:15 > bucket:<cumulative_count:17760 upper_bound:20 > bucket:<cumulative_count:17760 upper_bound:50 > bucket:<cumulative_count:17760 upper_bound:100 > bucket:<cumulative_count:17760 upper_bound:200 > bucket:<cumulative_count:17760 upper_bound:500 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"DNSTimeout" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:2 sample_sum:17000 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:0 upper_bound:200 > bucket:<cumulative_count:0 upper_bound:500 > bucket:<cumulative_count:0 upper_bound:1000 > bucket:<cumulative_count:0 upper_bound:2000 > bucket:<cumulative_count:0 upper_bound:5000 > bucket:<cumulative_count:2 upper_bound:10000 > bucket:<cumulative_count:2 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"404" > label:<name:"error" value:"None" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:17741 sample_sum:1.5591060000000058e+06 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:3130 upper_bound:10 > bucket:<cumulative_count:6917 upper_bound:20 > bucket:<cumulative_count:13167 upper_bound:50 > bucket:<cumulative_count:13207 upper_bound:100 > bucket:<cumulative_count:14781 upper_bound:200 > bucket:<cumulative_count:17124 upper_bound:500 > bucket:<cumulative_count:17739 upper_bound:1000 > bucket:<cumulative_count:17740 upper_bound:2000 > bucket:<cumulative_count:17741 upper_bound:5000 > bucket:<cumulative_count:17741 upper_bound:10000 > bucket:<cumulative_count:17741 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"Canceled" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:490 sample_sum:33772.00000000005 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:3 upper_bound:20 > bucket:<cumulative_count:251 upper_bound:50 > bucket:<cumulative_count:415 upper_bound:100 > bucket:<cumulative_count:470 upper_bound:200 > bucket:<cumulative_count:490 upper_bound:500 > bucket:<cumulative_count:490 upper_bound:1000 > bucket:<cumulative_count:490 upper_bound:2000 > bucket:<cumulative_count:490 upper_bound:5000 > bucket:<cumulative_count:490 upper_bound:10000 > bucket:<cumulative_count:490 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"200" > label:<name:"error" value:"None" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:19 sample_sum:4727 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:5 upper_bound:200 > bucket:<cumulative_count:18 upper_bound:500 > bucket:<cumulative_count:19 upper_bound:1000 > bucket:<cumulative_count:19 upper_bound:2000 > bucket:<cumulative_count:19 upper_bound:5000 > bucket:<cumulative_count:19 upper_bound:10000 > bucket:<cumulative_count:19 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"DeadlineExceeded" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:79 sample_sum:789989 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:0 upper_bound:200 > bucket:<cumulative_count:0 upper_bound:500 > bucket:<cumulative_count:0 upper_bound:1000 > bucket:<cumulative_count:0 upper_bound:2000 > bucket:<cumulative_count:0 upper_bound:5000 > bucket:<cumulative_count:23 upper_bound:10000 > bucket:<cumulative_count:79 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"Net" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:30 sample_sum:41185 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:9 upper_bound:10 > bucket:<cumulative_count:19 upper_bound:20 > bucket:<cumulative_count:21 upper_bound:50 > bucket:<cumulative_count:21 upper_bound:100 > bucket:<cumulative_count:21 upper_bound:200 > bucket:<cumulative_count:21 upper_bound:500 > bucket:<cumulative_count:21 upper_bound:1000 > bucket:<cumulative_count:22 upper_bound:2000 > bucket:<cumulative_count:26 upper_bound:5000 > bucket:<cumulative_count:30 upper_bound:10000 > bucket:<cumulative_count:30 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"404" > label:<name:"error" value:"None" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:17741 sample_sum:1.5591060000000058e+06 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:3130 upper_bound:10 > bucket:<cumulative_count:6917 upper_bound:20 > bucket:<cumulative_count:13167 upper_bound:50 > bucket:<cumulative_count:13207 upper_bound:100 > bucket:<cumulative_count:14781 upper_bound:200 > bucket:<cumulative_count:17124 upper_bound:500 > bucket:<cumulative_count:17739 upper_bound:1000 > bucket:<cumulative_count:17740 upper_bound:2000 > bucket:<cumulative_count:17741 upper_bound:5000 > bucket:<cumulative_count:17741 upper_bound:10000 > bucket:<cumulative_count:17741 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"Canceled" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:490 sample_sum:33772.00000000005 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:3 upper_bound:20 > bucket:<cumulative_count:251 upper_bound:50 > bucket:<cumulative_count:415 upper_bound:100 > bucket:<cumulative_count:470 upper_bound:200 > bucket:<cumulative_count:490 upper_bound:500 > bucket:<cumulative_count:490 upper_bound:1000 > bucket:<cumulative_count:490 upper_bound:2000 > bucket:<cumulative_count:490 upper_bound:5000 > bucket:<cumulative_count:490 upper_bound:10000 > bucket:<cumulative_count:490 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"200" > label:<name:"error" value:"None" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:19 sample_sum:4727 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:5 upper_bound:200 > bucket:<cumulative_count:18 upper_bound:500 > bucket:<cumulative_count:19 upper_bound:1000 > bucket:<cumulative_count:19 upper_bound:2000 > bucket:<cumulative_count:19 upper_bound:5000 > bucket:<cumulative_count:19 upper_bound:10000 > bucket:<cumulative_count:19 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"DeadlineExceeded" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:79 sample_sum:789989 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:0 upper_bound:200 > bucket:<cumulative_count:0 upper_bound:500 > bucket:<cumulative_count:0 upper_bound:1000 > bucket:<cumulative_count:0 upper_bound:2000 > bucket:<cumulative_count:0 upper_bound:5000 > bucket:<cumulative_count:23 upper_bound:10000 > bucket:<cumulative_count:79 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"Net" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:30 sample_sum:41185 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:9 upper_bound:10 > bucket:<cumulative_count:19 upper_bound:20 > bucket:<cumulative_count:21 upper_bound:50 > bucket:<cumulative_count:21 upper_bound:100 > bucket:<cumulative_count:21 upper_bound:200 > bucket:<cumulative_count:21 upper_bound:500 > bucket:<cumulative_count:21 upper_bound:1000 > bucket:<cumulative_count:22 upper_bound:2000 > bucket:<cumulative_count:26 upper_bound:5000 > bucket:<cumulative_count:30 upper_bound:10000 > bucket:<cumulative_count:30 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"DNSTimeout" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:2 sample_sum:17000 bucket:<cumulative_count:0 upper_bound:1 > bucket:<cumulative_count:0 upper_bound:2 > bucket:<cumulative_count:0 upper_bound:5 > bucket:<cumulative_count:0 upper_bound:10 > bucket:<cumulative_count:0 upper_bound:20 > bucket:<cumulative_count:0 upper_bound:50 > bucket:<cumulative_count:0 upper_bound:100 > bucket:<cumulative_count:0 upper_bound:200 > bucket:<cumulative_count:0 upper_bound:500 > bucket:<cumulative_count:0 upper_bound:1000 > bucket:<cumulative_count:0 upper_bound:2000 > bucket:<cumulative_count:0 upper_bound:5000 > bucket:<cumulative_count:2 upper_bound:10000 > bucket:<cumulative_count:2 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_length" { label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:17760 sample_sum:67.00000000000001 bucket:<cumulative_count:17741 upper_bound:1 > bucket:<cumulative_count:17742 upper_bound:2 > bucket:<cumulative_count:17753 upper_bound:5 > bucket:<cumulative_count:17760 upper_bound:10 > bucket:<cumulative_count:17760 upper_bound:11 > bucket:<cumulative_count:17760 upper_bound:12 > bucket:<cumulative_count:17760 upper_bound:15 > bucket:<cumulative_count:17760 upper_bound:20 > bucket:<cumulative_count:17760 upper_bound:50 > bucket:<cumulative_count:17760 upper_bound:100 > bucket:<cumulative_count:17760 upper_bound:200 > bucket:<cumulative_count:17760 upper_bound:500 > > } was collected before with the same name and label values

Seems we have a bug in boxo/routing/http/client, needs analysis.

@lidel lidel added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels May 23, 2023
@lidel lidel moved this to 🥞 Todo in IPFS Shipyard Team May 23, 2023
@lidel lidel added P0 Critical: Tackled by core team ASAP topic/routing Topic routing exp/intermediate Prior experience is likely helpful topic/IPNI Issues related to InterPlanetary Network Indexer and removed need/triage Needs initial labeling and prioritization labels May 23, 2023
@BigLep BigLep mentioned this issue May 23, 2023
@lidel
Copy link
Member Author

lidel commented May 23, 2023

I was unable to reproduce this with vanilla Docker image for 0.20.0,
but I can reproduce it every time with my own repo by asking for a CID without any providers or setting /etc/hosts to use 127.0.0.1 for cid.contact so it always fails:

An error has occurred while serving metrics:

2 error(s) occurred:
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"Net" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:36 sample_sum:0 bucket:<cumulative_count:36 upper_bound:1 > bucket:<cumulative_count:36 upper_bound:2 > bucket:<cumulative_count:36 upper_bound:5 > bucket:<cumulative_count:36 upper_bound:10 > bucket:<cumulative_count:36 upper_bound:20 > bucket:<cumulative_count:36 upper_bound:50 > bucket:<cumulative_count:36 upper_bound:100 > bucket:<cumulative_count:36 upper_bound:200 > bucket:<cumulative_count:36 upper_bound:500 > bucket:<cumulative_count:36 upper_bound:1000 > bucket:<cumulative_count:36 upper_bound:2000 > bucket:<cumulative_count:36 upper_bound:5000 > bucket:<cumulative_count:36 upper_bound:10000 > bucket:<cumulative_count:36 upper_bound:20000 > > } was collected before with the same name and label values
* collected metric "routing_http_client_latency" { label:<name:"code" value:"0" > label:<name:"error" value:"Net" > label:<name:"host" value:"cid.contact" > label:<name:"operation" value:"FindProviders" > histogram:<sample_count:36 sample_sum:0 bucket:<cumulative_count:36 upper_bound:1 > bucket:<cumulative_count:36 upper_bound:2 > bucket:<cumulative_count:36 upper_bound:5 > bucket:<cumulative_count:36 upper_bound:10 > bucket:<cumulative_count:36 upper_bound:20 > bucket:<cumulative_count:36 upper_bound:50 > bucket:<cumulative_count:36 upper_bound:100 > bucket:<cumulative_count:36 upper_bound:200 > bucket:<cumulative_count:36 upper_bound:500 > bucket:<cumulative_count:36 upper_bound:1000 > bucket:<cumulative_count:36 upper_bound:2000 > bucket:<cumulative_count:36 upper_bound:5000 > bucket:<cumulative_count:36 upper_bound:10000 > bucket:<cumulative_count:36 upper_bound:20000 > > } was collected before with the same name and label values

Will try to poke more, want to confirm if this is or is not something that impacts ipfs.io bifrost-infra.

@lidel lidel self-assigned this May 23, 2023
@lidel
Copy link
Member Author

lidel commented May 23, 2023

bifrost-infra box running 0.20 did not experience this, which was a good sign.

I've focused on the differences in config, and the problem only occurs if RPC API is exposed on more than one port.
Each time we expose a listener, it re-register metrics, which then creates the surface for this bug.

TLDR:

  • The bug can be reproduced every time ONLY with custom config that has more than one listener defined in Addresses.API
  • There is no bug with the default config, so moving this to P2.

@lidel lidel added P2 Medium: Good to have, but can wait until someone steps up and removed P0 Critical: Tackled by core team ASAP labels May 23, 2023
@lidel lidel removed their assignment May 23, 2023
@Mayeu
Copy link

Mayeu commented Jan 24, 2024

Hello,

We are getting this error on our two nodes as well (running Kubo 0.24.0), but we don't have multiple listener setup:

"Addresses": {
    "API": "/ip4/127.0.0.1/tcp/5001",
}

When I restart kubo, there is a few seconds during which the metrics are printed, and then this error shows up.

Anything I can provide to help debug this?

@Mayeu
Copy link

Mayeu commented Jan 25, 2024

Same situation with Kubo 0.25 (I did not yet tried 0.26 since I'm waiting for it to be released in the unstable branch of NixOS).

@gammazero
Copy link
Contributor

This can be reproduced in v0.29.0 by:

ipfs init
edit ~/.ipfs/config
  "Addresses": {
    "API": ["/ip4/127.0.0.1/tcp/5001", "/ip6/::1/tcp/5001"],
ipfs daemon
echo hello | ipfs add
open http://127.0.0.1:5001/debug/metrics/prometheus

gammazero added a commit that referenced this issue Aug 21, 2024
When there are multiple listeners configured for Addresses.API, serving metrics results in an errors: "<metric> was collected before with the same name and label values".

This PR fixes this by maintaining a global map of metrics handlers, and only creating and reginstering them once. The same metrics handlers are provided to the mux for every listener.

Fixes #9891
Fixes #9397

Unblocks #9637
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
exp/intermediate Prior experience is likely helpful kind/bug A bug in existing code (including security flaws) P2 Medium: Good to have, but can wait until someone steps up topic/IPNI Issues related to InterPlanetary Network Indexer topic/routing Topic routing
Projects
No open projects
Status: 🥞 Todo
Development

Successfully merging a pull request may close this issue.

3 participants