Possibility of using a more lenient handling of errors in individual samples #1543

thoraage · 2024-06-21T06:57:46Z

When the client find that a help description have changed it creates an error that from my otel-collectors perspective look like this:

2024-06-21T06:17:07.853Z        error   prometheusexporter@v0.99.0/log.go:23    error gathering metrics: 2 error(s) occurred:
* collected metric undertow_request_count_total label:{name:"app"  value:"xxxear-7.185.0-SNAPSHOT.ear"}  label:{name:"deployment"  value:"xxxear-7.185.0-SNAPSHOT.ear"}  label:{name:"job"  value:"wildfly"}  label:{name:"name"  value:"jakarta.ws.rs.core.Application"}  label:{name:"subdeployment"  value:"web-resources-7.185.0-SNAPSHOT.war"}  label:{name:"type"  value:"servlet"}  counter:{value:0} has help "Number of all requests" but should have "The number of requests this listener has served"
* collected metric undertow_request_count_total label:{name:"app"  value:"xxxear-7.185.0-SNAPSHOT.ear"}  label:{name:"deployment"  value:"xxxear-7.185.0-SNAPSHOT.ear"}  label:{name:"job"  value:"wildfly"}  label:{name:"name"  value:"com.yyy.xxx.api.XxxRestApplication"}  label:{name:"subdeployment"  value:"internt-tjenestelag-7.185.0-SNAPSHOT.war"}  label:{name:"type"  value:"servlet"}  counter:{value:484} has help "Number of all requests" but should have "The number of requests this listener has served"
        {"kind": "exporter", "data_type": "metrics", "name": "prometheus"}
github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter.(*promLogger).Println
        github.com/open-telemetry/opentelemetry-collector-contrib/exporter/prometheusexporter@v0.99.0/log.go:23
github.com/prometheus/client_golang/prometheus/promhttp.HandlerForTransactional.func1
        github.com/prometheus/client_golang@v1.19.0/prometheus/promhttp/http.go:144
net/http.HandlerFunc.ServeHTTP
        net/http/server.go:2166
net/http.(*ServeMux).ServeHTTP
        net/http/server.go:2683
go.opentelemetry.io/collector/config/confighttp.(*decompressor).ServeHTTP
        go.opentelemetry.io/collector/config/confighttp@v0.99.0/compression.go:160
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.(*middleware).serveHTTP
        go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.50.0/handler.go:214
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp.NewMiddleware.func1.1
        go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.50.0/handler.go:72
net/http.HandlerFunc.ServeHTTP
        net/http/server.go:2166
go.opentelemetry.io/collector/config/confighttp.(*clientInfoHandler).ServeHTTP
        go.opentelemetry.io/collector/config/confighttp@v0.99.0/clientinfohandler.go:26
net/http.serverHandler.ServeHTTP
        net/http/server.go:3137
net/http.(*conn).serve
        net/http/server.go:2039

The error is raised here:

client_golang/prometheus/registry.go

Line 640 in 33697e6

"collected metric %s %s has help %q but should have %q",

It's an error originally by the application server (in this case wildfly), but it abrupts the metric collection for one bad metric.

Is it possible to issue a warning on the logs instead?

Concerning the metric sample there is a choice whether to stick to the original description, adopt the new one or simply drop the metric sample.

The text was updated successfully, but these errors were encountered:

ArthurSens · 2024-06-22T14:51:37Z

Hey 👋

In the Prometheus ecosystem, a single scrape from a single target is seen as a transaction. Client_golang was designed and implemented with the pull-based model in mind, where all metrics from a registry will be exposed together in an HTTP request. Client_golang follows the same transaction philosophy, which means that we serve all metrics currently in-store, or we don't serve any at all. In the example you've shown, a single MetricFamily undertow_request_count is not following the specification when it shows 2 different HELP metadata. Since one metric cannot be processed, we fail the whole transaction here.

I can see how this philosophy might not make sense in a push-based system, but I don't think we can change the current behavior without breaking the pull-based behavior 😬

bwplotka · 2024-06-25T12:17:27Z

While I agree with what @ArthurSens said, I also think it's ok for client_golang to help with the use case like exposing metrics that process does not control (aka exporter model in Prometheus world) e.g. KSM, cadivsor, node exporter etc. It can happen that underlying data e.g. cgroup metrics (let's imagine they have some help) have inconsistent help description.

Ideally help is "unified" before it's registered. However to help with this case perhaps registry could have some options (or registry wrapper should exists) that either ignores this error and picks first help OR allows custom logic. Help wanted to enhance that (in non-breaking change). 🤗 I wouldn't change current error in current flow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possibility of using a more lenient handling of errors in individual samples #1543

Possibility of using a more lenient handling of errors in individual samples #1543

thoraage commented Jun 21, 2024

ArthurSens commented Jun 22, 2024

bwplotka commented Jun 25, 2024

Possibility of using a more lenient handling of errors in individual samples #1543

Possibility of using a more lenient handling of errors in individual samples #1543

Comments

thoraage commented Jun 21, 2024

ArthurSens commented Jun 22, 2024

bwplotka commented Jun 25, 2024