Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric names between OpenCensus & Prometheus exporters are different #2174

Open
dprotaso opened this issue Jun 28, 2021 · 14 comments
Open

Metric names between OpenCensus & Prometheus exporters are different #2174

dprotaso opened this issue Jun 28, 2021 · 14 comments
Assignees
Labels
area/monitoring kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@dprotaso
Copy link
Member

dprotaso commented Jun 28, 2021

/area monitoring
/kind cleanup

We currently prepend a 'domain' and the 'component' name to the metrics. This done for our OpenCensus exporter which might have been cargo cult'd from our stack driver exporter implementation. The prometheus exporter only prefixes the 'component' name to the metric.

metrixPrefix := path.Join(config.domain, config.component)

We should determine whether these should be consistent, if there should be a prefix and it's format

cc @upodroid @skonto @evankanderson @MontyCarter @pianohacker

@knative-prow-robot knative-prow-robot added area/monitoring kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Jun 28, 2021
@dprotaso
Copy link
Member Author

From @pianohacker re: prepending a domain prefix for prometheus https://knative.slack.com/archives/CEPGNPHP1/p1624558123099100

This level of namespacing on the metric name isn't typically done for Prometheus metrics; see https://prometheus.io/docs/practices/naming/
The source (as opposed to the kind) of a Prometheus-exposed metric is intended to be distinguished by its labels (in this case, namespace).
In that vein, could we expose the ability to configure one or more labels on the exposed metrics?

@dprotaso dprotaso changed the title Metric names between OpenCensus & Prometheus are different Metric names between OpenCensus & Prometheus exporters are different Jun 28, 2021
@dprotaso
Copy link
Member Author

cc @evankanderson who reviewed the OpenCensus exporter changes

@skonto
Copy link
Contributor

skonto commented Jun 29, 2021

Regarding Prometheus this is what I read and what we have does not seem to be a violation of the recommendations unless I misread something:

...should have a (single-word) application prefix relevant to the domain the metric belongs to. The prefix is sometimes referred to as namespace by client libraries. For metrics specific to an application, the prefix is usually the application name itself. Sometimes, however, metrics are more generic, like standardized metrics exported by client libraries. Examples:
prometheus_notifications_total (specific to the Prometheus server)
process_cpu_seconds_total (exported by many client libraries)
http_request_duration_seconds (for all HTTP requests)

I am not sure what value the domain adds here we already use the component name as the prefix and we know the domain for each component, it is straightforward for the control plane ones eg. Serving, Eventing.
Maybe there is some value for the user services to be able to correlate queue proxy metrics with some domain eg. my_business_domain but then Prom queries will have to detect that with some regex and does not make life easier (especially if we have arbitrary domains as prefixes) eg. for building generic dashboards. Also not though having queue_proxy as a metric prefix works well so far.
However, there is a valid need to be able to configure existing/add custom labels eg. tag metrics with special labels to avoid joining metrics info based on kube-state-metrics or other sources. I am facing this downstream as I would like to tag metrics with the func tag for Boson funcs, as not all services are functions.
What issues do we see with the current status? Are there users of the Opencensus exporter that benefit from the extra domain info?

@dprotaso
Copy link
Member Author

@skonto @evankanderson during the operation wg meeting mentioned you wrote a doc about stable metric names and a potential deprecation policy. Do you have a link to that?

Whatever we decide with this change we'll want to roll it out gradually with a flag.

@evankanderson
Copy link
Member

evankanderson commented Jun 29, 2021

The domain name is used today by at least Stackdriver, though they could set up a mapping to enrich the content:

https://cloud.google.com/monitoring/api/metrics_knative

While Prometheus tends to prefer short, non-namespaced metric names, some other products like Stackdriver and Wavefront tend to prefer somewhat more nested names (deeply for stackdriver, medium for Wavefront). Based on the principle that it's easier to remove data than to add it, I incorporated the domain and component labels in OpenCensus, though I'd be happy with an alternate mechanism that didn't discard these labels.

Note that there are some metrics-exporting components from both Serving and Eventing that run in a user's namespace, so having some mechanism (metric name or labels) to distinguish metrics between the two would be useful. At the same time, common metrics like go GC information should have the same metric name for both.

Maybe a few examples from our existing documentation:

metric (opencensus name) type prefix reason
serving/activator/request_count HTTP request count Prefix with serving component, because activator request count may be double-counted with e.g. queue-proxy counts
serving/revision/request_count HTTP request count Prefix with revision; counted at the queue-proxy. May double-count with activator.
go_alloc go memory stats No prefix, common go instrumentation across components.
eventing/broker/event_count events received count Prefix with eventing component, because events are counted at several places in the system.
eventing/broker/request_count HTTP request count Prefix with eventing component; not all HTTP requests may result in an event receipt.
eventing/trigger/event_count Post-filter event delivery counts Prefix with eventing component, because events are counted at several places in the system.
eventing/trigger/http_requests Post-filter HTTP request counts Prefix with eventing component; not all HTTP requests may

@MontyCarter
Copy link
Contributor

Yes, Stackdriver is using the domain, and the domain is different based on the component. However, the metric domains are pretty static, so it would be manageable keep a map in an otel collector config and prepend the domain to the metric name based on the sender.

With that said, I think it would be much nicer to pass the domain as a label rather than simply removing the domain from the metric. This would make it easy to rewrite the name from the domain label in an otel collector config.

@evankanderson
Copy link
Member

Summary of discussion from 6 July:

@dprotaso
Copy link
Member Author

dprotaso commented Jul 6, 2021

Direct link to the meeting notes (2021/07/06)

@dprotaso
Copy link
Member Author

dprotaso commented Jul 6, 2021

We should also look at the standardization going on in https://github.com/open-telemetry/opentelemetry-specification/tree/main/specification/metrics/semantic_conventions, particularly http-metrics.md).

Probably would be part of #855

@dprotaso
Copy link
Member Author

/assign @dprotaso

@github-actions
Copy link
Contributor

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 18, 2021
@upodroid
Copy link
Member

/remove-lifecycle stale

@knative-prow-robot knative-prow-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 20, 2021
@github-actions
Copy link
Contributor

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2022
@dprotaso
Copy link
Member Author

/reopen
/lifecycle frozen

@dprotaso dprotaso reopened this Feb 18, 2022
@knative-prow-robot knative-prow-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

6 participants