Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric names are very unusual and inconvenient. #732

Closed
Monnoroch opened this issue Mar 8, 2018 · 13 comments
Closed

Metric names are very unusual and inconvenient. #732

Monnoroch opened this issue Mar 8, 2018 · 13 comments
Assignees

Comments

@Monnoroch
Copy link

Metrics exported from the "production" images (not all-in-one though) are not "namespaced". If you have a Prometheus that monitors dozens of apps, you can have thousands of metrics. The names of Jaeger metrics (i.e. IndexCreate:attempts) are very not friendly to this situation. Normally I would expect a binary names jaeger-collector to export metrics starting with jaeger_collector_. This is the case with the all-in-one binary, but not the other ones.

Other less important issues:

  • It is common to export metrics with underscores, not CamelCase. Also not an issue in all-in-one binary afaik.
  • The : in metric names is unique to Jaeger.
@yurishkuro
Copy link
Member

yurishkuro commented Mar 8, 2018

+1 to add service name as a prefix

the camelcase should be easy to fix - it just refers to table names

The : is used as namespace separator, e.g. cassandra:Read:ReadTraces:latency_err_count. With your first suggestion it would become jaeger_query:cassandra:Read:ReadTraces:latency_err_count, however this is because we are still using hierarchical names (#395) instead of tags for dimensions. I am not a fan of going all the way to tags, e.g.

latency{service="jaeger_query",storage="cassandra",path="read",action="ReadTraces",result="err"}

because the metric name can easily clash with something else that uses different dimensions.

Suggestions are welcome.

@Monnoroch
Copy link
Author

because the metric name can easily clash with something else that uses different dimensions.

That was my original point: metrics should be "namespaced". I would expect the collector to export jaeger_latency{service="collector",storage="cassandra",path="read",action="ReadTraces",result="err"} and the agent to export jaeger_latency{service="agent",storage="cassandra",path="read",action="ReadTraces",result="err"} (btw I can't find the URL to get agent metrics on 1.2.0, but that's a whole other story).

That way metrics don't clash with other monitored software. For example you can look at metrics exported by official Prometheus products: Prometheus itself, node_exporter and a few other "official" exporters from this page. My suggestion is to follow the standard prometheus metric naming style because most software does and this will allow jaeger to coexist with it very nicely. Also, the all-in-one binary already has nice metrics (except for use of : instead of standard _).

@yurishkuro
Copy link
Member

These two examples jaeger_latency{service="collector"... and jaeger_latency{service="agent"... are insufficient to dedupe the metrics. The collector can very easily have another jaeger_latency metric with completely different dimensions (e.g. something about http endpoints, nothing to do with storage), which will blow up prometheus client. So we'd instead do something like jaeger_collector:storage:latency{...} and jaeger_collector:http_server:latency{...}

@Monnoroch
Copy link
Author

That's totally fine, it's easy to ensure metrics are called differently within the same project. Right now the bigger issue is ensuring they're unique between projects to scrape different software with a single prometheus instance. I literally have a metric called IndexCreate:attempts now and this is not very user-friendly.

@yurishkuro
Copy link
Member

I literally have a metric called IndexCreate:attempts now and this is not very user-friendly.

doh, that's just a bug - #733

@Monnoroch
Copy link
Author

Nice bug fixing latency, thanks! But still, you might want to think about aligning with standard prometheus metric naming conventions.

yurishkuro pushed a commit that referenced this issue Mar 9, 2018
yurishkuro added a commit that referenced this issue Mar 9, 2018
@ghost ghost removed the review label Mar 9, 2018
@yurishkuro yurishkuro reopened this Mar 9, 2018
@jpkrohling
Copy link
Contributor

I really should learn more about Prometheus, but in the meantime, I asked @simonpasquier to share his thoughts, as he's a Prometheus developer.

I'm copying here his answer to me, hoping that it will bring some light to this issue:

Replacing ':' by '_' and not using CamelCase: I agree with the OP that it is the most common pattern found. This isn't crucial but still nice for consistency.

Taking the "cassandra:Read:ReadTraces:latency_err_count" example, lets look at the different options:
"jaeger_query:cassandra:Read:ReadTraces:latency_err_count" doesn't work well because it can't be aggregated. Parts of the metric name should definitely be labels.
latency{service="jaeger_query",storage="cassandra",path="read",action="ReadTraces",result="err"} isn't correct either. You will probably never want to aggregate latency metrics irrespective of the labels (eg "sum(rate(latency[5m]))"). As Yuri points, you can have latency metrics for Cassandra and HTTP endpoints which it doesn't make sense to combine.
a middle ground would be jaeger_query_cassandra_latency_err_count{path="read",action="ReadTraces"}.But as I understand Jaeger has wrapper around the Prometheus instrumentation library so it may not be easy to change that.

Also regarding these metrics in your blog post:
jaeger:baggage_updates_total{result=”err”,} 0.0
jaeger:baggage_updates_total{result=”ok”,} 0.0

It is usually more convenient to have one counter for the errors and another one for all the events including errors (see [1]):

jaeger:baggage_updates_error_total 0.0
jaeger:baggage_updates_total 0.0

[1] https://docs.google.com/presentation/d/1aRtot7Zv_7MGS2nB96MLtzVRICEVFQCd5H_Y3TzR_-4/edit?usp=sharing

@jpkrohling
Copy link
Contributor

@yurishkuro , could you please assign this to me?

@svenwltr
Copy link

Talking about aggregation. It is also very hard to use the jaeger_collector:jaeger:spans_by_svc:my_service metrics, because it is not easy to aggregate them or just display all (or a subset) of them in Grafana. IMO the metrics should be named something like jaeger_collector:jaeger:spans and the services should just have a different label (eg service).

Not sure if this belongs to an extra issue.

@jpkrohling
Copy link
Contributor

@svenwltr, could you please check the Gist with the new metric names as per #776? Is the new scheme acceptable?

@svenwltr
Copy link

Yes it is. This is exactly what I meant. Thank you!

@yurishkuro
Copy link
Member

can this be closed post 1.5 release, or is there something remaining?

@jpkrohling
Copy link
Contributor

I'm closing this one, as there's been no response since end of May. If you believe this issue is still happening, feel free to reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants