-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why doesn't gauges support exemplars? #241
Comments
Gauges are a snapshot of something like memory used, wheres exemplars relate to an event that changed them. Only counter-type metrics have events which could then have exemplars. If you think you want an exemplar for a gauge, then you probably want event logging which is a non-goal of OpenMetrics. |
@brian-brazil The case is as follows: I'm not sure what you mean by event logging, could you clearify? |
OpenMetrics and Prometheus are the wrong tools for the job here, as what you're describing is the output of a log processing pipeline rather than a metrics use case. |
I would disagree. The output of such a model is a time series parallel (with a one-to-one mapping) to the trace durations. It's a number, not a log output. In some sense, this is indeed a metrics processing pipeline but the result is another metrics that needs to be visualised and correlated. I'm not sure what other tool you propose I would use? If this is still outside of OpenMetrics, thats another matter. |
Traces are fundamentally logs. Logs can and often do contain numbers, but that they contain numbers doesn't make them metrics. What an appropriate data format is for bulk log/trace processing is out of scope for OpenMetrics, whatever you'd usually use I guess (e.g. JSON, protobuf). |
That seems to clash with the spec which states:
|
I don't see the point you're trying to make, I see no conflict there. |
I would like to see exemplars attached to gauges as well. My use-case is attaching links to flamegraphs when a gauge crosses a threshold (CPU usage over a period of time in my case). I can hack around it by using a counter as a gauge, but it feels more natural to add support for exemplars in gauges instead. Having exemplar allows presenting an end user with an immediate link to the flamegraph, which doesn't seem possible with logs, as one need to look into multiple places. |
Good news: The Prometheus team decided on the dev summit on 10 November that Prometheus will ingest Exemplars on all time series. We just need to implement it on the Prometheus server side and on the client library side. PRs are welcome :) |
The OpenMetrics 1.0 specification supports exemplars on only counters and histograms, per our last discussions and agreement on the topic. Purposefully supporting them on other types violates that, so if the Prometheus server and clients do this with an OM content type they will not be compliant with OM. I'm saddened to read in the above comment and linked these meeting notes that the Prometheus project has seemingly decided to abandon OpenMetrics as the vendor-neutral standards project it was established to be. This has all happened without consulting or even informing the OpenMetrics team, with only those on Prometheus team involved. The above comment is the first communication on the matter we have received from the Prometheus project, over three months after the decision was made. @bobrik Thank you for providing your use case, we can discuss it when (if ever) the OpenMetrics team next has a meeting. It's not the exact sort of logging use cases previously brought up, though my first instinct is that it looks like what you really have is an alert firing which may be better handled in your alerting/notification/dashboarding system rather than in metrics. What do these metrics look like currently? |
@brian-brazil the metric currently looks like this:
This indicates the For some threads there would be a flamegraph attached. My hope is that an exemplar on the graph of saturating threads is easier to click than looking elsewhere (
This can be generalized as a change detection mechanism: whenever you see a significant gauge change, you attach some context. In our case it's a flamegraph. If somebody detect a temperature change on a sensor, they might want to attach a photo (of a squirrel or a fire, depending on the reason) in exemplars. |
Hmm, interesting. The thread id is already there so you could use that to adjust what the dashboard shows, there's no need to involve exemplars for what you already have sitting in the label. Keep in mind also that you've only 128 characters to play with in an exemplar, that limit is there to discourage misusing it for logging or anything too much beyond a trace id. What exactly is generating the flamegraph and the link to it in this case?
One general principle of metrics is that you always expose relevant metrics, not just when an alert is firing. If you need more fine grained information, you can dig through logs/traces/trailcams. |
The thread id isn't in the labels, the cgroup id is. There is still no flamegraph URL there, which is what I want to surface.
Not sure if I gave an impression that the metrics are only exported when alert is firing, but it's not the case.
Can't the same be said about trace exemplars in histograms? Why have exemplars at all? I must be missing something here. |
What is generating the flamegraph and the link to it? What exactly would be in the exemplar?
What's determining saturation here? My presumption is that there's some form of alert-y logic going on. I'm not quite sure a gauge is the appropriate metric type here, and say a counter of how many seconds threads were saturated for may be better. This would avoid missing for example threads briefly being saturated/unsaturated when the scrape happens, and allow the scraper to choose the analysis period. This is the same sort of reasoning as to why you expose cpu usage as a counter rather than a gauge.
The argument provided at the time was that doing so is unworkably inefficient for traces, though that argument was backtracked on subsequently. |
The exporter that is generating the number of threads near saturation is also generating flamegraphs for the exemplar.
The exporter itself determines saturation by looking at how long the thread was on CPU in a wall clock period.
I want to know the number of threads saturating per cgroup and I don't see a way to do this with a counter without getting a timeseries per thread.
There are tens of thousands of threads with lots of churn, so it doesn't seem practical to apply the same measurement.
What does this mean? Is there no argument for having exemplars anywhere then? It seems weird to have exemplars in counters but not in gauges. You can turn a counter into a gauge at query time with |
As the title says :)
The text was updated successfully, but these errors were encountered: