Skip to content
This repository was archived by the owner on Dec 3, 2024. It is now read-only.

Switch counter metrics to use cumulative or delta #18

Open
prog8 opened this issue Aug 10, 2021 · 10 comments
Open

Switch counter metrics to use cumulative or delta #18

prog8 opened this issue Aug 10, 2021 · 10 comments

Comments

@prog8
Copy link

prog8 commented Aug 10, 2021

It turns out that a counter metrics use a gauge metric type as here

MetricKind: metricpb.MetricDescriptor_GAUGE,

This doesn't seem to be right. According to documentation of Aligner I am unable to calculate deltas if the metric is not defined as delta or cumulative. This is a big limitation. Should we consider switching to cumulative in case of counters?

@prog8
Copy link
Author

prog8 commented Aug 10, 2021

^ @tam7t

@tam7t
Copy link
Contributor

tam7t commented Aug 10, 2021

So I'm understanding correctly, you're wanting to create an alert on rate-of-change of a counter metric and it looks like that is not possible when the counter is actually a GAUGE metric?

@prog8
Copy link
Author

prog8 commented Aug 10, 2021

Not only alert on deltas but primarily draw charts with deltas. As far as I can see one can use delta as an Aligner but only delta and cumulative are supported.

Imagine I want to count the number of function calls or the number of requests. I can either calculate request rate or function calls rate in code and expose it as a Gauge or I can use a cumulative counter and compute deltas in query time. The latter is a very common scenario. Many open source services expose cumulative values which are later converted to deltas by either metrics agents or monitoring tools by themselves. As far as I see Google Cloud Monitoring tool also gives me ability to calculate deltas but for sure it doesn't allow me doing it for gauges. Both ALIGN_DELTA and ALIGN_RATE mention this in the docs: This aligner is valid for CUMULATIVE and DELTA metrics with numeric values.

I mean more or less this: #19

@prog8
Copy link
Author

prog8 commented Aug 23, 2021

@tam7t any thoughts here?

@rf
Copy link
Contributor

rf commented Aug 30, 2021

Hi! I just started using this package for some custom metrics and was very disappointed to find that counters are incorrectly created in stackdriver as gauges.

To be clear, there is a SetGauge method which should be used for gauges. IncrCounter should be creating a metric of type cumulative. If it does the same thing as SetGauge then why does it have a different name, right?

@tam7t
Copy link
Contributor

tam7t commented Sep 1, 2021

From https://cloud.google.com/blog/products/management-tools/stackdriver-tips-and-tricks-understanding-metrics-and-building-charts

A cumulative metric measures a value that constantly increases, such as “sent bytes count” for Firebase. Cumulative metrics are never drawn directly in practice; you always use aligners (discussed below) to turn them into gauge or delta metrics first. If you could draw the raw data for “sent bytes count,” you would see an ever-increasing line going up as the total number of sent bytes grows without bound.

What this library does for the go-metrics counter type is report the actual raw data (ever-increasing line, except where I turned off and restarted the program):

Running the example/main.go and plotting the counter metric using MQL:

fetch generic_task
| metric 'custom.googleapis.com/go-metrics/baz_counter'
| group_by [], [value_baz_counter_mean: aggregate(value.baz_counter)]
| window 1m

Produces a graph like this:

image

If we want to plot (or alert) on the rate of change of the counter we can use the following MQL:

fetch generic_task
| metric 'custom.googleapis.com/go-metrics/baz_counter'
| align delta_gauge()
| group_by [], [value_baz_counter_mean: aggregate(value.baz_counter)]
| window 1m

image

Here we see a stead 1200/minute rate (which is consistent with running 2 instances of example/main.go each incrementing the counter once every 100ms).

I'm honestly not sure why the GUI does not provide the DELTA aligner on gauge metrics, but it is possible to create an alert from these counters using MQL:

fetch generic_task
| metric 'custom.googleapis.com/go-metrics/baz_counter'
| align delta_gauge()
| group_by [], [value_baz_counter_mean: aggregate(value.baz_counter)]
| window 1m
| condition val() > 605

I can see the desire to switch the type to CUMULATIVE, as it appears that is how opentelemetry and other stackdriver built-in-counter metrics are reported - but I just want to demonstrate how to build alerts on the existing implementation.

@prog8
Copy link
Author

prog8 commented Sep 1, 2021

@tam7t Wow. Thank you. I think this solves most of the problems in fact. I ot an impression that one cannot use delta aligner for gauges. It seems I have to invest time to get familiar with MQL and use it instead of clicking through UI.

@tam7t
Copy link
Contributor

tam7t commented Sep 1, 2021

Glad to hear that!

Be aware that in my tests | align delta_gauge() was needed, I tried a few times with | align delta() and while the resulting graphs looked the same I could not get alerts to fire.

My strategy to find those queries was to use the UI to build the initial aggregation, then click the MQL button to see what the MQL looks like and edit from there.

I think switching to CUMULATIVE may also have behavior on overflows - I don't think I've actually tested out what happens when a counter rolls-over, but that may be another thing look at to ensure it doesn't cause an extra or missing alert.

@prog8
Copy link
Author

prog8 commented Sep 3, 2021

@tam7t Thank you for all your support. Do you also have suggestions regarding time metrics sent by calling MeasureSinceWithLabels. I try to visualize max but I get an error that max cannot be used for distributions. I have to use high percentile as a workaround. You can see what I do in the example below. I'll appreciate all the suggestions if there is a better way (without a workaround) to find min/max values. Thanks

fetch generic_task
| metric custom.googleapis.com/blabla/refresh_time
| align delta()
| group_by
    [metric.source],
    [refresh_time: percentile(value.refresh_time, 99.99)]
| window 1m

@rf
Copy link
Contributor

rf commented Sep 22, 2021

When my process restarts I end up with huge negative spikes in the delta, as the value goes from some high number to some very low number. Is there anything I can do about that?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants