Fix PromQL epxression for coredns average packet size #43

geekofalltrades · 2023-04-19T23:44:35Z

The method for taking the average of a histogram metric is documented here:
https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations

Pull Request

Required Fields

🔎 What kind of change is it?

fix

🎯 What has been changed and why do we need it?

Fix the PromQL expression for CoreDNS request average packet size. The value being shown was a meaningless value: the average across all CoreDNS servers of the sum of all request packet sizes in each server over $__rate_interval. This is not the average size of packets received by CoreDNS servers over $__rate_interval, which I imagine is what was intended.

The old query, explained another way:

coredns_dns_request_size_bytes_sum is the sum of sizes of all request packets ever received.
rate(coredns_dns_request_size_bytes_sum[$__rate_interval]), by itself, is the per-second rate of increase of the sum of sizes of request packets received during $__rate_interval. This isn't a meaningful value. This query returns a series containing this value for each CoreDNS server deployed.
avg(rate(coredns_dns_request_size_bytes_sum[$__rate_interval])) takes the average of this value across all series returned -- in other words, to calculate average, it's dividing by the number of CoreDNS servers, not the number of packets. An operation that doesn't make sense on top of a value that already doesn't make sense.

The new query:

coredns_dns_request_size_bytes_count is the other major component of this histogram. (Histograms in general have a _count, a _sum, and some number of _buckets.) This metric is the number of observations -- in other words, the number of packets whose sizes were recorded.
We use sum to add up the summed packet sizes and packet counts across all CoreDNS servers.
Divide the summed packet sizes by the number of packets. (That's the average, alright!)
The rate function cancels out in this division -- you get the same value if you use the increase function in place of rate. I personally think increase is almost always easier to understand than rate, but I stuck with rate here because it's what the Prometheus docs use.

geekofalltrades · 2023-04-19T23:50:27Z

I got suspicious about this when investigating hourly spikes of DNS requests in our servers. The shape of the average packet size graph was the same as the shape of the number of requests graph. Intuitively it seemed like it could be possible for this to be accurate, but it would be an odd coincidence.

The fixed query paints a very different picture:

We're actually being flooded with really tiny request packets at the top of each hour. That's validated by comparing against the request packet size heatmap I previously added 😃 :

The method for taking the average of a histogram metric is documented here: https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations

dotdc · 2023-04-20T07:01:31Z

Really nice find, thank you for your contribution @geekofalltrades !

dotdc · 2024-04-25T21:09:48Z

🎉 This PR is included in version 1.1.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

geekofalltrades requested a review from dotdc as a code owner April 19, 2023 23:44

geekofalltrades force-pushed the fix-coredns-average-packet-size-query branch from b28ab57 to 3a62cca Compare April 19, 2023 23:57

Fix PromQL epxression for coredns average packet size

292a920

The method for taking the average of a histogram metric is documented here: https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations

geekofalltrades force-pushed the fix-coredns-average-packet-size-query branch from 3a62cca to 292a920 Compare April 19, 2023 23:59

dotdc merged commit b796909 into dotdc:master Apr 20, 2023

dotdc added the released label Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix PromQL epxression for coredns average packet size #43

Fix PromQL epxression for coredns average packet size #43

geekofalltrades commented Apr 19, 2023

geekofalltrades commented Apr 19, 2023 •

edited

Loading

dotdc commented Apr 20, 2023

dotdc commented Apr 25, 2024

Fix PromQL epxression for coredns average packet size #43

Fix PromQL epxression for coredns average packet size #43

Conversation

geekofalltrades commented Apr 19, 2023

Pull Request

Required Fields

🔎 What kind of change is it?

🎯 What has been changed and why do we need it?

geekofalltrades commented Apr 19, 2023 • edited Loading

dotdc commented Apr 20, 2023

dotdc commented Apr 25, 2024

geekofalltrades commented Apr 19, 2023 •

edited

Loading