Fix PromQL epxression for coredns average packet size #43
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The method for taking the average of a histogram metric is documented here:
https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations
Pull Request
Required Fields
🔎 What kind of change is it?
🎯 What has been changed and why do we need it?
Fix the PromQL expression for CoreDNS request average packet size. The value being shown was a meaningless value: the average across all CoreDNS servers of the sum of all request packet sizes in each server over
$__rate_interval
. This is not the average size of packets received by CoreDNS servers over$__rate_interval
, which I imagine is what was intended.The old query, explained another way:
coredns_dns_request_size_bytes_sum
is the sum of sizes of all request packets ever received.rate(coredns_dns_request_size_bytes_sum[$__rate_interval])
, by itself, is the per-second rate of increase of the sum of sizes of request packets received during$__rate_interval
. This isn't a meaningful value. This query returns a series containing this value for each CoreDNS server deployed.avg(rate(coredns_dns_request_size_bytes_sum[$__rate_interval]))
takes the average of this value across all series returned -- in other words, to calculate average, it's dividing by the number of CoreDNS servers, not the number of packets. An operation that doesn't make sense on top of a value that already doesn't make sense.The new query:
coredns_dns_request_size_bytes_count
is the other major component of this histogram. (Histograms in general have a_count
, a_sum
, and some number of_bucket
s.) This metric is the number of observations -- in other words, the number of packets whose sizes were recorded.sum
to add up the summed packet sizes and packet counts across all CoreDNS servers.rate
function cancels out in this division -- you get the same value if you use theincrease
function in place ofrate
. I personally thinkincrease
is almost always easier to understand thanrate
, but I stuck withrate
here because it's what the Prometheus docs use.