Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix PromQL epxression for coredns average packet size #43

Merged

Conversation

geekofalltrades
Copy link
Contributor

The method for taking the average of a histogram metric is documented here:
https://prometheus.io/docs/practices/histograms/#count-and-sum-of-observations

Pull Request

Required Fields

🔎 What kind of change is it?

  • fix

🎯 What has been changed and why do we need it?

Fix the PromQL expression for CoreDNS request average packet size. The value being shown was a meaningless value: the average across all CoreDNS servers of the sum of all request packet sizes in each server over $__rate_interval. This is not the average size of packets received by CoreDNS servers over $__rate_interval, which I imagine is what was intended.

The old query, explained another way:

  • coredns_dns_request_size_bytes_sum is the sum of sizes of all request packets ever received.
  • rate(coredns_dns_request_size_bytes_sum[$__rate_interval]), by itself, is the per-second rate of increase of the sum of sizes of request packets received during $__rate_interval. This isn't a meaningful value. This query returns a series containing this value for each CoreDNS server deployed.
  • avg(rate(coredns_dns_request_size_bytes_sum[$__rate_interval])) takes the average of this value across all series returned -- in other words, to calculate average, it's dividing by the number of CoreDNS servers, not the number of packets. An operation that doesn't make sense on top of a value that already doesn't make sense.

The new query:

  • coredns_dns_request_size_bytes_count is the other major component of this histogram. (Histograms in general have a _count, a _sum, and some number of _buckets.) This metric is the number of observations -- in other words, the number of packets whose sizes were recorded.
  • We use sum to add up the summed packet sizes and packet counts across all CoreDNS servers.
  • Divide the summed packet sizes by the number of packets. (That's the average, alright!)
  • The rate function cancels out in this division -- you get the same value if you use the increase function in place of rate. I personally think increase is almost always easier to understand than rate, but I stuck with rate here because it's what the Prometheus docs use.

@geekofalltrades geekofalltrades requested a review from dotdc as a code owner April 19, 2023 23:44
@geekofalltrades
Copy link
Contributor Author

geekofalltrades commented Apr 19, 2023

I got suspicious about this when investigating hourly spikes of DNS requests in our servers. The shape of the average packet size graph was the same as the shape of the number of requests graph. Intuitively it seemed like it could be possible for this to be accurate, but it would be an odd coincidence.

Screenshot from 2023-04-19 16-49-15

The fixed query paints a very different picture:

Screenshot from 2023-04-19 16-48-40

We're actually being flooded with really tiny request packets at the top of each hour. That's validated by comparing against the request packet size heatmap I previously added 😃 :

Screenshot from 2023-04-19 16-52-57

@geekofalltrades geekofalltrades force-pushed the fix-coredns-average-packet-size-query branch from b28ab57 to 3a62cca Compare April 19, 2023 23:57
@geekofalltrades geekofalltrades force-pushed the fix-coredns-average-packet-size-query branch from 3a62cca to 292a920 Compare April 19, 2023 23:59
@dotdc dotdc merged commit b796909 into dotdc:master Apr 20, 2023
@dotdc
Copy link
Owner

dotdc commented Apr 20, 2023

Really nice find, thank you for your contribution @geekofalltrades !

@dotdc
Copy link
Owner

dotdc commented Apr 25, 2024

🎉 This PR is included in version 1.1.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

@dotdc dotdc added the released label Apr 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants