Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus choice of bucket boundaries for http_request_duration_seconds #3196

Closed
luong-komorebi opened this issue Feb 25, 2021 · 3 comments · Fixed by #3214
Closed

Prometheus choice of bucket boundaries for http_request_duration_seconds #3196

luong-komorebi opened this issue Feb 25, 2021 · 3 comments · Fixed by #3214

Comments

@luong-komorebi
Copy link
Contributor

Hi, first of all, thanks for the OPA. Awesome work.

I am creating a Grafana dashboard for OPA. When visualizing http_request_duration_seconds, I face a problem when the data of average response time of all HTTP requests to opa doesnt match the percentile of duration of requests.

Here're some images showing the problem.
You can see that we have the same spike in two graphs, which is a good thing. On the avg request duration, I am able to find out the avg request duration down to microseconds.

image

However, the quantile stops at seconds

image

and the value doesnt change much overtime

image

which leads me to thinking that the we havent had the right bucket configuration for http_request_duration_seconds for our OPA.

From my view, I can see #1638 is where the work for http_request_duration_seconds is done. This is using the default bucket configuration that prometheus provides, but these buckets are for typical web application and in OPA's case the numbers are not granular enough.

Another case may happen that my queries are wrong, but these queries are pretty standard I dont think I messed up anything. Feel free to view the code at https://github.com/luong-komorebi/opa-grafana-dashboard

Steps to Reproduce the Problem

  • OPA version: 0.26.0
  • Have OPA, prometheus configured and running
  • Make some http call to OPA
  • If you have grafana up and running, install my grafana dashboard from Github or Grafana.com
  • If you dont, go to prometheus and query average request duration as well as, for example, 50th percentile for http_request_duration_seconds in 5 min interval
  • Check the query for average request duration and see why they differ so much from the http request duration quantiles.

Additional Info

If possible and my assumption is right, I suggest we lower the number for the buckets.

@srenatus
Copy link
Contributor

srenatus commented Mar 2, 2021

Thanks for bringing this up! 👏 We've started discussing this internally, the current thinking is that ditching some higher granularity buckets in favour of adding a few smaller ones is probably the right move. You'd agree, I presume? 😃

@luong-komorebi
Copy link
Contributor Author

Yes, I totally agree @srenatus

@luong-komorebi
Copy link
Contributor Author

luong-komorebi commented Mar 3, 2021

I tried to turn my idea into codes and it looks like this #3214 Maybe this is what you guys are looking for

luong-komorebi added a commit to luong-komorebi/opa that referenced this issue Mar 3, 2021
This pull request ditches some higher granularity buckets in favour of adding a
few smaller ones. The bucket that I chose was based on https://www.openpolicyagent.org/docs/latest/policy-performance/#high-performance-policy-decisions
where the expectation is "policy evaluation has a budget on the order of 1 millisecond".
Also, I tried to stay within Prometheus's default 10 buckets.

This fixes open-policy-agent#3196

Signed-off-by: Luong Vo <vo.tran.thanh.luong@gmail.com>
tsandall pushed a commit that referenced this issue Mar 5, 2021
This pull request ditches some higher granularity buckets in favour of adding a
few smaller ones. The bucket that I chose was based on https://www.openpolicyagent.org/docs/latest/policy-performance/#high-performance-policy-decisions
where the expectation is "policy evaluation has a budget on the order of 1 millisecond".
Also, I tried to stay within Prometheus's default 10 buckets.

This fixes #3196

Signed-off-by: Luong Vo <vo.tran.thanh.luong@gmail.com>
This was referenced Mar 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants