Skip to content
This repository has been archived by the owner on Feb 27, 2023. It is now read-only.

monitoring: use histograms for api latency and cycle time metrics #164

Merged
merged 2 commits into from
Jun 28, 2018

Conversation

alexbrand
Copy link
Contributor

@alexbrand alexbrand commented Jun 22, 2018

Fixes #163

  • Changed the API latency and OpenStack cycle time metrics to use a histogram instead of a gauge.
  • Updated the dashboards to display 50th,75th, 90th, and 100th percentiles.
  • Updated metrics documentation

image

I suspect we'll have to make the bucket sizes configurable, but I think we can follow up on that when time comes.

@stevesloka
Copy link
Member

I don't quite understand how the histograms are implemented in this visualization. Typically I see histograms like this: http://docs.grafana.org/features/panels/heatmap/#histograms-and-buckets

What does the current metric mean for each bucket? If we had multiple discoverers, then they you would get 4 more data series?

@alexbrand
Copy link
Contributor Author

This visualization shows the 50, 75, 90 and max percentiles for each discoverer. It was confusing for me at first, but after reading a bit I now understand it better. The way I read the graphs are:

"50% (or 75%, or 90%) of the requests are taking less than X milliseconds".

If we have multiple discoverers, then we would indeed have 4 more data series on the graph.

I'll let @rosskukulinski weigh in as well as this is somewhat new for me still.

@rosskukulinski
Copy link
Contributor

"50% (or 75%, or 90%) of the requests are taking less than X milliseconds".

@alexbrand is correct about what's being visualized

That said, @stevesloka - you're right that these are not actually histogram visualizations like the ones you linked to. This is partly because it's only in the latest release of Grafana (#148) is there a proper Prometheus histograph panel. It's also a common pattern to display the latency percentiles (especially

@alexbrand recommend:

  • changing legend for each graph to a table underneath the graph and display the 'avg' and 'current' values.
  • Recommend only showing 50th and 90th percentiles by default to minimize clutter when there are many clusters.

@stevesloka
Copy link
Member

What do the current values in the legend mean? Also, Why does it show 100% as a bucket?

@rosskukulinski Should we update the Grafana version and use the proper histogram panel?

@rosskukulinski
Copy link
Contributor

100% is basically Inf -- all values are less than this number. Its the same thing as maximum latency value recorded.

current is the most recent calculated value. This is as opposed to max which is the max value for the rolling window Grafana is currently displaying, min, which is the min value for the rolling window, and avg which is the average over the rolling window.

I do think we should update Grafana, but I do not think it's necessary for this PR. FWIW, I prefer line-graph representation of latencies because you can display many different lines. Doing that with true histograms is more complex and confusing to understand.

@alexbrand
Copy link
Contributor Author

@rosskukulinski @stevesloka If we remove the 100th percentile we would most likely be oblivious to the fact that some requests are taking a very long time. For example, in the attached screenshot, we can see that the 90th percentile of the Load Balancers Endpoint is below 20 seconds, but the 100th percentile goes up to 50 secs.

What are your thoughts on showing 50th and 99th percentiles?

@rosskukulinski
Copy link
Contributor

+1 for 50th and 99th

@alexbrand
Copy link
Contributor Author

Updated with 50th and 99th percentile, and also updated legend to show up as table with avg and current values.

screen shot 2018-06-25 at 4 43 12 pm

@rosskukulinski
Copy link
Contributor

I don't have an OS system up to test with, but latest attached image LGTM.

@stevesloka
Copy link
Member

LGTM, @alexbrand could you rebase to update?

Signed-off-by: Alexander Brand <alexbrand09@gmail.com>
Signed-off-by: Alexander Brand <alexbrand09@gmail.com>
@alexbrand
Copy link
Contributor Author

Rebased. Thanks @stevesloka!

@stevesloka stevesloka merged commit e2fa4aa into projectcontour:master Jun 28, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants