Skip to content
This repository has been archived by the owner on Feb 27, 2023. It is now read-only.

Add metrics and latency dashboard per service and namespace #130

Merged
merged 3 commits into from
May 30, 2018

Conversation

rosskukulinski
Copy link
Contributor

This pull request improves the default envoy-metrics Grafana dashboard to provide visualizations of the following:

  • Per-Envoy Requests per Second (RPS), Connections per Second (CPS), Latency, and Total Connections
  • Per Upstream RPS, CPS, Latency, Total Connections broken down by Namespace and/or Service
  • Per Upstream 2xx, 3xx, 4xx, 5xx response counts broken down by Namespace and/or Service
  • Grafana templated variables pulled from Envoy metrics

To support the additional breakdown by service and namespace, this PR modifies the prometheus-statsd exporter running in Envoy AND the Promtheus pods-job to split the cluster_name field into it's subcomponents: namespace, service, and port.

Note: the service field here is actually the backend-name + servicename from the discovery system. In future work, we should look to find an efficient way to isolate the backend-name from servicename so we can provide latency/RPS metrics per-backend cluster.

envoy-metrics dashboard

Signed-off-by: Ross Kukulinski <ross@kukulinski.com>
@rosskukulinski rosskukulinski added this to the v0.3 milestone May 25, 2018
@alexbrand
Copy link
Contributor

Some of the graph's legends display "namespace/service" instead of the actual namespace and service. The Upstream RPS graph is an example.

I did some digging and it looks like the metrics for envoy_cluster_name="contour" do not have the namespace and service prometheus labels. Same with the metrics for envoy_cluster_name="service_stats".

image

I am actually blanking on how the envoy metrics end up with service and namespace labels. How are those added?

@rosskukulinski
Copy link
Contributor Author

@alexbrand I added some regex in the statsd-exporter config (e.g. https://github.com/heptio/gimbal/pull/130/files#diff-5d97ec68ef0a6b5e22a5e67fa697e23bR18) that splits out to the different labels.

I don't know what to do about the contour ones that show up as namespace/service

@rosskukulinski
Copy link
Contributor Author

Oh, and there's corresponding regex in the Prometheus config: https://github.com/heptio/gimbal/pull/130/files#diff-63627bfdd1800e3898caa1eb81a4dd58R301

@alexbrand
Copy link
Contributor

Aaaah that makes more sense now. I was looking at the existing statsd-exporter config to try to figure it out. We'll have to be careful if we ever change the envoy cluster naming scheme in contour, as it will affect these stats.

I am also unsure about what we can do for the clusters that are statically configured in envoy.

Signed-off-by: Ross Kukulinski <ross@kukulinski.com>
@rosskukulinski
Copy link
Contributor Author

@alexbrand I realized that I had been overzealous with my grafana dashboard variables. I had added a custome "All" selector of .*, when I really didn't want a custom one. Statically configured clusters no longer show in the metrics.

@alexbrand
Copy link
Contributor

@rosskukulinski Are your Namespace and Service filters getting populated with values?

I found that I need to update the Variable query to get it to work. For example, I had to update the Namespace variable's query from label_values(envoy_cluster_upstream_rq_time_bucket,namespace) to label_values(namespace)

@alexbrand
Copy link
Contributor

The reason I was not getting anything in the dropdown was because I had not sent an initial request to the backend. Once I did, the metric that drives the Grafana Variables became available, and the Namespace & Service filters got populated as expected.

This LGTM.

@alexbrand alexbrand merged commit be13e93 into master May 30, 2018
@alexbrand alexbrand deleted the dashboard-latencies branch May 30, 2018 18:24
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants