Add metrics and latency dashboard per service and namespace #130

rosskukulinski · 2018-05-25T18:14:54Z

This pull request improves the default envoy-metrics Grafana dashboard to provide visualizations of the following:

Per-Envoy Requests per Second (RPS), Connections per Second (CPS), Latency, and Total Connections
Per Upstream RPS, CPS, Latency, Total Connections broken down by Namespace and/or Service
Per Upstream 2xx, 3xx, 4xx, 5xx response counts broken down by Namespace and/or Service
Grafana templated variables pulled from Envoy metrics

To support the additional breakdown by service and namespace, this PR modifies the prometheus-statsd exporter running in Envoy AND the Promtheus pods-job to split the cluster_name field into it's subcomponents: namespace, service, and port.

Note: the service field here is actually the backend-name + servicename from the discovery system. In future work, we should look to find an efficient way to isolate the backend-name from servicename so we can provide latency/RPS metrics per-backend cluster.

Signed-off-by: Ross Kukulinski <ross@kukulinski.com>

alexbrand · 2018-05-30T15:04:34Z

Some of the graph's legends display "namespace/service" instead of the actual namespace and service. The Upstream RPS graph is an example.

I did some digging and it looks like the metrics for envoy_cluster_name="contour" do not have the namespace and service prometheus labels. Same with the metrics for envoy_cluster_name="service_stats".

I am actually blanking on how the envoy metrics end up with service and namespace labels. How are those added?

rosskukulinski · 2018-05-30T15:10:24Z

@alexbrand I added some regex in the statsd-exporter config (e.g. https://github.com/heptio/gimbal/pull/130/files#diff-5d97ec68ef0a6b5e22a5e67fa697e23bR18) that splits out to the different labels.

I don't know what to do about the contour ones that show up as namespace/service

rosskukulinski · 2018-05-30T15:16:27Z

Oh, and there's corresponding regex in the Prometheus config: https://github.com/heptio/gimbal/pull/130/files#diff-63627bfdd1800e3898caa1eb81a4dd58R301

alexbrand · 2018-05-30T15:18:11Z

Aaaah that makes more sense now. I was looking at the existing statsd-exporter config to try to figure it out. We'll have to be careful if we ever change the envoy cluster naming scheme in contour, as it will affect these stats.

I am also unsure about what we can do for the clusters that are statically configured in envoy.

Signed-off-by: Ross Kukulinski <ross@kukulinski.com>

rosskukulinski · 2018-05-30T15:35:28Z

@alexbrand I realized that I had been overzealous with my grafana dashboard variables. I had added a custome "All" selector of .*, when I really didn't want a custom one. Statically configured clusters no longer show in the metrics.

alexbrand · 2018-05-30T16:52:19Z

@rosskukulinski Are your Namespace and Service filters getting populated with values?

I found that I need to update the Variable query to get it to work. For example, I had to update the Namespace variable's query from label_values(envoy_cluster_upstream_rq_time_bucket,namespace) to label_values(namespace)

alexbrand · 2018-05-30T18:17:30Z

The reason I was not getting anything in the dropdown was because I had not sent an initial request to the backend. Once I did, the metric that drives the Grafana Variables became available, and the Namespace & Service filters got populated as expected.

This LGTM.

Add metrics and latency dashboard per service and namespace

03b2fc5

Signed-off-by: Ross Kukulinski <ross@kukulinski.com>

rosskukulinski added this to the v0.3 milestone May 25, 2018

rosskukulinski requested review from alexbrand and stevesloka May 25, 2018 18:14

Update variable 'All' selector to remove contour from upstream metrics

a44cdd5

Signed-off-by: Ross Kukulinski <ross@kukulinski.com>

Merge branch 'master' into dashboard-latencies

2e132a5

alexbrand approved these changes May 30, 2018

View reviewed changes

alexbrand merged commit be13e93 into master May 30, 2018

alexbrand deleted the dashboard-latencies branch May 30, 2018 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metrics and latency dashboard per service and namespace #130

Add metrics and latency dashboard per service and namespace #130

rosskukulinski commented May 25, 2018

alexbrand commented May 30, 2018

rosskukulinski commented May 30, 2018

rosskukulinski commented May 30, 2018

alexbrand commented May 30, 2018

rosskukulinski commented May 30, 2018

alexbrand commented May 30, 2018

alexbrand commented May 30, 2018

Add metrics and latency dashboard per service and namespace #130

Add metrics and latency dashboard per service and namespace #130

Conversation

rosskukulinski commented May 25, 2018

alexbrand commented May 30, 2018

rosskukulinski commented May 30, 2018

rosskukulinski commented May 30, 2018

alexbrand commented May 30, 2018

rosskukulinski commented May 30, 2018

alexbrand commented May 30, 2018

alexbrand commented May 30, 2018