Aggregated metrics for consumption by Azure monitor and Traffic metrics team #2498

rashmichandrashekar · 2021-02-09T23:54:42Z

Azure Monitor is working on scraping metrics from envoy sidecar for several scenarios like latency, error and success rate etc. to light up the front end scenarios which provides a rich user experience which makes troubleshooting issues and understanding the mesh behavior easier. As a part of this work we did some analysis on the cost and performance involved to scrape these metrics and to build these scenarios. Below are the issues we see with scraping the metrics directly from the envoy sidecar with no aggregates.

The Azure monitor agent which scrapes the data might hit some issues with resource utilization while scraping all the raw metrics at high scale.
The LA data store which stores these metrics might incur huge cost for the customers since there is no aggregation in place.
The UX queries to load all of the aggregates upon load might see high latencies and will have perf issues.

After talking to the traffic metrics team we realize that they are building a query layer on top of the prometheus store which translates the kube api queries to prometheus queries to get these aggregated metrics. But this will not work for us because of the following reasons.

We do not want to take dependency on the kube api server since calls to the api server might account for load on the api server making it unresponsive and will affect other functionality on the cluster and workloads.
This is also taking dependency on prometheus. As a monitoring platform ourselves we cannot take dependency on prometheus which defeats the purpose of managed monitoring and which also means more work for the customer.

With OSM exposing aggregated metrics, we can query for aggregated metrics directly which will solve the first set of problems and essentially the traffic metrics layer can also leverage this functionality with no dependency on prometheus.

Here is a list of metrics for which aggregates would be required by Azure Monitor and which could be used by Traffic metrics.

Request latencies (P50, P90, P99)
Aggregates with which we can derive toplogy - (See here - https://github.com/servicemeshinterface/smi-spec/blob/main/apis/traffic-metrics/v1alpha1/traffic-metrics.md#topologies)
Request error count and percentage
Request success count and percentage
Bytes sent per second
Bytes received per second
Number of inbound connections to a service
Number of outbound connections to a service
Requests per second

Scope (please mark with X where applicable)

New Functionality [X ]
Install [ ]
SMI Traffic Access Policy [ ]
SMI Traffic Specs Policy [ ]
SMI Traffic Split Policy [ ]
Permissive Traffic Policy [ ]
Ingress [ ]
Egress [ ]
Envoy Control Plane [ ]
CLI Tool [ ]
Metrics [X ]
Certificate Management [ ]
Sidecar Injection [ ]
Logging [ ]
Debugging [ ]
Tests [ ]
CI System [ ]
Project Release [ ]

Possible use cases
Described above.

eduser25 · 2021-02-23T04:06:31Z

Hey Rashmi, thanks for leaving this here for us. I have some follow up questions:

(1) I think offline we both agreed, based on pragmatic experience also, that the latency histograms provided by envoy on prometheus endpoint are very hard to scale per cluster. For this we should explore the possibility to expose directly the latency Ps, which are at least already calculated internally in envoy but not exposed in prometheus endpoint
https://www.envoyproxy.io/docs/envoy/latest/operations/admin#get--stats

(2) For the other ones, we should better understand what "aggregation" do we imply. I think here we have the SMI metrics operator implied, but I'd like to understand for a given metric (say, "5. Bytes per second"), what level of aggregation is expected (per service? per pod? per destination? per mesh?)

github-actions · 2022-02-23T00:11:32Z

This issue will be closed due to a long period of inactivity. If you would like this issue to remain open then please comment or update.

github-actions · 2022-03-02T00:16:04Z

Issue closed due to inactivity.

rashmichandrashekar added the improvement / feature request label Feb 9, 2021

eduser25 mentioned this issue Feb 24, 2021

Implement minimum set of Prometheus metrics for OSM Controller pod #902

Closed

draychev mentioned this issue May 21, 2021

Milestone v0.10.0 planning #2800

Closed

draychev added this to the v0.10.0 milestone May 21, 2021

draychev removed the improvement / feature request label Jun 22, 2021

draychev removed this from the v0.10.0 milestone Jun 22, 2021

snehachhabria mentioned this issue Feb 9, 2022

Improve OSM Metrics Story #3704

Closed

10 tasks

github-actions bot added the stale label Feb 23, 2022

github-actions bot closed this as completed Mar 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregated metrics for consumption by Azure monitor and Traffic metrics team #2498

Aggregated metrics for consumption by Azure monitor and Traffic metrics team #2498

rashmichandrashekar commented Feb 9, 2021

eduser25 commented Feb 23, 2021 •

edited

Loading

github-actions bot commented Feb 23, 2022

github-actions bot commented Mar 2, 2022

Aggregated metrics for consumption by Azure monitor and Traffic metrics team #2498

Aggregated metrics for consumption by Azure monitor and Traffic metrics team #2498

Comments

rashmichandrashekar commented Feb 9, 2021

eduser25 commented Feb 23, 2021 • edited Loading

github-actions bot commented Feb 23, 2022

github-actions bot commented Mar 2, 2022

eduser25 commented Feb 23, 2021 •

edited

Loading