You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 11, 2023. It is now read-only.
Azure Monitor is working on scraping metrics from envoy sidecar for several scenarios like latency, error and success rate etc. to light up the front end scenarios which provides a rich user experience which makes troubleshooting issues and understanding the mesh behavior easier. As a part of this work we did some analysis on the cost and performance involved to scrape these metrics and to build these scenarios. Below are the issues we see with scraping the metrics directly from the envoy sidecar with no aggregates.
The Azure monitor agent which scrapes the data might hit some issues with resource utilization while scraping all the raw metrics at high scale.
The LA data store which stores these metrics might incur huge cost for the customers since there is no aggregation in place.
The UX queries to load all of the aggregates upon load might see high latencies and will have perf issues.
After talking to the traffic metrics team we realize that they are building a query layer on top of the prometheus store which translates the kube api queries to prometheus queries to get these aggregated metrics. But this will not work for us because of the following reasons.
We do not want to take dependency on the kube api server since calls to the api server might account for load on the api server making it unresponsive and will affect other functionality on the cluster and workloads.
This is also taking dependency on prometheus. As a monitoring platform ourselves we cannot take dependency on prometheus which defeats the purpose of managed monitoring and which also means more work for the customer.
With OSM exposing aggregated metrics, we can query for aggregated metrics directly which will solve the first set of problems and essentially the traffic metrics layer can also leverage this functionality with no dependency on prometheus.
Here is a list of metrics for which aggregates would be required by Azure Monitor and which could be used by Traffic metrics.
Hey Rashmi, thanks for leaving this here for us. I have some follow up questions:
(1) I think offline we both agreed, based on pragmatic experience also, that the latency histograms provided by envoy on prometheus endpoint are very hard to scale per cluster. For this we should explore the possibility to expose directly the latency Ps, which are at least already calculated internally in envoy but not exposed in prometheus endpoint https://www.envoyproxy.io/docs/envoy/latest/operations/admin#get--stats
(2) For the other ones, we should better understand what "aggregation" do we imply. I think here we have the SMI metrics operator implied, but I'd like to understand for a given metric (say, "5. Bytes per second"), what level of aggregation is expected (per service? per pod? per destination? per mesh?)
Azure Monitor is working on scraping metrics from envoy sidecar for several scenarios like latency, error and success rate etc. to light up the front end scenarios which provides a rich user experience which makes troubleshooting issues and understanding the mesh behavior easier. As a part of this work we did some analysis on the cost and performance involved to scrape these metrics and to build these scenarios. Below are the issues we see with scraping the metrics directly from the envoy sidecar with no aggregates.
After talking to the traffic metrics team we realize that they are building a query layer on top of the prometheus store which translates the kube api queries to prometheus queries to get these aggregated metrics. But this will not work for us because of the following reasons.
With OSM exposing aggregated metrics, we can query for aggregated metrics directly which will solve the first set of problems and essentially the traffic metrics layer can also leverage this functionality with no dependency on prometheus.
Here is a list of metrics for which aggregates would be required by Azure Monitor and which could be used by Traffic metrics.
Scope (please mark with X where applicable)
Possible use cases
Described above.
The text was updated successfully, but these errors were encountered: