Skip to content
This repository has been archived by the owner on Jul 11, 2023. It is now read-only.

Aggregated metrics for consumption by Azure monitor and Traffic metrics team #2498

Closed
Tracked by #3704
rashmichandrashekar opened this issue Feb 9, 2021 · 3 comments
Labels

Comments

@rashmichandrashekar
Copy link

Azure Monitor is working on scraping metrics from envoy sidecar for several scenarios like latency, error and success rate etc. to light up the front end scenarios which provides a rich user experience which makes troubleshooting issues and understanding the mesh behavior easier. As a part of this work we did some analysis on the cost and performance involved to scrape these metrics and to build these scenarios. Below are the issues we see with scraping the metrics directly from the envoy sidecar with no aggregates.

  1. The Azure monitor agent which scrapes the data might hit some issues with resource utilization while scraping all the raw metrics at high scale.
  2. The LA data store which stores these metrics might incur huge cost for the customers since there is no aggregation in place.
  3. The UX queries to load all of the aggregates upon load might see high latencies and will have perf issues.

After talking to the traffic metrics team we realize that they are building a query layer on top of the prometheus store which translates the kube api queries to prometheus queries to get these aggregated metrics. But this will not work for us because of the following reasons.

  1. We do not want to take dependency on the kube api server since calls to the api server might account for load on the api server making it unresponsive and will affect other functionality on the cluster and workloads.
  2. This is also taking dependency on prometheus. As a monitoring platform ourselves we cannot take dependency on prometheus which defeats the purpose of managed monitoring and which also means more work for the customer.

With OSM exposing aggregated metrics, we can query for aggregated metrics directly which will solve the first set of problems and essentially the traffic metrics layer can also leverage this functionality with no dependency on prometheus.

Here is a list of metrics for which aggregates would be required by Azure Monitor and which could be used by Traffic metrics.

  1. Request latencies (P50, P90, P99)
  2. Aggregates with which we can derive toplogy - (See here - https://github.com/servicemeshinterface/smi-spec/blob/main/apis/traffic-metrics/v1alpha1/traffic-metrics.md#topologies)
  3. Request error count and percentage
  4. Request success count and percentage
  5. Bytes sent per second
  6. Bytes received per second
  7. Number of inbound connections to a service
  8. Number of outbound connections to a service
  9. Requests per second

Scope (please mark with X where applicable)

  • New Functionality [X ]
  • Install [ ]
  • SMI Traffic Access Policy [ ]
  • SMI Traffic Specs Policy [ ]
  • SMI Traffic Split Policy [ ]
  • Permissive Traffic Policy [ ]
  • Ingress [ ]
  • Egress [ ]
  • Envoy Control Plane [ ]
  • CLI Tool [ ]
  • Metrics [X ]
  • Certificate Management [ ]
  • Sidecar Injection [ ]
  • Logging [ ]
  • Debugging [ ]
  • Tests [ ]
  • CI System [ ]
  • Project Release [ ]

Possible use cases
Described above.

@eduser25
Copy link
Contributor

eduser25 commented Feb 23, 2021

Hey Rashmi, thanks for leaving this here for us. I have some follow up questions:

(1) I think offline we both agreed, based on pragmatic experience also, that the latency histograms provided by envoy on prometheus endpoint are very hard to scale per cluster. For this we should explore the possibility to expose directly the latency Ps, which are at least already calculated internally in envoy but not exposed in prometheus endpoint
https://www.envoyproxy.io/docs/envoy/latest/operations/admin#get--stats

(2) For the other ones, we should better understand what "aggregation" do we imply. I think here we have the SMI metrics operator implied, but I'd like to understand for a given metric (say, "5. Bytes per second"), what level of aggregation is expected (per service? per pod? per destination? per mesh?)

@github-actions
Copy link

This issue will be closed due to a long period of inactivity. If you would like this issue to remain open then please comment or update.

@github-actions github-actions bot added the stale label Feb 23, 2022
@github-actions
Copy link

github-actions bot commented Mar 2, 2022

Issue closed due to inactivity.

@github-actions github-actions bot closed this as completed Mar 2, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants