-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How do I find envoy only overhead btwn two services? #37718
Comments
cc @wbpcode who probably has more knowledge on service-mesh tracing. |
When a request flows from service-a to service-b under Istio service mesh, it passes through: The trace you’re seeing (e.g., Zipkin) may report a total time that includes both network traversal and Envoy overhead, while your application logs (e.g., from Splunk) show the pure service processing time (1ms). The difference often comes from: To quantify just the Envoy overhead, you need metrics from Envoy’s perspective. Istio integrates Envoy’s metrics into Prometheus and provides Grafana dashboards out-of-the-box. By comparing source-reporter metrics and destination-reporter metrics, you can approximate how much time is spent in the proxies and network before/after hitting the application. Relevant Metrics Key Istio Metrics: By looking at these two views, you can break down where the latency is introduced. Key Envoy-Specific Metrics: How to Compare and Calculate Overhead histogram_quantile(0.95, sum(rate(istio_request_duration_milliseconds_bucket{reporter="source", destination_service="service-b..svc.cluster.local"}[1m])) by (le)) This gives you the 95th percentile latency as measured by the source sidecar. It includes:
Similarly: histogram_quantile(0.95, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination", destination_service="service-b..svc.cluster.local"}[1m])) by (le)) This measures latency from when the request hits the destination Envoy to when the response is returned. It includes:
You mentioned that application logs (e.g., Splunk logs) show only ~1ms service processing time. If the reporter="destination" metric shows something like ~2ms, that suggests roughly 1ms of Envoy overhead on the destination side. histogram_quantile(0.95, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name="inbound|||service-b..svc.cluster.local"}[1m])) by (le)) Compare this to the reporter="destination" measurements. If envoy_cluster_upstream_rq_time closely matches the source Envoy’s total time (minus the small difference at the destination), much of the latency is likely network or TLS overhead rather than the service itself. Putting It All Together Additional Tips Summary To determine Envoy-only overhead: By iterating through these comparisons, you can pinpoint where the overhead is introduced and quantify Envoy’s contribution to the latency between two services. |
If you are reporting any crash or any potential security issue, do not
open an issue in this repo. Please report the issue via emailing
envoy-security@googlegroups.com where the issue will be triaged appropriately.
Title: How do I find envoy only overhead btwn two services?
Description:
[optional Relevant Links:]
The text was updated successfully, but these errors were encountered: