Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I find envoy only overhead btwn two services? #37718

Open
nitinware opened this issue Dec 17, 2024 · 2 comments
Open

How do I find envoy only overhead btwn two services? #37718

nitinware opened this issue Dec 17, 2024 · 2 comments
Labels
area/tracing question Questions that are neither investigations, bugs, nor enhancements

Comments

@nitinware
Copy link

If you are reporting any crash or any potential security issue, do not
open an issue in this repo. Please report the issue via emailing
envoy-security@googlegroups.com where the issue will be triaged appropriately.

Title: How do I find envoy only overhead btwn two services?

Description:

We have two service which are on-boarded onto service-mesh named: service-a and service-b
The service-a calls service-b over the mesh and the call request goes over mesh.
The call originating from service-a envoy-egress --> service-b envoy-ingress
We are see high latency reported 34ms the call from service-a to service-b in zipkin trace, sharing below screenshot
image
The actual latency on service-b in splunk logs is only 1ms but the zipkin trace shows 34ms
We want to know where this latency is coming from? it is coming from service-a envoy-egress or service-b envoy-ingress
Can you please share the envoy grafana metrics and queries which can be used to find the latency cause? Also how to calculate the envoy only overhead btwn two services? Appreciate your inputs. Thnx

[optional Relevant Links:]

Any extra documentation required to understand the issue.

@nitinware nitinware added the triage Issue requires triage label Dec 17, 2024
@adisuissa adisuissa added question Questions that are neither investigations, bugs, nor enhancements area/tracing and removed triage Issue requires triage labels Dec 18, 2024
@adisuissa
Copy link
Contributor

cc @wbpcode who probably has more knowledge on service-mesh tracing.

@akhilsingh-git
Copy link

When a request flows from service-a to service-b under Istio service mesh, it passes through:
1. The egress sidecar of service-a (source Envoy).
2. The ingress sidecar of service-b (destination Envoy).
3. The actual service-b application.

The trace you’re seeing (e.g., Zipkin) may report a total time that includes both network traversal and Envoy overhead, while your application logs (e.g., from Splunk) show the pure service processing time (1ms). The difference often comes from:
• Envoy proxy overhead at source and/or destination.
• Network latency between the two pods.

To quantify just the Envoy overhead, you need metrics from Envoy’s perspective. Istio integrates Envoy’s metrics into Prometheus and provides Grafana dashboards out-of-the-box. By comparing source-reporter metrics and destination-reporter metrics, you can approximate how much time is spent in the proxies and network before/after hitting the application.

Relevant Metrics

Key Istio Metrics:
• istio_requests_total and istio_request_duration_milliseconds
These metrics are recorded by both the source and destination Envoy proxies.
• reporter="source": Measures time from the perspective of the client sidecar. It includes the time Envoy takes to route the request, TLS handshake (if any), and network transit until it gets a response back.
• reporter="destination": Measures time from the perspective of the server sidecar. It starts measuring from when the request arrives at the server Envoy until the response is sent back to the client Envoy.

By looking at these two views, you can break down where the latency is introduced.

Key Envoy-Specific Metrics:
These can provide deeper insights into the Envoy overhead itself:
• envoy_cluster_upstream_rq_time (histogram)
Measures the time taken for requests routed upstream, as seen by Envoy. This includes network and server-side processing time from Envoy’s perspective.
• envoy_cluster_upstream_cx_connect_ms
Measures connection establishment times upstream (if Envoy needs to establish new connections).
• envoy_tcp_downstream_cx_* and envoy_tcp_upstream_cx_* metrics (if applicable)
Provides insight into TCP-level overhead. Not always necessary for HTTP-based workloads, but can help if TLS or connection setup overhead is suspected.

How to Compare and Calculate Overhead
1. Find Latency Reported by Source Envoy (Client-Side)
Use a query like:

histogram_quantile(0.95, sum(rate(istio_request_duration_milliseconds_bucket{reporter="source", destination_service="service-b..svc.cluster.local"}[1m])) by (le))

This gives you the 95th percentile latency as measured by the source sidecar. It includes:
• Source Envoy overhead
• Network latency
• Destination Envoy overhead
• Destination service processing time

2.	Find Latency Reported by Destination Envoy (Server-Side)

Similarly:

histogram_quantile(0.95, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination", destination_service="service-b..svc.cluster.local"}[1m])) by (le))

This measures latency from when the request hits the destination Envoy to when the response is returned. It includes:
• Destination Envoy overhead
• Destination service processing time

3.	Compare Destination Envoy Latency with Actual Application Latency

You mentioned that application logs (e.g., Splunk logs) show only ~1ms service processing time. If the reporter="destination" metric shows something like ~2ms, that suggests roughly 1ms of Envoy overhead on the destination side.
4. Calculate Approximate Envoy+Network Overhead
The difference between source-reporter and destination-reporter metrics will give you a combined overhead of:
• Source Envoy overhead
• Network latency between the two pods
• Any additional overhead such as TLS handshake if not reused connections
For example, if reporter="source" P95 is 34ms and reporter="destination" P95 is 2ms, then ~32ms is either source Envoy overhead or network latency.
5. Drilling Down Further Using Envoy Cluster Metrics
Look at envoy_cluster_upstream_rq_time from the source Envoy’s perspective for the cluster that represents service-b. This metric measures how long the Envoy proxy at service-a waited for a response after sending the request upstream.
For example:

histogram_quantile(0.95, sum(rate(envoy_cluster_upstream_rq_time_bucket{envoy_cluster_name="inbound|||service-b..svc.cluster.local"}[1m])) by (le))

Compare this to the reporter="destination" measurements. If envoy_cluster_upstream_rq_time closely matches the source Envoy’s total time (minus the small difference at the destination), much of the latency is likely network or TLS overhead rather than the service itself.

Putting It All Together
1. Identify Baseline Application Latency (From Splunk or App Logs):
Application internal logs: ~1ms
2. Check Destination Envoy Metrics:
If reporter="destination" metrics show ~2-3ms, overhead on the destination side is minimal (~1-2ms) plus the 1ms service time.
3. Check Source Envoy Metrics:
If reporter="source" metrics show ~34ms, and the destination side only accounts for ~2ms total, the extra ~32ms could be from:
• Network latency
• Source Envoy overhead (e.g., waiting for DNS, TLS handshake, or connection pooling issues)
4. Use Envoy Cluster Metrics to Narrow Down:
If envoy_cluster_upstream_rq_time is close to 33-34ms, then the majority is in transit (network or source Envoy overhead), not at the destination side.

Additional Tips
• Check the Istio Dashboards:
Istio’s default dashboards (e.g., istio-service-dashboard or istio-workload-dashboard in Grafana) provide pre-built panels for reporter=source and reporter=destination latencies.
• Enable Additional Envoy Debug Logs (Only for Non-Prod):
If you need more precision, consider enabling Envoy debug logs in a test environment to see connection lifecycle events and TLS handshake durations.
• Look for Connection Reuse:
High latency might be due to Envoy repeatedly creating new connections. If Envoy can’t reuse connections (e.g., due to misconfiguration), you pay a TLS handshake cost per request.

Summary

To determine Envoy-only overhead:
1. Compare reporter="source" and reporter="destination" latencies using istio_request_duration_milliseconds.
2. Subtract known service processing time (from application logs) from the destination-reported latency to estimate Envoy overhead at the destination.
3. The remainder between source-reported and destination-reported latencies points to network + source Envoy overhead.
4. Use Envoy cluster metrics to further isolate whether the source delay is network-induced or related to Envoy’s handling of the request.

By iterating through these comparisons, you can pinpoint where the overhead is introduced and quantify Envoy’s contribution to the latency between two services.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/tracing question Questions that are neither investigations, bugs, nor enhancements
Projects
None yet
Development

No branches or pull requests

3 participants