YARP latency - less requests = more latency, more requests = less latency #2730

4865783a5d · 2025-01-30T15:24:30Z

Describe the bug

We are running Duende Identity Server in a production environment with YARP in front. We started seeing reduced latency beginning on 8th of january. See DuendeSoftware/Support#1565 for more the Identity Server side details. The STS endpoint basically returns a cached response in json (OpenId Discovery Document).

Further technical details

Yarp.ReverseProxy Version: 2.2.0
NET8
Azure App Services are windows

We run Identity Server load-balanced over two regions with Azure Front Door and each region has its own YARP instance with idsrv as single cluster, they're connected through private endpoints.

PaaS resource usage is below ~5% in all aspects (DB, App Service Plan cpu, memory etc.)

The 99th percentile over a span of 7 days with ~2.6 mio requests of the two specific endpoints is almost exactly 1.03 second.

When we increased requests through a synthetic source, the latency went back to normal

Seeing this, we thought about some keep alive / overhead when establishing new tcp connections?

Further request details here: The Yarp instance is represented by the first two spans, the request is then forwarded to idsrv.

A fast request

Captured diagnostics:

{
  "StartTime": "2025-01-30T14:38:59.9374068Z",
  "RouteInvokeOffset": 0.0115,
  "ProxyStartOffset": 0.0149,
  "HttpRequestStartOffset": 0.0436,
  "HttpConnectionEstablishedOffset": 0,
  "HttpRequestLeftQueueOffset": 0,
  "HttpRequestHeadersStartOffset": 0.0611,
  "HttpRequestHeadersStopOffset": 0.1117,
  "HttpRequestContentStartOffset": 0,
  "HttpRequestContentStopOffset": 0,
  "HttpResponseHeadersStartOffset": 0.1137,
  "HttpResponseHeadersStopOffset": 19.1955,
  "HttpResponseContentStopOffset": 19.5173,
  "HttpRequestStopOffset": 19.3123,
  "ProxyStopOffset": 19.5248,
  "Error": 0,
  "RequestBodyLength": 0,
  "ResponseBodyLength": 2228,
  "RequestContentIops": 0,
  "ResponseContentIops": 2,
  "DestinationId": "destination1",
  "ClusterId": "identity_server",
  "RouteId": "identity_server"
}

A slow request

Captured diagnostics:

{
  "StartTime": "2025-01-30T14:22:59.994231Z",
  "RouteInvokeOffset": 0.0218,
  "ProxyStartOffset": 0.0267,
  "HttpRequestStartOffset": 0.0758,
  "HttpConnectionEstablishedOffset": 0,
  "HttpRequestLeftQueueOffset": 0,
  "HttpRequestHeadersStartOffset": 0.1095,
  "HttpRequestHeadersStopOffset": 0.2264,
  "HttpRequestContentStartOffset": 0,
  "HttpRequestContentStopOffset": 0,
  "HttpResponseHeadersStartOffset": 0.2295,
  "HttpResponseHeadersStopOffset": 1195.2714,
  "HttpResponseContentStopOffset": 1195.5575,
  "HttpRequestStopOffset": 1195.3947,
  "ProxyStopOffset": 1195.5646,
  "Error": 0,
  "RequestBodyLength": 0,
  "ResponseBodyLength": 2228,
  "RequestContentIops": 0,
  "ResponseContentIops": 2,
  "DestinationId": "destination1",
  "ClusterId": "identity_server",
  "RouteId": "identity_server"
}

Let me know if you need further information. We're using a standard YARP setup with 1:1 forwarding over a cluster / route config.

The text was updated successfully, but these errors were encountered:

MihaZupan · 2025-01-30T16:30:25Z

Hey, thank you for including the timings upfront :)

I'm guessing that the logic to collect them is about the same as the metrics sample in this repo?

IMO these are the interesting bits to look at

// Fast
"HttpRequestLeftQueueOffset": 0,
"HttpResponseHeadersStartOffset": 0.1137,
"HttpResponseHeadersStopOffset": 19.1955,

// Slow
"HttpRequestLeftQueueOffset": 0,
"HttpResponseHeadersStartOffset": 0.2295,
"HttpResponseHeadersStopOffset": 1195.2714,

The fact that you never saw a RequestLeftQueue event means that both of these requests were able to immediately pick up an existing idle connection from the connection pool.

The ResponseHeadersStart in this case will be logged after request headers are flushed to the network, but before we see an initial response.
For both requests, they're low, meaning that requests were immediately sent out.

The only thing that differs is the ResponseHeadersStop, which indicates that it was the server behind YARP that took a long time to respond.
Do you have any timing information from that server for these requests?

MihaZupan · 2025-01-30T16:35:35Z

Another guess since your latency is suspiciously close to 1 second: that happens to be the default Expect100ContinueTimeout for SocketsHttpHandler.

Is it possible that the server behind YARP doesn't support the mechanism?
You could try stripping out the header in a transform and see if that affects your P99.

Might not be the case since your requests don't appear to have a request body though.

4865783a5d · 2025-01-30T16:39:44Z

My pleasure :)

Yep, I copied your sample and used the same middleware code to write the timings to App Insights.

The backend currently doesn't have additional asp.net core timings enabled, other than OTEL.

Identity Server and YARP respectively run in a .net core 8 web api in Azure. Both services run on a scaled out to 2 instances P1v3 sku (Not very busy). I'll have a look at that header but I guess its supported in Azure App Service for Windows with .NET 8?

In production, requests come from a wide area of clients which might indeed send that specific one. Our synthetic client doesn't - its a C# Console App using a HttpClient.

Having a 99th percentile of 1.03 secs with over 2 mil requests smells like something internal..

4865783a5d · 2025-02-07T13:42:21Z

@MihaZupan I'm closing this one as YARP does not seem to be the culprit. Thanks for your help!

MihaZupan · 2025-02-07T14:00:48Z

Were you able to track down where the latency was coming from?

4865783a5d · 2025-02-12T09:24:40Z

@MihaZupan Not quite, after enabling some more logging we noticed the latency increase happens just after Message: Proxying to https:

10:06:43.036 | Request | Name: GET /.well-known/openid-configuration/jwks, Successful request: true, Response time: 1.5 s, URL: https://x/.well-known/openid-configuration/jwks
-- | -- | --
10:06:43.036 | Trace | Severity level: Information, Message: Request starting HTTP/1.1 GET https://x.azurewebsites.net/.well-known/openid-configuration/jwks - - -
10:06:43.036 | Trace | Message: All hosts are allowed.
10:06:43.036 | Trace | Message: Adding HSTS header to response.
10:06:43.036 | Trace | Message: 1 candidate(s) found for the request path '/.well-known/openid-configuration/jwks'
10:06:43.036 | Trace | Message: Endpoint 'identity_server' with route pattern '/{**catchall}' is valid for the request path '/.well-known/openid-configuration/jwks'
10:06:43.036 | Trace | Message: Request matched endpoint 'identity_server'
10:06:43.036 | Trace | Message: The endpoint does not specify the IRequestSizeLimitMetadata.
10:06:43.036 | Trace | Severity level: Information, Message: Executing endpoint 'identity_server'
10:06:43.036 | Trace | Severity level: Information, Message: Proxying to https://weu-x.azurewebsites.net/.well-known/openid-configuration/jwks HTTP/2 RequestVersionOrLower
10:06:44.551 | Request | Name: GET /.well-known/openid-configuration/jwks, Successful request: true, Response time: 875.1 μs, URL: https://x/.well-known/openid-configuration/jwks
10:06:44.551 | Trace | Severity level: Information, Message: Request starting HTTP/1.1 GET https://x.azurewebsites.net/.well-known/openid-configuration/jwks - - -
10:06:44.551 | Trace | Message: All hosts are allowed.
10:06:44.551 | Trace | Message: Adding HSTS header to response.
10:06:44.551 | Trace | Message: No candidates found for the request path '/.well-known/openid-configuration/jwks'
10:06:44.551 | Trace | Message: Request did not match any endpoints
10:06:44.551 | Trace | Message: AuthenticationScheme: Identity.Application was not authenticated.
10:06:44.551 | Trace | Message: AuthenticationScheme: Identity.Application was not authenticated.
10:06:44.551 | Dependency | Name: IdentityServerProtocolRequest, Type: Other, Call status: true, Duration: 143.8 μs
10:06:44.551 | Dependency | Name: DiscoveryEndpoint, Type: Other, Call status: true, Duration: 1.3 μs
10:06:44.551 | Dependency | Name: WriteJson, Type: Other, Call status: true, Duration: 104.1 μs
10:06:44.551 | Trace | Severity level: Information, Message: Request finished HTTP/1.1 GET https://x/.well-known/openid-configuration/jwks - 200 - application/json;+charset=UTF-8 0.7978ms
10:06:44.556 | Trace | Severity level: Information, Message: Received HTTP/2.0 response 200.
10:06:44.556 | Trace | Severity level: Information, Message: Executed endpoint 'identity_server'
10:06:44.556 | Trace | Severity level: Information, Message: Request finished HTTP/1.1 GET https://x/.well-known/openid-configuration/jwks - 200 - application/json;+charset=utf-8 1520.1950ms

Some more samples

10:05:42.610 | Trace | Severity level: Information, Message: Proxying to https://weu-x.azurewebsites.net/.well-known/openid-configuration/jwks HTTP/2 RequestVersionOrLower
-- | -- | --
10:05:45.739 | Request | Name: GET /.well-known/openid-configuration/jwks, Successful request: true, Response time: 911.1 μs, URL: https://x/.well-known/openid-configuration/jwks

10:04:42.757 | Trace | Severity level: Information, Message: Proxying to https://weu-x.azurewebsites.net/.well-known/openid-configuration/jwks HTTP/2 RequestVersionOrLower
-- | -- | --
10:04:44.512 | Request | Name: GET /.well-known/openid-configuration/jwks, Successful request: true, Response time: 981.3 μs, URL: https://x/.well-known/openid-configuration/jwks

That log-statement is done here, which is probably the last few lines of code before sending the to be proxied request downstream?:

We'll be involving Azure Support as well as there seems to be an issue between two App Services. We're also trying to reproduce the latency with a Linux App Service (Currently windows)

MihaZupan · 2025-02-14T19:33:21Z

Double-checking the tracing implementation, I misspoke before

// Slow
"HttpResponseHeadersStartOffset": 0.2295,
"HttpResponseHeadersStopOffset": 1195.2714,

ResponseHeadersStart is logged after we received the status line, but before we received all the headers.
Seeing a large time delta between these is unusual, normally those would get sent as a single chunk and the time delta would just be the couple nanoseconds it took to parse out the headers.
I don't see why any HTTP server would do it differently.

Where this could change is if the server is sending back interim 1xx responses right away, but the actual response takes a while. Or if something delayed the headers part of the traffic for some reason.
In your case are YARP connections going through some other server (e.g. custom proxies, firewalls, ...), or you have some custom VPN / antivirus solution intercepting traffic?

Given that this is happening in P99 it should still be relatively common. Would you be able to capture a network trace that included such slow requests?
Alternatively it's also possible to capture trace-level info within .NET itself (would also include PII).

With these timings, this is very unlikely to be an issue in YARP itself.
It could be in the underlying .NET networking stack, but I think it's most likely something specific to your setup.

4865783a5d · 2025-02-25T07:42:02Z

@MihaZupan Sorry for the late reply. A typical request flow looks like this in our infra (We're using Azure VWAN and Private Endpoints)

client --> Azure Front Door --Private Endpoint--> Azure App Service (YARP) --Private Endpoint--> Azure App Service (Cluster)

We do have Network Security Groups in place but disabling them had no impact. I'll be running a few network traces in Azure, using a direct route to the cluster and over YARP to see if there is anything significant in between (Eg. Azure Networking)

4865783a5d · 2025-02-28T14:40:49Z

We ran network traces from the apps running YARP to the destination and will open up a case with Azure Support as the application layer (based on logs) does not seem to be responsible for the latency.

4865783a5d added the Type: Bug label Jan 30, 2025

4865783a5d closed this as completed Feb 7, 2025

4865783a5d reopened this Feb 12, 2025

adityamandaleeka closed this as completed Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

YARP latency - less requests = more latency, more requests = less latency #2730

YARP latency - less requests = more latency, more requests = less latency #2730

4865783a5d commented Jan 30, 2025 •

edited

Loading

MihaZupan commented Jan 30, 2025

MihaZupan commented Jan 30, 2025 •

edited

Loading

4865783a5d commented Jan 30, 2025 •

edited

Loading

4865783a5d commented Feb 7, 2025

MihaZupan commented Feb 7, 2025

4865783a5d commented Feb 12, 2025 •

edited

Loading

MihaZupan commented Feb 14, 2025

4865783a5d commented Feb 25, 2025

4865783a5d commented Feb 28, 2025

YARP latency - less requests = more latency, more requests = less latency #2730

YARP latency - less requests = more latency, more requests = less latency #2730

Comments

4865783a5d commented Jan 30, 2025 • edited Loading

Describe the bug

Further technical details

MihaZupan commented Jan 30, 2025

MihaZupan commented Jan 30, 2025 • edited Loading

4865783a5d commented Jan 30, 2025 • edited Loading

4865783a5d commented Feb 7, 2025

MihaZupan commented Feb 7, 2025

4865783a5d commented Feb 12, 2025 • edited Loading

MihaZupan commented Feb 14, 2025

4865783a5d commented Feb 25, 2025

4865783a5d commented Feb 28, 2025

4865783a5d commented Jan 30, 2025 •

edited

Loading

MihaZupan commented Jan 30, 2025 •

edited

Loading

4865783a5d commented Jan 30, 2025 •

edited

Loading

4865783a5d commented Feb 12, 2025 •

edited

Loading