Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YARP latency - less requests = more latency, more requests = less latency #2730

Closed
4865783a5d opened this issue Jan 30, 2025 · 9 comments
Closed
Labels
Type: Bug Something isn't working

Comments

@4865783a5d
Copy link

4865783a5d commented Jan 30, 2025

Describe the bug

We are running Duende Identity Server in a production environment with YARP in front. We started seeing reduced latency beginning on 8th of january. See DuendeSoftware/Support#1565 for more the Identity Server side details. The STS endpoint basically returns a cached response in json (OpenId Discovery Document).

Further technical details

Yarp.ReverseProxy Version: 2.2.0
NET8
Azure App Services are windows

We run Identity Server load-balanced over two regions with Azure Front Door and each region has its own YARP instance with idsrv as single cluster, they're connected through private endpoints.

PaaS resource usage is below ~5% in all aspects (DB, App Service Plan cpu, memory etc.)

The 99th percentile over a span of 7 days with ~2.6 mio requests of the two specific endpoints is almost exactly 1.03 second.
Image

When we increased requests through a synthetic source, the latency went back to normal

Image

Seeing this, we thought about some keep alive / overhead when establishing new tcp connections?

Further request details here: The Yarp instance is represented by the first two spans, the request is then forwarded to idsrv.

A fast request

Image

Captured diagnostics:

{
  "StartTime": "2025-01-30T14:38:59.9374068Z",
  "RouteInvokeOffset": 0.0115,
  "ProxyStartOffset": 0.0149,
  "HttpRequestStartOffset": 0.0436,
  "HttpConnectionEstablishedOffset": 0,
  "HttpRequestLeftQueueOffset": 0,
  "HttpRequestHeadersStartOffset": 0.0611,
  "HttpRequestHeadersStopOffset": 0.1117,
  "HttpRequestContentStartOffset": 0,
  "HttpRequestContentStopOffset": 0,
  "HttpResponseHeadersStartOffset": 0.1137,
  "HttpResponseHeadersStopOffset": 19.1955,
  "HttpResponseContentStopOffset": 19.5173,
  "HttpRequestStopOffset": 19.3123,
  "ProxyStopOffset": 19.5248,
  "Error": 0,
  "RequestBodyLength": 0,
  "ResponseBodyLength": 2228,
  "RequestContentIops": 0,
  "ResponseContentIops": 2,
  "DestinationId": "destination1",
  "ClusterId": "identity_server",
  "RouteId": "identity_server"
}

A slow request

Image

Captured diagnostics:

{
  "StartTime": "2025-01-30T14:22:59.994231Z",
  "RouteInvokeOffset": 0.0218,
  "ProxyStartOffset": 0.0267,
  "HttpRequestStartOffset": 0.0758,
  "HttpConnectionEstablishedOffset": 0,
  "HttpRequestLeftQueueOffset": 0,
  "HttpRequestHeadersStartOffset": 0.1095,
  "HttpRequestHeadersStopOffset": 0.2264,
  "HttpRequestContentStartOffset": 0,
  "HttpRequestContentStopOffset": 0,
  "HttpResponseHeadersStartOffset": 0.2295,
  "HttpResponseHeadersStopOffset": 1195.2714,
  "HttpResponseContentStopOffset": 1195.5575,
  "HttpRequestStopOffset": 1195.3947,
  "ProxyStopOffset": 1195.5646,
  "Error": 0,
  "RequestBodyLength": 0,
  "ResponseBodyLength": 2228,
  "RequestContentIops": 0,
  "ResponseContentIops": 2,
  "DestinationId": "destination1",
  "ClusterId": "identity_server",
  "RouteId": "identity_server"
}

Let me know if you need further information. We're using a standard YARP setup with 1:1 forwarding over a cluster / route config.

@4865783a5d 4865783a5d added the Type: Bug Something isn't working label Jan 30, 2025
@MihaZupan
Copy link
Member

Hey, thank you for including the timings upfront :)

I'm guessing that the logic to collect them is about the same as the metrics sample in this repo?

IMO these are the interesting bits to look at

// Fast
"HttpRequestLeftQueueOffset": 0,
"HttpResponseHeadersStartOffset": 0.1137,
"HttpResponseHeadersStopOffset": 19.1955,

// Slow
"HttpRequestLeftQueueOffset": 0,
"HttpResponseHeadersStartOffset": 0.2295,
"HttpResponseHeadersStopOffset": 1195.2714,

The fact that you never saw a RequestLeftQueue event means that both of these requests were able to immediately pick up an existing idle connection from the connection pool.

The ResponseHeadersStart in this case will be logged after request headers are flushed to the network, but before we see an initial response.
For both requests, they're low, meaning that requests were immediately sent out.

The only thing that differs is the ResponseHeadersStop, which indicates that it was the server behind YARP that took a long time to respond.
Do you have any timing information from that server for these requests?

@MihaZupan
Copy link
Member

MihaZupan commented Jan 30, 2025

Another guess since your latency is suspiciously close to 1 second: that happens to be the default Expect100ContinueTimeout for SocketsHttpHandler.

Is it possible that the server behind YARP doesn't support the mechanism?
You could try stripping out the header in a transform and see if that affects your P99.

Might not be the case since your requests don't appear to have a request body though.

@4865783a5d
Copy link
Author

4865783a5d commented Jan 30, 2025

My pleasure :)

Yep, I copied your sample and used the same middleware code to write the timings to App Insights.

The backend currently doesn't have additional asp.net core timings enabled, other than OTEL.

Identity Server and YARP respectively run in a .net core 8 web api in Azure. Both services run on a scaled out to 2 instances P1v3 sku (Not very busy). I'll have a look at that header but I guess its supported in Azure App Service for Windows with .NET 8?

In production, requests come from a wide area of clients which might indeed send that specific one. Our synthetic client doesn't - its a C# Console App using a HttpClient.

Having a 99th percentile of 1.03 secs with over 2 mil requests smells like something internal..

@4865783a5d
Copy link
Author

@MihaZupan I'm closing this one as YARP does not seem to be the culprit. Thanks for your help!

@MihaZupan
Copy link
Member

Were you able to track down where the latency was coming from?

@4865783a5d
Copy link
Author

4865783a5d commented Feb 12, 2025

@MihaZupan Not quite, after enabling some more logging we noticed the latency increase happens just after Message: Proxying to https:

10:06:43.036 | Request | Name: GET /.well-known/openid-configuration/jwks, Successful request: true, Response time: 1.5 s, URL: https://x/.well-known/openid-configuration/jwks
-- | -- | --
10:06:43.036 | Trace | Severity level: Information, Message: Request starting HTTP/1.1 GET https://x.azurewebsites.net/.well-known/openid-configuration/jwks - - -
10:06:43.036 | Trace | Message: All hosts are allowed.
10:06:43.036 | Trace | Message: Adding HSTS header to response.
10:06:43.036 | Trace | Message: 1 candidate(s) found for the request path '/.well-known/openid-configuration/jwks'
10:06:43.036 | Trace | Message: Endpoint 'identity_server' with route pattern '/{**catchall}' is valid for the request path '/.well-known/openid-configuration/jwks'
10:06:43.036 | Trace | Message: Request matched endpoint 'identity_server'
10:06:43.036 | Trace | Message: The endpoint does not specify the IRequestSizeLimitMetadata.
10:06:43.036 | Trace | Severity level: Information, Message: Executing endpoint 'identity_server'
10:06:43.036 | Trace | Severity level: Information, Message: Proxying to https://weu-x.azurewebsites.net/.well-known/openid-configuration/jwks HTTP/2 RequestVersionOrLower
10:06:44.551 | Request | Name: GET /.well-known/openid-configuration/jwks, Successful request: true, Response time: 875.1 μs, URL: https://x/.well-known/openid-configuration/jwks
10:06:44.551 | Trace | Severity level: Information, Message: Request starting HTTP/1.1 GET https://x.azurewebsites.net/.well-known/openid-configuration/jwks - - -
10:06:44.551 | Trace | Message: All hosts are allowed.
10:06:44.551 | Trace | Message: Adding HSTS header to response.
10:06:44.551 | Trace | Message: No candidates found for the request path '/.well-known/openid-configuration/jwks'
10:06:44.551 | Trace | Message: Request did not match any endpoints
10:06:44.551 | Trace | Message: AuthenticationScheme: Identity.Application was not authenticated.
10:06:44.551 | Trace | Message: AuthenticationScheme: Identity.Application was not authenticated.
10:06:44.551 | Dependency | Name: IdentityServerProtocolRequest, Type: Other, Call status: true, Duration: 143.8 μs
10:06:44.551 | Dependency | Name: DiscoveryEndpoint, Type: Other, Call status: true, Duration: 1.3 μs
10:06:44.551 | Dependency | Name: WriteJson, Type: Other, Call status: true, Duration: 104.1 μs
10:06:44.551 | Trace | Severity level: Information, Message: Request finished HTTP/1.1 GET https://x/.well-known/openid-configuration/jwks - 200 - application/json;+charset=UTF-8 0.7978ms
10:06:44.556 | Trace | Severity level: Information, Message: Received HTTP/2.0 response 200.
10:06:44.556 | Trace | Severity level: Information, Message: Executed endpoint 'identity_server'
10:06:44.556 | Trace | Severity level: Information, Message: Request finished HTTP/1.1 GET https://x/.well-known/openid-configuration/jwks - 200 - application/json;+charset=utf-8 1520.1950ms

Some more samples

10:05:42.610 | Trace | Severity level: Information, Message: Proxying to https://weu-x.azurewebsites.net/.well-known/openid-configuration/jwks HTTP/2 RequestVersionOrLower
-- | -- | --
10:05:45.739 | Request | Name: GET /.well-known/openid-configuration/jwks, Successful request: true, Response time: 911.1 μs, URL: https://x/.well-known/openid-configuration/jwks
10:04:42.757 | Trace | Severity level: Information, Message: Proxying to https://weu-x.azurewebsites.net/.well-known/openid-configuration/jwks HTTP/2 RequestVersionOrLower
-- | -- | --
10:04:44.512 | Request | Name: GET /.well-known/openid-configuration/jwks, Successful request: true, Response time: 981.3 μs, URL: https://x/.well-known/openid-configuration/jwks

That log-statement is done here, which is probably the last few lines of code before sending the to be proxied request downstream?:

Image

We'll be involving Azure Support as well as there seems to be an issue between two App Services. We're also trying to reproduce the latency with a Linux App Service (Currently windows)

@4865783a5d 4865783a5d reopened this Feb 12, 2025
@MihaZupan
Copy link
Member

Double-checking the tracing implementation, I misspoke before

// Slow
"HttpResponseHeadersStartOffset": 0.2295,
"HttpResponseHeadersStopOffset": 1195.2714,

ResponseHeadersStart is logged after we received the status line, but before we received all the headers.
Seeing a large time delta between these is unusual, normally those would get sent as a single chunk and the time delta would just be the couple nanoseconds it took to parse out the headers.
I don't see why any HTTP server would do it differently.

Where this could change is if the server is sending back interim 1xx responses right away, but the actual response takes a while. Or if something delayed the headers part of the traffic for some reason.
In your case are YARP connections going through some other server (e.g. custom proxies, firewalls, ...), or you have some custom VPN / antivirus solution intercepting traffic?

Given that this is happening in P99 it should still be relatively common. Would you be able to capture a network trace that included such slow requests?
Alternatively it's also possible to capture trace-level info within .NET itself (would also include PII).

With these timings, this is very unlikely to be an issue in YARP itself.
It could be in the underlying .NET networking stack, but I think it's most likely something specific to your setup.

@4865783a5d
Copy link
Author

@MihaZupan Sorry for the late reply. A typical request flow looks like this in our infra (We're using Azure VWAN and Private Endpoints)

client --> Azure Front Door --Private Endpoint--> Azure App Service (YARP) --Private Endpoint--> Azure App Service (Cluster)

We do have Network Security Groups in place but disabling them had no impact. I'll be running a few network traces in Azure, using a direct route to the cluster and over YARP to see if there is anything significant in between (Eg. Azure Networking)

@4865783a5d
Copy link
Author

We ran network traces from the apps running YARP to the destination and will open up a case with Azure Support as the application layer (based on logs) does not seem to be responsible for the latency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants