-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Http2: Fix remote server tests occassionally timing out in CI #39588
Conversation
Should hold off merge until CI infra issues are mitigated |
We can try increasing the timeout, but 100 seconds is already a very long time. If they continue to fail sporadically after this increase, there is almost certainly something more fundamental going wrong that needs to be investigated. |
/azp run |
Azure Pipelines successfully started running 4 pipeline(s). |
I'm seeing this error in the Debian9 build:
Not related to SocketsHttpHandler but thought I'd point it out. c.f. https://dev.azure.com/dnceng/public/_build/results?buildId=270647&view=ms.vss-test-web.build-test-results-tab&runId=7436544&resultId=100020&paneView=debug |
/azp run |
Azure Pipelines successfully started running 4 pipeline(s). |
I agree it's worth understanding better. That said, at least in the case of PostAsync_RedirectWith307_LargePayload, this has been a longstanding issue even without HTTP2. See e.g. https://github.com/dotnet/corefx/issues/23768 and the discussion and linked issues there. In fact I saw this test fail once in CI for HTTP/1.1 on my PR here. For that test at least, I think we should just increase the timeout. We understand why it runs slowly. For the other test (PostRewindableContentUsingAuth_NoPreAuthenticate_Success), I don't think we understand why this runs slowly. Seems worth understanding this better. Looking at this again, I'm slightly worried there's a product issue here related to receiving an early 401 response, but I have no idea if that's actually happening or not. |
@@ -133,6 +133,11 @@ protected HttpClient CreateHttpClientForRemoteServer(Configuration.Http.RemoteSe | |||
|
|||
var client = new HttpClient(wrappedHandler); | |||
SetDefaultRequestVersion(client, remoteServer.HttpVersion); | |||
if (timeout.HasValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than pass this in and set it here, can we just have callers set it directly on the HttpClient? This is generally what we do for other settings on HttpClient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, that should simplify the changes significantly.
@@ -2276,7 +2268,10 @@ public async Task PostAsync_Redirect_LargePayload_Helper(Configuration.Http.Remo | |||
// Compute MD5 of request body data. This will be verified by the server when it receives the request. | |||
content.Headers.ContentMD5 = TestHelper.ComputeMD5Hash(contentBytes); | |||
|
|||
using (HttpClient client = CreateHttpClientForRemoteServer(remoteServer)) | |||
// Issue #39545: test occassionally timing on CI using default timeout of 100 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we're leaving the issue open, we shouldn't tag the issue like this. Otherwise it looks like it's an ActiveIssue. It's fine to link to the issue for more info, but we should comment something like:
"Because [reasons], test can take very long to run, particularly in CI. See github issue [link] for more information."
That said, there isn't really any additional info in the github issue other than "this test takes a long time" so I'm not sure it's actually worth linking. The comment itself is pretty self explanatory.
Same comment below in the other test.
(edited for clarity)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will adjust wording.
I think we should merge the changes so that the test can run a few times in CI. That should give us more data on what could be happening. |
@geoffkizer, can you clarify? I'm not seeing the explanation in the issue, though I'm probably just missing it. What about the test causes it to take > 100 seconds? |
Tests are failing in Debian & Redhat despite the increase in timeouts. Surfaced errors are CurlExceptions. |
I should clarify; we know why this test is at least somewhat slow. Per linked issue:
This doesn't fully explain why it takes > 100 seconds. That said, it seems like there is a lot of general slowness that comes from (a) CI and (b) remote servers that, combined with the above, could reasonably make this test take > 100 seconds on occasion. It would be interesting to run the test in a loop locally (but still against the remote server) and see how long it takes, on average and worst case. If these times look reasonable, then I think we can conclude that it's reasonable to bump up the time to deal with CI slowness. If they don't, then we should probably investigate where the slowness is -- first suspect being remote servers. |
I wonder what sort of bandwidth limitations the remote servers in Azure have. Given the way CI works -- i.e. launch the same tests in parallel for a bunch of different platforms -- it may well be that CI is inducing some bandwidth limits that lead to more frequent timeouts. @davidsh Do you know? |
There are no bandwidth limitations that I'm aware of. And looking at the Azure AppInsights logs don't show any particular errors lately. |
I thought every Azure box came with some sort of max networking bandwidth. For example, there's some description of networking bandwidth for different VM sizes here: https://docs.microsoft.com/en-us/azure/virtual-machines/windows/sizes-general Do you know what sort of VMs we are using for these servers? If we were being limited by Azure VM bandwidth, would this show in logs? |
What do you see for 99th percentile? If we're occasionally hitting bandwidth limits due to CI slamming the server all at once, I'd expect to see some bad numbers in the 99th percentile. |
I put it in a loop. In < 100 iterations for each of a PlatformHandler, SocketsHttpHandler 1.1, and SocketsHttpHandler 2.0 test, I got hangs, e.g. taking greater than 10 minutes. They're all stuck trying to write to the network, which suggests (beyond that this timeout increase isn't going to help) that the connection is full because the server has simply stopped reading (or is running slow enough to be equivalent). @davidsh, where's the code for the endpoint these talk to? Or maybe there is some kind of per minute ingestion limit after which it simply stops receiving? Or something like that? |
Actually, ignore my previous comment. I think I misinterpreted some data from xunit. Will follow-up... |
Been running it in a loop for around 20 minutes. It's oddly bi-modal, with every request taking ~1.3 seconds or ~8.5 seconds (it starts out mostly the former with a few of the latter thrown in, and then over time switches to being mostly the latter with a few of the former thrown in). I've not seen anything more than 9 seconds. |
These endpoints: deployed this source code: |
Keep in mind that there is a lot of network infrastructure in front of our endpoints. Azure App Service runs a load balanced front-end proxy etc. So, any "slowdown" or "congestion" could come from a variety of places in the path from client to server. |
It would be interesting to run a bunch of instances in parallel. That's what CI is effectively doing. |
I did yesterday, and it made little difference (average was still the same, worst case increased by a couple of seconds, highest I saw with 10 loops all banging on this was ~13 seconds). It's of course possible so much load could be thrown at it that worst case grew to 100 seconds, but I'm not sure why this test would be distinct in that regard. It's also possible I'm closer to the relevant data center or my machine is different or some such thing. |
I just finished running
It basically seems that most requests have latencies within the p99 figure, and the rest just time out at 100 seconds. |
What happens if you set HttpClient.Timeout to be infinite? |
Let me give it a try |
No responses yielded after 20 minutes. I will see if I can reproduce with different handlers and operating systems. |
Here's what I got so far for
EDIT: for clarification this is running 10K requests throttled to 50 at a time. Every request is effectively a single test run and sets up the client context from scratch. |
Increases HttpClient timeout from the default 100 seconds to 150, in order to address functional tests occasionally timing out in CI. Fixes #39545.