-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System.Net.Http.Functional - HTTP/3 timeouts (mostly on NativeAOT) #75493
Comments
Tagging subscribers to this area: @dotnet/ncl Issue DetailsFrequency in last 30 days via Runfo as of 9/12: Examples of failures
Console log - Helix timeout
|
@LakshanF @MichalStrehovsky looks specific to NativeAOT, can you please take a look? |
#75471 is opened on that. |
The failures listed in #75471 are exactly same, so this does not appear to be native aot specific. The frequency just comes and goes due to unrelated changes in timing. All failing tests seem to be Http3 tests, that means Quic is suspect.
|
This has correlation with #74749 and #75163. The first attempt was merged on 9/1 and reverted on 9/2, that aligns with the first occurrence. The second attempt was merged on 9/9, that aligns with when it started hitting significantly. Take this with grain of salt. The quic unloading change might have just changed the timing such that the problem is hit more often. |
note that we never run Quic tests on arm32 before. While the observed failure may be the same, root cause may be different. I think #75471 should be investigated separately -> and closed if we get successful run. |
I did another test run on arm32 with #75441 merged. |
This issue seems to reliably happen on arm32 platforms, the connection establishment attempt times out.
Further digging points at potential MsQuic issue on arm32, MsQuic sends only the first datagram and rest of the sends fail with EINVAL MsQuic trace
|
Submitted fix for the above (arm32 non-NativeAOT timeouts) for MsQuic microsoft/msquic#3065. I will look at Arm64 failures next. |
Not all timeouts occur in HTTP3 tests, e.g. https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-75381-merge-abd43458a4a042cf95/System.Net.Http.Functional.Tests/1/console.d67787ff.log?helixlogtype=result, the first failure occurred before we even got to send HTTP3 request. (it failed during HTTP2 request to get Alt-Svc header). Another failure in HTTP2 only test https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-540330e989f54fc08b/System.Net.Http.Functional.Tests/1/console.95517a16.log?helixlogtype=result So far not able to reproduce locally by running tests in tight loop :/ |
The last occurence of the issue on main on NativeAOT is 09-12 according to following Kusto query let timeouts = (includePR : bool) {
cluster('engsrvprod.kusto.windows.net').database('engineeringdata').WorkItems
| where Status == "Timeout"
| where FriendlyName startswith "System.Net.Http.Functional"
| where Finished > now(-21d)
| distinct JobId, WorkItemId, Name, FriendlyName, ConsoleUri
| join kind=inner (cluster('engsrvprod.kusto.windows.net').database('engineeringdata').Jobs
| where (Branch == 'refs/heads/main') or (Branch == 'refs/heads/master') or (includePR and (Source startswith "pr/"))
| where Type startswith "test/functional/cli/"
and not(Properties contains "runtime-staging")
| summarize arg_max(Finished, Properties, Type, Branch, Source, Started, QueueName) by JobId
| project-rename JobType = Type) on JobId
| order by Finished desc
| where extract_json("$.['System.PhaseName']", Properties) contains "NativeAOT"
| extend DefinitionName = extractjson("$.['DefinitionName']", Properties)
| extend OS = replace_regex(extractjson("$.['operatingSystem']", Properties), @'\((.*)\).*|([^\(].*)', @'\1\2')
| extend TargetBranch = extractjson("$.['System.PullRequest.TargetBranch']", Properties)
//| project-keep Finished, FriendlyName, ConsoleUri, TargetBranch, OS, DefinitionName, Branch
};
timeouts(true); There is one OSX timeout on 09-16, but that seems to unrelated (no other Native AOT timeouts in last 3 weeks). This correlates with merging of #75441 on 09-13. I think we can close this issue for now, and track the rest of the timeouts in #75611 |
Frequency in last 30 days via Runfo as of 9/12:
Examples of failures
Console log
Console log
Console log
Console log - Helix timeout
Report
Summary
The text was updated successfully, but these errors were encountered: