Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.NET 9 Arm32 tests are randomly failing from PowerShell hang #104106

Closed
mthalman opened this issue Jun 21, 2024 · 21 comments
Closed

.NET 9 Arm32 tests are randomly failing from PowerShell hang #104106

mthalman opened this issue Jun 21, 2024 · 21 comments
Assignees

Comments

@mthalman
Copy link
Member

mthalman commented Jun 21, 2024

There have been random test jobs failing with either one of two issues: a timeout or a HttpRequestException. I've only seen this happening in the Noble and Bookworm jobs, not Alpine.

This first popped up with this PR: dotnet/dotnet-docker#5587. But I can't imagine it was the cause. The most recent .NET change prior to that was this: dotnet/dotnet-docker#5584.

I've only seen this in public builds but there haven't been many internal builds yet to determine if it's only limited to public.

Example timeout build

Note

The rest of the information below is not relevant to this particular issue. Keeping it here because the conversation below references it. But this error is not a pervasive error like the timeout issue is.

Example HttpRequestException build

Microsoft.DotNet.Docker.Tests.SdkImageTests.VerifyBlazorWasmScenario

Executing: docker run --name 9.0-app-run-133634530996233443 -d -t -p 8080 9.0-publish_fx_dependent-133634530130912981 dotnet run --urls [http://*:8080](http://%2A:8080/)
 80fef330b2b77c6817b3479302abfbcbcfdf3eff5d94d6ff58798d5ed3ff452a
 Execution Elapsed Time: 00:00:01.4260173
 Executing: docker inspect -f "{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}" 9.0-app-run-133634530996233443
 172.17.0.3
 Execution Elapsed Time: 00:00:00.0211357
 Request to http://172.17.0.3:8080/ failed - retrying: System.Net.Http.HttpRequestException: Connection refused (172.17.0.3:8080)
  ---> System.Net.Sockets.SocketException (111): Connection refused
    at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.CreateException(SocketError error, Boolean forAsyncThrow)
    at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ConnectAsync(Socket socket)
    at System.Net.Sockets.Socket.ConnectAsync(EndPoint remoteEP, CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
    at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
    at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
    at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
    at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
    at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
    at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(QueueItem queueItem)
    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
    at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecuteFromThreadPool(Thread threadPoolThread)
    at System.Threading.ThreadPoolWorkQueue.Dispatch()
    at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
 --- End of stack trace from previous location ---
    at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
    --- End of inner exception stack trace ---
    at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(QueueItem queueItem)
    at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
    at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
    at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
    at Microsoft.DotNet.Docker.Tests.WebScenario.GetHttpResponseFromContainerAsync(String containerName, DockerHelper dockerHelper, ITestOutputHelper outputHelper, Int32 containerPort, String pathAndQuery, Action`1 validateCallback, AuthenticationHeaderValue authorizationHeader) in /repo/dotnet-docker/tests/Microsoft.DotNet.Docker.Tests/TestScenarios/WebScenario.cs:line 96
@dotnet-issue-labeler dotnet-issue-labeler bot added area-Infrastructure untriaged New issue has not been triaged by the area owner labels Jun 21, 2024
@mthalman
Copy link
Member Author

Confirmed that I can still repro with this changed reverted: dotnet/dotnet-docker#5587.

@mthalman
Copy link
Member Author

I've also tried reverting dotnet/dotnet-docker#5584 and the issue still repros.

@richlander
Copy link
Member

@wfurt @sbomer

@wfurt
Copy link
Member

wfurt commented Jun 24, 2024

could this be coincident with some other infrastructure changes? The connect is handled by kernel so it seems like network infrastructure problem to me.
is this IP related to Docker anyhow? Looks suspicious to me Connection refused (172.17.0.3:8080)

@mthalman
Copy link
Member Author

The IP address in question is the IP address of the aspnet container being tested. In other words, the app container isn't responding.

@mthalman
Copy link
Member Author

I've attempted to isolate which set of tests was causing this: runtime, aspnet, sdk, etc. I ran jobs that tested each of those sets separately... and they all passed. 🤷

@wfurt
Copy link
Member

wfurt commented Jun 24, 2024

It could be timing as well e.g. the over is still starting on slow platform. I don't know if there is any synchronization for that.

@mthalman
Copy link
Member Author

It attempts to get a response from the container, retrying 5 times with a 2 second delay for each retry. This test code hasn't changed in years. If it was a machine thing, I would expect to see it in other .NET versions and in the arm64 jobs (since arm32 and arm64 use the same machines). But it's very specific to 9.0 arm32.

@mthalman
Copy link
Member Author

Confirming this same behavior on internal builds. Example build (internal link)

@mthalman
Copy link
Member Author

I think the BlazorWasmScenario test failure that I originally posted about isn't really the issue here. I can't remember how many times I saw that it occur (maybe it was only once). But the real issue is the hang that occurs while running the tests, causing a timeout. That is very prevalent. Probably > 50% of the jobs fail with a timeout.

I added some logging to the tests and was able to identify a test that hangs in two separate jobs: SdkImageTests.VerifyPowerShellScenario_NonDefaultUser. In both cases, that is the first attempt to run PowerShell in the job. Example build

There hasn't been a new drop of PowerShell for 9.0 though in quite a while. This is the last one: dotnet/dotnet-docker#5506. We would have seen this earlier if it was solely a PowerShell thing. My only guess is that it's related to the interaction between PowerShell and a new drop of .NET 9. @adaggarwal - are you aware of any behavior like this? To recap, it seems that execution of PowerShell sporadically hangs when running in an Arm32 Debian/Ubuntu container environment.

@mthalman
Copy link
Member Author

I just realized the reason this only occurs for Debian and Ubuntu and not Alpine or Azure Linux. For Alpine, we don't have PowerShell installed since PowerShell doesn't release binaries for Arm32 linux-musl (this is just further evidence that the issue is related to PowerShell). And for Azure Linux, we don't have any Arm32 images at all.

@mthalman mthalman changed the title .NET 9 Arm32 tests are randomly failing .NET 9 Arm32 tests are randomly failing from PowerShell hang Jun 26, 2024
@mthalman
Copy link
Member Author

mthalman commented Jun 27, 2024

Repro Steps

  1. Get a Linux Arm64/Arm32 VM. I created one with Azure DevTest Lab which did the trick. (I could not get this to repro with Apple Silicon. It would run into a QEMU error and crash. That behavior was common across all .NET versions and not specific to this issue.)
  2. Install Docker if it's not already installed.
  3. sudo docker pull mcr.microsoft.com/dotnet/nightly/sdk:9.0.100-preview.6-bookworm-slim-arm32v7
  4. sudo docker run --rm -it mcr.microsoft.com/dotnet/nightly/sdk:9.0.100-preview.6-bookworm-slim-arm32v7
  5. You're now in the shell of the container.
  6. pwsh (You've hit the repro if this hangs. If it doesn't hang and gets to the PowerShell prompt, run exit to close PowerShell and then run pwsh again. Continue until it hangs.)

@mthalman
Copy link
Member Author

I attempted to collect a dump using dotnet-dump but the collect command would just hang. Instead, I used the createdump tool and collected both a minidump and full dump. The minidump is included here: coredump-mini.zip. The full dump is too large to attach so I can provide that offline if necessary.

@mthalman mthalman transferred this issue from dotnet/dotnet-docker Jun 27, 2024
@mthalman mthalman moved this from Backlog to Tracking in .NET Docker Jun 27, 2024
Copy link
Contributor

Tagging subscribers to this area: @vitek-karas, @agocke, @VSadov
See info in area-owners.md if you want to be subscribed.

@mthalman
Copy link
Member Author

I've moved this to the runtime repo to investigate further.

Copy link
Contributor

Tagging subscribers to this area: @mangod9
See info in area-owners.md if you want to be subscribed.

@elinor-fung
Copy link
Member

Looks like the main thread is in System.RuntimeType+BoxCache.GetBoxInfo I'm not set up to grab the native side right now, but it is a QCall into the runtime added in #101137. @jkoritzinsky would you be able to take a look?

> clrstack
OS Thread Id: 0x9 (0)
Child SP       IP Call Site
FFE120F4 ffe12080 [InlinedCallFrame: ffe120f4] System.RuntimeType+BoxCache.GetBoxInfo(System.Runtime.CompilerServices.QCallTypeHandle, System.Object (Void*)*, Void**, Int32*, UInt32*)
FFE120F4 e310f340 [InlinedCallFrame: ffe120f4] System.RuntimeType+BoxCache.GetBoxInfo(System.Runtime.CompilerServices.QCallTypeHandle, System.Object (Void*)*, Void**, Int32*, UInt32*)
FFE120D8 E310F340 System.RuntimeType+BoxCache..ctor(System.RuntimeType) [/_/src/coreclr/System.Private.CoreLib/src/System/RuntimeType.BoxCache.cs @ 54]
FFE12158 E310F274 System.RuntimeType+BoxCache.Create(System.RuntimeType) [/_/src/coreclr/System.Private.CoreLib/src/System/RuntimeType.BoxCache.cs @ 17]
FFE12168 F64912A8 System.RuntimeType+IGenericCacheEntry`1[[System.__Canon, System.Private.CoreLib]].CreateAndCache(System.RuntimeType)
FFE121D8 EFFC37F6 System.Runtime.CompilerServices.RuntimeHelpers.Box(Byte ByRef, System.RuntimeTypeHandle) [/_/src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.CoreCLR.cs @ 425]
FFE121F0 EFE0102C System.Enum.InternalBoxEnum(System.RuntimeTypeHandle, Int64)
FFE12208 EFDFE8E2 System.Enum.TryParse(System.Type, System.ReadOnlySpan`1<Char>, Boolean, Boolean, System.Object ByRef) [/_/src/libraries/System.Private.CoreLib/src/System/Enum.cs @ 772]
FFE12240 EFDFE71A System.Enum.Parse(System.Type, System.String, Boolean) [/_/src/libraries/System.Private.CoreLib/src/System/Enum.cs @ 585]
FFE12260 EFDFE692 System.Enum.Parse(System.Type, System.String) [/_/src/libraries/System.Private.CoreLib/src/System/Enum.cs @ 551]
FFE12268 E396D4F6 System.Management.Automation.LanguagePrimitives.ConvertStringToEnum(System.Object, System.Type, Boolean, System.Management.Automation.PSObject, System.IFormatProvider, System.Management.Automation.Runspaces.TypeTable)

@jkoritzinsky
Copy link
Member

I believe I actually just fixed this with #104126. @mthalman can you test out with a nightly build or do you need to wait until we have a preview with that fix?

@jkoritzinsky jkoritzinsky self-assigned this Jul 2, 2024
@mthalman
Copy link
Member Author

mthalman commented Jul 2, 2024

I ran some tests: dotnet/dotnet-docker#5643. This succeeded 3 times in a row. Normally that would for sure have run into this issue.

@jkoritzinsky
Copy link
Member

Awesome! Would you prefer I keep this open until you can consume the fix or close it as fixed?

@mthalman
Copy link
Member Author

mthalman commented Jul 2, 2024

I'm fine with closing it.

@mthalman mthalman closed this as completed Jul 2, 2024
@github-project-automation github-project-automation bot moved this from Tracking to Done in .NET Docker Jul 2, 2024
@dotnet-policy-service dotnet-policy-service bot removed the untriaged New issue has not been triaged by the area owner label Jul 2, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Aug 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Status: Done
Archived in project
Development

No branches or pull requests

5 participants