-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
win-arm64 NativeAOT CI job frequently times out in PRs #70549
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
The problem is that the workitems gets stuck in the queue for too long. There is a pattern of a couple of the workitems being queued up and finishing quickly (the tests are done very quickly because we just run them, don't compile them) and then a workitem getting stuck in the queue for more than an hour (#70233 (comment)). I'll increase the timeout but I wonder if anyone in @dotnet/runtime-infrastructure knows a way to find out why the workitem is in the queue for so long when we obviously have Windows ARM64 machines available around that time and then suddenly don't. |
cc @dotnet/dnceng and @MattGal for @MichalStrehovsky's question. Please also see the data that he shared above: #70233 (comment). |
Hello. There are 31 OnPrem machines associated with |
Two things to add here:
|
Why is this no longer marked as blocking? System.Runtime.Tests.WorkItemExecution [Details] [Artifacts] [3.95% failure rate] |
It was supposed to be made non-blocking by #70551 which raised the timeout. Also this issue was specifically about win-arm64 and the failures you are linking seem to be linux-arm64, but perhaps that queue has similar issues. |
Same pattern. Here's details about the workitems:
6 workitems started within one second and finished in 4-6 minutes each. The seventh workitem got deadlettered after 2.5 hours. @dotnet/runtime-infrastructure It looks like we define the helix queue to use here: runtime/eng/pipelines/libraries/helix-queues-setup.yml Lines 172 to 174 in a0b426d
Any objections to changing that to the windows.10.arm64v8.open queue that has 4 times more machines (31 vs 127) in it per the above? |
The Linux issue that Andy saw is different: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-70264-merge-d0da0d0858e74cb4b0/System.Runtime.Tests/1/console.006f8aba.log?helixlogtype=result The workitem itself hangs. @LakshanF has been disabling tests that hang like this on Linux. The current theory is that the hang is due to #67805 that is being worked on. We're trying to strike a balance between "this test is hanging too often" and "we don't have any test coverage for ARM64 Linux". I've not seen System.Runtime tests hang often but I can be proven wrong on that. |
@MichalStrehovsky This is probably worth the effort of getting a repro machine from @ilyas1974 and trying it out... checking 3/3 of these machines, they're deadlettering because after the timeout occurs at 45 minutes, and we kill the running processes, the machine reboots before sending the final events (hence the retries). This would make it seem like a very investigate-able problem. |
Trying to get a local repro sounds like a good idea. At minimum we could get a dump and see what's running |
#67805 affects Windows ARM64 in the same way as Linux. The only suspension mechanism that is enabled on Win ARM64 is via polling. That only happens when calling some runtime helpers or allocating. Incomplete suspension implementation can cause pauses/hangs. Not sure if it is relevant to this issue as I can't tell what is timing out here - tests themselves or the code that runs them. |
#70740 enables return address hijacking on Windows ARM64. If it works, and it looks like it does, since tests are passing. it should alleviate the pausing/hanging problem. |
#70769 enables redirection (asynchronous suspension) on Windows ARM64 If that works, win-arm64 NativeAOT will be at functional parity with win-x64 and with CoreClr counterparts in terms of GC suspension. |
Both ARM64 PRs for missing suspension mechanisms have been merged. If there are other issues with tests pausing hanging on Windows ARM64, they are likely different issues not related to suspension. |
Since this was very likely caused by win-arm64 suspension issues, which should be fixed now, I think we should close this, and start opening more granular bugs. |
Seemingly another instance...https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-70809-merge-038c72672ded403397/System.Runtime.Tests/1/console.87b58c26.log?helixlogtype=result |
The win-arm64 NativeAOT CI job is very frequently timing out on PRs. For example:
https://dev.azure.com/dnceng/public/_build/results?buildId=1815538&view=results
https://dev.azure.com/dnceng/public/_build/results?buildId=1816221&view=results
https://dev.azure.com/dnceng/public/_build/results?buildId=1807960&view=results
https://dev.azure.com/dnceng/public/_build/results?buildId=1807065&view=results
https://dev.azure.com/dnceng/public/_build/results?buildId=1807071&view=results
The win-arm64 machines are not very powerful so 2 hours is probably not enough.
Another thing I noticed is that these jobs only seem to run on PRs and do not seem to be running at all in the rolling runs.
cc @dotnet/ilc-contrib
The text was updated successfully, but these errors were encountered: