Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

win-arm64 NativeAOT CI job frequently times out in PRs #70549

Closed
jakobbotsch opened this issue Jun 10, 2022 · 18 comments
Closed

win-arm64 NativeAOT CI job frequently times out in PRs #70549

jakobbotsch opened this issue Jun 10, 2022 · 18 comments
Labels
area-NativeAOT-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Milestone

Comments

@jakobbotsch
Copy link
Member

The win-arm64 NativeAOT CI job is very frequently timing out on PRs. For example:
https://dev.azure.com/dnceng/public/_build/results?buildId=1815538&view=results
https://dev.azure.com/dnceng/public/_build/results?buildId=1816221&view=results
https://dev.azure.com/dnceng/public/_build/results?buildId=1807960&view=results
https://dev.azure.com/dnceng/public/_build/results?buildId=1807065&view=results
https://dev.azure.com/dnceng/public/_build/results?buildId=1807071&view=results

The win-arm64 machines are not very powerful so 2 hours is probably not enough.

Another thing I noticed is that these jobs only seem to run on PRs and do not seem to be running at all in the rolling runs.

cc @dotnet/ilc-contrib

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jun 10, 2022
@jakobbotsch jakobbotsch added area-NativeAOT-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' and removed untriaged New issue has not been triaged by the area owner labels Jun 10, 2022
@MichalStrehovsky
Copy link
Member

The win-arm64 machines are not very powerful so 2 hours is probably not enough.

The problem is that the workitems gets stuck in the queue for too long. There is a pattern of a couple of the workitems being queued up and finishing quickly (the tests are done very quickly because we just run them, don't compile them) and then a workitem getting stuck in the queue for more than an hour (#70233 (comment)).

I'll increase the timeout but I wonder if anyone in @dotnet/runtime-infrastructure knows a way to find out why the workitem is in the queue for so long when we obviously have Windows ARM64 machines available around that time and then suddenly don't.

@ViktorHofer
Copy link
Member

cc @dotnet/dnceng and @MattGal for @MichalStrehovsky's question. Please also see the data that he shared above: #70233 (comment).

@andriipatsula
Copy link
Member

Hello. There are 31 OnPrem machines associated with Windows.10.Arm64.Open queue. In Kusto DB I see 6 out of 8 tests where executed and 7th was killed due to timeout. Looks like there are not enough machines to cover the needs, cause the 7th started at 2022-06-09 23:23:58.6310000 and was killed at 2022-06-09 23:28:00.0140000.
I think increasing the timeout from 2 to 3 h could help, but not in 100%

@jakobbotsch jakobbotsch removed the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Jun 10, 2022
@MattGal
Copy link
Member

MattGal commented Jun 10, 2022

cc @dotnet/dnceng and @MattGal for @MichalStrehovsky's question. Please also see the data that he shared above: #70233 (comment).

Two things to add here:

  • windows.10.arm64v8.open has 127 machine and might be able to absorb this capacity a bit better.
  • I'd just note that human/developer nature's default instinct when their test run times out is to instantly send the same job again via any spam-able retry button so increasing your timeout when the run would have otherwise succeeded is almost always the right choice; you're waiting behind your own repo's runs a majority of the time.

@AndyAyersMS
Copy link
Member

Why is this no longer marked as blocking?

System.Runtime.Tests.WorkItemExecution [Details] [Artifacts] [3.95% failure rate]

@jakobbotsch
Copy link
Member Author

It was supposed to be made non-blocking by #70551 which raised the timeout. Also this issue was specifically about win-arm64 and the failures you are linking seem to be linux-arm64, but perhaps that queue has similar issues.

@jakobbotsch
Copy link
Member Author

@jakobbotsch jakobbotsch added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Jun 12, 2022
@MichalStrehovsky
Copy link
Member

Timed out again in https://dev.azure.com/dnceng/public/_build/results?buildId=1820303&view=results

Same pattern. Here's details about the workitems:

"Queued":"2022-06-12T16:42:06.111+00:00","Started":"2022-06-12T16:42:07.064+00:00","Finished":"2022-06-12T16:48:50.158+00:00","Delay":"00:00:00.9530000","Duration":"00:06:43.0940000"
"Queued":"2022-06-12T16:42:06.111+00:00","Started":"2022-06-12T16:42:06.986+00:00","Finished":"2022-06-12T16:46:48.342+00:00","Delay":"00:00:00.8750000","Duration":"00:04:41.3560000"
"Queued":"2022-06-12T16:42:06.111+00:00","Started":"2022-06-12T19:12:21.96+00:00","Finished":"2022-06-12T19:12:21.991+00:00","Delay":"02:30:15.8490000","Duration":"00:00:00.0310000"
"Queued":"2022-06-12T16:42:06.111+00:00","Started":"2022-06-12T16:42:07.174+00:00","Finished":"2022-06-12T16:48:17.431+00:00","Delay":"00:00:01.0630000","Duration":"00:06:10.2570000"
"Queued":"2022-06-12T16:42:06.111+00:00","Started":"2022-06-12T16:42:06.986+00:00","Finished":"2022-06-12T16:46:24.32+00:00","Delay":"00:00:00.8750000","Duration":"00:04:17.3340000"
"Queued":"2022-06-12T16:42:06.111+00:00","Started":"2022-06-12T16:42:06.955+00:00","Finished":"2022-06-12T16:50:16.332+00:00","Delay":"00:00:00.8440000","Duration":"00:08:09.3770000"
"Queued":"2022-06-12T16:42:06.111+00:00","Started":"2022-06-12T16:42:06.971+00:00","Finished":"2022-06-12T16:46:43.888+00:00","Delay":"00:00:00.8600000","Duration":"00:04:36.9170000"

6 workitems started within one second and finished in 4-6 minutes each. The seventh workitem got deadlettered after 2.5 hours.

@dotnet/runtime-infrastructure It looks like we define the helix queue to use here:

# windows arm64
- ${{ if eq(parameters.platform, 'windows_arm64') }}:
- Windows.10.Arm64.Open

Any objections to changing that to the windows.10.arm64v8.open queue that has 4 times more machines (31 vs 127) in it per the above?

@MichalStrehovsky
Copy link
Member

The Linux issue that Andy saw is different: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-70264-merge-d0da0d0858e74cb4b0/System.Runtime.Tests/1/console.006f8aba.log?helixlogtype=result

The workitem itself hangs. @LakshanF has been disabling tests that hang like this on Linux. The current theory is that the hang is due to #67805 that is being worked on. We're trying to strike a balance between "this test is hanging too often" and "we don't have any test coverage for ARM64 Linux". I've not seen System.Runtime tests hang often but I can be proven wrong on that.

@MattGal
Copy link
Member

MattGal commented Jun 13, 2022

6 workitems started within one second and finished in 4-6 minutes each. The seventh workitem got deadlettered after 2.5 hours.

@MichalStrehovsky This is probably worth the effort of getting a repro machine from @ilyas1974 and trying it out... checking 3/3 of these machines, they're deadlettering because after the timeout occurs at 45 minutes, and we kill the running processes, the machine reboots before sending the final events (hence the retries). This would make it seem like a very investigate-able problem.

@agocke
Copy link
Member

agocke commented Jun 13, 2022

Trying to get a local repro sounds like a good idea. At minimum we could get a dump and see what's running

@VSadov
Copy link
Member

VSadov commented Jun 14, 2022

#67805 affects Windows ARM64 in the same way as Linux.

The only suspension mechanism that is enabled on Win ARM64 is via polling. That only happens when calling some runtime helpers or allocating. Incomplete suspension implementation can cause pauses/hangs.

Not sure if it is relevant to this issue as I can't tell what is timing out here - tests themselves or the code that runs them.

@VSadov
Copy link
Member

VSadov commented Jun 14, 2022

#70740 enables return address hijacking on Windows ARM64.

If it works, and it looks like it does, since tests are passing. it should alleviate the pausing/hanging problem.
That is assuming the problem here is pausing/hanging tests.

@VSadov
Copy link
Member

VSadov commented Jun 15, 2022

#70769 enables redirection (asynchronous suspension) on Windows ARM64

If that works, win-arm64 NativeAOT will be at functional parity with win-x64 and with CoreClr counterparts in terms of GC suspension.

@VSadov
Copy link
Member

VSadov commented Jun 15, 2022

Both ARM64 PRs for missing suspension mechanisms have been merged. If there are other issues with tests pausing hanging on Windows ARM64, they are likely different issues not related to suspension.

@VSadov
Copy link
Member

VSadov commented Jun 15, 2022

Since this was very likely caused by win-arm64 suspension issues, which should be fixed now, I think we should close this, and start opening more granular bugs.
Unless, of course, we still see lots of hangs.

@jkotas jkotas closed this as completed Jun 15, 2022
Repository owner moved this from High Priority to Done in AppModel Jun 15, 2022
@AndyAyersMS
Copy link
Member

The Linux issue that Andy saw is different: https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-70264-merge-d0da0d0858e74cb4b0/System.Runtime.Tests/1/console.006f8aba.log?helixlogtype=result

The workitem itself hangs. @LakshanF has been disabling tests that hang like this on Linux. The current theory is that the hang is due to #67805 that is being worked on. We're trying to strike a balance between "this test is hanging too often" and "we don't have any test coverage for ARM64 Linux". I've not seen System.Runtime tests hang often but I can be proven wrong on that.

Seemingly another instance...https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-70809-merge-038c72672ded403397/System.Runtime.Tests/1/console.87b58c26.log?helixlogtype=result

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-NativeAOT-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Projects
Archived in project
Development

No branches or pull requests

9 participants