Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TIME OUT - Mono llvmfullaot Pri0 Runtime Tests Run Linux arm64 release #66083

Closed
thaystg opened this issue Mar 2, 2022 · 17 comments · Fixed by #66157
Closed

TIME OUT - Mono llvmfullaot Pri0 Runtime Tests Run Linux arm64 release #66083

thaystg opened this issue Mar 2, 2022 · 17 comments · Fixed by #66157
Assignees
Labels
area-Infrastructure-mono blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'

Comments

@thaystg
Copy link
Member

thaystg commented Mar 2, 2022

We couldn't find any obvious change that could be causing it between these commits:
git diff a82d92c..c8da2fd

Log which the timeout started:
https://dev.azure.com/dnceng/public/_build/results?buildId=1633880&view=results

@thaystg thaystg added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Mar 2, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 2, 2022
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@thaystg
Copy link
Member Author

thaystg commented Mar 2, 2022

@MattGal did we have any relevant infra change between friday and saturday?

@simonrozsival
Copy link
Member

The timeout seems to be related to my PR which was merged on Friday (#65128). I'll try reverting it and see if it helps.

@MattGal
Copy link
Member

MattGal commented Mar 2, 2022

@MattGal did we have any relevant infra change between friday and saturday?

Aside from serious "world-on-fire" hot-fixes we will not make infrastructure changes on a Friday evening or Saturday morning. If something's going to roll out and affect you, it'll almost always be on a Wednesday in the morning time for the Pacific time zone.

If your investigation does make you think there are infra issues, please feel free to also/instead tag @dotnet/dnceng as sometimes I take vacation days and there are other folks who might reply faster on those days. Including things like the Job GUID (sometimes called Correlation id), Work item friendly name, and why you think we caused your timeout can also speed investigation. Sounds like we're waiting for Simon to check it out first, though.

@vargaz
Copy link
Contributor

vargaz commented Mar 2, 2022

This seems to be caused by more assemblies getting AOTed.
In particular, every test project now seems to include TestLibrary.dll, so we end up AOTing it like 2000 extra times. The original lane already took about 2 hours, so these extra AOT steps might push it over the timeout.

@ghost
Copy link

ghost commented Mar 2, 2022

Tagging subscribers to this area: @directhex
See info in area-owners.md if you want to be subscribed.

Issue Details

We couldn't find any obvious change that could be causing it between these commits:
git diff a82d92c..c8da2fd

Log which the timeout started:
https://dev.azure.com/dnceng/public/_build/results?buildId=1633880&view=results

Author: thaystg
Assignees: -
Labels:

blocking-clean-ci, untriaged, area-Infrastructure-mono, in-pr

Milestone: -

@vargaz
Copy link
Contributor

vargaz commented Mar 2, 2022

The extra TestLibrary dll was added by
572405a#diff-13390079908954088047032dd93629fec2bcfd68749c2c7422c9ed60511ea108L73
Which adds TestLibrary to every test project.

@lewing
Copy link
Member

lewing commented Mar 2, 2022

cc @trylek

@naricc
Copy link
Contributor

naricc commented Mar 2, 2022

@trylek @jkoritzinsky Did you complete the test grouping change for this lane? That should have made it faster, since there would be less redundant AOT work.

@jkoritzinsky
Copy link
Member

We're part-way through the work. We added the TestLibrary reference to each project to make it simpler than adding the reference to each project individually. If that's causing issues, then we can move to manually adding the project references to the required projects.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Mar 2, 2022
@SamMonoRT
Copy link
Member

We're part-way through the work. We added the TestLibrary reference to each project to make it simpler than adding the reference to each project individually. If that's causing issues, then we can move to manually adding the project references to the required projects.

Yes, please switch to manually adding the project references. We need to keep this lane running in PR (in lieu of iOS device runtime tests).

@jkoritzinsky
Copy link
Member

cc @trylek can you make this change? (I’m stretched a little thin on cycles right now)

@trylek
Copy link
Member

trylek commented Mar 3, 2022

@koritzinsky - Sure, will do. I suppose that basically means we need to manually reference TestLibrary from each test using PlatformDetection.

trylek added a commit to trylek/runtime that referenced this issue Mar 3, 2022
When I was tagging several architecture-dependent runtime tests
with the ConditionalFact attribute, I added CoreCLRTestLibrary
as a general reference. This started causing timeouts in Mono AOT
tests because the library needs to be compiled many times.
As there's only a small number of architecture-dependent tests,
I have modified the change so that the test library is only added
explicitly to architecture-dependent tests.

Fixes: dotnet#66083

Thanks

Tomas
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 3, 2022
trylek added a commit that referenced this issue Mar 4, 2022
When I was tagging several architecture-dependent runtime tests
with the ConditionalFact attribute, I added CoreCLRTestLibrary
as a general reference. This started causing timeouts in Mono AOT
tests because the library needs to be compiled many times.
As there's only a small number of architecture-dependent tests,
I have modified the change so that the test library is only added
explicitly to architecture-dependent tests.

Fixes: #66083

Thanks

Tomas
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Mar 4, 2022
@fanyang-mono
Copy link
Member

fanyang-mono commented Mar 4, 2022

With @trylek's PR, the Mono llvmfullaot Pri0 Runtime Tests Run Linux arm64 release lane still time out. And the time out currently was because the tests running on helix step took too long to finish (10min -> 1h). @simonrozsival Could you re-open your revert PR and merge main to it to see if the testing time would go down? (#66088) Since the time out issue is not fully fixed, I am reopening it now.

@fanyang-mono fanyang-mono reopened this Mar 4, 2022
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Mar 4, 2022
@SamMonoRT SamMonoRT removed the untriaged New issue has not been triaged by the area owner label Mar 4, 2022
@SamMonoRT SamMonoRT removed the in-pr There is an active PR which will close this issue when it is merged label Mar 4, 2022
@SamMonoRT
Copy link
Member

The aot time build issue seems to be fixed by the PR - but we still are seeing some tests running longer and exceeding the overall timeout of the lanes. Changing ownership of issue till we determine what is causing the tests to run longer than expected.

@MattGal
Copy link
Member

MattGal commented Mar 4, 2022

The aot time build issue seems to be fixed by the PR - but we still are seeing some tests running longer and exceeding the overall timeout of the lanes. Changing ownership of issue till we determine what is causing the tests to run longer than expected.

Please don't forget that ARM64 test machines are a limited, not-scalable resources and the runtime team is sending thousands of hours of work through them. Being more clever saves time but we can't forget that we have a static number of limited resources.

@fanyang-mono
Copy link
Member

The relevant CI lanes are back to normal now. Closing this issue.

@ghost ghost locked as resolved and limited conversation to collaborators Apr 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Infrastructure-mono blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Projects
None yet