Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Infra] SampleAnalyzerIntegrationTest randomly timeouting in CI #10036

Closed
Tracked by #10316
JanKrivanek opened this issue Apr 18, 2024 · 15 comments
Closed
Tracked by #10316

[Infra] SampleAnalyzerIntegrationTest randomly timeouting in CI #10036

JanKrivanek opened this issue Apr 18, 2024 · 15 comments
Assignees
Labels
Area: BuildCheck Area: Our Own Build Problems affecting the build or build infrastructure of the MSBuild repo itself. needs-investigation

Comments

@JanKrivanek
Copy link
Member

Context

System.TimeoutException : Test failed due to timeout: process 12042 is active for more than 30 sec.

   at Microsoft.Build.UnitTests.Shared.RunnerUtilities.RunProcessAndGetOutput(String process, String parameters, Boolean& successfulExit, Boolean shellExecute, ITestOutputHelper outputHelper) in /_/src/UnitTests.Shared/RunnerUtilities.cs:line 139
   at Microsoft.Build.UnitTests.Shared.RunnerUtilities.ExecBootstrapedMSBuild(String msbuildParameters, Boolean& successfulExit, Boolean shellExecute, ITestOutputHelper outputHelper) in /_/src/UnitTests.Shared/RunnerUtilities.cs:line 67
   at Microsoft.Build.BuildCheck.UnitTests.EndToEndTests.SampleAnalyzerIntegrationTest(Boolean buildInOutOfProcessNode, Boolean analysisRequested) in /home/vsts/work/1/s/src/BuildCheck.UnitTests/EndToEndTests.cs:line 122
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeDirectByRefWithFewArgs(Object obj, Span`1 copyOfArgs, BindingFlags invokeAttr)

Sample run: https://dev.azure.com/dnceng-public/public/_build/results?buildId=648080&view=ms.vss-test-web.build-test-results-tab&runId=16013174&resultId=100001&paneView=debug

@JanKrivanek JanKrivanek added the Area: Our Own Build Problems affecting the build or build infrastructure of the MSBuild repo itself. label Apr 18, 2024
@YuliiaKovalova
Copy link
Member

YuliiaKovalova commented Apr 19, 2024

discussed offline: the execution time can be extended for handling it. Please track it for some time if this approach solves the problem

@GangWang01
Copy link
Member

GangWang01 commented Apr 30, 2024

After observing 9 runs that failed with this test, the failure only happened to the following test cases. From the output logs attached inline, it looks like within 30 seconds the build check analyzer hadn't finished.

Checking with the output log Microsoft.Build.BuildCheck.UnitTests_net8.0_x64.log of fully passed test run, I found there were over 1000 warnings when the analysis was requested. But the warnings except BC0101 couldn't be reproduced locally. This may explain it looks like within 30 seconds the build check analyzer hadn't finished. No idea why CI run has so many extra warnings.

@YuliiaKovalova
Copy link
Member

@JanKrivanek , can this fix help with the test performance?
#10084

@JanKrivanek
Copy link
Member Author

Only slightly - it slashes only about a third of currently detected slowdown of the analyzers.
Though the detected slowdown was in range of about 14% - so it's puzzling that the process runs for over 30 second. It would be interesting to see if it fails with about the same or significantly decreased ratio if we increase the timeout to say - 120 seconds. That might bisect between the slowdown and deadlock theory

@JanKrivanek
Copy link
Member Author

@JanKrivanek
Copy link
Member Author

Might be totally red herring:

The timeouting cases seem to have TerminalLogger used (notice the control sequences):

�[33;1m   Microsoft.Build.BuildCheck.UnitTests: [Long Running Test] 'Microsoft.Build.BuildCheck.UnitTests.EndToEndTests.SampleAnalyzerIntegrationTest', Elapsed: 00:01:02
�[mExecuting [/home/vsts/work/1/s/.dotnet/dotnet /home/vsts/work/1/s/artifacts/bin/bootstrap/net8.0/MSBuild/MSBuild.dll FooBar.csproj /m:1 -nr:False -restore -analyze]
==== OUTPUT ====
�[31;1m    Microsoft.Build.BuildCheck.UnitTests.EndToEndTests.SampleAnalyzerIntegrationTest(buildInOutOfProcessNode: True, analysisRequested: True) [FAIL]
�[m�[37m      System.TimeoutException : Test failed due to timeout: process 11430 is active for more than 120 sec.

Sample full log: TIMEOUT-Microsoft.Build.BuildCheck.UnitTests_net8.0_x64.log

Wherease I do not see those appear in individual test cases for the success cases - sample log: OK-Microsoft.Build.BuildCheck.UnitTests_net8.0_x64.log

Regardless of it being red herring or not - it is strange that TerminalLogger is used in CI. I believe @AR-May was looking into somthing similar

@JaynieBai
Copy link
Member

JaynieBai commented May 24, 2024

@JanKrivanek and @maridematte I repro this issue with the command "dotnet test ./artifacts/bin/Microsoft.Build.BuildCheck.UnitTests/Debug/net8.0/Microsoft.Build.BuildCheck.UnitTests.dll --filter "FullyQualifiedName~SampleAnalyzerIntegrationTest" randomly
" in linux machine.

Added -v:d with process command to output more info and found the build is stuck in target GetCopyToPublishDirectoryItems

Image

This is build log.txt.

@JanKrivanek
Copy link
Member Author

Perfect!

Can you collect 2 more cases to see if there is any pattern? Plus appart from the diag log it might be helpfull to have binlog as well.

Other than that I unfortunately do not have any idea now about what can bw causing this

@JaynieBai
Copy link
Member

JaynieBai commented May 29, 2024

Perfect!

Can you collect 2 more cases to see if there is any pattern? Plus appart from the diag log it might be helpfull to have binlog as well.

Other than that I unfortunately do not have any idea now about what can bw causing this

Here is log with -v:diag diaglog.txt

The binlog is not completed when failed. True.binlog.txt

Image

@JanKrivanek
Copy link
Member Author

Yeah - the binlog being cut aburptly is fine (and expected as test is killed after timeout) - the timeline from binlog will be interresting - it should show where was the time spent. At it'd be nice to see 2 or 3 cases - to be able to compare if the excesive time is spent in the same part of the build or if it is random.

Since it is Linux, we unfortunately cannot use ETW :-/
Maybe @AR-May or @ladipro have some top of the head ideas of what to focus on beyond the timing in binlog.

@JaynieBai
Copy link
Member

JaynieBai commented May 31, 2024

Yeah - the binlog being cut aburptly is fine (and expected as test is killed after timeout) - the timeline from binlog will be interresting - it should show where was the time spent. At it'd be nice to see 2 or 3 cases - to be able to compare if the excesive time is spent in the same part of the build or if it is random.

Since it is Linux, we unfortunately cannot use ETW :-/ Maybe @AR-May or @ladipro have some top of the head ideas of what to focus on beyond the timing in binlog.

I compare the logs of several failed cases, it's always the same part of the build. And unfortunately, there is no timeline shown in the binlog True.binlog.txt

@JanKrivanek
Copy link
Member Author

From offline discussion

Observations:

  • The repros seem to indicate that the test will get stuck in the same place - during _GenerateRestoreProjectPathWalk execution (which is invoked from restore) - on a FooBar-Copy.csproj (the second project) - while the previous call for a same project succeeded (from a recursive call of _GenerateRestoreProjectPathWalk on the FooBar.csproj)
  • The target is calling MSBuild task (to execute the same target recursively) - but that MSBuild call doesn't end

Recommendation on further steps:

As a side not - the issue happens during restore. So it might get resolved by skipping the restore - #9747. But it still would be very valuable to understand why is it happening

@GangWang01
Copy link
Member

Though this issue might be resolved in the fix above, I had some information about trying to debug the test running on linux remotely from VS on Windows. I tried the following ways. But they failed with different errors. I was not able to debug this test running on Linux.

  • Debug the test in WSL
    Remote debugging in WSL got the error MSB4062: The "CheckForImplicitPackageReferenceOverrides" task could not be loaded from the assembly /mnt/d/WS/fork/msbuild/artifacts/bin/bootstrap/net8.0/MSBuild/Sdks/Microsoft.NET.Sdk/targets/../tools/net8.0/Microsoft.NET.Build.Tasks.dll. An attempt was made to load a program with an incorrect format. See
    TestLog.txt for more details. After my investigation, I found the assembly Microsoft.NET.Build.Tasks.dll existed. But this assembly from clean build on Windows is not compatible to run in WSL due to its architecture x64, while the architecture of it from clean build on Linux is 64797.

  • Attach the debugger to the child process started by the test running on Linux
    I could attach the debugger to the started process(dotnet <MSBuild.dll in boostrap>) by the test on Linux for a while. There is was no call stack displayed and after a while the error below popped up. No Windows application event log was found either...
    image

@YuliiaKovalova
Copy link
Member

Though this issue might be resolved in the fix above, I had some information about trying to debug the test running on linux remotely from VS on Windows. I tried the following ways. But they failed with different errors. I was not able to debug this test running on Linux.

  • Debug the test in WSL
    Remote debugging in WSL got the error MSB4062: The "CheckForImplicitPackageReferenceOverrides" task could not be loaded from the assembly /mnt/d/WS/fork/msbuild/artifacts/bin/bootstrap/net8.0/MSBuild/Sdks/Microsoft.NET.Sdk/targets/../tools/net8.0/Microsoft.NET.Build.Tasks.dll. An attempt was made to load a program with an incorrect format. See
    TestLog.txt for more details. After my investigation, I found the assembly Microsoft.NET.Build.Tasks.dll existed. But this assembly from clean build on Windows is not compatible to run in WSL due to its architecture x64, while the architecture of it from clean build on Linux is 64797.
  • Attach the debugger to the child process started by the test running on Linux
    I could attach the debugger to the started process(dotnet <MSBuild.dll in boostrap>) by the test on Linux for a while. There is was no call stack displayed and after a while the error below popped up. No Windows application event log was found either...
    image

@GangWang01 , thank you for analysis.
I was able to catch it on Windows locally, after adding a new build-in rule. Using Parallel.ForEach in this case was causing deadlock, according to the callstack. So it has to be OK now.
Could you please reenable to tests in a separate PR now? I included it in my changes, but the review can take time :)

@JanKrivanek
Copy link
Member Author

Fixed by #10353

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: BuildCheck Area: Our Own Build Problems affecting the build or build infrastructure of the MSBuild repo itself. needs-investigation
Projects
None yet
Development

No branches or pull requests

5 participants