Attempt to make functional tests more reliable #5874

jeffkl · 2024-06-26T21:57:00Z

Bug

Fixes: https://github.com/NuGet/Client.Engineering/issues/2501
Fixes: https://github.com/NuGet/Client.Engineering/issues/2866

Regression? Last working version:

Description

The Functional tests seem to very unreliable with a few symptoms that this pull request hopefully addresses:

Test host process crashing
A particular command times out
A particular test or set of tests hang, causing ADO to cancel the test run

To attempt to address the issues, I've tweaked the parallelism in the test runner. There are three settings in the .runsettings for parallelism:

Setting	Purprose	Old value	New Value
`MaxCpuCount`	The number or processes that `vstest.console.exe` spawns to run tests in	0	4
`MaxParallelThreads`	The number of threads that the xunit runner can use at a time	0	4
`ParallelizeAssembly`	Whether or not the xunit test runner should run tests in assemblies in parallel	`true`	`false`

Since we are currently using the default values for MaxCpuCount and MaxParallelThreads, as well as enabling ParallelizeAssembly, I suspect that vstest.console.exe is launching 12 process, each one then runs the tests for each assembly with 12 threads, and if one process is running multiple assemblies, they're happening in parallel. My suspicion is that we're over parallelizing tests, leading to crashes. By changing the values to lower, I do not see any increase in test run time and in fact, sometimes the runs are faster. However, in my testing, I rarely see a test host process crashing anymore.

I also deleted build\TestShared\xunit.runner.json and moved the settings to build\xunit.runsettings so that everything is defined in one place. Some projects were opting into test parallelization with the UseParallelXunit property which would copy the xunit.runner.json to the output directory, while other projects were copying it themselves. I found it confusing to determine why some projects were allowed to run in parallel and others weren't so I changed how we run tests in parallel and made it uniform across the repo.

Finally, I've disabled the crash dump helper when running tests since all of the memory dumps it provided were always for the main vstest.console.exe which wasn't helpful since the tests themselves are running in a different process. Instead, I've added the --diag switch to the call to dotnet test and now there's a diagnostic log for the main process and all of the child processes with valuable information. These logs go to the binlog location and get uploaded when a test run fails or if you enable diagnostics on a particular pipeline run.

PR Checklist

kartheekp-ms · 2024-06-27T14:56:34Z

build/xunit.runsettings

+    <MaxParallelThreads>4</MaxParallelThreads>
+
+    <!-- Set this to true if this assembly is willing to participate in parallelization with other assemblies. Test runners can use this information to automatically enable parallelization across assemblies if all the assemblies agree to it. -->
+    <ParallelizeAssembly>false</ParallelizeAssembly>


I think this setting nullifies the MaxcpuCount (4) property because as per the comment it looks like MaxcpuCount specifies the number of distinct DLLs in the run determines the actual number of testhosts started.. Given that the multiple assemblies are not running parallel then my understanding is that only one test host process is started. I guess we should change to true so that 4 DLLs can run in parallel.

Since MaxCpuCount is set to 4 (here), there will be 4 vstest.console.exe processes spawned. I think it then launches each test assembly in those processes in parallel. Then with MaxParallelThreads set to 4, each test assembly runs with 4 threads (if there are enough CPUs).

This is really confusing since there are multiple test runners. You can run tests under xunit.runner.exe, vstest.console.exe, etc. I think the ParallelizeAssembly tells xunit specifically that it should run tests from multiple assemblies at the same time when running under certain test runners. But since we're running the tests under vstest.console.exe, it seems to me that we should not be using xunit's parallelization. I spent a lot of time read https://xunit.net/docs/runsettings but only experimenting with these numbers did I find a good balance between speed and reliability. Do you know more about how all of this works?

It looks like a similar issue has been discussed here. I think we are good because it appears that xUnit parallelization is ignored by the dotnet test command. One interesting thing I learned from this link is that executing dotnet test without specifying assemblies runs test assemblies in parallel, whereas dotnet test assembly1.dll assembly2.dll does not run test assemblies in parallel. I am not sure if this is accurate in .NET 8, but at least that was the case earlier because the linked comment explained the different behaviors with an example.

zivkan · 2024-07-02T01:24:49Z

test/TestUtilities/Test.Utility/CommandRunner.cs

@@ -53,6 +53,9 @@ public static CommandRunnerResult Run(string filename, string workingDirectory =
                    process.StartInfo.Environment["NUGET_SHOW_STACK"] = bool.TrueString;
                    process.StartInfo.Environment["NuGetTestModeEnabled"] = bool.TrueString;
                    process.StartInfo.Environment["UseSharedCompilation"] = bool.FalseString;
+                    process.StartInfo.Environment["DOTNET_SKIP_FIRST_TIME_EXPERIENCE"] = bool.TrueString;
+                    process.StartInfo.Environment["DOTNET_CLI_TELEMETRY_OPTOUT"] = bool.TrueString;
+                    process.StartInfo.Environment["SuppressNETCoreSdkPreviewMessage"] = bool.FalseString;


is this a mistake? Why wouldn't we want to suppress the .NET (Core) preview message?

🤦 Yes, thank you for catching that!

zivkan · 2024-07-02T01:29:19Z

eng/pipelines/pr.job.yml

@@ -78,7 +78,7 @@ jobs:
    condition: succeeded()
    inputs:
      command: test
-      arguments: --no-restore --restore:false --no-build --framework $(TestTargetFramework) --blame --logger trx --binarylogger:"$(BinlogDirectory)02-RunTests.binlog"


Finally, I've disabled the crash dump helper when running tests since all of the memory dumps it provided were always for the main vstest.console.exe which wasn't helpful since the tests themselves are running in a different process

Have you talked to the team who owns dotnet test? This sounds very surprising so sounds like something that should be fixed for all customers, or if we're doing something wrong, they can help us.

I see a few issues in the vstest repo:
microsoft/vstest#3936
microsoft/vstest#2593
microsoft/vstest#4376
microsoft/vstest#2952

Nothing specific to the problem I was seeing. Supposedly the blame stuff also doesn't supply enough command-line arguments to capture other exceptions than StackOverflowException and AccessViolationException

https://github.com/microsoft/vstest/blob/main/src/Microsoft.TestPlatform.Extensions.BlameDataCollector/ProcDumpArgsBuilder.cs#L33

Although now there's an env var we can set to include more arguments to proc dump. The main point in all of this is it seems like adding --diag to the vstest.console.exe execution is better. I don't think we have that many "test crashes". But if we feel like we need the memory dumps back, we can do so at a later time in my opinion.

zivkan · 2024-07-02T01:34:34Z

build/xunit.runsettings

+    <ParallelizeAssembly>false</ParallelizeAssembly>
+
+    <!-- Set this to true if the assembly is willing to run tests inside this assembly in parallel against each other. Tests in the same test collection will be run sequentially against each other, but tests in different test collections will be run in parallel against each other. Set this to false to disable all parallelization within this test assembly. -->
+    <ParallelizeTestCollections>false</ParallelizeTestCollections>


I compared the CI run for this PR to a different CI run, because I would have expected that setting ParallelizeTestCollections and ParallelizeAssembly both to false would make the tests run very slowly. But somehow this PR's CI run was margionally faster then the dev branch run.

It's totally not intuitive to me why, but I didn't see this mentioned in the PR description or anywhere else, so I wanted to point out that apparently the CI is not going to take a long longer with this PR merged.

I struggle to understand it as well. Its still running tests assemblies in parallel with each other, but I guess not running tests themselves in parallel? A machine with only 2 cores would only be able to run two tests at the same time anyway so I think the adjustments don't affect the maximum parallelization we could already achieve on CI agents. It really just lowers the amount of parallelization that's queued up waiting for CPU.

jeffkl added the Engineering Changes related to the build infrastructure of the repo and that do not change product code label Jun 26, 2024

jeffkl self-assigned this Jun 26, 2024

jeffkl requested a review from a team as a code owner June 26, 2024 21:57

kartheekp-ms added the Learning Opportunity Triggers an in-depth conversation in the weekly PR review meeting so that reviewers can learn more label Jun 27, 2024

kartheekp-ms reviewed Jun 27, 2024

View reviewed changes

jeffkl added 3 commits June 30, 2024 12:58

WIP

4c64e8a

No blame

4821ee8

Add some more environment variables to CommandRunner

d29b4eb

jeffkl force-pushed the dev-jeffkl-test-parallelism branch from 965bbb3 to e6aabcc Compare June 30, 2024 20:01

jeffkl added 2 commits July 1, 2024 10:45

Update tests

6d422fa

Increase timeout for pack tests

ce678f8

jeffkl force-pushed the dev-jeffkl-test-parallelism branch from e6aabcc to ce678f8 Compare July 1, 2024 21:52

zivkan previously approved these changes Jul 2, 2024

View reviewed changes

Address comments

e5553cb

jeffkl dismissed zivkan’s stale review via e5553cb July 2, 2024 02:56

zivkan approved these changes Jul 2, 2024

View reviewed changes

jeffkl merged commit 7a1e5b1 into dev Jul 2, 2024
27 of 28 checks passed

jeffkl deleted the dev-jeffkl-test-parallelism branch July 2, 2024 23:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to make functional tests more reliable #5874

Attempt to make functional tests more reliable #5874

jeffkl commented Jun 26, 2024 •

edited

Loading

kartheekp-ms Jun 27, 2024

jeffkl Jun 27, 2024

kartheekp-ms Jun 28, 2024 •

edited

Loading

zivkan Jul 2, 2024

jeffkl Jul 2, 2024

zivkan Jul 2, 2024

jeffkl Jul 2, 2024

zivkan Jul 2, 2024

jeffkl Jul 2, 2024

Attempt to make functional tests more reliable #5874

Attempt to make functional tests more reliable #5874

Conversation

jeffkl commented Jun 26, 2024 • edited Loading

Bug

Description

PR Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kartheekp-ms Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeffkl commented Jun 26, 2024 •

edited

Loading

kartheekp-ms Jun 28, 2024 •

edited

Loading