Test hang investigation support #3094

markwilkie · 2019-06-19T16:34:17Z

The following should be built into Helix:

• Understand what process is hanging.
• Understand what test is hanging.
• Take a dump of the process and have it be a downloadable artifact.

danmoseley · 2019-06-19T16:47:41Z

Bonus would be if we could dump all stacks to the console when the system grabs the dump. (This is what we've historically done for crashes of course). It's a lot of text, but then hangs are bad 😈 This can be really useful if it's an issue we're already investigating - saves pulling down another dump and cracking it when it's not new.

There are some extra SOS commands that are likely to be worth running on a hang, also. Eg., possibly related to in flight async operations.

markwilkie · 2019-06-19T16:59:53Z

From @davidfowl - ASP.NET Core has something for xunit https://github.com/aspnet/AspNetCore/blob/730db07d2905fdc536073c064dab321c9616d25a/src/Servers/Kestrel/test/FunctionalTests/UnixDomainSocketsTests.cs#L31 but it's per test and used for certain failures. I've been thinking more and more about ways to integrate the test runner experience in visual studio with CI system running tests in order to look at live progress and view things like "what tests are currently being executed" as it's too much of a black box today.

From @sharwell - We still use it. It’s part of the RunTests tool.
https://github.com/dotnet/roslyn/blob/e1146e5f51019f8539aedc1d3e67df207a45bc8e/src/Tools/Source/RunTests/Program.cs#L64-L76

From @tmat - Right, Roslyn has its own custom test runner that provides this functionality and also optimized scheduling. I did talk to the VSTest team (see attached thread) about the reasons we currently need custom test runners in Roslyn and in Arcade SDK.
I have not mentioned procdump integration but it would be certainly useful if VSTest provided it as well.

From @singhsarab - Here are key takeaways:
Looking at the diff, between the custom target already being used for Arcade and VSTest, following stand out
o Htmllogger (Anything which is cross plat actually, trx format does not fulfil that requirement)
o Also, if there is way to build AnyCpu and run for multiple architectures
o For Roslyn, there is config used to customize parallel, something like that would be good to have.

dotnet/msbuild#3658

singhsarab · 2019-09-17T11:02:16Z

@tmat @markwilkie We have added the support for the html logger in the VS Test platform.
Also, we have added the support for x86 architecture for .NET core.

Talking about diagnostics, and flaky behaviour like crash and hang, we have a feature called blame, which can be used to identify the problematic tests and also take dumps for such scenarios.

ViktorHofer · 2019-09-17T12:24:05Z

@singhsarab With xunit we have the ability to enable diagnostic message and set a timeout for long running tests detection (https://xunit.net/docs/configuration-files#longRunningTestSeconds). We have this mode enabled by default so that in case a test hangs we don't need to switch a flag before being able to detect the test. Is there a downside to the blame mode or can it be enabled by default?

singhsarab · 2019-09-17T12:38:23Z

@ViktorHofer Yes we have something called TestTimeout (https://github.com/microsoft/vstest/blob/041133fef03c327d3d759d215f868fdecebcdf45/src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Constants.cs#L81) that if configured can be used to track such things.

Blame comes in as a datacollector and monitors the test host process, hence can also take dumps (integrates with procdump for that). There will be some downside perf wise, but then it would not be huge.
Blame is already used in the VSTest task in Azure pipelines for crashes.
We recently added the support for hang (hence it is not documented yet, you can take a look at associated PR microsoft/vstest#2110)

singhsarab · 2019-09-17T12:42:00Z

@ViktorHofer Also, Since we are using apphost feature for .NET core scenario and also added the support for environment variables in runsettings, I thought it should unblock corefx scenarios that we had discussed. Can you please help validate those ?
The latest packages pushed to the myget feed have all these features that I have talked about. If you can help validate, I will then release them to public nuget feed.

ViktorHofer · 2019-09-17T13:56:24Z

The difference that I see here is that the timeout in xunit defines when diagnostic messages should be logged that shows what test is currently executed and how much time is already spent on it. The timeout in vstest is a hard kill cap. I don't want to argue what is the better solution but I really like a dump is collected automatically.

The TestTimeout is only honored in blame mode?
Can you provide more details on how much slower the blame mode is? We run 2.5M tests per CI run, which means that every percent in performance matters to us.

Can you please help validate those ?

Sure, the remaining items before we can use those features in corefx are:

Publish the packages to the correct BAR channel and to the dotnet feed (VSTest is pushing to the wrong channel in BAR microsoft/vstest#2169)
Enable apphost feature on other environments as well. We need that to enable code coverage. I will soon open a PR with the necessary changes for that in vstest. I have a POC already working locally, basically the apphost doesn't need to be packaged anymore individually and instead can be generated directly on the consumer side.

As soon as those two are in, corefx can take a dependency on VSTest and we will be able to dogfood test your changes.

singhsarab · 2019-09-18T14:43:09Z

@ViktorHofer Let me come back with numbers from perf side for blame option.
TestTimeout is configuration for blame given in the runsettings, so it is honored only by the blame data collector.

Eagerly waiting for your PR and so we can eventually see the things work end to end.

danmoseley · 2020-02-27T00:17:35Z

We have the long running test notification running and it's definitely useful.

What remains here is to have Helix create dumps on hangs. I'm closing this @markwilkie against an issue tracking that specifically dotnet/dnceng#1216

davidfowl · 2020-02-27T05:38:53Z

Can I see hanging tests in helix easily now?

danmoseley closed this as completed Feb 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test hang investigation support #3094

Test hang investigation support #3094

markwilkie commented Jun 19, 2019 •

edited

Loading

danmoseley commented Jun 19, 2019 •

edited

Loading

markwilkie commented Jun 19, 2019 •

edited

Loading

singhsarab commented Sep 17, 2019

ViktorHofer commented Sep 17, 2019

singhsarab commented Sep 17, 2019 •

edited

Loading

singhsarab commented Sep 17, 2019

ViktorHofer commented Sep 17, 2019

singhsarab commented Sep 18, 2019

danmoseley commented Feb 27, 2020

davidfowl commented Feb 27, 2020

Test hang investigation support #3094

Test hang investigation support #3094

Comments

markwilkie commented Jun 19, 2019 • edited Loading

danmoseley commented Jun 19, 2019 • edited Loading

markwilkie commented Jun 19, 2019 • edited Loading

singhsarab commented Sep 17, 2019

ViktorHofer commented Sep 17, 2019

singhsarab commented Sep 17, 2019 • edited Loading

singhsarab commented Sep 17, 2019

ViktorHofer commented Sep 17, 2019

singhsarab commented Sep 18, 2019

danmoseley commented Feb 27, 2020

davidfowl commented Feb 27, 2020

markwilkie commented Jun 19, 2019 •

edited

Loading

danmoseley commented Jun 19, 2019 •

edited

Loading

markwilkie commented Jun 19, 2019 •

edited

Loading

singhsarab commented Sep 17, 2019 •

edited

Loading