Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test hang investigation support #3094

Closed
markwilkie opened this issue Jun 19, 2019 · 10 comments
Closed

Test hang investigation support #3094

markwilkie opened this issue Jun 19, 2019 · 10 comments

Comments

@markwilkie
Copy link
Member

markwilkie commented Jun 19, 2019

The following should be built into Helix:

• Understand what process is hanging.
• Understand what test is hanging.
• Take a dump of the process and have it be a downloadable artifact.

@danmoseley
Copy link
Member

danmoseley commented Jun 19, 2019

Bonus would be if we could dump all stacks to the console when the system grabs the dump. (This is what we've historically done for crashes of course). It's a lot of text, but then hangs are bad 😈 This can be really useful if it's an issue we're already investigating - saves pulling down another dump and cracking it when it's not new.

There are some extra SOS commands that are likely to be worth running on a hang, also. Eg., possibly related to in flight async operations.

@markwilkie
Copy link
Member Author

markwilkie commented Jun 19, 2019

From @davidfowl - ASP.NET Core has something for xunit https://github.com/aspnet/AspNetCore/blob/730db07d2905fdc536073c064dab321c9616d25a/src/Servers/Kestrel/test/FunctionalTests/UnixDomainSocketsTests.cs#L31 but it's per test and used for certain failures. I've been thinking more and more about ways to integrate the test runner experience in visual studio with CI system running tests in order to look at live progress and view things like "what tests are currently being executed" as it's too much of a black box today.

From @sharwell - We still use it. It’s part of the RunTests tool.
https://github.com/dotnet/roslyn/blob/e1146e5f51019f8539aedc1d3e67df207a45bc8e/src/Tools/Source/RunTests/Program.cs#L64-L76

From @tmat - Right, Roslyn has its own custom test runner that provides this functionality and also optimized scheduling. I did talk to the VSTest team (see attached thread) about the reasons we currently need custom test runners in Roslyn and in Arcade SDK.
I have not mentioned procdump integration but it would be certainly useful if VSTest provided it as well.

From @singhsarab - Here are key takeaways:
Looking at the diff, between the custom target already being used for Arcade and VSTest, following stand out
o Htmllogger (Anything which is cross plat actually, trx format does not fulfil that requirement)
o Also, if there is way to build AnyCpu and run for multiple architectures
o For Roslyn, there is config used to customize parallel, something like that would be good to have.

dotnet/msbuild#3658

@singhsarab
Copy link
Contributor

@tmat @markwilkie We have added the support for the html logger in the VS Test platform.
Also, we have added the support for x86 architecture for .NET core.

Talking about diagnostics, and flaky behaviour like crash and hang, we have a feature called blame, which can be used to identify the problematic tests and also take dumps for such scenarios.

@ViktorHofer
Copy link
Member

@singhsarab With xunit we have the ability to enable diagnostic message and set a timeout for long running tests detection (https://xunit.net/docs/configuration-files#longRunningTestSeconds). We have this mode enabled by default so that in case a test hangs we don't need to switch a flag before being able to detect the test. Is there a downside to the blame mode or can it be enabled by default?

@singhsarab
Copy link
Contributor

singhsarab commented Sep 17, 2019

@ViktorHofer Yes we have something called TestTimeout (https://github.com/microsoft/vstest/blob/041133fef03c327d3d759d215f868fdecebcdf45/src/Microsoft.TestPlatform.Extensions.BlameDataCollector/Constants.cs#L81) that if configured can be used to track such things.

Blame comes in as a datacollector and monitors the test host process, hence can also take dumps (integrates with procdump for that). There will be some downside perf wise, but then it would not be huge.
Blame is already used in the VSTest task in Azure pipelines for crashes.
We recently added the support for hang (hence it is not documented yet, you can take a look at associated PR microsoft/vstest#2110)

@singhsarab
Copy link
Contributor

@ViktorHofer Also, Since we are using apphost feature for .NET core scenario and also added the support for environment variables in runsettings, I thought it should unblock corefx scenarios that we had discussed. Can you please help validate those ?
The latest packages pushed to the myget feed have all these features that I have talked about. If you can help validate, I will then release them to public nuget feed.

@ViktorHofer
Copy link
Member

The difference that I see here is that the timeout in xunit defines when diagnostic messages should be logged that shows what test is currently executed and how much time is already spent on it. The timeout in vstest is a hard kill cap. I don't want to argue what is the better solution but I really like a dump is collected automatically.

  • The TestTimeout is only honored in blame mode?
  • Can you provide more details on how much slower the blame mode is? We run 2.5M tests per CI run, which means that every percent in performance matters to us.

Can you please help validate those ?

Sure, the remaining items before we can use those features in corefx are:

  • Publish the packages to the correct BAR channel and to the dotnet feed (VSTest is pushing to the wrong channel in BAR microsoft/vstest#2169)
  • Enable apphost feature on other environments as well. We need that to enable code coverage. I will soon open a PR with the necessary changes for that in vstest. I have a POC already working locally, basically the apphost doesn't need to be packaged anymore individually and instead can be generated directly on the consumer side.

As soon as those two are in, corefx can take a dependency on VSTest and we will be able to dogfood test your changes.

@singhsarab
Copy link
Contributor

@ViktorHofer Let me come back with numbers from perf side for blame option.
TestTimeout is configuration for blame given in the runsettings, so it is honored only by the blame data collector.

Eagerly waiting for your PR and so we can eventually see the things work end to end.

@danmoseley
Copy link
Member

We have the long running test notification running and it's definitely useful.

What remains here is to have Helix create dumps on hangs. I'm closing this @markwilkie against an issue tracking that specifically dotnet/dnceng#1216

@davidfowl
Copy link
Member

Can I see hanging tests in helix easily now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants