Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CollectDumpOnTestSessionHang doesn't save dump file to TestResults directory when Azure DevOps pipeline job was canceled due to the timeout #2608

Closed
kartheekp-ms opened this issue Oct 28, 2020 · 10 comments
Assignees
Labels
needs-triage This item should be discussed in the next triage meeting.

Comments

@kartheekp-ms
Copy link
Member

kartheekp-ms commented Oct 28, 2020

Description

For a defect reproducable from the vstest command line, describe the issue you've observed.

We NuGet.Client team are working on running tests in blame mode to collect dumps when tests hang on the CI pipeline.

Linux tests job on the CI pipeline is configured to timeout after 45 mins. We publish build artifacts such as dump,sequence.xml and test results.xml from TestResultsDirectory for trouble shooting purposes only if job status is either failed or canceled for troubleshooting.

Tests are executed in parallel. one of the PR builds yesterday, blame detector identified a hang, collected the dump, and test framework started executing other test assemblies. In the mean-time Azure DevOps pipeline cancelled this job due to 45 min timeout. Unfortunately, only sequence.xml was uploaded to build artifacts but not dump file. As per below log dump of the hang was collected but not sure why the dump was not saved to TestResultsDirectory. Hence it was not published to the artifacts (see image below).

[Long Running Test] 'NuGet.Commands.Test.FallbackFolderRestoreTests.FallbackFolderRestore_AllPackagesFoundInFallback_NuspecModeAsync', Elapsed: 00:14:02
[Long Running Test] 'NuGet.Commands.Test.RuntimeTargetsTests.RestoreTargets_RestoreWithRuntimesAsync', Elapsed: 00:14:03
[Long Running Test] 'NuGet.Commands.Test.NETCoreProject2ProjectTests.NETCoreProject2Project_VerifyCompileForTransitiveSettings(expected: "BDXY", ab: True, bc: False, cd: True, ax: True, xy: True, yd: True)', Elapsed: 00:14:02
  Blame: Creating hang dump of process dotnet (5225).
  The active test run was aborted. Reason: Test host process crashed
  Data collector 'Blame' message: The specified inactivity time of 15 minute/s has elapsed. Collecting a dump and killing the test host process.

image

Here is the .runsettings file set to 15 mins test timeout to collect hang dump.

<RunSettings>
    <LoggerRunSettings>
        <Loggers>
            <Logger friendlyName="blame" enabled="True" />
        </Loggers>
    </LoggerRunSettings>
    <DataCollectionRunSettings>
        <DataCollectors>
            <DataCollector friendlyName="blame" enabled="True">
                <Configuration>
                    <CollectDumpOnTestSessionHang TestTimeout="15min" HangDumpType="full" />
                    <ResultsDirectory>%AGENT_TEMPDIRECTORY%</ResultsDirectory>
                </Configuration>
            </DataCollector>
        </DataCollectors>
    </DataCollectionRunSettings>
</RunSettings>

Expected behavior

Dump that was collected due to test hang saved to the TestResults Directory so that next steps in the pipeline could have published it to artifacts for troubleshooting.

Actual behavior

CollectDumpOnTestSessionHang doesn't save dump file to TestResults directory when Azure DevOps pipeline job was canceled due to the timeout

Environment

  • Tests are executed using .NET 5 Preview 7 SDK
  • As per this PR description, no additional tools are to collect hang dumps on Windows and Linux. Hence added only DataCollector > CollectDumpOnTestSessionHang in .runsettings file.
  • xUnit.net VSTest Adapter v2.4.1 for running tests

I appreciate any work around or additional guidance on this issue. Thanks.

@kartheekp-ms
Copy link
Member Author

kartheekp-ms commented Oct 28, 2020

I am unable to reproduce it locally on Windows. Here are the steps I followed.

  1. Execute dotnet test "a.dll" "b.dll" --settings "path to .runsettings file"
<RunSettings>
    <DataCollectionRunSettings>
        <DataCollectors>
            <DataCollector friendlyName="blame" enabled="True">
                <Configuration>
                    <CollectDumpOnTestSessionHang TestTimeout="1min" HangDumpType="full" />
                    <ResultsDirectory>%AGENT_TEMPDIRECTORY%</ResultsDirectory>
                </Configuration>
            </DataCollector>
        </DataCollectors>
    </DataCollectionRunSettings>
</RunSettings>
  1. dotnet test started executing tests a.dll in parallel. The process encountered a test which ran more than 1 min. A dump was collected.
The active test run was aborted. Reason: Test host process crashed
Data collector 'Blame' message: The specified inactivity time of 60 seconds has elapsed. Collecting hang dumps from testhost and its child processes.
Data collector 'Blame' message: Dumping 13360 - dotnet.
  1. I pressed Ctrl + c to exit the test execution immediately. Hence dotnet test stopped executing after dump was collected and did not get a chance to run tests in b.dll
  2. I found hang dump and sequence file in the local test results folder.

I am not sure why the dump was not saved to the TestsResultsDirectory on CI on Linux machine when the job timed out after 45 mins. I don't see Data collector 'Blame' message: Dumping 13360 - dotnet. message in the CI log but found in my local windows machine

@nohwnd
Copy link
Member

nohwnd commented Nov 2, 2020

The ResultsDirectory option under DataCollector > Configuration does not work, and never worked in this way. It specified a folder that was used as temporary location to store the dumps, before they were picked up by the attachments subsystem and moved to the "TestResults" folder where you have your xml reports, trx etc.

The folder specified in ResultsDirectory had to exist, but always ended up empty. With the new changes in Blame, the normal %TEMP% is used instead.

If you are interesting in saving the dumps into a specific location and not uploading them as attachments of the run then you can set VSTEST_DUMP_PATH to a directory that will be used to store the dump files.

I also might be misreading your issue, is it happening only in some cases, when you cancel the build? There should be a lot of info in the diagnostic logs, https://github.com/microsoft/vstest-docs/blob/master/docs/diagnose.md could you provide them if the stuff above does not solve your issue?

@kartheekp-ms
Copy link
Member Author

kartheekp-ms commented Nov 5, 2020

I found ResultsDirectory option in this doc. I thought it make sense to configure ResultsDirectory option to AGENT_TEMPDIRECTORY because the folder is cleaned after each pipeline job even if attachments subsystem fails to pick the dump files in the worst case.

If ResultsDirectory option under DataCollector > Configuration does not work, then we will consider removing it.

It specified a folder that was used as temporary location to store the dumps, before they were picked up by the attachments subsystem

We want these dump files in the Test Results directory. We have a downstream job that publishes these dump files from Test Results directory to artifacts for trouble shooting.

I also might be misreading your issue, is it happening only in some cases, when you cancel the build?

Yes, that is correct. Do we know at what point attachments subsystem picks up dump files and places them under TestResults directory? If this happens after all the test assemblies were executed, then it explains why dump files were not present in the expected location because the pipeline job was cancelled after timeout

@nohwnd
Copy link
Member

nohwnd commented Nov 5, 2020

I found ResultsDirectory option in this doc. I thought it make sense to configure ResultsDirectory option to AGENT_TEMPDIRECTORY because the folder is cleaned after each pipeline job even if attachments subsystem fails to pick the dump files in the worst case.

I thought that is done automatically, by setting the TEMP for the agent, at least it seemed to work that way when I tried it, but now I don't see it happen.

If ResultsDirectory option under DataCollector > Configuration does not work, then we will consider removing it.

Yeah that slipped during review.

We want these dump files in the Test Results directory. We have a downstream job that publishes these dump files from Test Results directory to artifacts for trouble shooting.

You can do that using the variable mentioned above.

Yes, that is correct. Do we know at what point attachments subsystem picks up dump files and places them under TestResults directory? If this happens after all the test assemblies were executed, then it explains why dump files were not present in the expected location because the pipeline job was cancelled after timeout

I thought it handles cancelling correctly. But might re-visit it and fix it. If it does not happen it would be a bug.

@kartheekp-ms
Copy link
Member Author

kartheekp-ms commented Nov 5, 2020

I thought it handles cancelling correctly. But might re-visit it and fix it. If it does not happen it would be a bug.

I even tried to mimic job timeout cancellation by manually cancelling the CI build while tests were running and after the dump was collected when a test timed out. I didn't see dump file published to TestResults directory. It looks like a bug to me unless I am missing something.

We want these dump files in the Test Results directory. We have a downstream job that publishes these dump files from Test Results directory to artifacts for trouble shooting.

You can do that using the variable mentioned above.

If VSTEST_DUMP_PATH setting causes dotnet test to create dumps directly in this folder instead of creating in %TEMP% and moving them later to desired location, then we can consider this option.

@kartheekp-ms
Copy link
Member Author

A work-around for this issue to increase the CI job timeout so that that dotnet test is not interrupted by a cancellation request and has sufficient time to publish dumps to Test Results directory even if there is a hang.

@nohwnd
Copy link
Member

nohwnd commented Nov 18, 2020

#2641 Added the fix to pickup the agent temp folder here. This fix will get also into 16.9

@kartheekp-ms
Copy link
Member Author

Thanks for the update. Looking at the PR, I think <ResultsDirectory>%AGENT_TEMPDIRECTORY%</ResultsDirectory> setting in .runsettings file is not required because temporary directory is chosen from environment variables with a fall back option on current user's temporary folder.

I thought it handles cancelling correctly. But might re-visit it and fix it. If it does not happen it would be a bug.

Is there any update on the issue discussed above to ensure that collected dump will always be moved from TEMP directory to TEST RESULTS directory even after azure pipeline sends a cancellation request?

@Evangelink Evangelink added the needs-triage This item should be discussed in the next triage meeting. label Aug 3, 2022
@Evangelink
Copy link
Member

@nohwnd What's the status of this issue? Is there anything else to do on our side?

@nohwnd
Copy link
Member

nohwnd commented Jan 17, 2023

I've setup two dlls both waiting for significant time, and setup hang dump. Then I tried cancelling the execution request after the first dll was dumped, and also killing the process tree after the first dll was dumped. In both cases I've got the resulting dump in the test results folder.

So to me this looks like it is working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-triage This item should be discussed in the next triage meeting.
Projects
None yet
Development

No branches or pull requests

3 participants