Generate dumps for hanging tests #1216

danmoseley · 2020-02-27T00:16:54Z

Right now when a test hangs we do not get a dump file. Much of the time we need a dump (can't repro locally and code inspection doesn't show an obvious problem)

This mechanism needs to be a generic Helix mechanism - not tied to a particular repo's test infrastructure.

[Exception: In a few cases, in the libraries tests, we spawn a child process ("RemoteExec") and in those cases we can improve that infrastructure ourselves to get dumps on hangs. We have done this already on Windows and Linux is tracked here. ]

But in most cases, it's Helix infrastructure that needs to collect the dump.

Mono needs this also apparently: dotnet/runtime#32325

danmoseley · 2020-02-27T00:18:01Z

cc @stephentoub - I'm not really adding any value by opening this issue just FYI we now have one.

danmoseley · 2020-04-27T17:50:19Z

Perhaps requires https://github.com/dotnet/core-eng/issues/9533#issuecomment-610595478

ViktorHofer · 2020-04-27T17:51:23Z

cc @davidfowl

michellemcdaniel · 2022-02-17T22:06:49Z

Closing because the parent Epic was closed. If you believe this issue should still be worked on, please re-open.

agocke · 2022-06-22T21:14:18Z

Looks like this still hasn't been completed.

MattGal · 2022-06-22T21:33:28Z

Current ask: it's very cheap for us to call our timeout kill() with SIGQUIT on non-Windows systems; marking for triage discussion

davidfowl · 2022-06-22T22:29:51Z

AFAIK we did this. What doesn't work exactly?

MichalStrehovsky · 2022-06-23T01:10:02Z

AFAIK we did this. What doesn't work exactly?

E.g. this job: https://dev.azure.com/dnceng/public/_build/results?buildId=1838524&view=logs&j=b2c675c1-1c32-50e9-7c40-e762459300db&t=4f51db95-f7de-5a4e-cd10-f078782dd75b

ended with

...
[EXECUTION TIMED OUT]
Exit Code:-3Executor timed out after 2700 seconds and was killed
['System.Runtime.Tests' END OF WORK ITEM LOG: Command timed out, and was killed]

But there's no crashdump for it to see where the process got stuck.

davidfowl · 2022-06-23T02:18:22Z

Oh wait, the runtime uses a jank version of xunit right? It doesn't use the dotnet test runner.

MichalStrehovsky · 2022-06-23T02:23:45Z

Yes, there's a custom runner. But unless the timeouting logic of dotnet test is in native code, I don't think that would help in situations like this (we suspect this one is a hang in GC suspension, which means any timeouting logic written in managed code likely wouldn't get a chance to run). We basically need the process to be killed in a way that produced a crashdump.

davidfowl · 2022-06-23T02:26:27Z

@MichalStrehovsky the test host is a separate process so it would work just fine since it's the one collect a dump from the user process hosting test cases. The runtime doesn't use this currently and I thought there was a plan to move it over but I don't know where we are there.

cc @ViktorHofer

MichalStrehovsky · 2022-06-23T02:28:32Z

We won't be able to use the test host for Native AOT anyway, at least not in near future.

I'm trying to chase it in low tech way here: dotnet/runtime#71177. It's not great.

MattGal · 2022-07-08T00:54:23Z

We discussed this today in triage, and will not be moving forward with the "SIGQUIT" idea. While Helix machines will continue to happily collect dumps and provide them along with logs and result XMLs, the ROI and challenges of doing it from the Helix client remain too high to be worth it.

Some thoughts:

The entry point of the vast majority of helix work items is bash or cmd.exe. Getting a dump of this process is, like it was in the past, fairly useless.
Walking the entire tree of child processses and sigquit'ing each would produce copious dumps on every timeout in all of helix, which is undesirable from both a cost.

As runtime has already shown a way forward, and dotnet test can provide information (not just dumps) when tests hang, it does not make sense to try to force this behavior on every helix work item universally.

ViktorHofer · 2022-07-08T19:13:26Z

@davidfowl

The runtime doesn't use this currently and I thought there was a plan to move it over but I don't know where we are there.

Unfortunately as long as VSTest depends on functionality that we would like avoid to depend on, i.e. the networking stack, we won't be able to use it.

See microsoft/vstest#3595 (comment) in which other devs like @jkotas shared concerns with depending on VSTest in its current state and what we would need.

IMO in an ideal world we would have an in-proc test runner entry point, generated with a source generator based on the test framework's attributes (i.e. Fact/Theory for xunit) to make a test invocation fast and reflection free. In addition to that, there should be an option to just print to stdout to avoid the IO dependency when it's not strictly needed.

davidfowl · 2022-07-09T15:17:05Z

I don’t know if I agree with that. How does that work with VS? Is this a command line only story? Is this for customers or just the runtime (which has a unique problem)? Is this with things like nativeAOT in mind? I’m missing the big picture here

jkotas · 2022-07-09T18:54:07Z

How does that work with VS? Is this a command line only story?

VS test does not work for the inner loop if you work on the core runtime, for the reasons mentioned above. Command line only.

Is this for customers or just the runtime (which has a unique problem)? Is this with things like nativeAOT in mind?

It is primarily for the runtime, but the problems are not necessarily unique to the runtime.

For example, vstest does not have first class support for single-file testing (ie verifying that you library works in a single file deployment). We have custom solutions for number of these problems in the runtime.

ViktorHofer · 2022-07-11T15:50:15Z

VS test does not work for the inner loop if you work on the core runtime, for the reasons mentioned above. Command line only.

FWIW, VS Test Explorer which uses VSTest underneath does work well in the inner loop for libraries developers (working on source code in this repository under src/libraries).

danmoseley · 2022-07-25T20:00:28Z

As runtime has already shown a way forward,

I assume you mean dotnet/runtime#71177 .. @MichalStrehovsky did that work?

We still need a way to get dumps from hangs in runtime unit tests. Reactivating while we discuss that and maybe when we have clarity we can move this to runtime.

@hoyosjs @mikem8361 @elinor-fung let us say that we are limiting the problem to getting a dump in this situation -- "long running test"

   System.Threading.Tasks.Dataflow.Tests: [Long Running Test] 'System.Threading.Tasks.Dataflow.Tests.TransformManyBlockTests.TestProducerConsumerAsyncEnumerable', Elapsed: 00:42:19
   System.Threading.Tasks.Dataflow.Tests: [Long Running Test] 'System.Threading.Tasks.Dataflow.Tests.TransformManyBlockTests.TestProducerConsumerAsyncEnumerable', Elapsed: 00:44:20
['System.Threading.Tasks.Dataflow.Tests' END OF WORK ITEM LOG: Command timed out, and was killed]

If we could make xunit send a SIGQUIT to the worker process, would that make a dump? Would we need the createdump variables set?

hoyosjs · 2022-07-25T20:30:02Z

We'd need the variables set. dotnet test used the diagnostics netcore client library to collect dumps on hang, but as they noted this is a VSTest feature. The library could be used in XUnit too or whatever we desired to request a point-in-time dump. Now sure mono implements that one. Also, that solution doesn't work for NativeAOT. Native AOT would require ulimit and rely on the OSs capability for dump collection.

danmoseley · 2022-07-25T21:23:47Z

OK so work to make this happen for our unit tests would be

set env variables -- you are already working on that.
write code so that xunit could invoke the "diagnostics netcore client library" on hang
ensure "diagnostics netcore client library" is deployed ?

and vstest does all this, so we know it works?

If so - I can open an issue in dotnet/runtime.

This wouldn't help coreclr tests, right? If I understand right, those are standalone exe's so there is no harness to figure out something hung. (cc @trylek )

jkotas · 2022-07-25T21:32:29Z

This wouldn't help coreclr tests, right?

Also, it wouldn't help libraries tests in case the hang is in the runtime (GC, etc.).

hoyosjs · 2022-07-25T22:21:47Z

If the runtime is assumed to allow for event pipe requests and to be able to handle signals, the client library would work, or if there's a signal the environment variables could work.
If the runtime is assumed to be adversarial, calling createdump directly on the process ensures that all is captured. This requires SYS_CAP_PTRACE available on containers and elevation of an external watchdog (assuming the test runs in-proc). This is a coreclr-only solution. Often falling back on the OS for these cases is a good idea, although their dumps are not ideal.

hoyosjs · 2022-07-25T22:22:46Z

In runtime the one that launches the scripts acts as a watchdog of hangs and collects dumps on timeout.

danmoseley · 2022-07-26T20:21:41Z

@hoyosjs is it reasonable to move this to an issue in dotnet/runtime? unless and until it becomes clear that there's some way to handle it centrally

hoyosjs · 2022-08-05T20:27:38Z

CoreCLR already implements this for the runtime tests in here: https://github.com/dotnet/runtime/blob/543bcc5ee7d6a2b9471b016770227421c43a756e/src/tests/Common/Coreclr.TestWrapper/CoreclrTestWrapperLib.cs#L207-L254. The vstest solution is better for upstack repos. Libraries would either need to use vstest (which I am not sure I recommend given they use a live-built coreclr), or have an external monitor grab dumps. In-proc xunit running the patched runtime as we currently do is not resilient against runtime issues. CoreCLR will likely need this too with the wrapper based approach.

rzikm · 2024-07-31T11:31:51Z

I see this feature request has been opened for some time, is there an approximate timeline for this feature or some minimal subset? We (.NET Networking team) are mainly interested for hangs in .NET libraries test runs.

agocke · 2024-07-31T13:03:44Z

I would probably close this issue on the dnceng side. It doesn’t seem realistic to have shared dnceng test execution infrastructure any time soon. There are too many moving pieces and unique configurations in different repos.

If this gets done it probably needs to be done at the individual team level.

danmoseley mentioned this issue Feb 27, 2020

Test hang investigation support dotnet/arcade#3094

Closed

danmoseley mentioned this issue Apr 27, 2020

Long Running Test - System.Diagnostics.Tests.ProcessTests.Kill_ExitedChildProcess_DoesNotThrow dotnet/runtime#35506

Closed

michellemcdaniel closed this as completed Feb 17, 2022

agocke reopened this Jun 22, 2022

agocke mentioned this issue Jun 22, 2022

Feature request: create dumps for hangs in Helix dotnet/arcade#9719

Closed

MattGal added the Needs Triage A new issue that needs to be associated with an epic. label Jun 22, 2022

MichalStrehovsky mentioned this issue Jun 23, 2022

[NativeAOT] Enable Reflection library tests in Linux-ARM64 dotnet/runtime#70010

Closed

michellemcdaniel removed the Needs Triage A new issue that needs to be associated with an epic. label Jul 7, 2022

MattGal closed this as completed Jul 8, 2022

danmoseley reopened this Jul 25, 2022

danmoseley mentioned this issue Jul 25, 2022

System.Threading.Tasks.Dataflow tests timing out on coreclr Linux_musl arm dotnet/runtime#71475

Closed

missymessa transferred this issue from dotnet/arcade Oct 26, 2023

missymessa added this to the Improve Dev WF by providing shared test infrastructure milestone Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate dumps for hanging tests #1216

Generate dumps for hanging tests #1216

danmoseley commented Feb 27, 2020

danmoseley commented Feb 27, 2020

danmoseley commented Apr 27, 2020

ViktorHofer commented Apr 27, 2020

michellemcdaniel commented Feb 17, 2022

agocke commented Jun 22, 2022

MattGal commented Jun 22, 2022

davidfowl commented Jun 22, 2022

MichalStrehovsky commented Jun 23, 2022

davidfowl commented Jun 23, 2022

MichalStrehovsky commented Jun 23, 2022

davidfowl commented Jun 23, 2022

MichalStrehovsky commented Jun 23, 2022

MattGal commented Jul 8, 2022

ViktorHofer commented Jul 8, 2022

davidfowl commented Jul 9, 2022

jkotas commented Jul 9, 2022

ViktorHofer commented Jul 11, 2022

danmoseley commented Jul 25, 2022

hoyosjs commented Jul 25, 2022

danmoseley commented Jul 25, 2022

jkotas commented Jul 25, 2022

hoyosjs commented Jul 25, 2022

hoyosjs commented Jul 25, 2022

danmoseley commented Jul 26, 2022

hoyosjs commented Aug 5, 2022

rzikm commented Jul 31, 2024

agocke commented Jul 31, 2024

Generate dumps for hanging tests #1216

Generate dumps for hanging tests #1216

Comments

danmoseley commented Feb 27, 2020

danmoseley commented Feb 27, 2020

danmoseley commented Apr 27, 2020

ViktorHofer commented Apr 27, 2020

michellemcdaniel commented Feb 17, 2022

agocke commented Jun 22, 2022

MattGal commented Jun 22, 2022

davidfowl commented Jun 22, 2022

MichalStrehovsky commented Jun 23, 2022

davidfowl commented Jun 23, 2022

MichalStrehovsky commented Jun 23, 2022

davidfowl commented Jun 23, 2022

MichalStrehovsky commented Jun 23, 2022

MattGal commented Jul 8, 2022

ViktorHofer commented Jul 8, 2022

davidfowl commented Jul 9, 2022

jkotas commented Jul 9, 2022

ViktorHofer commented Jul 11, 2022

danmoseley commented Jul 25, 2022

hoyosjs commented Jul 25, 2022

danmoseley commented Jul 25, 2022

jkotas commented Jul 25, 2022

hoyosjs commented Jul 25, 2022

hoyosjs commented Jul 25, 2022

danmoseley commented Jul 26, 2022

hoyosjs commented Aug 5, 2022

rzikm commented Jul 31, 2024

agocke commented Jul 31, 2024