Possible deadlock on process exit if GC is in progress #107800

VSadov · 2024-09-13T16:26:24Z

Bug reported by internal partners. In some relatively infrequent cases a worker process may get stuck at exiting.
Such "stuck" processes could become a nuisance, especially when the memory footprint is large.

The reason for the lock up is that an exiting thread tries to get into COOP mode to fix up its stack.
This should only be done when a thread exits cleanly (i.e. thread exits). When this is a result of process termination (i.e. thread calls ExitProcess), getting into COOP mode can wait for GC state to clear, which may never happen, since we may not have GC any more.

The bug seems to be old - I tried 9.0, 8.0, 6.0 and saw it reproducing with a small directed repro.

The bug does not affect NativeAOT, only CoreCLR.

VSadov · 2024-09-13T16:30:15Z

A simple repro:
This program will often lock up and not exit.

    internal class Program
    {
        static void Main(string[] args)
        {
            Task.Run(() =>
            {
                for (; ; )
                {
                    GC.Collect();
                }
            });

            Thread.Sleep(10);
            Environment.Exit(0);
        }
    }

VSadov · 2024-09-13T16:33:07Z

There are several ways to fix this.
I am testing a simple enough fix that can be backported to older runtimes, if needed.

jkotas · 2024-09-14T02:32:33Z

I tried running your repro on .NET 9 RC1 for 10 minutes in a loop and it did not hang. What's the stacktrace of the hang?

VSadov · 2024-09-14T21:20:02Z

The stack is

 	ntdll.dll!NtWaitForSingleObject()	Unknown
 	KernelBase.dll!WaitForSingleObjectEx()	Unknown
>	[Inline Frame] coreclr.dll!GCEvent::Impl::Wait(unsigned int) Line 1372	C++
 	[Inline Frame] coreclr.dll!GCEvent::Wait(unsigned int) Line 1422	C++
 	coreclr.dll!WKS::GCHeap::WaitUntilGCComplete(bool bConsiderGCStart) Line 286	C++
 	coreclr.dll!Thread::RareDisablePreemptiveGC() Line 2161	C++
 	coreclr.dll!TlsDestructionMonitor::~TlsDestructionMonitor() Line 1748	C++
 	coreclr.dll!__dyn_tls_dtor(void * __formal, const unsigned long dwReason, void * __formal) Line 119	C++
 	ntdll.dll!LdrpCallInitRoutine()	Unknown
 	ntdll.dll!LdrpCallTlsInitializers()	Unknown
 	ntdll.dll!LdrShutdownProcess()	Unknown
 	ntdll.dll!RtlExitUserProcess()	Unknown
 	kernel32.dll!ExitProcessImplementation()	Unknown
 	coreclr.dll!SafeExitProcess(unsigned int exitCode, ShutdownCompleteAction sca) Line 74	C++
 	coreclr.dll!Environment_Exit(int exitcode) Line 65	C++

VSadov · 2024-09-14T21:20:38Z

I tried running your repro on .NET 9 RC1 for 10 minutes in a loop and it did not hang. What's the stacktrace of the hang?

Right, after updating to RC1 I no longer see this reproducing. It looks like it has been fixed as a sideeffect of #103877 as it switched IsAtProcessExit to use RtlDllShutdownInProgress

This is the case where g_fProcessDetach is false and RtlDllShutdownInProgress returns true, so the new way is more reliable, thus the hang no longer happens.

VSadov · 2024-09-14T21:31:01Z

So, I guess this is "fixed" for 9.0+ , but what should we do with 8.0?
There are many changes in #103877. Maybe we can backport just the IsAtProcessExit/RtlDllShutdownInProgress portion?

My fix was adding a condition for g_fEEShutDown to the if (thread->m_pFrame != FRAME_TOP) in the ~TlsDestructionMonitor(), but I think making IsAtProcessExit more reliable is a better fix.

jkotas · 2024-09-15T05:32:13Z

My fix was adding a condition for g_fEEShutDown

I have tried number of variants of this when implementing #103877. I do not think it would work well. It would replace intermittent shutdown hang with intermittent shutdown crash. The PR description tried to explain it: "The existing g_fProcessDetach flag is set too late - using this flag to skip cooperative mode switch would lead to shutdown deadlocks, and the existing g_fEEShutDown flag is set too early - using this flag to skip cooperative mode switch would lead to shutdown crashes."

RtlDllShutdownInProgress flag is set at the right time.

Maybe we can backport just the IsAtProcessExit/RtlDllShutdownInProgress portion?

Does the customer workload experiences the shutdown hangs in just a few specific spots? I think that a target servicing-grace fix would fix just those few spots.

VSadov · 2024-09-16T00:09:44Z

Cooperative shutdown seems quite a fragile bug-farmish scenario. In particular, incorrect timeliness could indeed balance between deadlock or a crash (too late/early).
#103877 seems to have closed quite a few potential ways to fail.

I was thinking - "why can't we just backport the whole thing?", but my concerns are:

the change is fairly big and in some cases a result of mechanical upgrade of g_fProcessDetach check to IsAtProcessExit, which all could be correct, but there are so many places...
some changes (TLS, allocation contexts, EH) are on top of 9.0 changes. Backporting would require thinking about possible effects in the context of 8.0 implementations
we have one specific scenario reported that also appears to be easily reproducible. Backporting just enough to address that issue may be better on benefit/risk scale for a servicing fix.

There is a possibility that after plugging one hole we will see another one reported, but at least we could minimize chances of introducing new or worse issues.

VSadov · 2024-09-16T00:31:44Z

Does the customer workload experiences the shutdown hangs in just a few specific spots?

It appears for that report we always see the lock up in the same place - when TlsDestructionMonitor ends up waiting for GC when we are detaching a thread with nonempty stack (i.e. ExitThread/ExitProcess).
In a case of ExitThread we need to fix up m_pFrame and need coop mode, but in ExitProcess cases that causes a deadlock.

(Perhaps m_pFrame fixup could be avoided/deferred, but probably not something to be changed in a servicing fix.)

VSadov · 2024-09-19T23:00:39Z

This is fixed in 9.0+ . Just need to track fixing in 8.0, so I changed milestone to be 8.0.

VSadov · 2024-09-19T23:01:12Z

Fixed in #107844

VSadov added the area-VM-coreclr label Sep 13, 2024

dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Sep 13, 2024

VSadov self-assigned this Sep 13, 2024

mangod9 removed the untriaged New issue has not been triaged by the area owner label Sep 13, 2024

mangod9 added this to the 10.0.0 milestone Sep 13, 2024

VSadov added the os-windows label Sep 13, 2024

VSadov mentioned this issue Sep 15, 2024

[release/8.0] Fix a possible infinite wait for GC completion at process shutdown. #107844

Merged

4 tasks

VSadov modified the milestones: 10.0.0, 8.0.x Sep 19, 2024

VSadov closed this as completed Sep 19, 2024

github-actions bot locked and limited conversation to collaborators Oct 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible deadlock on process exit if GC is in progress #107800

Possible deadlock on process exit if GC is in progress #107800

VSadov commented Sep 13, 2024 •

edited

Loading

VSadov commented Sep 13, 2024

VSadov commented Sep 13, 2024 •

edited

Loading

jkotas commented Sep 14, 2024

VSadov commented Sep 14, 2024

VSadov commented Sep 14, 2024 •

edited

Loading

VSadov commented Sep 14, 2024 •

edited

Loading

jkotas commented Sep 15, 2024

VSadov commented Sep 16, 2024 •

edited

Loading

VSadov commented Sep 16, 2024 •

edited

Loading

VSadov commented Sep 19, 2024

VSadov commented Sep 19, 2024

Possible deadlock on process exit if GC is in progress #107800

Possible deadlock on process exit if GC is in progress #107800

Comments

VSadov commented Sep 13, 2024 • edited Loading

VSadov commented Sep 13, 2024

VSadov commented Sep 13, 2024 • edited Loading

jkotas commented Sep 14, 2024

VSadov commented Sep 14, 2024

VSadov commented Sep 14, 2024 • edited Loading

VSadov commented Sep 14, 2024 • edited Loading

jkotas commented Sep 15, 2024

VSadov commented Sep 16, 2024 • edited Loading

VSadov commented Sep 16, 2024 • edited Loading

VSadov commented Sep 19, 2024

VSadov commented Sep 19, 2024

VSadov commented Sep 13, 2024 •

edited

Loading

VSadov commented Sep 13, 2024 •

edited

Loading

VSadov commented Sep 14, 2024 •

edited

Loading

VSadov commented Sep 14, 2024 •

edited

Loading

VSadov commented Sep 16, 2024 •

edited

Loading

VSadov commented Sep 16, 2024 •

edited

Loading