Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible deadlock on process exit if GC is in progress #107800

Closed
VSadov opened this issue Sep 13, 2024 · 11 comments
Closed

Possible deadlock on process exit if GC is in progress #107800

VSadov opened this issue Sep 13, 2024 · 11 comments

Comments

@VSadov
Copy link
Member

VSadov commented Sep 13, 2024

Bug reported by internal partners. In some relatively infrequent cases a worker process may get stuck at exiting.
Such "stuck" processes could become a nuisance, especially when the memory footprint is large.

The reason for the lock up is that an exiting thread tries to get into COOP mode to fix up its stack.
This should only be done when a thread exits cleanly (i.e. thread exits). When this is a result of process termination (i.e. thread calls ExitProcess), getting into COOP mode can wait for GC state to clear, which may never happen, since we may not have GC any more.

The bug seems to be old - I tried 9.0, 8.0, 6.0 and saw it reproducing with a small directed repro.

The bug does not affect NativeAOT, only CoreCLR.

@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Sep 13, 2024
@VSadov VSadov self-assigned this Sep 13, 2024
@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Sep 13, 2024
@mangod9 mangod9 added this to the 10.0.0 milestone Sep 13, 2024
@VSadov
Copy link
Member Author

VSadov commented Sep 13, 2024

A simple repro:
This program will often lock up and not exit.

    internal class Program
    {
        static void Main(string[] args)
        {
            Task.Run(() =>
            {
                for (; ; )
                {
                    GC.Collect();
                }
            });

            Thread.Sleep(10);
            Environment.Exit(0);
        }
    }

@VSadov
Copy link
Member Author

VSadov commented Sep 13, 2024

There are several ways to fix this.
I am testing a simple enough fix that can be backported to older runtimes, if needed.

@jkotas
Copy link
Member

jkotas commented Sep 14, 2024

I tried running your repro on .NET 9 RC1 for 10 minutes in a loop and it did not hang. What's the stacktrace of the hang?

@VSadov
Copy link
Member Author

VSadov commented Sep 14, 2024

The stack is

 	ntdll.dll!NtWaitForSingleObject()	Unknown
 	KernelBase.dll!WaitForSingleObjectEx()	Unknown
>	[Inline Frame] coreclr.dll!GCEvent::Impl::Wait(unsigned int) Line 1372	C++
 	[Inline Frame] coreclr.dll!GCEvent::Wait(unsigned int) Line 1422	C++
 	coreclr.dll!WKS::GCHeap::WaitUntilGCComplete(bool bConsiderGCStart) Line 286	C++
 	coreclr.dll!Thread::RareDisablePreemptiveGC() Line 2161	C++
 	coreclr.dll!TlsDestructionMonitor::~TlsDestructionMonitor() Line 1748	C++
 	coreclr.dll!__dyn_tls_dtor(void * __formal, const unsigned long dwReason, void * __formal) Line 119	C++
 	ntdll.dll!LdrpCallInitRoutine()	Unknown
 	ntdll.dll!LdrpCallTlsInitializers()	Unknown
 	ntdll.dll!LdrShutdownProcess()	Unknown
 	ntdll.dll!RtlExitUserProcess()	Unknown
 	kernel32.dll!ExitProcessImplementation()	Unknown
 	coreclr.dll!SafeExitProcess(unsigned int exitCode, ShutdownCompleteAction sca) Line 74	C++
 	coreclr.dll!Environment_Exit(int exitcode) Line 65	C++

@VSadov
Copy link
Member Author

VSadov commented Sep 14, 2024

I tried running your repro on .NET 9 RC1 for 10 minutes in a loop and it did not hang. What's the stacktrace of the hang?

Right, after updating to RC1 I no longer see this reproducing. It looks like it has been fixed as a sideeffect of #103877 as it switched IsAtProcessExit to use RtlDllShutdownInProgress

This is the case where g_fProcessDetach is false and RtlDllShutdownInProgress returns true, so the new way is more reliable, thus the hang no longer happens.

@VSadov
Copy link
Member Author

VSadov commented Sep 14, 2024

So, I guess this is "fixed" for 9.0+ , but what should we do with 8.0?
There are many changes in #103877. Maybe we can backport just the IsAtProcessExit/RtlDllShutdownInProgress portion?

My fix was adding a condition for g_fEEShutDown to the if (thread->m_pFrame != FRAME_TOP) in the ~TlsDestructionMonitor(), but I think making IsAtProcessExit more reliable is a better fix.

@jkotas
Copy link
Member

jkotas commented Sep 15, 2024

My fix was adding a condition for g_fEEShutDown

I have tried number of variants of this when implementing #103877. I do not think it would work well. It would replace intermittent shutdown hang with intermittent shutdown crash. The PR description tried to explain it: "The existing g_fProcessDetach flag is set too late - using this flag to skip cooperative mode switch would lead to shutdown deadlocks, and the existing g_fEEShutDown flag is set too early - using this flag to skip cooperative mode switch would lead to shutdown crashes."

RtlDllShutdownInProgress flag is set at the right time.

Maybe we can backport just the IsAtProcessExit/RtlDllShutdownInProgress portion?

Does the customer workload experiences the shutdown hangs in just a few specific spots? I think that a target servicing-grace fix would fix just those few spots.

@VSadov
Copy link
Member Author

VSadov commented Sep 16, 2024

Cooperative shutdown seems quite a fragile bug-farmish scenario. In particular, incorrect timeliness could indeed balance between deadlock or a crash (too late/early).
#103877 seems to have closed quite a few potential ways to fail.

I was thinking - "why can't we just backport the whole thing?", but my concerns are:

  • the change is fairly big and in some cases a result of mechanical upgrade of g_fProcessDetach check to IsAtProcessExit, which all could be correct, but there are so many places...
  • some changes (TLS, allocation contexts, EH) are on top of 9.0 changes. Backporting would require thinking about possible effects in the context of 8.0 implementations
  • we have one specific scenario reported that also appears to be easily reproducible. Backporting just enough to address that issue may be better on benefit/risk scale for a servicing fix.

There is a possibility that after plugging one hole we will see another one reported, but at least we could minimize chances of introducing new or worse issues.

@VSadov
Copy link
Member Author

VSadov commented Sep 16, 2024

Does the customer workload experiences the shutdown hangs in just a few specific spots?

It appears for that report we always see the lock up in the same place - when TlsDestructionMonitor ends up waiting for GC when we are detaching a thread with nonempty stack (i.e. ExitThread/ExitProcess).
In a case of ExitThread we need to fix up m_pFrame and need coop mode, but in ExitProcess cases that causes a deadlock.

(Perhaps m_pFrame fixup could be avoided/deferred, but probably not something to be changed in a servicing fix.)

@VSadov VSadov modified the milestones: 10.0.0, 8.0.x Sep 19, 2024
@VSadov
Copy link
Member Author

VSadov commented Sep 19, 2024

This is fixed in 9.0+ . Just need to track fixing in 8.0, so I changed milestone to be 8.0.

@VSadov
Copy link
Member Author

VSadov commented Sep 19, 2024

Fixed in #107844

@VSadov VSadov closed this as completed Sep 19, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Oct 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants