Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Access violations around unmanaged LoaderAllocator when using dynamic (emitted with AssemblyBuilder) collectible assemblies #32171

Closed
ishimko opened this issue Feb 12, 2020 · 42 comments · Fixed by #33243
Assignees
Milestone

Comments

@ishimko
Copy link

ishimko commented Feb 12, 2020

After migration to .NET Core 3.1 from 2.2 we've started facing dotnet runtime crash during our test runs:
The active test run was aborted. Reason: Test host process crashed : Fatal error. Internal CLR error. (0x80131506). The problem also occurs with .NET Core 3.0, both on Windows and Linux.

The crash is not 100% reproducible (tests need to be rerun multiple times to catch this situation) and unfortunately we were not able to locate the problem and build a small reproducible example. But there are several crash dumps showing different aspects of the problem.

We're using AssemblyBuilder to emit collectible (AssemblyBuilderAccess.RunAndCollect) assemblies dynamically. Under some circumstances a try to use a type/object from this assembly produces access violation (reading memory near zero). When using AssemblyBuilderAccess.Run instead of RunAndCollect the issue is not reproduced. So it looks like the issue is related to introducing of collectible assemblies occurred in .NET Core 3.0.

There are several crash dumps collected on Windows we can share. I will add some thoughts on the analysis performed so far.

1. coreclr!LoaderAllocator::GetHandleValueFastPhase2

Dump: https://drive.google.com/file/d/1bZ80x7fQ_kkc3mus9KI1BT8Qi1fiNLpq/view?usp=sharing

Managed stack trace (copying only .NET part):

WinDbg output
0000001D68C3E1A8 00007fff9e7d6ab5 [HelperMethodFrame_PROTECTOBJ: 0000001d68c3e1a8] System.Signature.GetSignature(Void*, Int32, System.RuntimeFieldHandleInternal, System.IRuntimeMethodInfo, System.RuntimeType)
0000001D68C3E2E0 00007fff400c727e System.Reflection.RuntimeMethodInfo.FetchNonReturnParameters() [/_/src/System.Private.CoreLib/src/System/Reflection/RuntimeMethodInfo.cs @ 115]
0000001D68C3E330 00007fff4007be10 System.Reflection.RuntimeMethodInfo.GetParametersNoCopy() [/_/src/System.Private.CoreLib/src/System/Reflection/RuntimeMethodInfo.cs @ 357]
0000001D68C3E360 00007fff40079b5e System.RuntimeType.FilterApplyMethodBase(System.Reflection.MethodBase, System.Reflection.BindingFlags, System.Reflection.BindingFlags, System.Reflection.CallingConventions, System.Type[]) [/_/src/System.Private.CoreLib/src/System/RuntimeType.CoreCLR.cs @ 2236]
0000001D68C3E3C0 00007fff400c7e4b System.RuntimeType.GetMethodCandidates(System.String, Int32, System.Reflection.BindingFlags, System.Reflection.CallingConventions, System.Type[], Boolean) [/_/src/System.Private.CoreLib/src/System/RuntimeType.CoreCLR.cs @ 2455]
0000001D68C3E470 00007fff400ca4b2 System.RuntimeType.GetMethodImplCommon(System.String, Int32, System.Reflection.BindingFlags, System.Reflection.Binder, System.Reflection.CallingConventions, System.Type[], System.Reflection.ParameterModifier[]) [/_/src/System.Private.CoreLib/src/System/RuntimeType.CoreCLR.cs @ 2750]
0000001D68C3E500 00007fff4007b488 System.RuntimeType.GetMethodImpl(System.String, System.Reflection.BindingFlags, System.Reflection.Binder, System.Reflection.CallingConventions, System.Type[], System.Reflection.ParameterModifier[]) [/_/src/System.Private.CoreLib/src/System/RuntimeType.CoreCLR.cs @ 2736]

From unmanaged side this crash was produced by the following line (handleTable is null):

*pValue = handleTable->GetAt(index);

By redoing some address arithmetic starting from mov rax [rdi+18h] it can be determined that LoaderAllocatorObject is 000001d8`1b940568:

WinDbg outout
00007fff`9e7d6a86 488b4718         mov     rax, qword ptr [rdi+18h]
00007fff`9e7d6a8a 488b4830         mov     rcx, qword ptr [rax+30h]
00007fff`9e7d6a8e 488b4140         mov     rax, qword ptr [rcx+40h]
00007fff`9e7d6a92 4885d2           test    rdx, rdx
00007fff`9e7d6a95 0f846c24ecff     je      coreclr!TypeHandle::GetManagedClassObject+0x3f (00007fff`9e698f07)
00007fff`9e7d6a9b 488b8040020000   mov     rax, qword ptr [rax+240h]
00007fff`9e7d6aa2 488b00           mov     rax, qword ptr [rax]
00007fff`9e7d6aa5 4885c0           test    rax, rax
00007fff`9e7d6aa8 0f845924ecff     je      coreclr!TypeHandle::GetManagedClassObject+0x3f (00007fff`9e698f07)
00007fff`9e7d6aae 488b4010         mov     rax, qword ptr [rax+10h]
00007fff`9e7d6ab2 48d1ea           shr     rdx, 1
ACCESS VIOLATION -> 00007fff`9e7d6ab5 488b5cd008       mov     rbx, qword ptr [rax+rdx*8+8] ds:00000000`00000230=????????????????
0:159> dt coreclr!LoaderAllocatorObject 000001d8`1b940568
   +0x000 m_pMethTab       : 0x00007fff`3ed1dc60 MethodTable
   +0x008 m_pLoaderAllocatorScout : 0x000001d8`1b19f198 LoaderAllocatorScoutObject
   +0x010 m_pSlots         : (null) 
   +0x018 m_slotsUsed      : 0n0
   +0x020 m_methodInstantiationsTable : (null) 

Going further into m_pLoaderAllocatorScout:

0:159> dt coreclr!LoaderAllocatorScoutObject 0x1d81b19f198
   +0x000 m_pMethTab       : 0x00007fff`3ec144c0 MethodTable
   +0x008 m_nativeLoaderAllocator : 0x000001d8`1afa2c08 LoaderAllocator

and to m_nativeLoaderAllocator, the memory looks a bit strange:

WinDbg output
0:159> dt coreclr!LoaderAllocator 0x000001d8`1afa2c08
. . .
   +0x288 m_fGCPressure    : 58 -- boolean?
   +0x289 m_fUnloaded      : 15 -- boolean?
   +0x28a m_fTerminated    : 74 -- boolean?
   +0x28b m_fMarked        : 16 -- boolean?
   +0x28c m_nGCCount       : 0n472
   +0x290 m_IsCollectible  : 0
. . .

From the CLR stack this dynamic assembly does not look like a one to be unloaded. Here is the AssemblyBuilder which owns the type on which the crash occurred:

0:159> !gcroot 000001d81afa46c8
Thread 1748:
    0000001D68C3E5E0 00007FFF3F3D3452 Nezaboodka.ClientTypeBinder..ctor(Nezaboodka.ClientTypeSystem, System.Collections.Generic.IEnumerable`1<System.Type>, System.Collections.Generic.IEnumerable`1<System.Type>, Nezaboodka.Nzon.NzonSettings)
        rdi: 
            ->  000001D81AFA46A8 Nezaboodka.Nzon.FormattersGenerator
            ->  000001D81AFA46C8 System.Reflection.Emit.AssemblyBuilder

2. coreclr!MethodTable::GetLoaderAllocatorObjectForGC

Dump: https://drive.google.com/file/d/1RSjdjAJ0TTjpH5QmcmGIheoHCk-il2al/view?usp=sharing

The line produced the crash (looks like it tries to dereference invalid handle returned from GetLoaderAllocatorObjectHandle()):

BYTE * retVal = *(BYTE**)GetLoaderAllocatorObjectHandle();

There is an interesting unmanaged stack trace here showing GC and JIT parts:

WinDbg output
00 0000002d`379bccf8 00007ff9`19ba335c coreclr!MethodTable::GetLoaderAllocatorObjectForGC+0x1e [f:\workspace\_work\1\s\src\vm\methodtable.cpp @ 10161] 
01 (Inline Function) --------`-------- coreclr!GCToEEInterface::GetLoaderAllocatorObjectForGC+0xc [f:\workspace\_work\1\s\src\vm\gcenv.ee.cpp @ 359] 
02 0000002d`379bcd00 00007ff9`19a4760e coreclr!WKS::gc_heap::mark_through_cards_for_segments+0x1628ec [f:\workspace\_work\1\s\src\gc\gc.cpp @ 29019] 
03 0000002d`379bce50 00007ff9`19a43402 coreclr!WKS::gc_heap::mark_phase+0x9be [f:\workspace\_work\1\s\src\gc\gc.cpp @ 20795] 
04 0000002d`379be1c0 00007ff9`19a44a5c coreclr!WKS::gc_heap::gc1+0xb2 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 16346] 
05 (Inline Function) --------`-------- coreclr!GCToOSInterface::GetLowPrecisionTimeStamp+0x5 [f:\workspace\_work\1\s\src\vm\gcenv.os.cpp @ 946] 
06 0000002d`379be290 00007ff9`19a65dfc coreclr!WKS::gc_heap::garbage_collect+0x25c [f:\workspace\_work\1\s\src\gc\gc.cpp @ 17976] 
07 0000002d`379be320 00007ff9`19ad41fe coreclr!WKS::GCHeap::GarbageCollectGeneration+0x1fc [f:\workspace\_work\1\s\src\gc\gc.cpp @ 36597] 
08 (Inline Function) --------`-------- coreclr!WKS::gc_heap::trigger_gc_for_alloc+0x12 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 13832] 
09 0000002d`379be370 00007ff9`19ad6d58 coreclr!WKS::gc_heap::try_allocate_more_space+0x24e [f:\workspace\_work\1\s\src\gc\gc.cpp @ 13934] 
0a (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate_more_space+0x11 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 14369] 
0b (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate+0x58 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 14400] 
0c 0000002d`379be3f0 00007ff9`19a7c46e coreclr!WKS::GCHeap::Alloc+0x88 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 35827] 
0d (Inline Function) --------`-------- coreclr!Alloc+0x18b [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 240] 
0e (Inline Function) --------`-------- coreclr!AllocateObject+0x22d [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 1209] 
0f 0000002d`379be420 00007ff8`baf6819f coreclr!JIT_New+0x31e [f:\workspace\_work\1\s\src\vm\jithelpers.cpp @ 2724] 

LoaderAllocator address seems to be 00007ff8`ba020a88 (and its' boolean flags section looks suspicios too). m_hLoaderAllocatorObjectHandle has the value of 0x00000000`00000080 and it produces the crash:

WinDbg output
00007ff9`19cae933 488b4830           mov     rcx, qword ptr [rax+30h]
00007ff9`19cae937 488b4140           mov     rax, qword ptr [rcx+40h]
00007ff9`19cae93b 488b8040020000     mov     rax, qword ptr [rax+240h]
ACCESS VIOLATION -> 00007ff9`19cae942 488b00             mov     rax, qword ptr [rax] ds:00000000`00000080=????????????????
0:165> dt coreclr!LoaderAllocator 00007ff8`ba020a88
. . .
   +0x240 m_hLoaderAllocatorObjectHandle : 0x00000000`00000080 OBJECTHANDLE__
. . .
   +0x288 m_fGCPressure    : 0
   +0x289 m_fUnloaded      : 0
   +0x28a m_fTerminated    : 16
   +0x28b m_fMarked        : 1
   +0x28c m_nGCCount       : 0n1
   +0x290 m_IsCollectible  : ffffffffffffffe8
. . .

3. coreclr!LoaderAllocatorObject::GetHandleTable

Dump: https://drive.google.com/file/d/1n8p-JeOCrUS5qjBkOyiuQmP16kJYfAR4/view?usp=sharing

The crash is produced by the following line (loaderAllocator of type LoaderAllocatorObject is null):

PTRARRAYREF handleTable = loaderAllocator->GetHandleTable();

WinDbg output
00007ffc`54d2b8e8 488b8640020000   mov     rax, qword ptr [rsi+240h]
00007ffc`54d2b8ef 488b08           mov     rcx, qword ptr [rax]
00007ffc`54d2b8f2 488b4110         mov     rax, qword ptr [rcx+10h] ds:00000000`00000010=????????????????
0:020> ? rsi
Evaluate expression: 1846619047024 = 000001ad`f3128470
0:020> dt coreclr!LoaderAllocator 000001ad`f3128470
. . .
   +0x240 m_hLoaderAllocatorObjectHandle : 0x000001ad`ee457a10 OBJECTHANDLE__
. . .
   +0x288 m_fGCPressure    : 1
   +0x289 m_fUnloaded      : 0
   +0x28a m_fTerminated    : 0
   +0x28b m_fMarked        : 0
   +0x28c m_nGCCount       : 0n32763
   +0x290 m_IsCollectible  : 1
. . .

The most strange thing here is that this allocator is referenced from managed LoaderAllocatorScout.

By dumping all LoaderAllocatorScout objects from the managed heap and looking inside their m_nativeLoaderAllocator objects for address 000001ad`f3128470 (unmanaged LoaderAllocator) it can be determined that LoaderAllocatorScout presents in managed heap and is not collected. However the handle from unmanaged to managed (m_hLoaderAllocatorObjectHandle) is wrong.

WinDbg output
0:020> .foreach ($obj {!dumpheap -type LoaderAllocatorScout -short}){.echo $obj; !do $obj}
. . .
000001ad847a5dc8
Name:        System.Reflection.LoaderAllocatorScout
MethodTable: 00007ffbf59dfb10
EEClass:     00007ffbf59e39a8
Size:        24(0x18) bytes
File:        C:\Program Files\dotnet\shared\Microsoft.NETCore.App\3.1.1\System.Private.CoreLib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ffbf516ecc0  4000ec9        8        System.IntPtr  1 instance 000001ADF3128470 m_nativeLoaderAllocator
. . .
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Feb 12, 2020
@jeffschwMSFT jeffschwMSFT added this to the 5.0 milestone Feb 12, 2020
@jkotas
Copy link
Member

jkotas commented Feb 12, 2020

cc @janvorli

@janvorli
Copy link
Member

I'll take a look

@janvorli janvorli self-assigned this Feb 13, 2020
@janvorli
Copy link
Member

By redoing some address arithmetic starting from mov rax [rdi+18h] it can be determined that LoaderAllocatorObject is 000001d8`1b940568:

The interesting thing is that it is not a LoaderAllocator (named as LoaderAllocatorObject on the native side), so then what you've considered to be a LoaderAllocatorScout was something else too and thus the same was true for the m_nativeLoaderAllocator:

!DumpObj /d 000001d81b940568
Name:        System.RuntimeType+RuntimeTypeCache
MethodTable: 00007fff3ed1dc60
EEClass:     00007fff3ed25600
Size:        152(0x98) bytes
Type Name:   <Unloaded Type>
Type MT:     00007fff3fdc5388
File:        C:\Program Files\dotnet\shared\Microsoft.NETCore.App\3.1.1\System.Private.CoreLib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007fff3ec144c0  40001e5        8   System.RuntimeType  0 instance 000001d81b19f198 m_runtimeType
00007fff3ec144c0  40001e6       10   System.RuntimeType  0 instance 0000000000000000 m_enclosingType
00007fff3ed1d960  40001e7       88         System.Int32  1 instance                0 m_typeCode
00007fff3ecd1e18  40001e8       18        System.String  0 instance 0000000000000000 m_name
00007fff3ecd1e18  40001e9       20        System.String  0 instance 0000000000000000 m_fullname
00007fff3ecd1e18  40001ea       28        System.String  0 instance 0000000000000000 m_toString
00007fff3ecd1e18  40001eb       30        System.String  0 instance 0000000000000000 m_namespace
00007fff3ec171c8  40001ec       8c       System.Boolean  1 instance                0 m_isGlobal
00007fff3ec171c8  40001ed       8d       System.Boolean  1 instance                0 m_bIsDomainInitialized
00007fff3ed316b0  40001ee       38 ...Private.CoreLib]]  0 instance 0000000000000000 m_methodInfoCache
00007fff3ef8edb0  40001ef       40 ...Private.CoreLib]]  0 instance 000001d81b940600 m_constructorInfoCache
00007fff3ef87e30  40001f0       48 ...Private.CoreLib]]  0 instance 0000000000000000 m_fieldInfoCache
00007fff3eede150  40001f1       50 ...Private.CoreLib]]  0 instance 0000000000000000 m_interfaceCache
00007fff3eede150  40001f2       58 ...Private.CoreLib]]  0 instance 0000000000000000 m_nestedClassesCache
00007fff3ed1eac0  40001f3       60 ...Private.CoreLib]]  0 instance 0000000000000000 m_propertyInfoCache
00007fff3f38e0c0  40001f4       68 ...Private.CoreLib]]  0 instance 0000000000000000 m_eventInfoCache
00007fff3ecd1e18  40001f7       70        System.String  0 instance 0000000000000000 m_defaultMemberName
00007fff3ec10ae8  40001f8       78        System.Object  0 instance 0000000000000000 m_genericCache
00007fff3ec16610  40001f9       80      System.Object[]  0 instance 0000000000000000 _emptyArray
00007fff405b4b18  40001f5       f0 ...Private.CoreLib]]  1   static 000001d826741130 s_methodInstantiations
00007fff3ec10ae8  40001f6       f8        System.Object  0   static 000001d8187d3d48 s_methodInstantiationsLock

But the MethodTable at frame 3 points to that as if it was a LoaderAllocator instance (you can find the MethodTable address from frame 4: this->m_asMT):

!dumpmt 0x00007fff`3fd98808
EEClass:         00007fff3fd98690
Module:          00007fff3fcac080
Name:            
mdToken:         0000000002000044
File:            Unknown Module
LoaderAllocator: 000001d81b940568
BaseSize:        0x20
ComponentSize:   0x0
Slots in VTable: 11
Number of IFaces in IFaceMap: 0

My guess is that the real LoaderAllocator was collected somehow and a new object of type System.RuntimeType+RuntimeTypeCache got allocated at the same address.
Obviously, it should not have gotten collected as the MethodTable of an existing object still references it.
But, getting all objects from the heap with this MethodTable shows:

!dumpheap -mt 0x00007fff`3fd98808 
         Address               MT     Size
000001d81b93b428 00007fff3fd98808       32     

Statistics:
              MT    Count    TotalSize Class Name
00007fff3fd98808        1           32 <Unloaded Type>

 !dumpobj 000001d81b93b428 
Name:        <Unloaded Type>
MethodTable: 00007fff3fd98808
EEClass:     00007fff3fd98690
Size:        32(0x20) bytes
File:        Unknown Module
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007fff3f64a150  40000ab        8 ...zaboodka.Server]]  0 instance 000001d81af7b590 as__Timestamp
00007fff3f481038  40000ac       10 ...Private.CoreLib]]  0 instance 000001d81af73418 as__Boolean

!gcroot -all 000001d81b93b428 
Thread 1748:
    0000001D68C3E550 00007FFF3F3E0F58 Nezaboodka.Nzon.ObjectNzonType`1[[System.__Canon, System.Private.CoreLib]].BindFormatter(Nezaboodka.Nzon.FormattersGenerator)
        rbx: 
            ->  000001D81B93B428 <Unloaded Type>

So the object of that type is still referenced by a register in one of the stack frames of the current thread (frame 0b), yet the LoaderAllocator was unloaded as if there were no more references to it.
That seems to indicate that for a single GC collection, the object was not found to be referenced anywhere and then resurfaced in the rbx register in that frame or GC somehow didn't call the MethodTable::GetLoaderAllocatorObjectForGC for this object.

I keep looking, the other failure modes are also quite strange.

@ishimko
Copy link
Author

ishimko commented Feb 14, 2020

Thanks for looking into that @janvorli. That explains why ""LoaderAllocator"" fields look so strange, because it is not LoaderAllocator at all.

@janvorli
Copy link
Member

@vanashimko I wonder if you'd be able to create a simple repro of the problem that you would be able to share. That would make looking into it much easier.
I also wonder if your code uses GCHandle struct and its ToIntPtr / FromIntPtr methods. I can see a possible bug related to that that could accidentally result in freeing of the LoaderAllocator handle.

@ishimko
Copy link
Author

ishimko commented Feb 14, 2020

@janvorli I tried to create an example multiple times, but unfortunately without success. There is a lot of code involved there. I even tried to create a separate test (based on our code) to have something reproducible internally as a starting point for isolated example, but that didn't work. There are around 300+ tests that are running. And it can be the case when they are run successfully 2-3 times in a row without any issues. And than another restart produces this crash during the first test.

Because of "gc" in a stack trace I've also tried to insert GC.Collect here and there in the code, to try force this issue somehow, but that didn't work too.

As for GCHandle, the answer is no.

@janvorli
Copy link
Member

Ok, thank you for your answer, I'll keep digging in the dumps.

@janvorli
Copy link
Member

@vanashimko I got an idea that can help us diagnose it better. There is a feature called stress log that is able to log various low level details on the runtime into an in-memory buffer that is dumped as part of a crash dump. It can also log handle creation and destruction. I was wondering if you'd be able to enable that and once you hit the failures again, share the crash dumps again.
To enable that, you'd need to set the following environment variables in the command line before starting your application:

COMPlus_StressLog=1
COMPlus_LogLevel=6
COMPlus_LogFacility=1
COMPlus_StressLogSize=2000000
COMPlus_TotalStressLogSize=40000000

This enables stress log buffer, logs events of frequency level 6, logs GC events (LogFacility=1), the stress log size per thread is 0x2000000 bytes (32MB) and total stress log size is 0x40000000 bytes (1GB).

@janvorli
Copy link
Member

Actually, it would be good to add LoaderAllocator events - so instead of COMPlus_LogFacility=1, you'd use COMPlus_LogFacility=401

@ishimko
Copy link
Author

ishimko commented Feb 14, 2020

@janvorli Thanks. Actually I've tried stress log myself (enabled it using some registry changes found on the Internet). But was not able to make use of the results because not so familiar with all this stuff and just deleted those dumps.

Now I've used the approach with env variables you described (including 401 for LogFacility) and collected the following: https://drive.google.com/file/d/1kk_RVVix-IVyIBih22Doo9_MdmgPchSf/view?usp=sharing. Hope did not miss anything.

@janvorli
Copy link
Member

@vanashimko the dump and logs shows something interesting. First, the reference in a handle that should be a LoaderAllocator is again the System.RuntimeType+RuntimeTypeCache like in one of the crashes before. The stresslog proves that the handle was re-created without the old one being deleted, which I have no idea how it could happen. Here are copied out lines from the log that correspond to the handle slot 000001F2493291F8:

339c  20.132816500 : `GC`                 DestroyHandle: *000001F2493291F8->0000000000000000 <<< Some previous handle using the same slot was destroyed
38cc  20.576978600 : `GC`                 CreateHandle: 000001F2493291F8, type=1 << Here the handle for the LoaderAllocator got created
3d64  20.582316000 : `GC`GCROOTS`             GC Root 000001F2493291F8 RELOCATED 000001F24FE01CA0 -> 000001F24FC30828  MT = 00007FFAE89DF8A8 (System.Reflection.LoaderAllocator)
3144  20.606298600 : `GC`GCROOTS`             GC Root 000001F2493291F8 RELOCATED 000001F24FC30828 -> 000001F24C9B5208  MT = 00007FFAE89DF8A8 (System.Reflection.LoaderAllocator)
3700  20.693169400 : `GC`                 CreateHandle: 000001F2493291F8, type=1 <<< Handle in the same slot was created again, but the previous one was not destroyed

As you can see, there is no DestroyHandle between the two CreateHandle logs and still the same slot was reused.

@janvorli
Copy link
Member

The only explanation I can see is a memory corruption or a bug in the handle table code (which seems extremely unlikely).
@jkotas do you happen to have some other possible explanation?

@janvorli
Copy link
Member

@vanashimko I'll be OOF next week, so I won't be able to try to look into it more before I get back. If you'd get more dumps with the stresslog enabled with stack trace similar to the cases 2 and 3, please share them so that we can get more data.

@ishimko
Copy link
Author

ishimko commented Feb 15, 2020

@janvorli Thank you for your research. Some more dumps with stresslog enabled:

coreclr!LoaderAllocatorObject::GetHandleTable - probably similar to 3
https://drive.google.com/file/d/155rv89FbXHczPf0Rw5vEITJJdMjFqoG8/view?usp=sharing

00 (Inline Function) --------`-------- coreclr!LoaderAllocatorObject::GetHandleTable [f:\workspace\_work\1\s\src\vm\object.h @ 2468] 
01 0000007f`9647def0 00007ffc`4526fd5b coreclr!LoaderAllocator::CompareExchangeValueInHandle+0xfc6d6 [f:\workspace\_work\1\s\src\vm\loaderallocator.cpp @ 940] 
02 0000007f`9647df80 00007ffc`4526f1df coreclr!DomainFile::GetExposedModuleObject+0x147 [f:\workspace\_work\1\s\src\vm\domainfile.cpp @ 491] 
03 (Inline Function) --------`-------- coreclr!Module::GetExposedObject+0xf [f:\workspace\_work\1\s\src\vm\ceeload.cpp @ 3029] 
04 0000007f`9647e000 00007ffc`4526f09a coreclr!GetRuntimeModuleHelper+0x103 [f:\workspace\_work\1\s\src\vm\runtimehandles.cpp @ 180] 
05 0000007f`9647e180 00007ffb`e6c30458 coreclr!RuntimeTypeHandle::GetModule+0x3a [f:\workspace\_work\1\s\src\vm\runtimehandles.cpp @ 539] 

coreclr!MethodTable::GetParentMethodTableOrIndirection - looks like a new one
https://drive.google.com/file/d/1Q6N9iiqrC_rBW8u0NFtf65Iy7u3jfEW8/view?usp=sharing

00 (Inline Function) --------`-------- coreclr!MethodTable::GetParentMethodTableOrIndirection [f:\workspace\_work\1\s\src\vm\methodtable.h @ 2064] 
01 00000096`0e23d838 00007ffc`452f647e coreclr!MethodTable::CanCastToClassNoGC+0x31 [f:\workspace\_work\1\s\src\vm\methodtable.cpp @ 1733] 
02 00000096`0e23d840 00007ffc`44e8d2ee coreclr!JIT_Stelem_Ref__ObjIsInstanceOfNoGC_Helper+0x1e [F:\workspace\_work\1\s\src\vm\amd64\JitHelpers_Fast.asm @ 804] 
03 00000096`0e23d870 00007ffc`44e8d03e System_Private_CoreLib!System.Collections.Generic.ArraySortHelper`1[System.__Canon].PickPivotAndPartition(System.__Canon[], Int32, Int32, System.Comparison`1<System.__Canon>)$##6005554+0x17e
04 00000096`0e23d8f0 00007ffc`44e8cfb5 System_Private_CoreLib!System.Collections.Generic.ArraySortHelper`1[System.__Canon].IntroSort(System.__Canon[], Int32, Int32, Int32, System.Comparison`1<System.__Canon>)$##6005553+0x6e
05 00000096`0e23d970 00007ffc`44e8cb66 System_Private_CoreLib!System.Collections.Generic.ArraySortHelper`1[System.__Canon].IntrospectiveSort(System.__Canon[], Int32, Int32, System.Comparison`1<System.__Canon>)$##6005552+0x55
06 00000096`0e23d9e0 00007ffc`44cab072 System_Private_CoreLib!System.Collections.Generic.ArraySortHelper`1[System.__Canon].Sort(System.__Canon[], Int32, Int32, System.Collections.Generic.IComparer`1<System.__Canon>)$##600554C+0x96
07 00000096`0e23da50 00007ffb`e6b6d63c System_Private_CoreLib!System.Array.Sort[System.__Canon](System.__Canon[], Int32, Int32, System.Collections.Generic.IComparer`1<System.__Canon>)$##6000155+0xa2

I will post more if find (this GC trace from the case 2 is pretty rare, BTW, the dump from the original issue was the only time I faced it).

UPD:

coreclr!MethodTable::GetLoaderAllocatorObjectForGC - similar to 2
https://drive.google.com/file/d/1_y5Reql7FSNv1UqDaHFF6XnBMMyPbX2M/view?usp=sharing

00 00000095`90e7c0e8 00007ffc`4533335c coreclr!MethodTable::GetLoaderAllocatorObjectForGC+0x1e [f:\workspace\_work\1\s\src\vm\methodtable.cpp @ 10161] 
01 (Inline Function) --------`-------- coreclr!GCToEEInterface::GetLoaderAllocatorObjectForGC+0xc [f:\workspace\_work\1\s\src\vm\gcenv.ee.cpp @ 359] 
02 00000095`90e7c0f0 00007ffc`451d760e coreclr!WKS::gc_heap::mark_through_cards_for_segments+0x1628ec [f:\workspace\_work\1\s\src\gc\gc.cpp @ 29019] 
03 00000095`90e7c240 00007ffc`451d3402 coreclr!WKS::gc_heap::mark_phase+0x9be [f:\workspace\_work\1\s\src\gc\gc.cpp @ 20795] 
04 00000095`90e7d5b0 00007ffc`451d4a5c coreclr!WKS::gc_heap::gc1+0xb2 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 16346] 
05 (Inline Function) --------`-------- coreclr!GCToOSInterface::GetLowPrecisionTimeStamp+0x5 [f:\workspace\_work\1\s\src\vm\gcenv.os.cpp @ 946] 
06 00000095`90e7d680 00007ffc`451f5dfc coreclr!WKS::gc_heap::garbage_collect+0x25c [f:\workspace\_work\1\s\src\gc\gc.cpp @ 17976] 
07 00000095`90e7d710 00007ffc`452641fe coreclr!WKS::GCHeap::GarbageCollectGeneration+0x1fc [f:\workspace\_work\1\s\src\gc\gc.cpp @ 36597] 
08 (Inline Function) --------`-------- coreclr!WKS::gc_heap::trigger_gc_for_alloc+0x12 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 13832] 
09 00000095`90e7d760 00007ffc`45266d58 coreclr!WKS::gc_heap::try_allocate_more_space+0x24e [f:\workspace\_work\1\s\src\gc\gc.cpp @ 13934] 
0a (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate_more_space+0x11 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 14369] 
0b (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate+0x58 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 14400] 
0c 00000095`90e7d7e0 00007ffc`4520bd9b coreclr!WKS::GCHeap::Alloc+0x88 [f:\workspace\_work\1\s\src\gc\gc.cpp @ 35827] 
0d (Inline Function) --------`-------- coreclr!Alloc+0x199 [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 240] 
0e (Inline Function) --------`-------- coreclr!AllocateSzArray+0x356 [f:\workspace\_work\1\s\src\vm\gchelpers.cpp @ 543] 
0f 00000095`90e7d810 00007ffb`e6c3d569 coreclr!JIT_NewArr1+0x47b [f:\workspace\_work\1\s\src\vm\jithelpers.cpp @ 3145] 

@jkotas
Copy link
Member

jkotas commented Feb 15, 2020

@jkotas do you happen to have some other possible explanation?

One possibility to look for is double-free or use-after-free of a GC handle.

@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Feb 24, 2020
@janvorli
Copy link
Member

@davidwrighton has also told me today that he has recently seen a similar bug caused by double free of a GC handle that later results in double allocation of the same handle. So I've looked at the stress log again and I can see exactly that. Extending the list of relevant items from the stress log that I've shown before:

38cc  18.090732000 : `GC`                 CreateHandle: 000001F2493291F8, type=1
3164  18.109332900 : `GC`GCROOTS`             GC Root 000001F2493291F8 RELOCATED 000001F2502B42D8 -> 000001F24FB5C338  MT = 00007FFAE89DF8A8 (System.Reflection.LoaderAllocator)
38cc  18.141538400 : `GC`GCROOTS`             GC Root 000001F2493291F8 RELOCATED 000001F24FB5C338 -> 000001F24D0E1140  MT = 00007FFAE89DF8A8 (System.Reflection.LoaderAllocator)
339c  19.530733500 : `GC`                 DestroyHandle: *000001F2493291F8->000001F24D0E1140
339c  20.132816500 : `GC`                 DestroyHandle: *000001F2493291F8->0000000000000000 <<< Some previous handle using the same slot was destroyed
38cc  20.576978600 : `GC`                 CreateHandle: 000001F2493291F8, type=1 << Here the handle for the LoaderAllocator got created
3d64  20.582316000 : `GC`GCROOTS`             GC Root 000001F2493291F8 RELOCATED 000001F24FE01CA0 -> 000001F24FC30828  MT = 00007FFAE89DF8A8 (System.Reflection.LoaderAllocator)
3144  20.606298600 : `GC`GCROOTS`             GC Root 000001F2493291F8 RELOCATED 000001F24FC30828 -> 000001F24C9B5208  MT = 00007FFAE89DF8A8 (System.Reflection.LoaderAllocator)
3700  20.693169400 : `GC`                 CreateHandle: 000001F2493291F8, type=1 <<< Handle in the same slot was created again, but the previous one was not destroyed

As you can see at 19.530733500 and 20.132816500, the same handle was destroyed. The handle was pointing to a LoaderAllocator at the first call to destroy and to nothing at the second call to destroy, so it seems like we have an issue with double freeing the LoaderAllocator handle in some cases in the runtime.

I'll keep digging into it to pinpoint the case when it happens.

@janvorli
Copy link
Member

@vanashimko would you be able to try to run your application with a modified coreclr.dll? I could add some additional checks that would trigger right at the spot when handle is double-freed and share a source patch that you can apply and build your own coreclr. I could also share the binary if you'd prefer that.

@ishimko
Copy link
Author

ishimko commented Feb 28, 2020

@janvorli Yes sure I can do testing with modified coreclr.dll. I don't have experience with building it, so binary ready to go file is what I'd prefer.

@janvorli
Copy link
Member

Ok, great, thank you! I'll try to get you something on Monday.

@ishimko
Copy link
Author

ishimko commented Mar 2, 2020

@janvorli I've tried to switch one part of our code from emit using AssemblyBuilder to Roslyn and AssemblyLoadContext. And faced the same (I suppose) crash:

https://drive.google.com/file/d/1Q2ermVUhxF07JTZCwJuVOnXsgRVMA0mE/view?usp=sharing

managed trace:

System.Runtime.Loader.AssemblyLoadContext.InternalLoad(System.ReadOnlySpan`1, System.ReadOnlySpan`1) [/_/src/System.Private.CoreLib/src/System/Runtime/Loader/AssemblyLoadContext.CoreCLR.cs @ 47]
System.Runtime.Loader.AssemblyLoadContext.LoadFromStream(System.IO.Stream, System.IO.Stream)
System.Runtime.Loader.AssemblyLoadContext.LoadFromStream(System.IO.Stream) [/_/src/System.Private.CoreLib/shared/System/Runtime/Loader/AssemblyLoadContext.cs @ 340]

unmanaged trace:

coreclr!ArrayBase::GetNumComponents [f:\workspace\_work\1\s\src\vm\object.inl @ 213] 
coreclr!LoaderAllocator::AllocateHandle+0xfbd6f [f:\workspace\_work\1\s\src\vm\loaderallocator.cpp @ 789] 
coreclr!DomainAssembly::GetExposedAssemblyObject+0x185 [f:\workspace\_work\1\s\src\vm\domainfile.cpp @ 1365] 
coreclr!Assembly::GetExposedObject+0x23 [f:\workspace\_work\1\s\src\vm\assembly.cpp @ 1808] 
coreclr!AssemblyNative::LoadFromStream+0x130 [f:\workspace\_work\1\s\src\vm\assemblynative.cpp @ 354] 
System_Private_CoreLib!System.Runtime.Loader.AssemblyLoadContext.InternalLoad(System.ReadOnlySpan`1<Byte>, System.ReadOnlySpan`1<Byte>)$##600302D+0xb7
System_Private_CoreLib!System.Runtime.Loader.AssemblyLoadContext.LoadFromStream(System.IO.Stream, System.IO.Stream)$##600306E+0x11a
System_Private_CoreLib!System.Runtime.Loader.AssemblyLoadContext.LoadFromStream(System.IO.Stream)$##600306D+0xe

@janvorli
Copy link
Member

janvorli commented Mar 3, 2020

@vanashimko I have created a testing coreclr.dll that you can use as a drop-in replacement for the existing one. I have just added check for the handle double-free and let the process abort and generate crash dump when the condition is detected. This should enable us to see the call stack of the place where the second freeing happens in the crashdump. You can get it here: https://1drv.ms/u/s!AkLV4wRkyHYhyEcji1tMQVoUJXQ1?e=pO2rxZ

To use the modified coreclr.dll, you can either publish your app as self-contained rid specific (dotnet publish -r win-x64 --self-contained) or temporarily replace the coreclr.dll in the installed runtime (c:\Program Files\dotnet\shared\Microsoft.NetCore.App\3.1.1\coreclr.dll).

Please verify that the coreclr.dll you are replacing was built from commit 5558d18aafc84ecac3630ca6a31f52699cde0b66. You can check that by viewing file properties of the coreclr.dll in explorer and viewing the Product Version string on the Details tab.

@ishimko
Copy link
Author

ishimko commented Mar 3, 2020

@janvorli here is what I have: https://drive.google.com/file/d/13JmyLpup_0AYfx53-b3bG3den0R2D3fz/view?usp=sharing.

The stacktrace looks pretty strange (to me), because it contains "coreclr!coreclr_shutdown_2":

ntdll!NtTerminateProcess+0x14
ntdll!RtlExitUserProcess+0xb8
kernel32!ExitProcessImplementation+0xa
coreclr!GetCLRRuntimeHost+0x15c05
coreclr!GetCLRRuntimeHost+0x14564
coreclr!coreclr_shutdown_2+0x163c7
coreclr!coreclr_initialize+0x417ab
coreclr!GetCLRRuntimeHost+0x4bc9e
coreclr!GetCLRRuntimeHost+0x45596
coreclr!GetCLRRuntimeHost+0x4beb5
System_Private_CoreLib!System.Reflection.LoaderAllocatorScout.Destroy(IntPtr)$##600394B+0x41
System_Private_CoreLib!System.Reflection.LoaderAllocatorScout.Finalize()$##600394C+0x21
coreclr!coreclr_shutdown_2+0xcbf6

I've tried several times and always got shutdown there. Is it what we're looking for or I need to try more?

@janvorli
Copy link
Member

janvorli commented Mar 3, 2020

Most of the method names you are seeing on the stack trace are not correct as I've not shared pdb file for the coreclr.dll. Let me open your dump locally and see.

@janvorli
Copy link
Member

janvorli commented Mar 3, 2020

This is the real stack trace:

 # Child-SP          RetAddr           Call Site
00 000000c8`f0cfed58 00007fff`61a0a9b8 ntdll!ZwTerminateProcess+0x14 
01 000000c8`f0cfed60 00007fff`6167cd8a ntdll!RtlExitUserProcess+0xb8 
02 000000c8`f0cfed90 00007ffe`e0662045 kernel32!FatalExit+0xa 
03 (Inline Function) --------`-------- coreclr!EEPolicy::ExitProcessViaShim+0x8 [F:\git\coreclr2\src\vm\eepolicy.cpp @ 538] 
04 000000c8`f0cfedc0 00007ffe`e06609a4 coreclr!SafeExitProcess+0x1d9 [F:\git\coreclr2\src\vm\eepolicy.cpp @ 515] 
05 000000c8`f0cfef50 00007ffe`e059baf7 coreclr!EEPolicy::HandleFatalError+0x170 [F:\git\coreclr2\src\vm\eepolicy.cpp @ 1221] 
06 000000c8`f0cff560 00007ffe`e054fc4b coreclr!HandleTestDestroyHandle+0xbf [F:\git\coreclr2\src\vm\corhost.cpp @ 995] 
07 000000c8`f0cff5a0 00007ffe`e06980de coreclr!HndDestroyHandle+0x57 [F:\git\coreclr2\src\gc\handletable.cpp @ 380] 
08 000000c8`f0cff5e0 00007ffe`e06919d6 coreclr!AssemblyLoaderAllocator::CleanupHandles+0x3e [F:\git\coreclr2\src\vm\loaderallocator.cpp @ 1860] 
09 000000c8`f0cff610 00007ffe`e06982f5 coreclr!AppDomain::ShutdownFreeLoaderAllocators+0x52 [F:\git\coreclr2\src\vm\appdomain.cpp @ 848] 
0a 000000c8`f0cff650 00007ffe`dff43d01 coreclr!LoaderAllocator::Destroy+0x1a5 [F:\git\coreclr2\src\vm\loaderallocator.cpp @ 714] 
0b 000000c8`f0cff750 00007ffe`e007a861 System.Reflection.LoaderAllocatorScout.Destroy(IntPtr)
0c 000000c8`f0cff820 00007ffe`e0592326 System.Reflection.LoaderAllocatorScout.Finalize()
0d 000000c8`f0cff860 00007ffe`e04ea9b7 coreclr!FastCallFinalizeWorker+0x6 [F:\git\coreclr2\src\vm\amd64\CallDescrWorkerAMD64.asm @ 27] 
0e 000000c8`f0cff890 00007ffe`e04ea892 coreclr!MethodTable::CallFinalizer+0x97 [F:\git\coreclr2\src\vm\methodtable.cpp @ 4027] 
0f (Inline Function) --------`-------- coreclr!CallFinalizer+0x3d [F:\git\coreclr2\src\vm\finalizerthread.cpp @ 91] 
10 (Inline Function) --------`-------- coreclr!FinalizerThread::DoOneFinalization+0x3d [F:\git\coreclr2\src\vm\finalizerthread.cpp @ 108] 
11 000000c8`f0cff8d0 00007ffe`e04a9587 coreclr!FinalizerThread::FinalizeAllObjects+0xda [F:\git\coreclr2\src\vm\finalizerthread.cpp @ 137] 
12 000000c8`f0cff9c0 00007ffe`e04eb4b7 coreclr!FinalizerThread::FinalizerThreadWorker+0x97 [F:\git\coreclr2\src\vm\finalizerthread.cpp @ 353] 
13 (Inline Function) --------`-------- coreclr!ManagedThreadBase_DispatchInner+0xd [F:\git\coreclr2\src\vm\threads.cpp @ 7454] 
14 000000c8`f0cffa00 00007ffe`e04eb392 coreclr!ManagedThreadBase_DispatchMiddle+0x8f [F:\git\coreclr2\src\vm\threads.cpp @ 7498] 
15 000000c8`f0cffb10 00007ffe`e051346a coreclr!ManagedThreadBase_DispatchOuter+0xaa [F:\git\coreclr2\src\vm\threads.cpp @ 7663] 
16 (Inline Function) --------`-------- coreclr!ManagedThreadBase_NoADTransition+0x28 [F:\git\coreclr2\src\vm\threads.cpp @ 7734] 
17 (Inline Function) --------`-------- coreclr!ManagedThreadBase::FinalizerBase+0x28 [F:\git\coreclr2\src\vm\threads.cpp @ 7760] 
18 000000c8`f0cffbb0 00007ffe`e057b46a coreclr!FinalizerThread::FinalizerThreadStart+0x7a [F:\git\coreclr2\src\vm\finalizerthread.cpp @ 416] 
19 000000c8`f0cffcd0 00007fff`61677bd4 coreclr!Thread::intermediateThreadProc+0x8a 
1a 000000c8`f0cffd90 00007fff`61a0ced1 kernel32!BaseThreadInitThunk+0x14 
1b 000000c8`f0cffdc0 00000000`00000000 ntdll!RtlUserThreadStart+0x21 

Unfortunately it shows a path where the cleanup is expected to happen, so the problem seems to be that something else destroyed the handle incorrectly before. An additional possibility could be a race of multiple threads running the AssemblyLoaderAllocator::CleanupHandles and ending up popping the same handle from the cleanup list twice, but I don't think it is that one based on the fact that in the stresslog, the calls to the DestroyHandle were quite far from each other.

I'll need to figure out how to add some new stresslog events / instrumentation to the runtime to find out more details. I'll share another patched version of the coreclr with you once I have it.

@janvorli
Copy link
Member

janvorli commented Mar 3, 2020

Here is an updated coreclr.dll (and this time I've included the corresponding pdb too). @vanashimko can you please give it a try (with the stress log enabled)?
https://1drv.ms/u/s!AkLV4wRkyHYhyEkg3wPuUVYFCQ_H?e=p4W2KE

@ishimko
Copy link
Author

ishimko commented Mar 3, 2020

@janvorli
Copy link
Member

janvorli commented Mar 3, 2020

@vanashimko it seems that you have accidentally not replaced the coreclr.dll by the new one I've shared or accidentally used the previous one? The stresslog doesn't contain any of the new events I've added. I've looked the disassembly of one of the methods where I've added the logging and the call to the logging is not in it in the dump. But if I disassemble the coreclr.dll, the call is there.
The zip was called coreclr_2.zip.

@ishimko
Copy link
Author

ishimko commented Mar 3, 2020

@janvorli ahh sorry seems to be I've uploaded wrong dump! Could you please check this one: https://drive.google.com/file/d/1M5qCy1J8FYgKljdEX-zae6cr3qUb96Mi/view?usp=sharing

@janvorli
Copy link
Member

janvorli commented Mar 3, 2020

Great, that moved my understanding of the problem further. What happens is that the LoaderAllocator ends up being on the cleanup list twice. This is the relevant part of the stress log that shows that:

122c   8.707233700 : `CLASSLOADER`        CleanupHandles for allocator 000002E079EAA850 destroying handle 000002E07AD72670
122c   8.707233900 : `GC`                 DestroyHandle: *000002E07AD72670->0000000000000000
122c   8.707234100 : `CLASSLOADER`        CleanupHandles for allocator 000002E079EAA850 destroying handle 000002E07AD72660
122c   8.707234200 : `GC`                 DestroyHandle: *000002E07AD72660->0000000000000000
122c   8.707234300 : `CLASSLOADER`        CleanupHandles for allocator 000002E079EAA850 destroying handle 000002E07AD72648
122c   8.707234300 : `GC`                 DestroyHandle: *000002E07AD72648->0000000000000000
122c   8.707234500 : `CLASSLOADER`        CleanupHandles for allocator 000002E079EAA850 destroying handle 000002E07AD72640
122c   8.707234500 : `GC`                 DestroyHandle: *000002E07AD72640->0000000000000000
122c   8.707234700 : `CLASSLOADER`        CleanupHandles for allocator 000002E079EAA850 destroying handle 000002E07AD72638
122c   8.707234700 : `GC`                 DestroyHandle: *000002E07AD72638->0000000000000000
122c   8.707234800 : `CLASSLOADER`        CleanupHandles for allocator 000002E079EAA850 destroying handle 000002E07AD72630
122c   8.707234900 : `GC`                 DestroyHandle: *000002E07AD72630->0000000000000000
122c   8.707235000 : `CLASSLOADER`        CleanupHandles for allocator 000002E079EAA850 destroying handle 000002E07AD72630
122c   8.707235100 : `GC`                 DestroyHandle: *000002E07AD72630->0000000000000000
122c   8.707235300 : `CLASSLOADER`        Double-destroy of handle 000002E07AD72630 called from 00007FFF1EB9822B

I am going to add some more instrumentation to catch the spot when the double-adding happens.

@janvorli
Copy link
Member

janvorli commented Mar 4, 2020

@vanashimko here is a new version of the patched coreclr.dll: https://1drv.ms/u/s!AkLV4wRkyHYhyEot9KwZ3VS8N5UX?e=Jlx7Eh

I've added more stresslog entries and also added a failure right at the spot when a handle gets added into the cleanup list for the 2nd time. I hope we are really close to figuring out the culprit.

@ishimko
Copy link
Author

ishimko commented Mar 4, 2020

@janvorli Got the message "double register of a handle", seems to be the correct one: https://drive.google.com/file/d/142g0JYD8MgC91Vzx2AxA2MXojAIddsRI/view?usp=sharing

P.S. Just out of curiosity, can I look at the changes you're making somewhere?

@janvorli
Copy link
Member

janvorli commented Mar 4, 2020

Sure, you can take a look at it here. The changes are just adding more stress log events and a hashtable to keep track of allocated handles. They are meant only to debug the issue and are a bit hacky.
https://github.com/janvorli/coreclr/pull/new/loader-allocator-handle-debugging

@janvorli
Copy link
Member

janvorli commented Mar 4, 2020

So thanks to the last dump, I've found what causes the double registration of the handle. The RuntimeType.InitializeCache uses the Interlocked.CompareExchange pattern to ensure the cache handle gets created just once even if multiple threads are racing for it. It calls RuntimeTypeHandle.GetGCHandle (which registers the handle) and then tries to set the m_cache to that using Interlocked.CompareExchange. If the m_cache was already set, it destroys the just created handle again, as it means that some other thread has won the initialization race. And the destruction is the problem, as it only destroys the handle, but doesn't remove it from the cleanup list on the AssemblyLoaderAllocator. Some later call to the RuntimeTypeHandle.GetGCHandle gets the same handle and adds it again to the cleanup list. And when the AssemblyLoaderAllocator::CleanupHandles() is called at the loader allocator destruction time, we call DestroyHandle twice on it, as it is twice on that list.

Here is the piece of code I am talking about:
https://github.com/dotnet/coreclr/blob/v3.1.1/src/System.Private.CoreLib/src/System/RuntimeType.CoreCLR.cs#L2413-L2421

Now I need to figure out how to fix that properly. It seems that we need to make sure that handles obtained by RuntimeTypeHandle.GetGCHandle are never freed by the GCHandle.InternalFree that doesn't know anything about the loader allocator. I am not sure yet what would the right fix for that. It seems that adding a RuntimeTypeHandle.FreeGCHandle and a method to AssemblyLoaderAllocator to remove an entry from the cleanup list would be one way.

@vanashimko to verify that it fixes the issue you are hitting completely, I'll create a patched version of System.Private.CoreLib.dll with removed call to GCHandle.InternalFree from the InitializeCache method. It would just waste one or few GC handles, but the problem should be gone as the handles on the cleanup list would never be freed at other place than the cleanup list.

@janvorli
Copy link
Member

janvorli commented Mar 4, 2020

Here is the updated System.Private.CoreLib.dll:
https://1drv.ms/u/s!AkLV4wRkyHYhyEsNZ_RpNAhe2Dgf?e=afOgD2

@ishimko
Copy link
Author

ishimko commented Mar 4, 2020

@janvorli can no longer observe the issue with patched System.Private.CoreLib.dll, tests work fine now! 🎉

janvorli added a commit to janvorli/runtime that referenced this issue Mar 5, 2020
When there is a race calling RuntimeType.InitializeCache, each of the
racing threads creates a new GC handle using
new RuntimeTypeHandle(this).GetGCHandle(GCHandleType.WeakTrackResurrection);
This ends up calling RuntimeTypeHandle::GetGCHandle native method that
adds the allocated handle into the handle cleanup list of the
AssemblyLoaderAllocator specific for the runtime type.
All but the winning thread then call GCHandle.InternalFree on the just
allocated handle. That frees the handle, but leaves it on the cleanup
list of the loader allocator. The same handle can be later allocated for some
other purpose. When the AssemblyLoaderAllocator is being destroyed, all
the handles on the cleanup list are destroyed too. So it destroys also
the handle that was left on the cleanup list incorrectly. That can cause
all kinds of hard to diagnose issues, like the dotnet#32171.

This change fixes it by adding a FreeGCHandle method on the
RuntimeTypeHandle that besides freeing the handle also removes it from
the cleanup list of the related AssemblyLoadContext.
janvorli added a commit that referenced this issue Mar 5, 2020
When there is a race calling RuntimeType.InitializeCache, each of the
racing threads creates a new GC handle using
new RuntimeTypeHandle(this).GetGCHandle(GCHandleType.WeakTrackResurrection);
This ends up calling RuntimeTypeHandle::GetGCHandle native method that
adds the allocated handle into the handle cleanup list of the
AssemblyLoaderAllocator specific for the runtime type.
All but the winning thread then call GCHandle.InternalFree on the just
allocated handle. That frees the handle, but leaves it on the cleanup
list of the loader allocator. The same handle can be later allocated for some
other purpose. When the AssemblyLoaderAllocator is being destroyed, all
the handles on the cleanup list are destroyed too. So it destroys also
the handle that was left on the cleanup list incorrectly. That can cause
all kinds of hard to diagnose issues, like the #32171.

This change fixes it by adding a FreeGCHandle method on the
RuntimeTypeHandle that besides freeing the handle also removes it from
the cleanup list of the related AssemblyLoadContext.
@ishimko
Copy link
Author

ishimko commented Mar 6, 2020

@janvorli thanks a lot for the work on this issue! Also could you please point me to any source of information about when there is a chance to have this fix released?

@janvorli
Copy link
Member

janvorli commented Mar 6, 2020

I also wanted to thank you a lot for your cooperation with investigating this nasty issue. I have merged the proper fix to the master branch yesterday (PR #33243) and I am going to create a PR to port it to 3.1 today and I hope to get it approved. I am not sure when is the next release slot for the 3.1 though, I'll let you know once I have more details.

janvorli added a commit to janvorli/coreclr that referenced this issue Mar 9, 2020
When there is a race calling RuntimeType.InitializeCache, each of the
racing threads creates a new GC handle using
new
RuntimeTypeHandle(this).GetGCHandle(GCHandleType.WeakTrackResurrection);
This ends up calling RuntimeTypeHandle::GetGCHandle native method that
adds the allocated handle into the handle cleanup list of the
AssemblyLoaderAllocator specific for the runtime type.
All but the winning thread then call GCHandle.InternalFree on the just
allocated handle. That frees the handle, but leaves it on the cleanup
list of the loader allocator. The same handle can be later allocated for
some
other purpose. When the AssemblyLoaderAllocator is being destroyed, all
the handles on the cleanup list are destroyed too. So it destroys also
the handle that was left on the cleanup list incorrectly. That can cause
all kinds of hard to diagnose issues, like the
dotnet/runtime#32171.

This change fixes it by adding a FreeGCHandle method on the
RuntimeTypeHandle that besides freeing the handle also removes it from
the cleanup list of the related AssemblyLoadContext.

 ## Customer impact
Hard to diagnose crashes in the runtime caused by closing of random
GC handles. The customer that has reported this issue was using
collectible assemblies and it was resulting in collecting
LoaderAllocator that was still in use and it lead to crashes at various
places.

 ## Regression?
Yes, it was introduced in 3.0. In 2.1 and 2.2, the thread that loses
the race destroys the handle only if the type was not in a collectible
assembly. Since the non-collectible assemblies LoaderAllocator is never
destroyed, the handles were never cleaned up and so no problem could occur.
It was introduced in dotnet#21737

 ##Testing
Customer affected by the issue heavily has tested a fixed version and
reported the issue doesn't occur anymore.

 ## Risk
Low, the new code is executed at single place once per process runtine
only when a thread races for allocating the GC handle with another one
and loses the race.
@janvorli
Copy link
Member

@vanashimko it got approved for the 3.1.4 release.

Anipik pushed a commit to dotnet/coreclr that referenced this issue Mar 25, 2020
* Port to 3.1: Fix allocation of RuntimeTypeCache GC handle

When there is a race calling RuntimeType.InitializeCache, each of the
racing threads creates a new GC handle using
new
RuntimeTypeHandle(this).GetGCHandle(GCHandleType.WeakTrackResurrection);
This ends up calling RuntimeTypeHandle::GetGCHandle native method that
adds the allocated handle into the handle cleanup list of the
AssemblyLoaderAllocator specific for the runtime type.
All but the winning thread then call GCHandle.InternalFree on the just
allocated handle. That frees the handle, but leaves it on the cleanup
list of the loader allocator. The same handle can be later allocated for
some
other purpose. When the AssemblyLoaderAllocator is being destroyed, all
the handles on the cleanup list are destroyed too. So it destroys also
the handle that was left on the cleanup list incorrectly. That can cause
all kinds of hard to diagnose issues, like the
dotnet/runtime#32171.

This change fixes it by adding a FreeGCHandle method on the
RuntimeTypeHandle that besides freeing the handle also removes it from
the cleanup list of the related AssemblyLoadContext.

 ## Customer impact
Hard to diagnose crashes in the runtime caused by closing of random
GC handles. The customer that has reported this issue was using
collectible assemblies and it was resulting in collecting
LoaderAllocator that was still in use and it lead to crashes at various
places.

 ## Regression?
Yes, it was introduced in 3.0. In 2.1 and 2.2, the thread that loses
the race destroys the handle only if the type was not in a collectible
assembly. Since the non-collectible assemblies LoaderAllocator is never
destroyed, the handles were never cleaned up and so no problem could occur.
It was introduced in #21737

 ##Testing
Customer affected by the issue heavily has tested a fixed version and
reported the issue doesn't occur anymore.

 ## Risk
Low, the new code is executed at single place once per process runtine
only when a thread races for allocating the GC handle with another one
and loses the race.

* Fix build break - subtle differences from runtime master
@GeorgeAlexandria
Copy link
Contributor

Presumably we have the same issue (AccessViolation) with LoaderAllocator on the coreclr!MethodTable::GetLoaderAllocatorObjectForGC, when app targets .net472, and would be helpful if we can check fix on our app under .netframework. So was this fix applied for .netframework or just only for .netcore, @janvorli?

@jkotas
Copy link
Member

jkotas commented Jul 29, 2020

This bug never existed in .NET Framework. It was introduced in CoreCLR by dotnet/coreclr#21737.

@GeorgeAlexandria
Copy link
Contributor

@jkotas, as I see GCHandle.InternalFree, which was a reason of issue, existed before was applied MR, that you mentioned: https://github.com/dotnet/coreclr/pull/21737/files#diff-6f4491bf559af3ad09c5a66fb49ae47cL2391. If you mean some other peace of code, that proves that bug never existed in .netframework or something else, please let me know, but currently I am still not sure for that.

@jkotas
Copy link
Member

jkotas commented Jul 29, 2020

!IsCollectible check before GCHandle.InternalFree prevented this bug from happening in .NET Framework and earlier versions of .NET Core.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants