-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Access violations around unmanaged LoaderAllocator when using dynamic (emitted with AssemblyBuilder) collectible assemblies #32171
Comments
cc @janvorli |
I'll take a look |
The interesting thing is that it is not a LoaderAllocator (named as LoaderAllocatorObject on the native side), so then what you've considered to be a LoaderAllocatorScout was something else too and thus the same was true for the m_nativeLoaderAllocator:
But the MethodTable at frame 3 points to that as if it was a LoaderAllocator instance (you can find the MethodTable address from frame 4: this->m_asMT):
My guess is that the real LoaderAllocator was collected somehow and a new object of type System.RuntimeType+RuntimeTypeCache got allocated at the same address.
So the object of that type is still referenced by a register in one of the stack frames of the current thread (frame 0b), yet the I keep looking, the other failure modes are also quite strange. |
Thanks for looking into that @janvorli. That explains why "" |
@vanashimko I wonder if you'd be able to create a simple repro of the problem that you would be able to share. That would make looking into it much easier. |
@janvorli I tried to create an example multiple times, but unfortunately without success. There is a lot of code involved there. I even tried to create a separate test (based on our code) to have something reproducible internally as a starting point for isolated example, but that didn't work. There are around 300+ tests that are running. And it can be the case when they are run successfully 2-3 times in a row without any issues. And than another restart produces this crash during the first test. Because of "gc" in a stack trace I've also tried to insert GC.Collect here and there in the code, to try force this issue somehow, but that didn't work too. As for GCHandle, the answer is no. |
Ok, thank you for your answer, I'll keep digging in the dumps. |
@vanashimko I got an idea that can help us diagnose it better. There is a feature called stress log that is able to log various low level details on the runtime into an in-memory buffer that is dumped as part of a crash dump. It can also log handle creation and destruction. I was wondering if you'd be able to enable that and once you hit the failures again, share the crash dumps again.
This enables stress log buffer, logs events of frequency level 6, logs GC events (LogFacility=1), the stress log size per thread is 0x2000000 bytes (32MB) and total stress log size is 0x40000000 bytes (1GB). |
Actually, it would be good to add LoaderAllocator events - so instead of |
@janvorli Thanks. Actually I've tried stress log myself (enabled it using some registry changes found on the Internet). But was not able to make use of the results because not so familiar with all this stuff and just deleted those dumps. Now I've used the approach with env variables you described (including 401 for LogFacility) and collected the following: https://drive.google.com/file/d/1kk_RVVix-IVyIBih22Doo9_MdmgPchSf/view?usp=sharing. Hope did not miss anything. |
@vanashimko the dump and logs shows something interesting. First, the reference in a handle that should be a
As you can see, there is no DestroyHandle between the two CreateHandle logs and still the same slot was reused. |
The only explanation I can see is a memory corruption or a bug in the handle table code (which seems extremely unlikely). |
@vanashimko I'll be OOF next week, so I won't be able to try to look into it more before I get back. If you'd get more dumps with the stresslog enabled with stack trace similar to the cases 2 and 3, please share them so that we can get more data. |
@janvorli Thank you for your research. Some more dumps with stresslog enabled: coreclr!LoaderAllocatorObject::GetHandleTable - probably similar to 3
coreclr!MethodTable::GetParentMethodTableOrIndirection - looks like a new one
I will post more if find (this GC trace from the case 2 is pretty rare, BTW, UPD:coreclr!MethodTable::GetLoaderAllocatorObjectForGC - similar to 2
|
One possibility to look for is double-free or use-after-free of a GC handle. |
@davidwrighton has also told me today that he has recently seen a similar bug caused by double free of a GC handle that later results in double allocation of the same handle. So I've looked at the stress log again and I can see exactly that. Extending the list of relevant items from the stress log that I've shown before:
As you can see at 19.530733500 and 20.132816500, the same handle was destroyed. The handle was pointing to a LoaderAllocator at the first call to destroy and to nothing at the second call to destroy, so it seems like we have an issue with double freeing the LoaderAllocator handle in some cases in the runtime. I'll keep digging into it to pinpoint the case when it happens. |
@vanashimko would you be able to try to run your application with a modified coreclr.dll? I could add some additional checks that would trigger right at the spot when handle is double-freed and share a source patch that you can apply and build your own coreclr. I could also share the binary if you'd prefer that. |
@janvorli Yes sure I can do testing with modified coreclr.dll. I don't have experience with building it, so binary ready to go file is what I'd prefer. |
Ok, great, thank you! I'll try to get you something on Monday. |
@janvorli I've tried to switch one part of our code from emit using https://drive.google.com/file/d/1Q2ermVUhxF07JTZCwJuVOnXsgRVMA0mE/view?usp=sharing managed trace:
unmanaged trace:
|
@vanashimko I have created a testing coreclr.dll that you can use as a drop-in replacement for the existing one. I have just added check for the handle double-free and let the process abort and generate crash dump when the condition is detected. This should enable us to see the call stack of the place where the second freeing happens in the crashdump. You can get it here: https://1drv.ms/u/s!AkLV4wRkyHYhyEcji1tMQVoUJXQ1?e=pO2rxZ To use the modified coreclr.dll, you can either publish your app as self-contained rid specific ( Please verify that the coreclr.dll you are replacing was built from commit 5558d18aafc84ecac3630ca6a31f52699cde0b66. You can check that by viewing file properties of the coreclr.dll in explorer and viewing the Product Version string on the Details tab. |
@janvorli here is what I have: https://drive.google.com/file/d/13JmyLpup_0AYfx53-b3bG3den0R2D3fz/view?usp=sharing. The stacktrace looks pretty strange (to me), because it contains "coreclr!coreclr_shutdown_2":
I've tried several times and always got shutdown there. Is it what we're looking for or I need to try more? |
Most of the method names you are seeing on the stack trace are not correct as I've not shared pdb file for the coreclr.dll. Let me open your dump locally and see. |
This is the real stack trace:
Unfortunately it shows a path where the cleanup is expected to happen, so the problem seems to be that something else destroyed the handle incorrectly before. An additional possibility could be a race of multiple threads running the I'll need to figure out how to add some new stresslog events / instrumentation to the runtime to find out more details. I'll share another patched version of the coreclr with you once I have it. |
Here is an updated coreclr.dll (and this time I've included the corresponding pdb too). @vanashimko can you please give it a try (with the stress log enabled)? |
Wrong dump, see below. |
@vanashimko it seems that you have accidentally not replaced the coreclr.dll by the new one I've shared or accidentally used the previous one? The stresslog doesn't contain any of the new events I've added. I've looked the disassembly of one of the methods where I've added the logging and the call to the logging is not in it in the dump. But if I disassemble the coreclr.dll, the call is there. |
@janvorli ahh sorry seems to be I've uploaded wrong dump! Could you please check this one: https://drive.google.com/file/d/1M5qCy1J8FYgKljdEX-zae6cr3qUb96Mi/view?usp=sharing |
Great, that moved my understanding of the problem further. What happens is that the LoaderAllocator ends up being on the cleanup list twice. This is the relevant part of the stress log that shows that:
I am going to add some more instrumentation to catch the spot when the double-adding happens. |
@vanashimko here is a new version of the patched coreclr.dll: https://1drv.ms/u/s!AkLV4wRkyHYhyEot9KwZ3VS8N5UX?e=Jlx7Eh I've added more stresslog entries and also added a failure right at the spot when a handle gets added into the cleanup list for the 2nd time. I hope we are really close to figuring out the culprit. |
@janvorli Got the message "double register of a handle", seems to be the correct one: https://drive.google.com/file/d/142g0JYD8MgC91Vzx2AxA2MXojAIddsRI/view?usp=sharing P.S. Just out of curiosity, can I look at the changes you're making somewhere? |
Sure, you can take a look at it here. The changes are just adding more stress log events and a hashtable to keep track of allocated handles. They are meant only to debug the issue and are a bit hacky. |
So thanks to the last dump, I've found what causes the double registration of the handle. The RuntimeType.InitializeCache uses the Interlocked.CompareExchange pattern to ensure the cache handle gets created just once even if multiple threads are racing for it. It calls RuntimeTypeHandle.GetGCHandle (which registers the handle) and then tries to set the m_cache to that using Interlocked.CompareExchange. If the m_cache was already set, it destroys the just created handle again, as it means that some other thread has won the initialization race. And the destruction is the problem, as it only destroys the handle, but doesn't remove it from the cleanup list on the AssemblyLoaderAllocator. Some later call to the RuntimeTypeHandle.GetGCHandle gets the same handle and adds it again to the cleanup list. And when the AssemblyLoaderAllocator::CleanupHandles() is called at the loader allocator destruction time, we call DestroyHandle twice on it, as it is twice on that list. Here is the piece of code I am talking about: Now I need to figure out how to fix that properly. It seems that we need to make sure that handles obtained by RuntimeTypeHandle.GetGCHandle are never freed by the GCHandle.InternalFree that doesn't know anything about the loader allocator. I am not sure yet what would the right fix for that. It seems that adding a RuntimeTypeHandle.FreeGCHandle and a method to AssemblyLoaderAllocator to remove an entry from the cleanup list would be one way. @vanashimko to verify that it fixes the issue you are hitting completely, I'll create a patched version of System.Private.CoreLib.dll with removed call to GCHandle.InternalFree from the InitializeCache method. It would just waste one or few GC handles, but the problem should be gone as the handles on the cleanup list would never be freed at other place than the cleanup list. |
Here is the updated System.Private.CoreLib.dll: |
@janvorli can no longer observe the issue with patched System.Private.CoreLib.dll, tests work fine now! 🎉 |
When there is a race calling RuntimeType.InitializeCache, each of the racing threads creates a new GC handle using new RuntimeTypeHandle(this).GetGCHandle(GCHandleType.WeakTrackResurrection); This ends up calling RuntimeTypeHandle::GetGCHandle native method that adds the allocated handle into the handle cleanup list of the AssemblyLoaderAllocator specific for the runtime type. All but the winning thread then call GCHandle.InternalFree on the just allocated handle. That frees the handle, but leaves it on the cleanup list of the loader allocator. The same handle can be later allocated for some other purpose. When the AssemblyLoaderAllocator is being destroyed, all the handles on the cleanup list are destroyed too. So it destroys also the handle that was left on the cleanup list incorrectly. That can cause all kinds of hard to diagnose issues, like the dotnet#32171. This change fixes it by adding a FreeGCHandle method on the RuntimeTypeHandle that besides freeing the handle also removes it from the cleanup list of the related AssemblyLoadContext.
When there is a race calling RuntimeType.InitializeCache, each of the racing threads creates a new GC handle using new RuntimeTypeHandle(this).GetGCHandle(GCHandleType.WeakTrackResurrection); This ends up calling RuntimeTypeHandle::GetGCHandle native method that adds the allocated handle into the handle cleanup list of the AssemblyLoaderAllocator specific for the runtime type. All but the winning thread then call GCHandle.InternalFree on the just allocated handle. That frees the handle, but leaves it on the cleanup list of the loader allocator. The same handle can be later allocated for some other purpose. When the AssemblyLoaderAllocator is being destroyed, all the handles on the cleanup list are destroyed too. So it destroys also the handle that was left on the cleanup list incorrectly. That can cause all kinds of hard to diagnose issues, like the #32171. This change fixes it by adding a FreeGCHandle method on the RuntimeTypeHandle that besides freeing the handle also removes it from the cleanup list of the related AssemblyLoadContext.
@janvorli thanks a lot for the work on this issue! Also could you please point me to any source of information about when there is a chance to have this fix released? |
I also wanted to thank you a lot for your cooperation with investigating this nasty issue. I have merged the proper fix to the master branch yesterday (PR #33243) and I am going to create a PR to port it to 3.1 today and I hope to get it approved. I am not sure when is the next release slot for the 3.1 though, I'll let you know once I have more details. |
When there is a race calling RuntimeType.InitializeCache, each of the racing threads creates a new GC handle using new RuntimeTypeHandle(this).GetGCHandle(GCHandleType.WeakTrackResurrection); This ends up calling RuntimeTypeHandle::GetGCHandle native method that adds the allocated handle into the handle cleanup list of the AssemblyLoaderAllocator specific for the runtime type. All but the winning thread then call GCHandle.InternalFree on the just allocated handle. That frees the handle, but leaves it on the cleanup list of the loader allocator. The same handle can be later allocated for some other purpose. When the AssemblyLoaderAllocator is being destroyed, all the handles on the cleanup list are destroyed too. So it destroys also the handle that was left on the cleanup list incorrectly. That can cause all kinds of hard to diagnose issues, like the dotnet/runtime#32171. This change fixes it by adding a FreeGCHandle method on the RuntimeTypeHandle that besides freeing the handle also removes it from the cleanup list of the related AssemblyLoadContext. ## Customer impact Hard to diagnose crashes in the runtime caused by closing of random GC handles. The customer that has reported this issue was using collectible assemblies and it was resulting in collecting LoaderAllocator that was still in use and it lead to crashes at various places. ## Regression? Yes, it was introduced in 3.0. In 2.1 and 2.2, the thread that loses the race destroys the handle only if the type was not in a collectible assembly. Since the non-collectible assemblies LoaderAllocator is never destroyed, the handles were never cleaned up and so no problem could occur. It was introduced in dotnet#21737 ##Testing Customer affected by the issue heavily has tested a fixed version and reported the issue doesn't occur anymore. ## Risk Low, the new code is executed at single place once per process runtine only when a thread races for allocating the GC handle with another one and loses the race.
@vanashimko it got approved for the 3.1.4 release. |
* Port to 3.1: Fix allocation of RuntimeTypeCache GC handle When there is a race calling RuntimeType.InitializeCache, each of the racing threads creates a new GC handle using new RuntimeTypeHandle(this).GetGCHandle(GCHandleType.WeakTrackResurrection); This ends up calling RuntimeTypeHandle::GetGCHandle native method that adds the allocated handle into the handle cleanup list of the AssemblyLoaderAllocator specific for the runtime type. All but the winning thread then call GCHandle.InternalFree on the just allocated handle. That frees the handle, but leaves it on the cleanup list of the loader allocator. The same handle can be later allocated for some other purpose. When the AssemblyLoaderAllocator is being destroyed, all the handles on the cleanup list are destroyed too. So it destroys also the handle that was left on the cleanup list incorrectly. That can cause all kinds of hard to diagnose issues, like the dotnet/runtime#32171. This change fixes it by adding a FreeGCHandle method on the RuntimeTypeHandle that besides freeing the handle also removes it from the cleanup list of the related AssemblyLoadContext. ## Customer impact Hard to diagnose crashes in the runtime caused by closing of random GC handles. The customer that has reported this issue was using collectible assemblies and it was resulting in collecting LoaderAllocator that was still in use and it lead to crashes at various places. ## Regression? Yes, it was introduced in 3.0. In 2.1 and 2.2, the thread that loses the race destroys the handle only if the type was not in a collectible assembly. Since the non-collectible assemblies LoaderAllocator is never destroyed, the handles were never cleaned up and so no problem could occur. It was introduced in #21737 ##Testing Customer affected by the issue heavily has tested a fixed version and reported the issue doesn't occur anymore. ## Risk Low, the new code is executed at single place once per process runtine only when a thread races for allocating the GC handle with another one and loses the race. * Fix build break - subtle differences from runtime master
Presumably we have the same issue (AccessViolation) with |
This bug never existed in .NET Framework. It was introduced in CoreCLR by dotnet/coreclr#21737. |
@jkotas, as I see |
|
After migration to .NET Core 3.1 from 2.2 we've started facing dotnet runtime crash during our test runs:
The active test run was aborted. Reason: Test host process crashed : Fatal error. Internal CLR error. (0x80131506)
. The problem also occurs with .NET Core 3.0, both on Windows and Linux.The crash is not 100% reproducible (tests need to be rerun multiple times to catch this situation) and unfortunately we were not able to locate the problem and build a small reproducible example. But there are several crash dumps showing different aspects of the problem.
We're using
AssemblyBuilder
to emit collectible (AssemblyBuilderAccess.RunAndCollect
) assemblies dynamically. Under some circumstances a try to use a type/object from this assembly produces access violation (reading memory near zero). When usingAssemblyBuilderAccess.Run
instead ofRunAndCollect
the issue is not reproduced. So it looks like the issue is related to introducing of collectible assemblies occurred in .NET Core 3.0.There are several crash dumps collected on Windows we can share. I will add some thoughts on the analysis performed so far.
1. coreclr!LoaderAllocator::GetHandleValueFastPhase2
Dump: https://drive.google.com/file/d/1bZ80x7fQ_kkc3mus9KI1BT8Qi1fiNLpq/view?usp=sharing
Managed stack trace (copying only .NET part):
WinDbg output
From unmanaged side this crash was produced by the following line (
handleTable
isnull
):runtime/src/coreclr/src/vm/loaderallocator.inl
Line 135 in 6d39d5d
By redoing some address arithmetic starting from
mov rax [rdi+18h]
it can be determined thatLoaderAllocatorObject
is000001d8`1b940568
:WinDbg outout
Going further into
m_pLoaderAllocatorScout
:and to
m_nativeLoaderAllocator
, the memory looks a bit strange:WinDbg output
From the CLR stack this dynamic assembly does not look like a one to be unloaded. Here is the
AssemblyBuilder
which owns the type on which the crash occurred:2. coreclr!MethodTable::GetLoaderAllocatorObjectForGC
Dump: https://drive.google.com/file/d/1RSjdjAJ0TTjpH5QmcmGIheoHCk-il2al/view?usp=sharing
The line produced the crash (looks like it tries to dereference invalid handle returned from
GetLoaderAllocatorObjectHandle()
):runtime/src/coreclr/src/vm/methodtable.cpp
Line 9889 in 6d39d5d
There is an interesting unmanaged stack trace here showing GC and JIT parts:
WinDbg output
LoaderAllocator
address seems to be00007ff8`ba020a88
(and its' boolean flags section looks suspicios too).m_hLoaderAllocatorObjectHandle
has the value of0x00000000`00000080
and it produces the crash:WinDbg output
3. coreclr!LoaderAllocatorObject::GetHandleTable
Dump: https://drive.google.com/file/d/1n8p-JeOCrUS5qjBkOyiuQmP16kJYfAR4/view?usp=sharing
The crash is produced by the following line (
loaderAllocator
of typeLoaderAllocatorObject
isnull
):runtime/src/coreclr/src/vm/loaderallocator.cpp
Line 936 in 6d39d5d
WinDbg output
The most strange thing here is that this allocator is referenced from managed
LoaderAllocatorScout
.By dumping all
LoaderAllocatorScout
objects from the managed heap and looking inside theirm_nativeLoaderAllocator
objects for address000001ad`f3128470
(unmanagedLoaderAllocator
) it can be determined thatLoaderAllocatorScout
presents in managed heap and is not collected. However the handle from unmanaged to managed (m_hLoaderAllocatorObjectHandle
) is wrong.WinDbg output
The text was updated successfully, but these errors were encountered: