Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AV in WKS::gc_heap::relocate_address #34279

Closed
jkotas opened this issue Mar 30, 2020 · 35 comments · Fixed by #35026
Closed

AV in WKS::gc_heap::relocate_address #34279

jkotas opened this issue Mar 30, 2020 · 35 comments · Fixed by #35026

Comments

@jkotas
Copy link
Member

jkotas commented Mar 30, 2020

Hit in Common.Tests on Windows x86 Checked in #34261:

ntdll!KiUserExceptionDispatcher+0xf [d:\rs1\minkernel\ntos\rtl\i386\userdisp.asm @ 597] 
coreclr!WKS::tree_search+0x6 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 24978] 
coreclr!WKS::gc_heap::relocate_address+0x5e [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 25068] 
coreclr!WKS::GCHeap::Relocate+0x45 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 36864] 
coreclr!PromoteCarefully+0x83 [F:\workspace\_work\1\s\src\coreclr\src\vm\siginfo.cpp @ 4875] 
coreclr!GcEnumObject+0x4b [F:\workspace\_work\1\s\src\coreclr\src\vm\gcenv.ee.common.cpp @ 167] 
coreclr!EECodeManager::EnumGcRefs+0xc48 [F:\workspace\_work\1\s\src\coreclr\src\vm\eetwain.cpp @ 4702] 
coreclr!GcStackCrawlCallBack+0x152 [F:\workspace\_work\1\s\src\coreclr\src\vm\gcenv.ee.common.cpp @ 290] 
coreclr!Thread::MakeStackwalkerCallback+0x48 [F:\workspace\_work\1\s\src\coreclr\src\vm\stackwalk.cpp @ 841] 
coreclr!Thread::StackWalkFramesEx+0x193 [F:\workspace\_work\1\s\src\coreclr\src\vm\stackwalk.cpp @ 917] 
coreclr!Thread::StackWalkFrames+0x159 [F:\workspace\_work\1\s\src\coreclr\src\vm\stackwalk.cpp @ 1001] 
coreclr!ScanStackRoots+0x1e9 [F:\workspace\_work\1\s\src\coreclr\src\vm\gcenv.ee.cpp @ 146] 
coreclr!GCToEEInterface::GcScanRoots+0xe4 [F:\workspace\_work\1\s\src\coreclr\src\vm\gcenv.ee.cpp @ 175] 
coreclr!WKS::gc_heap::relocate_phase+0x35 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 25922] 
coreclr!WKS::gc_heap::plan_phase+0x1aa8 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 24246] 
coreclr!WKS::gc_heap::gc1+0x173 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 16782] 
coreclr!WKS::gc_heap::garbage_collect+0x3bd [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 18414] 
coreclr!WKS::GCHeap::GarbageCollectGeneration+0x239 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 38169] 
coreclr!WKS::gc_heap::trigger_gc_for_alloc+0x40 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 14039] 
coreclr!WKS::gc_heap::try_allocate_more_space+0x1ba [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 14157] 
coreclr!WKS::gc_heap::allocate_more_space+0x14 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 14587] 
coreclr!WKS::gc_heap::allocate+0x31 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 14618] 
coreclr!WKS::GCHeap::Alloc+0x199 [F:\workspace\_work\1\s\src\coreclr\src\gc\gc.cpp @ 37242] 
coreclr!Alloc+0x13e [F:\workspace\_work\1\s\src\coreclr\src\vm\gchelpers.cpp @ 231] 
coreclr!AllocateSzArray+0x2ea [F:\workspace\_work\1\s\src\coreclr\src\vm\gchelpers.cpp @ 486] 
coreclr!JIT_NewArr1+0x185 [F:\workspace\_work\1\s\src\coreclr\src\vm\jithelpers.cpp @ 2718] 
System_Private_CoreLib!System.Span`1[[System.Byte, System.Private.CoreLib]].ToArray()+0x9a6b2fa1 [/_/src/libraries/System.Private.CoreLib/src/System/Span.cs @ 477] 
Common_Tests!System.Net.Test.Common.VirtualNetworkStream.Write(Byte[], Int32, Int32)+0x67
Common_Tests!System.Net.Test.Common.VirtualNetworkStreamTest.VirtualNetworkStream_SingleThreadIntegrityTest_Ok()+0x137
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-GC-coreclr untriaged New issue has not been triaged by the area owner labels Mar 30, 2020
@jkotas
Copy link
Member Author

jkotas commented Mar 30, 2020

Crash while stackwalking Interop+Kernel32.GetFullPathNameW(Char ByRef, UInt32, Char ByRef, IntPtr)

@jkotas
Copy link
Member Author

jkotas commented Mar 30, 2020

@fadimounir This looks like a duplicate of #31809

@fadimounir
Copy link
Contributor

Thanks @jkotas, I will take a look. Are there crash dumps in the failures since I wasn't able to repro this locally?

@jkotas
Copy link
Member Author

jkotas commented Mar 30, 2020

I have saved the dump and binaries at \jkotas9\drops\issue-34279 so that it won't get lost.

@fadimounir
Copy link
Contributor

I took a look at the dump, and it looks like we're enumerating the gs cookie of the InlinedCallFrame as a gcref object. Still not sure why

@fadimounir
Copy link
Contributor

@jkotas I've been able to trace the root cause of this issue. The problem is here: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/vm/i386/PInvokeStubs.asm#L58

We're using the return address from the call to the JIT_PInvokeBegin helper, and not the return address of the actual PInvoke. The effect of this is that we report GC roots using incorrect GC data. I've been able to manually get a similar state under debugger, and saw that we were enumerating the gs cookie of the InlinedCallFrame. When I manually got this state under debugger however, the Promote() function was able to ignore this bad value, since it detected it wasn't a valid address in the GC heaps. In the crash dump though, things were not as lucky (maybe the gs cookie value we use in checked builds (0x12345678) fell between the low/high range of the gc heap?)

Anyways, I did test under debugger what the correct behavior would have been by manually patching the return address in the InlinedCallFrame to the correct one, and the GC refs enumeration seemed to be doing the correct thing (nothing was being enumerated from the pinvoke frame at that point).

I think the fix is to pass the correct return address value as a parameter to the JIT_PInvokeBegin helper here: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/lower.cpp#L3635

@dotnet/jit-contrib I need some help figuring out how to do that.

@fadimounir
Copy link
Contributor

fadimounir commented Apr 2, 2020

BTW, the bug exists on all architectures, all platforms, and not just on Windows-x86. It's just extremely rare that it would ever fail

@fadimounir fadimounir self-assigned this Apr 2, 2020
@fadimounir fadimounir removed the untriaged New issue has not been triaged by the area owner label Apr 2, 2020
@jkotas
Copy link
Member Author

jkotas commented Apr 2, 2020

I think the fix is to pass the correct return address value as a parameter to the JIT_PInvokeBegin helper

That is one option. Other option is to fix the GCInfo reporting so that the GCInfo at the JIT_PInvokeBegin IP is same as the GCInfo at the actual PInvoke.

@fadimounir fadimounir assigned AndyAyersMS and unassigned fadimounir Apr 3, 2020
@fadimounir
Copy link
Contributor

@AndyAyersMS could you please forward this issue to someone on your team for a fix? Thanks!

@AndyAyersMS
Copy link
Member

cc @BruceForstall

@BruceForstall BruceForstall added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-ReadyToRun-coreclr labels Apr 8, 2020
@BruceForstall BruceForstall added this to the 5.0 milestone Apr 8, 2020
@AndyAyersMS
Copy link
Member

@fadimounir do you still have the GC info reported for the method? Looking at the jit code I don't see offhand how the GC state could change between the PINVOKE_BEGIN and the actual pinvoke.

@jkotas
Copy link
Member Author

jkotas commented Apr 11, 2020

This is specific to R2R codegen. If you would like to debug it under JIT, you can set flags.Set(CORJIT_FLAGS::CORJIT_FLAG_USE_PINVOKE_HELPERS); flag. It will hit the same problem.

The GC Info looks like this:

GC info 142CFEB8
Method info block:
    method      size   = 0065
    prolog      size   = 20 
    epilog      size   = 10 
    epilog     count   =  1 
    epilog      end    = yes  
    callee-saved regs  = EDI ESI EBX EBP 
    ebp frame          = yes  
    fully interruptible= no  
    double align       = no  
    arguments size     =  2 DWORDs
    stack frame size   = 12 DWORDs
    untracked count    =  2 
    var ptr tab count  =  0 
    epilog        at   005B
65 AA DC AC F4 | 
BD BE 3F       | 

Pointer table:
0D             |             [EBP-10H] an untracked pinned byref local
04             |             [EBP-14H] an untracked pinned byref local
F9 26 C0 C0    | 0026        call [ ESI'EBX'] argMask=00
F9 1F 05 05    | 0045        call [ ] argMask=05 (iargs=05)
FF             | 

The problem is the F9 1F 05 05 | 0045 call [ ] argMask=05 (iargs=05). It corresponds to PINVOKE_BEGIN call. PINVOKE_BEGIN should have no arg mask, just like the actual call that does the PInvoke.

@jkotas
Copy link
Member Author

jkotas commented Apr 14, 2020

I am trying to understand why we are doing the GC reporting for CORINFO_HELP_JIT_PINVOKE_BEGIN call, but not the actual PInvoke call instruction. Is the problem that the actual PInvoke arguments are pushed on the stack when the helper is called? If it is the case, would it help to emit the CORINFO_HELP_JIT_PINVOKE_BEGIN helper call before the arguments are pushed to the stack?

Thinking about consequences of adding yet another argument for the CORINFO_HELP_JIT_PINVOKE_BEGIN helper: It will mean that the helper is going to take 3 arguments on x86 (PInvoke Frame address, return address, stack size to pop). One of the arguments is going to be passed via the stack. Is it going to cause more problems with making sure that the helper can accurately capture the bits of context that it needs?

@AndyAyersMS
Copy link
Member

In the case I'm looking at we currently don't emit anything for the pinvoke call because of this early out in emitRecordGCcall:

#ifdef JIT32_GCENCODER
// The JIT32 GCInfo encoder allows us to (as the comment previously here said):
// "Bail if this is a totally boring call", but the GCInfoEncoder/Decoder interface
// requires a definition for every call site, so we skip these "early outs" when we're
// using the general encoder.
if (regs == 0)
{
#if EMIT_TRACK_STACK_DEPTH
if (emitCurStackLvl == 0)
return;

@AndyAyersMS
Copy link
Member

If I disable the early bail out then we see reports for the pinvoke and the pivoke end calls:

Pointer table:
0D             |             [EBP-10H] an untracked pinned byref local
04             |             [EBP-14H] an untracked pinned byref local
F9 32 03 03    | 0032        call [ ] argMask=03 (iargs=03)
02             | 0034        call [ ] argMask=00
0B             | 003F        call [ ] argMask=00
FF 

I don't understand how the early bail out is safe. Seems like the logic should be tracking whether the call we're looking at has different GC reporting than the previous call, or something like that.

@briansull
Copy link
Contributor

briansull commented Apr 14, 2020

I thought that the way that partially interruptible GC reporting worked on x86 was that if nothing is recorded for a particular offset then nothing get reported as being alive. It should not be reporting what it saw alive at a earlier callsite. Fully interruptible does keep updating the state and reports what it has incrementally computed as being alive. But that isn't how partially interruptible works.
It just looks up the GC info at the exact address and reports whatever is there. If there isn't anything for the current offset then nothing is reported as being alive.

@AndyAyersMS
Copy link
Member

Is the problem that the actual PInvoke arguments are pushed on the stack when the helper is called?

This is why the pinvoke begin gc descriptor has a nonzero arg mask, yes.

Maybe the issue is that the pinvoke args are byrefs and not native ints?

IL_0010  07                ldloc.1     
IL_0011  08                ldloc.2     
IL_0012  28 01 00 00 06    call         0x6000001
IL_0017  20 28 00 00 00    ldc.i4       0x28
IL_001c  58                add         
IL_001d  4d                ldind.i     
IL_001e  4d                ldind.i     
IL_001f  29 ff ff ff 11    calli        0x11FFFFFF

;  V03 loc1            byref  pinned
;  V04 loc2            byref  pinned

               [000021] -ACXG-------              *  ASG       int   
               [000020] D------N----              +--*  LCL_VAR   int    V06 loc4         
               [000019] --CXG-------              \--*  CALL ind unman int   
               [000010] ------------ arg0            +--*  LCL_VAR   byref  V04 loc2         
               [000009] ------------ arg1            +--*  LCL_VAR   byref  V03 loc1         
               [000018] ------------ calli tgt       \--*  LCL_VAR   int    V08 tmp1 

Seems like the IL or the jit could cast these to non-gc types.

@briansull
Copy link
Contributor

What is the signature of the callsite? I believe that is what determines if it reports a byref or not.

@jkotas
Copy link
Member Author

jkotas commented Apr 14, 2020

Maybe the issue is that the pinvoke args are byrefs and not native ints?

Yes, good point. The ECMA specs says:

When a method is called, an unmanaged pointer (type native unsigned int or *) is
permitted to match a parameter that requires a managed pointer (type &). The reverse,
however, is not permitted since it would allow a managed pointer to be “lost” by the
memory manager.

I guess that we can start with declaring this to be invalid IL, and fix the IL stub generation accordingly.

@jkotas
Copy link
Member Author

jkotas commented Apr 14, 2020

But it would be also useful to add at least an assert to the JIT that fires when GC tracked value is passed into PInvoke.

@AndyAyersMS
Copy link
Member

Looks like the pinvoke call site sig has nativeint, but the jit doesn't do anything with this particular mismatch (see impPopCallArgs).

Seems like this mostly happens with pinvokes that take byrefs.

Assert is something like:

@@ -6737,9 +6737,16 @@ void Compiler::impPopArgsForUnmanagedCall(GenTree* call, CORINFO_SIG_INFO* sig)
         assert(thisPtr->TypeGet() == TYP_I_IMPL || thisPtr->TypeGet() == TYP_BYREF);
     }

-    for (GenTreeCall::Use& use : GenTreeCall::UseList(args))
+    for (GenTreeCall::Use& argUse : GenTreeCall::UseList(args))
     {
-        call->gtFlags |= use.GetNode()->gtFlags & GTF_GLOB_EFFECT;
+        // We should not be passing gc typed args to an unmanaged call.
+        GenTree* arg = argUse.GetNode();
+        if (varTypeIsGC(arg->TypeGet()))
+        {
+            assert(!"*** invalid IL: gc type passed to unmanaged call");
+        }
+
+        call->gtFlags |= arg->gtFlags & GTF_GLOB_EFFECT;

@jkotas jkotas added area-Interop-coreclr and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Apr 14, 2020
@jkotas
Copy link
Member Author

jkotas commented Apr 14, 2020

@jkoritzinsky @AaronRobinsonMSFT The interop stub generators are generating invalid IL (passing byref values as pointer arguments) that is hitting subtle issues in the JIT.

Can you please take a look at fixing the stub generators and adding the assert that Andy suggested?

@AndyAyersMS
Copy link
Member

I see 93 cases when crossgenning SPC for x86 (changing the assert to a printf) -- some of these stubs have multiple byref args, so maybe 50 stubs in all?

*** invalid IL: gc type passed to unmanaged call, in ILStubClass:IL_STUB_PInvoke(System.String,byref,int):int
*** invalid IL: gc type passed to unmanaged call, in ILStubClass:IL_STUB_PInvoke(System.String,int,byref,int,int,int):Microsoft.Win32.SafeHandles.SafeFindHandle
*** invalid IL: gc type passed to unmanaged call, in ILStubClass:IL_STUB_PInvoke(byref,byref):int
... etc ...

@AaronRobinsonMSFT
Copy link
Member

@jkotas and @AndyAyersMS Will take over issue.

@AaronRobinsonMSFT AaronRobinsonMSFT self-assigned this Apr 14, 2020
@ghost ghost closed this as completed in #35026 Apr 18, 2020
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Jul 10, 2020
Make the jit more robust in cases where the IL producer is passing a byref
to an unmanaged caller, by retyping the argument as native int.

Allows the jit to produce self-consistent GC info and avoid the issues
seen in dotnet#34279, at least for byrefs.

Closes dotnet#39040.
AndyAyersMS added a commit that referenced this issue Jul 11, 2020
Make the jit more robust in cases where the IL producer is passing a byref
to an unmanaged caller, by retyping the argument as native int.

Allows the jit to produce self-consistent GC info and avoid the issues
seen in #34279, at least for byrefs.

Closes #39040.
@mjsabby
Copy link
Contributor

mjsabby commented Jul 13, 2020

A couple of questions:

(1) Does this not impact x64 for some reason? I don't understand why not if that is the case.
(2) Should we add this to IL Verification also?

I'm trying to determine if I'm hitting this or not, and I don't have a good way to audit all my generated IL at runtime in a system where this is happening.

@AndyAyersMS
Copy link
Member

I would guess it is x86 specific. x86 uses a different gc encoding and may rely more heavily on the pinvoke signature to make sense of the stack.

I can get you a version of the jit that will (optionally) fail with "invalid program" errors instead of asserts or silent casts, if you want to look for this sort of IL in your apps by running tests.

@mjsabby
Copy link
Contributor

mjsabby commented Jul 15, 2020

@AndyAyersMS I see you've made a change that ignores this assert. I suppose if I use that as a baseline build for my testing I can assume that it is not the problem. And you've also said it doesn't seem likely on x64.

@AndyAyersMS
Copy link
Member

I see you've made a change that ignores this assert

Yes, more or less, at least for byrefs, motivated by IL from C++/CLI that passes byrefs to unmanaged callers.

We're hoping that the IL producer has somehow guaranteed that the byref won't change values across a GC, and want to make sure we don't cause the program to fail for other reasons.

We could just reject the IL, but since we've tolerated this for a long time it would likely cause more trouble than it's worth. But you might want to do that if you are able/willing to fix up your IL producers.

you've also said it doesn't seem likely on x64.

I might be wrong.... @jkotas ?

@jkotas
Copy link
Member Author

jkotas commented Jul 15, 2020

I agree with your statement: "It doesn't seem likely on x64". We have not seen an actual problem caused by it on x64.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.