Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failed: JIT\Performance\CodeQuality\BenchmarksGame\pidigits\pidigits-3\pidigits-3.cmd #13720

Closed
VincentBu opened this issue Nov 4, 2019 · 7 comments · Fixed by #37432
Closed
Assignees
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-windows
Milestone

Comments

@VincentBu
Copy link
Contributor

Job:
coreclr-gcstress-extra #20191102.1

Detail:
https://helix.dot.net/api/2019-06-17/jobs/684cbdca-f557-48d4-8a7d-b9d9c506a4cf/workitems/JIT/console

OS & Arch:
windows x64

Mode:
COMPlus_TieredCompilation=0
COMPlus_GCStress=0xC
COMPlus_JitStress=2

Note:
This failure repros on our windows x64 machine.

Log:

JIT\Performance\CodeQuality\BenchmarksGame\pidigits\pidigits-3\pidigits-3.cmd [FAIL]
      
      Assert failure(PID 5464 [0x00001558], Thread: 2764 [0x0acc]): SanityCheck()
      
      CORECLR! MethodTable::Validate + 0x3F (0x00007ffa`13215013)
      CORECLR! Object::ValidateInner + 0x127 (0x00007ffa`13136f77)
      CORECLR! Object::Validate + 0x131 (0x00007ffa`13136e11)
      CORECLR! WKS::GCHeap::Promote + 0x9F (0x00007ffa`1356afcf)
      CORECLR! PromoteCarefully + 0xAB (0x00007ffa`1316225b)
      CORECLR! GcEnumObject + 0x61 (0x00007ffa`13476c71)
      CORECLR! GcInfoDecoder::ReportSlotToGC + 0xCD (0x00007ffa`1353e175)
      CORECLR! GcInfoDecoder::EnumerateLiveSlots + 0xCA1 (0x00007ffa`1353d399)
      CORECLR! EECodeManager::EnumGcRefs + 0x39B (0x00007ffa`131a9c6b)
      CORECLR! GcStackCrawlCallBack + 0x2CD (0x00007ffa`1347711d)
          File: F:\workspace\_work\1\s\src\vm\methodtable.cpp Line: 9907
          Image: C:\dotnetbuild\work\684cbdca-f557-48d4-8a7d-b9d9c506a4cf\Payload\CoreRun.exe
      
      
      Return code:      1
      Raw output file:      C:\dotnetbuild\work\684cbdca-f557-48d4-8a7d-b9d9c506a4cf\Work\f0773ebb-06b1-46fb-8104-1bbf4458304f\Exec\JIT\Performance\Reports\JIT.Performance\CodeQuality\BenchmarksGame\pidigits\pidigits-3\pidigits-3.output.txt
      Raw output:
      BEGIN EXECUTION
       "C:\dotnetbuild\work\684cbdca-f557-48d4-8a7d-b9d9c506a4cf\Payload\corerun.exe" pidigits-3.dll 
      Expected: 100
      Actual: -1073740286
      END EXECUTION - FAILED
      FAILED
      Test Harness Exitcode is : 1

category:testing
theme:gc-stress
skill-level:intermediate
cost:medium

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 5.0 milestone Jan 31, 2020
@v-haren
Copy link

v-haren commented Apr 27, 2020

failed again in job: runtime-coreclr gcstress0x3-gcstress0xc 20200426.1

failed test:
Interop\ArrayMarshalling\BoolArray\MarshalBoolArrayTest\MarshalBoolArrayTest.cmd

Error message

Assert failure(PID 5184 [0x00001440], Thread: 1224 [0x04c8]): SanityCheck()

CORECLR! MethodTable::Validate + 0x3F (0x00007ff950f89f67)
CORECLR! Object::ValidateInner + 0x19B (0x00007ff950ed064b)
CORECLR! Object::Validate + 0x10A (0x00007ff950ed046a)
CORECLR! GcInfoDecoder::ReportStackSlotToGC + 0x12A (0x00007ff95110a046)
CORECLR! GcInfoDecoder::ReportSlotToGC + 0x168 (0x00007ff951109f00)
CORECLR! GcInfoDecoder::ReportUntrackedSlots + 0x69 (0x00007ff95110a201)
CORECLR! GcInfoDecoder::EnumerateLiveSlots + 0xD5E (0x00007ff9511090ba)
CORECLR! EECodeManager::EnumGcRefs + 0x378 (0x00007ff95105fc38)
CORECLR! GcStackCrawlCallBack + 0x2CC (0x00007ff9511b803c)
CORECLR! Thread::MakeStackwalkerCallback + 0x6B (0x00007ff950cb9a1f)
 File: F:\workspace\_work\1\s\src\coreclr\src\vm\methodtable.cpp Line: 9822
 Image: C:\h\w\A1C6091D\p\CoreRun.exe


Return code: 1
Raw output file: C:\h\w\A1C6091D\w\C75D0A28\e\Interop\ArrayMarshalling\Reports\Interop.ArrayMarshalling\BoolArray\MarshalBoolArrayTest\MarshalBoolArrayTest.output.txt
Raw output:
BEGIN EXECUTION
 "C:\h\w\A1C6091D\p\corerun.exe" MarshalBoolArrayTest.dll 
Expected: 100
Actual: -1073740286
END EXECUTION - FAILED
FAILED
Test Harness Exitcode is : 1
To run the test:

> set CORE_ROOT=C:\h\w\A1C6091D\p
> C:\h\w\A1C6091D\w\C75D0A28\e\Interop\ArrayMarshalling\BoolArray\MarshalBoolArrayTest\MarshalBoolArrayTest.cmd
Expected: True
Actual: False


Stack trace
   at Interop_ArrayMarshalling._BoolArray_MarshalBoolArrayTest_MarshalBoolArrayTest_._BoolArray_MarshalBoolArrayTest_MarshalBoolArrayTest_cmd() in F:\workspace\_work\1\s\artifacts\tests\coreclr\Windows_NT.x64.Checked\TestWrappers\Interop.ArrayMarshalling\Interop.ArrayMarshalling.XUnitWrapper.cs:line 140

@BruceForstall
Copy link
Member

MarshalBoolArrayTest failure is Windows x64 GCStress=C, no JIT stress modes

@AndyAyersMS
Copy link
Member

Looks like pidigits-3 no longer fails.

The MarshalBoolArrayTest failure repros for me locally once in every 5 runs or so, so the failure is intermittent.

@AndyAyersMS
Copy link
Member

I can't repro the MarshalBoolArrayTest failure locally anymore. It failed again on May 9 and there is a helix dump file available but I get an error trying to download.

Will work a bit harder on getting a local repro, but looking like we'll have to close this and wait for it to happen again in CI.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jun 3, 2020

Got a local repro, though it is intermittent. Still figuring out what might have gone wrong.

Failure is while scanning the stack for a reverse pinvoke IL stub. An untracked GC ref on the stub frame fails to validate as a heap object. GC is invoked by a stress interrupt in that same frame.

The jit doesn't manipulate this slot other than to zero it out in the prolog; after that all the updates are done on the runtime side by MngdNativeArrayMarshaler methods. So it's a bit puzzling how the gc ref in the jit frame could go bad. Heap is not corrupted, so the only explanation I have now is that some stack walk doesn't reach the that jit frame and the object gets collected.

@AndyAyersMS
Copy link
Member

Have a runtime log that shows that in GCs 2859 and 2861, we walk thread 1 stack and see the reverse pinvoke stub, but for some reason, in GC 2860, we don't see this frame and so the object gets collected.

It looks like 2680 is not a stress mode triggered GC.

In the reverse pinvoke method we're past the call to CORINFO_HELP_JIT_REVERSE_PINVOKE_EXIT, so I wonder if the issue is that if we get a non-stress mode GC between that point and the method epilog we are ignoring the frame, but if we get a stress mode GC there we're not ignoring the frame, and so if we have both GCs happen then things blow up.

We have:

00007FFDE5975A9A 4C 8B 45 28          mov         r8,qword ptr [rbp+28h]                // stack slot in question
00007FFDE5975A9E FF D0                call        rax        // call managed method
00007FFDE5975AA0 85 C0                test        eax,eax  
00007FFDE5975AA2 40 0F 95 C6          setne       sil  
00007FFDE5975AA6 40 0F B6 F6          movzx       esi,sil  
00007FFDE5975AAA 48 8D 4D 18          lea         rcx,[rbp+18h]  
00007FFDE5975AAE E8 FD 20 46 5F       call        JIT_ReversePInvokeExit (07FFE44DD7BB0h)  
00007FFDE5975AB3 8B C6                mov         eax,esi  
00007FFDE5975AB5 48 B9 78 56 34 12 F0 DE BC 9A mov         rcx,9ABCDEF012345678h       // gc reporting was ok here
00007FFDE5975ABF 48 39 4D 10          cmp         qword ptr [rbp+10h],rcx              // but not here

Presumably we unlink the frame at the reverse pinvoke exit, and after that point, the reverse pinvoke stub's frame vanishes from normal stackwalks. We then do a stress mode GC at B3 and B5 and since GC is started from this thread we see the frame for some reason. We then get a non-stress mode GC also at B5 and does not see the frame, so GC collects the bool array pointed to by rbp+28 and zeros out the method table and data.We then hit a stress mode GC at BF and blow up.

Normally there isn't much (any?) code between the pinvoke exit and the epilog, but here we have a GS check so there is perhaps a bit bigger window in which we can see a non-stress GC followed by a stress GC.

And presumably this won't happen w/o gc stress as we would normally not initiate a GC in that code range from that thread.

If the above is accurate, we perhaps could fix this by one of:

  • marking the method as uninterruptible after the call to reverse pinvoke exit
  • (perhaps) avoid placing gc stress interrupts after such a call
  • extend the reach of the epilog somehow
  • tinker with stack walk for threads that initiate a GC via stress so they don't walk methods like this that are already unlinked?

@AndyAyersMS
Copy link
Member

Looks like reverse pinvoke exit just puts the thread into preemptive mode. So perhaps we could fix as follows: when we have a stress interrupt, if we're in managed code and we're also in preemptive mode, assert that we're in a reverse pinvoke stub (perhaps), and then don't induce a GC.

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Jun 4, 2020
Reverse PInvoke stubs change the thread to preemptive mode before
they exit. Often the epilog immediately follows this helper call,
but in some cases there may be a region with a few instructions
before we reach the epilog.

If there is a normal GC triggered while the method is in this region
then the runtime will not report GC references from the stub frame.

If the stub method is fully interruptible we will put GC coverage
interrupts in this region when doing GC stress. When these trigger a GC
then the runtime will report GC references from the stub frame.

So if while executing in this region we happen to do a normal GC and then
a stress invoked GC, the stress mode GC can find invalid GC references
in the frame, as the objects referenced could have been collected/moved
by the normal GC that happened just before.

There are a couple of avenues for a fix, but since this is a stress mode
only problem, I've modified GC stress avoid invoking a GC when the thread
is already in preemptive mode and the method is a reverse PINvoke stub.

Fixes dotnet#13720.
AndyAyersMS added a commit that referenced this issue Jun 5, 2020
Reverse PInvoke stubs change the thread to preemptive mode before
they exit. Often the epilog immediately follows this helper call,
but in some cases there may be a region with a few instructions
before we reach the epilog.

If there is a normal GC triggered while the method is in this region
then the runtime will not report GC references from the stub frame.

If the stub method is fully interruptible we will put GC coverage
interrupts in this region when doing GC stress. When these trigger a GC
then the runtime will report GC references from the stub frame.

So if while executing in this region we happen to do a normal GC and then
a stress invoked GC, the stress mode GC can find invalid GC references
in the frame, as the objects referenced could have been collected/moved
by the normal GC that happened just before.

There are a couple of avenues for a fix, but since this is a stress mode
only problem, I've modified GC stress avoid invoking a GC when the thread
is already in preemptive mode and the method is a reverse PINvoke stub.

Fixes #13720.
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-windows
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants