-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failed: JIT\Performance\CodeQuality\BenchmarksGame\pidigits\pidigits-3\pidigits-3.cmd #13720
Comments
failed again in job: runtime-coreclr gcstress0x3-gcstress0xc 20200426.1 failed test: Error message
|
MarshalBoolArrayTest failure is Windows x64 GCStress=C, no JIT stress modes |
Looks like pidigits-3 no longer fails. The MarshalBoolArrayTest failure repros for me locally once in every 5 runs or so, so the failure is intermittent. |
I can't repro the MarshalBoolArrayTest failure locally anymore. It failed again on May 9 and there is a helix dump file available but I get an error trying to download. Will work a bit harder on getting a local repro, but looking like we'll have to close this and wait for it to happen again in CI. |
Got a local repro, though it is intermittent. Still figuring out what might have gone wrong. Failure is while scanning the stack for a reverse pinvoke IL stub. An untracked GC ref on the stub frame fails to validate as a heap object. GC is invoked by a stress interrupt in that same frame. The jit doesn't manipulate this slot other than to zero it out in the prolog; after that all the updates are done on the runtime side by |
Have a runtime log that shows that in GCs 2859 and 2861, we walk thread 1 stack and see the reverse pinvoke stub, but for some reason, in GC 2860, we don't see this frame and so the object gets collected. It looks like 2680 is not a stress mode triggered GC. In the reverse pinvoke method we're past the call to CORINFO_HELP_JIT_REVERSE_PINVOKE_EXIT, so I wonder if the issue is that if we get a non-stress mode GC between that point and the method epilog we are ignoring the frame, but if we get a stress mode GC there we're not ignoring the frame, and so if we have both GCs happen then things blow up. We have: 00007FFDE5975A9A 4C 8B 45 28 mov r8,qword ptr [rbp+28h] // stack slot in question
00007FFDE5975A9E FF D0 call rax // call managed method
00007FFDE5975AA0 85 C0 test eax,eax
00007FFDE5975AA2 40 0F 95 C6 setne sil
00007FFDE5975AA6 40 0F B6 F6 movzx esi,sil
00007FFDE5975AAA 48 8D 4D 18 lea rcx,[rbp+18h]
00007FFDE5975AAE E8 FD 20 46 5F call JIT_ReversePInvokeExit (07FFE44DD7BB0h)
00007FFDE5975AB3 8B C6 mov eax,esi
00007FFDE5975AB5 48 B9 78 56 34 12 F0 DE BC 9A mov rcx,9ABCDEF012345678h // gc reporting was ok here
00007FFDE5975ABF 48 39 4D 10 cmp qword ptr [rbp+10h],rcx // but not here Presumably we unlink the frame at the reverse pinvoke exit, and after that point, the reverse pinvoke stub's frame vanishes from normal stackwalks. We then do a stress mode GC at B3 and B5 and since GC is started from this thread we see the frame for some reason. We then get a non-stress mode GC also at B5 and does not see the frame, so GC collects the bool array pointed to by rbp+28 and zeros out the method table and data.We then hit a stress mode GC at BF and blow up. Normally there isn't much (any?) code between the pinvoke exit and the epilog, but here we have a GS check so there is perhaps a bit bigger window in which we can see a non-stress GC followed by a stress GC. And presumably this won't happen w/o gc stress as we would normally not initiate a GC in that code range from that thread. If the above is accurate, we perhaps could fix this by one of:
|
Looks like reverse pinvoke exit just puts the thread into preemptive mode. So perhaps we could fix as follows: when we have a stress interrupt, if we're in managed code and we're also in preemptive mode, assert that we're in a reverse pinvoke stub (perhaps), and then don't induce a GC. |
Reverse PInvoke stubs change the thread to preemptive mode before they exit. Often the epilog immediately follows this helper call, but in some cases there may be a region with a few instructions before we reach the epilog. If there is a normal GC triggered while the method is in this region then the runtime will not report GC references from the stub frame. If the stub method is fully interruptible we will put GC coverage interrupts in this region when doing GC stress. When these trigger a GC then the runtime will report GC references from the stub frame. So if while executing in this region we happen to do a normal GC and then a stress invoked GC, the stress mode GC can find invalid GC references in the frame, as the objects referenced could have been collected/moved by the normal GC that happened just before. There are a couple of avenues for a fix, but since this is a stress mode only problem, I've modified GC stress avoid invoking a GC when the thread is already in preemptive mode and the method is a reverse PINvoke stub. Fixes dotnet#13720.
Reverse PInvoke stubs change the thread to preemptive mode before they exit. Often the epilog immediately follows this helper call, but in some cases there may be a region with a few instructions before we reach the epilog. If there is a normal GC triggered while the method is in this region then the runtime will not report GC references from the stub frame. If the stub method is fully interruptible we will put GC coverage interrupts in this region when doing GC stress. When these trigger a GC then the runtime will report GC references from the stub frame. So if while executing in this region we happen to do a normal GC and then a stress invoked GC, the stress mode GC can find invalid GC references in the frame, as the objects referenced could have been collected/moved by the normal GC that happened just before. There are a couple of avenues for a fix, but since this is a stress mode only problem, I've modified GC stress avoid invoking a GC when the thread is already in preemptive mode and the method is a reverse PINvoke stub. Fixes #13720.
Job:
coreclr-gcstress-extra #20191102.1
Detail:
https://helix.dot.net/api/2019-06-17/jobs/684cbdca-f557-48d4-8a7d-b9d9c506a4cf/workitems/JIT/console
OS & Arch:
windows x64
Mode:
COMPlus_TieredCompilation=0
COMPlus_GCStress=0xC
COMPlus_JitStress=2
Note:
This failure repros on our windows x64 machine.
Log:
category:testing
theme:gc-stress
skill-level:intermediate
cost:medium
The text was updated successfully, but these errors were encountered: