Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Do not encode safe points with -1 offset. (#104336)" #105234

Merged
merged 1 commit into from
Jul 22, 2024

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Jul 22, 2024

This reverts #104336, which appears to fix the gcstress-extra issues filed this morning.

Close #105210
Close #105212
Close #105213
Close #105214
Close #105215
Close #105216
Close #105218
Close #105219
Close #105220
Close #105221
Close #105222
Close #105223
Close #105224
Close #105225
Close #105226
Close #105227

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 22, 2024
@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr gcstress-extra

Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jakobbotsch jakobbotsch changed the title Try to track down gcstress-extra failures Revert "Do not encode safe points with -1 offset. (#104336)" Jul 22, 2024
@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib @VSadov @jkotas

It looks like reverting this commit fixes the gcstress-extra issues that were filed this morning.
This might be specific interaction with jitstress, so it might not be a product bug at all. Should we revert it for the snap, or do you want to look into it and backport a fix into preview 7 @VSadov?

@jakobbotsch jakobbotsch marked this pull request as ready for review July 22, 2024 15:16
@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Jul 22, 2024
@VSadov
Copy link
Member

VSadov commented Jul 22, 2024

It is very possible that JIT stress does something illegal. I'd prefer reverting though, if that fixes up the stress runs.

There are quite a few GCStress failures. Is that after removing or happens regardless?

  • just to be sure that reverting does not make things broken, just in a different way.

@jakobbotsch
Copy link
Member Author

It is very possible that JIT stress does something illegal. I'd prefer reverting though, if that fixes up the stress runs.

There are quite a few GCStress failures. Is that after removing or happens regardless?

  • just to be sure that reverting does not make things broken, just in a different way.

Those are predominantly #105186

@VSadov
Copy link
Member

VSadov commented Jul 22, 2024

We had some green gcstress runs just recently (Ex. https://dev.azure.com/dnceng-public/public/_build/results?buildId=732442&view=results), so I guess these are new failures not related to safepoint encoding.

@VSadov
Copy link
Member

VSadov commented Jul 22, 2024

Also note that the change that we are reverting could make #103950 a bit easier, so we might want to figure what is wrong with jit stress and put the change back.

@BrennanConroy
Copy link
Member

BrennanConroy commented Jul 23, 2024

This seems to be causing https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_apis/build/builds/746357/logs/28 which I'd say is a product issue.

Verified by using the commit before the original change and not having any issues, then using the original change and having issues.

@janvorli
Copy link
Member

janvorli commented Jul 23, 2024

@VSadov, I was helping @BrennanConroy yesterday to debug the issue and I have a dump that he's shared with stress log covering the whole run, where their app crashed in 2.5 seconds after start, and only a single GC was executed. Sounds like this might be the easiest way for you to figure out what's wrong.
The crash is due to what I believe is a GC hole where the code tries to use a reference that isn't valid anymore (but it is within the GC heap range). The method keeps the reference in RBX and then calls the following method. Notice that the RBX is pushed as the first thing in the prolog, but the pushed value doesn't seem to be tracked by GC. And I can see that when the GC kicked in, the call chain went through that method. So, I think that the GC most likely moved the object and then in the epilog popped the old reference back to RBX:

0:014> !u -gcinfo 00007FF9CBE041B9
preJIT generated code
System.Threading.ManualResetEventSlim.Wait(System.TimeSpan)
ilAddr is 00007FF9CC2C47DC pImport is 000002140133F910
Begin 00007FF9CBE04140, size c2

C:\github\runtime\src\libraries\System.Private.CoreLib\src\System\Threading\ManualResetEventSlim.cs @ 399:
Prolog size: 0
Security object: <none>
GS cookie: <none>
PSPSym: <none>
Generics inst context: <none>
PSP slot: <none>
GenericInst slot: <none>
Varargs: 0
Frame pointer: <none>
Wants Report Only Leaf: 0
Size of parameter area: 0
Return Kind: Scalar
Code size: c2
00007ff9`cbe04140 53              push    rbx
00007ff9`cbe04141 4883ec20        sub     rsp,20h
00007ff9`cbe04145 4889542438      mov     qword ptr [rsp+38h],rdx
00007ff9`cbe0414a 488bd9          mov     rbx,rcx
00007ff9`cbe0414d 488d4c2438      lea     rcx,[rsp+38h]
00007ff9`cbe04152 ff1528cea100    call    qword ptr [System_Private_CoreLib!System.Collections.Generic.CollectionExtensions.TryAdd+0x6444e0 (00007ff9`cc820f80)] (System.TimeSpan.get_TotalMilliseconds())
00007ff9`cbe04158 f20f12c8        movddup xmm1,xmm0
00000019 is a safepoint: 
00000018 +rbx
00007ff9`cbe0415c f20f12d0        movddup xmm2,xmm0
00007ff9`cbe04160 f20f12c0        movddup xmm0,xmm0
00007ff9`cbe04164 660fc2ca00      cmpeqpd xmm1,xmm2
00007ff9`cbe04169 660f54c1        andpd   xmm0,xmm1
00007ff9`cbe0416d 0f100d0c588900  movups  xmm1,xmmword ptr [System_Private_CoreLib!System.Collections.Generic.CollectionExtensions.TryAdd+0x4bcee0 (00007ff9`cc699980)]
00007ff9`cbe04174 0f101515588900  movups  xmm2,xmmword ptr [System_Private_CoreLib!System.Collections.Generic.CollectionExtensions.TryAdd+0x4bcef0 (00007ff9`cc699990)]
00007ff9`cbe0417b 660fc2c802      cmplepd xmm1,xmm0
00007ff9`cbe04180 f2480f2cc8      cvttsd2si rcx,xmm0
00007ff9`cbe04185 66480f6ec1      movq    xmm0,rcx
00007ff9`cbe0418a 0f28d8          movaps  xmm3,xmm0
00007ff9`cbe0418d 660f6cd8        punpcklqdq xmm3,xmm0
00007ff9`cbe04191 0f28c1          movaps  xmm0,xmm1
00007ff9`cbe04194 660f3810da      pblendvb xmm3,xmm2,xmm0
00007ff9`cbe04199 66480f7eda      movq    rdx,xmm3

C:\github\runtime\src\libraries\System.Private.CoreLib\src\System\Threading\ManualResetEventSlim.cs @ 401:
00007ff9`cbe0419e 4883faff        cmp     rdx,0FFFFFFFFFFFFFFFFh
00007ff9`cbe041a2 7c27            jl      System_Private_CoreLib!System.Threading.ManualResetEventSlim.Wait+0x8b (00007ff9`cbe041cb)

C:\github\runtime\src\libraries\System.Private.CoreLib\src\System\Threading\ManualResetEventSlim.cs @ 402:
00007ff9`cbe041a4 4881faffffff7f  cmp     rdx,7FFFFFFFh
00007ff9`cbe041ab 7d13            jge     System_Private_CoreLib!System.Threading.ManualResetEventSlim.Wait+0x80 (00007ff9`cbe041c0)

C:\github\runtime\src\libraries\System.Private.CoreLib\src\System\Threading\ManualResetEventSlim.cs @ 404:
00007ff9`cbe041ad 488bcb          mov     rcx,rbx
00007ff9`cbe041b0 4533c0          xor     r8d,r8d
00007ff9`cbe041b3 ff151ffaa100    call    qword ptr [System_Private_CoreLib!System.Collections.Generic.CollectionExtensions.TryAdd+0x647138 (00007ff9`cc823bd8)] (System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken))
>>> 00007ff9`cbe041b9 90              nop
0000007a is a safepoint: 
00007ff9`cbe041ba 4883c420        add     rsp,20h
00007ff9`cbe041be 5b              pop     rbx
00007ff9`cbe041bf c3              ret

C:\github\runtime\src\libraries\System.Private.CoreLib\src\System\Threading\ManualResetEventSlim.cs @ 402:
00007ff9`cbe041c0 4881faffffff7f  cmp     rdx,7FFFFFFFh
00007ff9`cbe041c7 7ee4            jle     System_Private_CoreLib!System.Threading.ManualResetEventSlim.Wait+0x6d (00007ff9`cbe041ad)
00007ff9`cbe041c9 eb1e            jmp     System_Private_CoreLib!System.Threading.ManualResetEventSlim.Wait+0xa9 (00007ff9`cbe041e9)

C:\github\runtime\src\libraries\System.Private.CoreLib\src\System\Threading\ManualResetEventSlim.cs @ 401:
00007ff9`cbe041cb 4c8b05ceb0a300  mov     r8,qword ptr [System_Private_CoreLib!System.Collections.Generic.CollectionExtensions.TryAdd+0x662800 (00007ff9`cc83f2a0)]
00007ff9`cbe041d2 4d8b00          mov     r8,qword ptr [r8]
00007ff9`cbe041d5 488bca          mov     rcx,rdx
00007ff9`cbe041d8 48baffffffffffffffff mov rdx,0FFFFFFFFFFFFFFFFh
00007ff9`cbe041e2 ff1568eba200    call    qword ptr [System_Private_CoreLib!System.Collections.Generic.CollectionExtensions.TryAdd+0x6562b0 (00007ff9`cc832d50)] (System_Private_CoreLib)
00007ff9`cbe041e8 cc              int     3

C:\github\runtime\src\libraries\System.Private.CoreLib\src\System\Threading\ManualResetEventSlim.cs @ 402:
000000a9 is a safepoint: 
00007ff9`cbe041e9 4c8b05b0b0a300  mov     r8,qword ptr [System_Private_CoreLib!System.Collections.Generic.CollectionExtensions.TryAdd+0x662800 (00007ff9`cc83f2a0)]
00007ff9`cbe041f0 4d8b00          mov     r8,qword ptr [r8]
00007ff9`cbe041f3 488bca          mov     rcx,rdx
00007ff9`cbe041f6 baffffff7f      mov     edx,7FFFFFFFh
00007ff9`cbe041fb ff1547eba200    call    qword ptr [System_Private_CoreLib!System.Collections.Generic.CollectionExtensions.TryAdd+0x6562a8 (00007ff9`cc832d48)] (System_Private_CoreLib)
00007ff9`cbe04201 cc              int     3
000000c2 is a safepoint: 

@jakobbotsch
Copy link
Member Author

Notice that the RBX is pushed as the first thing in the prolog, but the pushed value doesn't seem to be tracked by GC.

This looks like it's saving a callee-save register because the method is going to be using it. If it has a GC reference I assume the caller is what needs to be marking that.

@janvorli
Copy link
Member

If it has a GC reference I assume the caller is what needs to be marking that.

Of course you are right and I was wrong.

@janvorli
Copy link
Member

janvorli commented Jul 23, 2024

This is the method that keeps the reference in RBX and calls the method above and also the method that crashes with the invalid reference:

preJIT generated code
Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Infrastructure.Heartbeat.TimerLoop()
ilAddr is 00007FF9E7ACEC08 pImport is 000002140F50A640
Begin 00007FF9E7A3E6C0, size 43
Prolog size: 0
Security object: <none>
GS cookie: <none>
PSPSym: <none>
Generics inst context: <none>
PSP slot: <none>
GenericInst slot: <none>
Varargs: 0
Frame pointer: <none>
Wants Report Only Leaf: 0
Size of parameter area: 0
Return Kind: Scalar
Code size: 43
00007ff9`e7a3e6c0 57              push    rdi
00007ff9`e7a3e6c1 56              push    rsi
00007ff9`e7a3e6c2 53              push    rbx
00007ff9`e7a3e6c3 4883ec20        sub     rsp,20h
00007ff9`e7a3e6c7 488bd9          mov     rbx,rcx
00007ff9`e7a3e6ca 488b350fa21900  mov     rsi,qword ptr [Microsoft_AspNetCore_Server_Kestrel_Core!Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.GetStream+0x1228b0 (00007ff9`e7bd88e0)]
00007ff9`e7a3e6d1 488b3df8a11900  mov     rdi,qword ptr [Microsoft_AspNetCore_Server_Kestrel_Core!Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.GetStream+0x1228a0 (00007ff9`e7bd88d0)]
00007ff9`e7a3e6d8 eb09            jmp     Microsoft_AspNetCore_Server_Kestrel_Core!Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Infrastructure.Heartbeat.TimerLoop+0x23 (00007ff9`e7a3e6e3)

/_/src/Servers/Kestrel/Core/src/Internal/Infrastructure/Heartbeat.cs @ 76:
00007ff9`e7a3e6da 488bcb          mov     rcx,rbx
00007ff9`e7a3e6dd ff1535591900    call    qword ptr [Microsoft_AspNetCore_Server_Kestrel_Core!Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.GetStream+0x11dfe8 (00007ff9`e7bd4018)] (Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Infrastructure.Heartbeat.OnHeartbeat())
^^^ The crash happens in this call, an invalid reference is passed in RCX ^^^

/_/src/Servers/Kestrel/Core/src/Internal/Infrastructure/Heartbeat.cs @ 74:
00000023 is a safepoint: 
00000022 +rbx
>>> 00007ff9`e7a3e6e3 488b0c33        mov     rcx,qword ptr [rbx+rsi]
00007ff9`e7a3e6e7 488b143b        mov     rdx,qword ptr [rbx+rdi]
00007ff9`e7a3e6eb 4c8d1d2ef41800  lea     r11,[Microsoft_AspNetCore_Server_Kestrel_Core!Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http2.Http2Connection.GetStream+0x117af0 (00007ff9`e7bcdb20)]
00007ff9`e7a3e6f2 3909            cmp     dword ptr [rcx],ecx
00007ff9`e7a3e6f4 41ff13          call    qword ptr [r11]; System.Threading.ManualResetEventSlim.Wait(System.TimeSpan)
^^^ It was in this call when GC was scanning the stack ^^^

00000037 is a safepoint: 
00000036 +rbx
00007ff9`e7a3e6f7 85c0            test    eax,eax
00007ff9`e7a3e6f9 74df            je      Microsoft_AspNetCore_Server_Kestrel_Core!Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Infrastructure.Heartbeat.TimerLoop+0x1a (00007ff9`e7a3e6da)

/_/src/Servers/Kestrel/Core/src/Internal/Infrastructure/Heartbeat.cs @ 78:
00007ff9`e7a3e6fb 4883c420        add     rsp,20h
00007ff9`e7a3e6ff 5b              pop     rbx
00007ff9`e7a3e700 5e              pop     rsi
00007ff9`e7a3e701 5f              pop     rdi
00007ff9`e7a3e702 c3              ret

@janvorli
Copy link
Member

And here is the call stack at the time of the crash:

 # Child-SP          RetAddr               Call Site
00 00000045`6fdff730 00007ff9`e7a3e6e3     Microsoft_AspNetCore_Server_Kestrel_Core!Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Infrastructure.Heartbeat.OnHeartbeat+0x3a [_/src/Servers/Kestrel/Core/src/Internal/Infrastructure/Heartbeat.cs @ 45] 
01 00000045`6fdff7c0 00007ff9`d73b2f83     Microsoft_AspNetCore_Server_Kestrel_Core!Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Infrastructure.Heartbeat.TimerLoop+0x23 [_/src/Servers/Kestrel/Core/src/Internal/Infrastructure/Heartbeat.cs @ 74] 
02 00000045`6fdff800 00007ff9`d725c284     coreclr!CallDescrWorkerInternal+0x83 [C:\github\runtime\src\coreclr\vm\amd64\CallDescrWorkerAMD64.asm @ 74] 
03 00000045`6fdff840 00007ff9`d73a0363     coreclr!DispatchCallSimple+0x60 [C:\github\runtime\src\coreclr\vm\callhelpers.cpp @ 217] 
04 00000045`6fdff8d0 00007ff9`d72fd020     coreclr!ThreadNative::KickOffThread_Worker+0x63 [C:\github\runtime\src\coreclr\vm\comsynchronizable.cpp @ 158] 
05 (Inline Function) --------`--------     coreclr!ManagedThreadBase_DispatchInner+0xd [C:\github\runtime\src\coreclr\vm\threads.cpp @ 7115] 
06 00000045`6fdff930 00007ff9`d72fcf2f     coreclr!ManagedThreadBase_DispatchMiddle+0x8c [C:\github\runtime\src\coreclr\vm\threads.cpp @ 7159] 
07 00000045`6fdff9e0 00007ff9`d72fd8de     coreclr!ManagedThreadBase_DispatchOuter+0xab [C:\github\runtime\src\coreclr\vm\threads.cpp @ 7318] 
08 (Inline Function) --------`--------     coreclr!ManagedThreadBase_FullTransition+0x28 [C:\github\runtime\src\coreclr\vm\threads.cpp @ 7363] 
09 (Inline Function) --------`--------     coreclr!ManagedThreadBase::KickOff+0x28 [C:\github\runtime\src\coreclr\vm\threads.cpp @ 7398] 
0a 00000045`6fdffa80 00007ffb`060a257d     coreclr!ThreadNative::KickOffThread+0x7e [C:\github\runtime\src\coreclr\vm\comsynchronizable.cpp @ 230] 
0b 00000045`6fdffae0 00007ffb`06e4af28     kernel32!BaseThreadInitThunk+0x1d [clientcore\base\win32\client\thread.c @ 75] 
0c 00000045`6fdffb10 00000000`00000000     ntdll!RtlUserThreadStart+0x28 [minkernel\ntdll\rtlstrt.c @ 1166] 

@VSadov
Copy link
Member

VSadov commented Jul 23, 2024

00007ff9`e7a3e6f4 41ff13          call    qword ptr [r11]; System.Threading.ManualResetEventSlim.Wait(System.TimeSpan)
^^^ It was in this call when GC was scanning the stack ^^^

00000037 is a safepoint: 
00000036 +rbx

does the above mean that rbx is not marked live going into the call, but marked live right after?
It would be normal to see return registers becoming live after a call, but RBX is not a return register.
(do I get right the notation of this print out?)

Discontinuity in nonvolatile register liveness is possible if the call throws, but then we'd have unreachable int3 after a call.

If this method was forced to be fully interruptible and was a leaf of a stackwalk, we could start from either of the locations (before call, after). Would both work correctly?

@janvorli
Copy link
Member

I am actually not sure how to read the GC info here. The fact that it has +rbx at two places, but no -rbx anywhere confuses me.

@janvorli
Copy link
Member

@VSadov this is the corresponding output of the !gcinfo if that helps you:

0:008> !gcinfo 00007ffa`5a2ce6c0
entry point 00007FFA5A2CE6C0
preJIT generated code
GC info 00007FFA5A270AB0
Pointer table:
Prolog size: 0
Security object: <none>
GS cookie: <none>
PSPSym: <none>
Generics inst context: <none>
PSP slot: <none>
GenericInst slot: <none>
Varargs: 0
Frame pointer: <none>
Wants Report Only Leaf: 0
Size of parameter area: 0
Return Kind: Scalar
Code size: 43
00000023 is a safepoint: 
00000022 +rbx
00000037 is a safepoint: 
00000036 +rbx

@VSadov
Copy link
Member

VSadov commented Jul 27, 2024

One interesting thing is that when I ran runtime-coreclr gcstress0x3-gcstress0xc for the original change it was green:

https://dev.azure.com/dnceng-public/public/_build/results?buildId=729833&view=results

@VSadov
Copy link
Member

VSadov commented Jul 27, 2024

The reason for @BrennanConroy ' s failures appear to be just cross-build breaking change nature of the change. The new codegen was incompatible with existing R2R images. We did update R2R min version for 9.0, but we do not version between builds.

Disabling R2R temporarily for these tests while old R2R images might still be used would very likely fix the problem.
It fixes the crash for me when I try to repro it. - With R2R disable stress tests run for as long as I care to wait, with R2R enabled crash within seconds.

@VSadov
Copy link
Member

VSadov commented Jul 27, 2024

I think we still have some real issue with JIT stress on arm/arm64 though.
Somehow change in encoding may break tests when run with JIT stress.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.