-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test failure: NullReferenceException in Internal.JitInterface.CorInfoImpl._beginInlining #105441
Comments
@steveisok We have picked up intermittent crossgen crash on arm64 in |
I think it was likely introduced by the #104696 toolset update, so it is not easy to roll-back. |
Yep, I can have someone take a closer look. |
Failed in: runtime-coreclr outerloop 20240725.5 Failed tests:
Error message:
Stack trace:
|
…opy with write barrier calls When the JIT generates code for a tailcall it must generate code to write the arguments into the incoming parameter area. Since the GC ness of the arguments of the tailcall may not match the GC ness of the parameters, we have to disable GC before we start writing these. This is done by finding the earliest `GT_PUTARG_STK` node and placing the start of the NOGC region right before it. In addition, there is logic to take care of potential overlap between the arguments and parameters. For example, if the call has an operand that uses one of the parameters, then we must take care that we do not override that parameter with the tailcall argument before the use of it. To do so, we sometimes may need to introduce copies from the parameter locals to locals on the stack frame. This used to work fine, however, with dotnet#101761 we started transforming block copies into managed calls in certain scenarios. It was possible for the JIT to decide to introduce a copy to a local and for this transformation to then kick in. This would cause us to end up with the managed helper call after starting the nogc region. In checked builds this would hit an assert during GC scan; in release builds, it would end up with corrupted data. The fix here is to make sure we insert the `GT_START_NOGC` after all the potential temporary copies we may introduce as part of the tailcat stll logic. There was an additional assumption that the first `PUTARG_STK` operand was the earliest one in execution order. That is not guaranteed, so this change stops relying on that as well by introducing a new `LIR::FirstNode` and using that to determine the earliest `PUTARG_STK` node. Fix dotnet#102370 Fix dotnet#104123 Fix dotnet#105441
…opy with write barrier calls When the JIT generates code for a tailcall it must generate code to write the arguments into the incoming parameter area. Since the GC ness of the arguments of the tailcall may not match the GC ness of the parameters, we have to disable GC before we start writing these. This is done by finding the earliest `GT_PUTARG_STK` node and placing the start of the NOGC region right before it. In addition, there is logic to take care of potential overlap between the arguments and parameters. For example, if the call has an operand that uses one of the parameters, then we must take care that we do not override that parameter with the tailcall argument before the use of it. To do so, we sometimes may need to introduce copies from the parameter locals to locals on the stack frame. This used to work fine, however, with #101761 we started transforming block copies into managed calls in certain scenarios. It was possible for the JIT to decide to introduce a copy to a local and for this transformation to then kick in. This would cause us to end up with the managed helper call after starting the nogc region. In checked builds this would hit an assert during GC scan; in release builds, it would end up with corrupted data. The fix here is to make sure we insert the `GT_START_NOGC` after all the potential temporary copies we may introduce as part of the tailcat stll logic. There was an additional assumption that the first `PUTARG_STK` operand was the earliest one in execution order. That is not guaranteed, so this change stops relying on that as well by introducing a new `LIR::FirstNode` and using that to determine the earliest `PUTARG_STK` node. Fix #102370 Fix #104123 Fix #105441
which fix are you referring to? if it's #105832 then it is in rc-1 |
oh yeah its that one, thought it had missed RC1, but maybe not. |
This (or something that looks like this) is still happening post RC1: https://dev.azure.com/dnceng-public/public/_build/results?buildId=816197&view=ms.vss-test-web.build-test-results-tab
Have not been able to make much headway with crash dumps yet, at least with LLDB. Also downloaded the above and am trying to get a local repro on WSL2/Volterra like I did with the earlier failure. |
I got a local repro after 95 runs of the full suite (~8 hours) but oddly the process is hung in
This is the same issue with the stack's internal array being null. And we are now using /_/src/coreclr/tools/aot/ILCompiler.ReadyToRun/JitInterface/CorInfoImpl.ReadyToRun.cs @ 3176:
0000ff981eb4bb58 80a20c91 add x0, x20, #0x328
0000ff981eb4bb5c b1d8bb97 bl 0xff981da41e20
0000ff981eb4bb60 f50300aa mov x21, x0
0000ff981eb4bb64 0e4e94d2 mov x14, #0xa270
0000ff981eb4bb68 ced8a3f2 movk x14, #0x1ec6, lsl #16
0000ff981eb4bb6c 0ef3dff2 movk x14, #0xff98, lsl #32
0000ff981eb4bb70 cec1bfb8 ldapr w14, [x14]
0000ff981eb4bb74 ee160036 tbz w14, #0x0, 0xff981eb4be50
(lldb) dumpobj 0000ff5818982850
Name: System.Collections.Generic.Stack`1[[System.Collections.Generic.HashSet`1[[Internal.TypeSystem.MethodDesc, ILCompiler.TypeSystem]], System.Private.CoreLib]]
MethodTable: 0000ff981ec3fc28
EEClass: 0000ff981e3d2678
Tracked Type: false
Size: 32(0x20) bytes
File: /home/andya/bugs/arm64-crossgen-crash2/Regression_6/correlation-payload/dotnet-cli/shared/Microsoft.NETCore.App/9.0.0-rc.1.24431.7/System.Collections.dll
Fields:
MT Field Offset Type VT Attr Value Name
0000ff981dbe1ee0 40000a0 8 System.__Canon[] 0 instance 0000000000000000 _array
0000ff981d9d7160 40000a1 10 System.Int32 1 instance 0 _size
0000ff981d9d7160 40000a2 14 System.Int32 1 instance 0 _version |
wonder if the issue doesn't repro under |
A stable minimal repro on macos-arm64 (Apple M2 Max): https://gist.github.com/EgorBo/c636520798d2c2d79969cf5299d53c79 dotnet build -c Release && while true; do bin/Release/net9.0/myapp; ; done on RC1 (fails on Main's corerun too). Reproduces with JitMinOpts too, so it means the issue is definitely not in the static cctor inlining in JIT |
@EgorBo 's repro also fails on the volterra under WSL2, pretty much every run. Seems like the failure is with statics for the shared generic case? |
I have confirmed locally that the regression appeared in eb8f54d, doesn't reproduce before this commit. |
@davidwrighton we have a simplified repro (see just above), seems related to the class statics change #99183. Still digging into exactly how it fails, |
Not sure if it's the issue, but the repro stops reproducing if I remove runtime/src/coreclr/vm/methodtable.h Lines 610 to 611 in 9d9c64d
|
Repros if the runtime dynamic LSE detection is disabled. So not LSE related evidently. Also repros with checked runtime. Also fails on windows arm64. |
Say two threads try to initialize a class at nearly the same time. Thread 1 sees the class is not initialized, kicks of initialization, allocates the static data, invokes the Thread 2 is lagging just a bit behind. It calls |
runtime/src/coreclr/vm/methodtable.h Line 598 in c20bdf6
I think we are missing barrier on the reader side as @EgorBo pointed in #105441 (comment) |
Ah, ok. What's (a little) surprising is that even in Tier0 the JIT is baking the static address into codegen. ;; Tier0
;; EmptyArray<MyClass998>.Value.Length == 0
movz x0, #104
movk x0, #0xCE30 LSL #16
movk x0, #0xFFAE LSL #32
bl CORINFO_HELP_GET_GCSTATIC_BASE
movz x0, #0x2A88 // data for EmptyArray`1[Program+MyClass998]:Value
movk x0, #0xB800 LSL #16
movk x0, #0xFF6E LSL #32
ldr x0, [x0] // .Value
ldr w0, [x0, #0x08] // .Length
cbnz w0, G_M9030_IG68 So the helper is just called for effect and the static field load afterwards is not data dependent on it. Seems like for Tier0 code would be more compact if we didn't do this. |
Ah, that's the problem. This really needs to either use the result of the helper, or have a barrier inserted. We could insert a barrier by adding a barrier to the code of the static base helper, or into the generated code stream any time after the call to GET_GCSTATIC_BASE. Any one of these 3 solutions would work. We could probably get the JIT to do the right thing with a simple change to the jit interface, where we simply refuse to provide a pre-computed address for a static without requiring that the class be initialized. I'll put together a proposed fix with that approach, so we can see what it looks like. |
@EgorBo I don't have a setup for actually running any arm64 code available tonight, but I think either of the two PR's I just linked to this issue should solve this problem. Could you try them out? The code to have the jit interface force the use of helpers is probably not a final change (we wouldn't need that logic on any platform with a TSO based memory model), but it should be enough to verify that it works. |
I can do some testing. |
#108311 works for me too, didn't try the other one. |
Well, I'm going to want to investigate why #108309 didn't work later, since I'm kinda surprised it didn't fix that test case, so I'm not understanding some part of this system, which isn't good, but the barriers are strictly a more complete solution, so let's go with that. |
I think it also needs fixes in Seems like giving the JIT static addresses before the class is initialized is asking for trouble. Also (as above) it is creating bigger code. So maybe we want both of these? |
It can expose bugs elsewhere in the system as we have seen here, but I do not think there are fundamental problems with it.
It is creating bigger code only when the JIT is not inlining the cctor check (e.g. optimizations are off). |
@jkotas Do you think this is likely to still reproduce in CI even if the underlying issue turns out to not have been fixed? IIUC the testing strategy for crossgen2 has changed significantly recently. |
In main branch, both crossgen2 and ilc run on runtime specified in https://github.com/dotnet/runtime/blob/main/global.json#L8 . This version is at 9.0.100-rc.2.24474.11 currently. 9.0.100-rc.2.24474.11 does not contain fix #108347. This failure should be gone once we update global.json to rtm. |
🎉 |
Failed in: runtime-coreclr outerloop 20240724.3
Failed tests:
Error message:
Stack trace:
Known Issue Error Message
Fill the error message using step by step known issues guidance.
Known issue validation
Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=753860
Error message validated:
[NullReferenceException _beginInlining
]Result validation: ✅ Known issue matched with the provided build.
Validation performed at: 7/27/2024 1:53:56 AM UTC
Report
Summary
The text was updated successfully, but these errors were encountered: