-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible JIT optimization bug on .NET 5 preview #39023
Comments
Can you provide a more complete code example so we can attempt to reproduce the issue on our end? |
It is a rather complex algorithm that sometimes runs for several minutes before hitting these exceptions (and some times it doesn't). If I knew how to create a reasonably small code example, I would. Perhaps a Live Share session or with some help I could try to get more info with WinDbg. |
Could you try a current master branch build (see https://github.com/dotnet/installer)? Another thing to try: run with |
Additional info: I've tried running on .net core 3.1 with basically a single change (due to MemoryExtensions.Sort not being present in 3.1) and the exceptions did not occur. I've tested for about an 1h on several instances, while on .net 5 preview I usually get these exceptions very frequently. @BruceForstall, yes I can test the master branch build and get back here to let you know. I can also try COMPlus_GCStress=F. |
A crash dump might be useful. |
@BruceForstall: I've got an ExecutionEngineException on 5.0.0-preview.8.20358.9. @AndyAyersMS: from VS Debug > Save Dump as... I exported a crash dump and it's 201Mb and I'm not sure if source code is present in this file, if there is a way to send the file to you privately, or if there is a way to generate a smaller dump without (C#) code present, please let me know. |
The dump will contain IL but not sources. You can share it securely by opening an issue on the VS Developer Portal and then attaching it there, or you can create your own share and email me the access info (andya@microsoft.com). |
Sent to your e-mail! |
Thanks. I should be able to look at it later today. |
Working with @dellamonica offline -- initial impression is bad GC reporting by the jit. Will add this to 5.0. |
Analysis of various dumps provided by @dellamonica shows potentially corrupted GC info, but if so, it's not clear how it got corrupted. Failures were always in the same method at the same offset, so it doesn't look like random corruption. I can repro the exact jit codegen in a mocked-up version of the method, and get normal-looking GC info. We are ready to do some more diagnosis to try and pin down what is going on, but apparently the failure doesn't repro like it once did. So we're kind of stuck waiting for this to start failing again... |
Failures are reproing once more, am going to look at a simplified repro provided by @dellamonica. |
Still working on tracking this down. @dellamonica has shared some non-crash examples showing the GC info is fine right after jitting. So either the GC info is produced incorrectly at times, or gets corrupted after it's produced. Given how surgical and repeatable the corruption is, the former seems far more likely. In particular if the IG flags or liveness state for an IG can be corrupted that could lead exactly to the sort of malformed GC info we see. So trying to figure out what could be happening in the jit that leads to occasional corruption of IG state -- we're going to enable pageheap to see if we can catch some out of bounds write. Also am also going to look into a special jit build that keeps duplicate IG information and sanity checks that both copies agree. |
Have some strong evidence now this is the jit misbehaving. Still not sure why. Here are two gc info traces, one that is correct, the other incorrect.
Here 0x16 is the start of the method body. In the bad case Checked jit always produces good info. As does release jit with various forms of DirectAlloc / Pageheap. |
Think I've finally figured this one out. Recall the release jit only sometimes generates bad GC info, and the checked jit never does. The release jit can sometimes end up in a situation where A checked jit will typically always set runtime/src/coreclr/src/jit/codegenlinear.cpp Lines 99 to 102 in 50a999d
It turns out In release builds, The above explains why checked jits always produce good GC info, and release jits mostly always do. This is a regression; the precipitating change is quite likely #1309. Before that change it wasn't possible (or at least wasn't common) for the jit to create a scratch BB after building pred lists. The order of these is significant; The simple minimal fix is to initialize I have sent an updated jit to @dellamonica to test; hopefully this is indeed the problem. |
@AndyAyersMS, I've run the crashing repro 8x with the new JIT and it did not crash once. Before we had a failure rate of about 60%, so I'm pretty confident that the problem is solved. Thank you for your efforts! |
This fixes an issue where release jits might sometimes generate bad GC info. Keeping it minimal for now so we can consider servicing preview 8. See dotnet#39023 for details.
@dellamonica thanks for reporting this, and for your help and patience in tracking this down. |
This fixes an issue where release jits might sometimes generate bad GC info. Keeping it minimal for now so we can consider servicing preview 8. See #39023 for details.
Port of dotnet#40038 to Preview 8. Fix dotnet#39023 Release jits might sometimes generate bad GC info. Mysterious intermittent crashes. Without this fix jit GC info generation for some methods is non-deterministically bad. Yes, problem does not occur in 3.1. Very low.
Closed via #40038. |
+1 thanks @dellamonica for helping track this down and fix it before final release. Much appreciated. |
This fixes an issue where release jits might sometimes generate bad GC info. Keeping it minimal for now so we can consider servicing preview 8. See dotnet#39023 for details.
Description
On a managed .NET 5 C# project, I get intermittent errors, sometimes an AccessViolationException, other times ExecutionEngineException, and even NullReferenceException (?) with no managed code in sight on the debugger.
The code does mathematical optimization and is fully deterministic, however the exceptions are not always consistent even though they are frequent.
On the following scenarios these exceptions have never appeared so far (I tried a few times):
Unfortunately, I don't know how to create a small program to reproduce the problem, but there seemed to be a particular point where the code breaks the most:
For context,
_complement
is declared as:private readonly ImmutableArray<ulong> _complement
An extra detail: in the constructor I initialize a regular
ulong[]
and useUnsafe.As<ulong[], ImmutableArray<ulong>>
to set_complement
.Here is a misterious NullReferenceException (on a value type?!)
data:image/s3,"s3://crabby-images/322c2/322c27104f941ecb6f9cb74c3795d55a8f511998" alt="image"
JIT and compiler optimizations have a lot of impact on the performance in this project so I don't think disabling these optimizations is a long term solution.
Configuration
.NET 5.0.0-preview.6.20305.6
Also tried preview 4 before and was getting similar exceptions.
category:correctness
theme:testing
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: