Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible JIT optimization bug on .NET 5 preview #39023

Closed
dellamonica opened this issue Jul 9, 2020 · 19 comments
Closed

Possible JIT optimization bug on .NET 5 preview #39023

dellamonica opened this issue Jul 9, 2020 · 19 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Milestone

Comments

@dellamonica
Copy link

dellamonica commented Jul 9, 2020

Description

On a managed .NET 5 C# project, I get intermittent errors, sometimes an AccessViolationException, other times ExecutionEngineException, and even NullReferenceException (?) with no managed code in sight on the debugger.

The code does mathematical optimization and is fully deterministic, however the exceptions are not always consistent even though they are frequent.

On the following scenarios these exceptions have never appeared so far (I tried a few times):

  • Debug mode.
  • Release mode with Optimize=False,
  • Release mode with Optimize=True and COMPlus_JITMinOpts=1.

Unfortunately, I don't know how to create a small program to reproduce the problem, but there seemed to be a particular point where the code breaks the most:

image

For context, _complement is declared as:
private readonly ImmutableArray<ulong> _complement
An extra detail: in the constructor I initialize a regular ulong[] and use Unsafe.As<ulong[], ImmutableArray<ulong>> to set _complement.

Here is a misterious NullReferenceException (on a value type?!)
image

JIT and compiler optimizations have a lot of impact on the performance in this project so I don't think disabling these optimizations is a long term solution.

Configuration

.NET 5.0.0-preview.6.20305.6

Also tried preview 4 before and was getting similar exceptions.

category:correctness
theme:testing
skill-level:expert
cost:large

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Jul 9, 2020
@tannergooding
Copy link
Member

Can you provide a more complete code example so we can attempt to reproduce the issue on our end?

@jkotas jkotas added the tenet-reliability Reliability/stability related issue (stress, load problems, etc.) label Jul 9, 2020
@dellamonica
Copy link
Author

It is a rather complex algorithm that sometimes runs for several minutes before hitting these exceptions (and some times it doesn't). If I knew how to create a reasonably small code example, I would.

Perhaps a Live Share session or with some help I could try to get more info with WinDbg.

@BruceForstall
Copy link
Member

Could you try a current master branch build (see https://github.com/dotnet/installer)?

Another thing to try: run with COMPlus_GCStress=F. If there's a problem with GC information, this might make the failure more deterministically reproducible. (If it's not a GC related issue, it probably won't help. It will also make the app run very slow.)

@dellamonica
Copy link
Author

Additional info: I've tried running on .net core 3.1 with basically a single change (due to MemoryExtensions.Sort not being present in 3.1) and the exceptions did not occur. I've tested for about an 1h on several instances, while on .net 5 preview I usually get these exceptions very frequently.

@BruceForstall, yes I can test the master branch build and get back here to let you know.

I can also try COMPlus_GCStress=F.

@AndyAyersMS
Copy link
Member

A crash dump might be useful.

@dellamonica
Copy link
Author

dellamonica commented Jul 10, 2020

@BruceForstall: I've got an ExecutionEngineException on 5.0.0-preview.8.20358.9.
I've set COMPlus_GCStress=F and it took forever to get even Visual Studio/PSCore open, if it is that much slower, then we might need hours and for that I'd have to use another PC/VM, so I'll postpone that test.

@AndyAyersMS: from VS Debug > Save Dump as... I exported a crash dump and it's 201Mb and I'm not sure if source code is present in this file, if there is a way to send the file to you privately, or if there is a way to generate a smaller dump without (C#) code present, please let me know.

@AndyAyersMS
Copy link
Member

The dump will contain IL but not sources.

You can share it securely by opening an issue on the VS Developer Portal and then attaching it there, or you can create your own share and email me the access info (andya@microsoft.com).

@dellamonica
Copy link
Author

The dump will contain IL but not sources.

You can share it securely by opening an issue on the VS Developer Portal and then attaching it there, or you can create your own share and email me the access info (andya@microsoft.com).

Sent to your e-mail!

@AndyAyersMS
Copy link
Member

Thanks. I should be able to look at it later today.

@AndyAyersMS
Copy link
Member

Working with @dellamonica offline -- initial impression is bad GC reporting by the jit. Will add this to 5.0.

@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Jul 11, 2020
@AndyAyersMS AndyAyersMS added this to the 5.0.0 milestone Jul 11, 2020
@AndyAyersMS AndyAyersMS self-assigned this Jul 11, 2020
@AndyAyersMS
Copy link
Member

Analysis of various dumps provided by @dellamonica shows potentially corrupted GC info, but if so, it's not clear how it got corrupted. Failures were always in the same method at the same offset, so it doesn't look like random corruption. I can repro the exact jit codegen in a mocked-up version of the method, and get normal-looking GC info.

We are ready to do some more diagnosis to try and pin down what is going on, but apparently the failure doesn't repro like it once did. So we're kind of stuck waiting for this to start failing again...

@AndyAyersMS AndyAyersMS added the blocked Issue/PR is blocked on something - see comments label Jul 15, 2020
@AndyAyersMS
Copy link
Member

Failures are reproing once more, am going to look at a simplified repro provided by @dellamonica.

@AndyAyersMS AndyAyersMS removed the blocked Issue/PR is blocked on something - see comments label Jul 17, 2020
@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jul 21, 2020

Still working on tracking this down.

@dellamonica has shared some non-crash examples showing the GC info is fine right after jitting. So either the GC info is produced incorrectly at times, or gets corrupted after it's produced. Given how surgical and repeatable the corruption is, the former seems far more likely.

In particular if the IG flags or liveness state for an IG can be corrupted that could lead exactly to the sort of malformed GC info we see.

So trying to figure out what could be happening in the jit that leads to occasional corruption of IG state -- we're going to enable pageheap to see if we can catch some out of bounds write. Also am also going to look into a special jit build that keeps duplicate IG information and sanity checks that both copies agree.

@AndyAyersMS
Copy link
Member

Have some strong evidence now this is the jit misbehaving. Still not sure why. Here are two gc info traces, one that is correct, the other incorrect.

Register slot id for reg rsi = 3.
Register slot id for reg rcx (byref) = 4.
Register slot id for reg rax (byref) = 5.
Register slot id for reg r9 (byref) = 6.

;; good info

...
Set state of slot 3 at instr offset 0x16 to Live.
Set state of slot 4 at instr offset 0x16 to Live.
Set state of slot 5 at instr offset 0x16 to Live.
Set state of slot 6 at instr offset 0x16 to Live.

;; bad info

Set state of slot 3 at instr offset 0x16 to Live.
Set state of slot 3 at instr offset 0x16 to Dead.
Set state of slot 4 at instr offset 0x19 to Live.
Set state of slot 5 at instr offset 0x20 to Live.
Set state of slot 6 at instr offset 0x27 to Live.
Set state of slot 3 at instr offset 0x2a to Live.

Here 0x16 is the start of the method body. In the bad case RSI and some byrefs are left unprotected on entry. Failure is that a GC happens here and RSI's referent gets relocated.

Checked jit always produces good info. As does release jit with various forms of DirectAlloc / Pageheap.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jul 28, 2020

Think I've finally figured this one out.

Recall the release jit only sometimes generates bad GC info, and the checked jit never does.

The release jit can sometimes end up in a situation where fgFirstBB does not have the BBF_HAS_LABEL or BBF_JMP_TARGET flags set. If this happens and there are GC references live into the method body, they won't be reported properly as the jit won't create a label for the first BB.

A checked jit will typically always set BBF_HAS_LABEL flag on fgBirstBB because of this bit of code:

if (compiler->fgHasSwitch)
{
compiler->fgFirstBB->bbFlags |= BBF_JMP_TARGET;
}

It turns out fgHasSwitch is never explicitly initialized, so it tends to be set to true in checked builds, given the default fill pattern (the only fill pattern that would have exposed this error in a checked jit is 00 which currently can't be used).

In release builds, fgHasSwitch will have a somewhat random value in methods that don't have switches. Apparently a random 0 tends to be rare, so most of the time the release jit will also set the BBF_HAS_LABEL flag on fgBirstBB. But occasionally the value is 0 and the release jit fails to create a label for the first BB, and thus leaves the jit in position to generate bad GC info.

The above explains why checked jits always produce good GC info, and release jits mostly always do.

This is a regression; the precipitating change is quite likely #1309. Before that change it wasn't possible (or at least wasn't common) for the jit to create a scratch BB after building pred lists. The order of these is significant; fgComputePreds sets both BBF_HAS_LABEL and BBF_JMP_TARGET flags on the first block. Post #1309 the jit might create a scratch BB after fgComputePreds has run, and in that case the new first BB does not have the right flags set.

The simple minimal fix is to initialize fgHasSwitch to false in fgInit, and then update fgEnsureFirstBBisScratch to set the necessary flags on the new first BB.

I have sent an updated jit to @dellamonica to test; hopefully this is indeed the problem.

@dellamonica
Copy link
Author

dellamonica commented Jul 28, 2020

@AndyAyersMS, I've run the crashing repro 8x with the new JIT and it did not crash once. Before we had a failure rate of about 60%, so I'm pretty confident that the problem is solved.

Thank you for your efforts!

AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Jul 28, 2020
This fixes an issue where release jits might sometimes generate bad GC info.
Keeping it minimal for now so we can consider servicing preview 8.

See dotnet#39023 for details.
@AndyAyersMS
Copy link
Member

@dellamonica thanks for reporting this, and for your help and patience in tracking this down.

AndyAyersMS added a commit that referenced this issue Jul 29, 2020
This fixes an issue where release jits might sometimes generate bad GC info.
Keeping it minimal for now so we can consider servicing preview 8.

See #39023 for details.
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this issue Jul 29, 2020
Port of dotnet#40038 to Preview 8.

Fix dotnet#39023

Release jits might sometimes generate bad GC info.

Mysterious intermittent crashes. Without this fix jit GC
info generation for some methods is non-deterministically bad.

Yes, problem does not occur in 3.1.

Very low.
@AndyAyersMS
Copy link
Member

Closed via #40038.

@danmoseley
Copy link
Member

+1 thanks @dellamonica for helping track this down and fix it before final release. Much appreciated.

danmoseley pushed a commit that referenced this issue Jul 30, 2020
…0086)

Port of #40038 to Preview 8.

Fix #39023

Release jits might sometimes generate bad GC info.

Mysterious intermittent crashes. Without this fix jit GC
info generation for some methods is non-deterministically bad.

Yes, problem does not occur in 3.1.

Very low.
Jacksondr5 pushed a commit to Jacksondr5/runtime that referenced this issue Aug 10, 2020
This fixes an issue where release jits might sometimes generate bad GC info.
Keeping it minimal for now so we can consider servicing preview 8.

See dotnet#39023 for details.
@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-reliability Reliability/stability related issue (stress, load problems, etc.)
Projects
None yet
Development

No branches or pull requests

7 participants