Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IGP causes NVMe Kernel Panic CSTS=0xffffffff #1193

Closed
0xfeedface-turbo opened this issue Oct 2, 2020 · 16 comments
Closed

IGP causes NVMe Kernel Panic CSTS=0xffffffff #1193

0xfeedface-turbo opened this issue Oct 2, 2020 · 16 comments

Comments

@0xfeedface-turbo
Copy link

Let me start with the fact that this is not a bug in NVMeFix or Whatevergreen but this seems like the best place to document the issue.

I have an Intel 9600K/H370 system that experiences kernel panics in IONVMeController that manifests as a generic timeout:

void AppleNVMeRequestTimer::PrintPending()::243:QID=1 Deadline=4390442285091 DW0=00140001 DW10=00F04593 DW11=00000000 DW12=0000001F DW13=00000000 DW14=00000000 DW15=00000000
void AppleNVMeRequestTimer::PrintPending()::243:QID=1 Deadline=4390442285091 DW0=00140001 DW10=00F04593 DW11=00000000 DW12=0000001F DW13=00000000 DW14=00000000 DW15=00000000
Debugger called:
IOPlatformPanicAction -> IONVMeController
IOPlatformPanicAction -> AppleSMC
: panic(cpu 0 caller 0xffffff7f865edb30): nvme: "Fatal error occurred. CSTS=0xffffffff US[1]=0x0 US[0]=0x5a1 VID/DID=0x500215b7
. FW Revision=102000WD\n"@/BuildRoot/Library/Caches/com.apple.xbs/Sources/IONVMeFamily/IONVMeFamily-387.270.1/IONVMeController.cpp:5334
Backtrace (CPU 0), Frame : Return Address
0xffffff873a6f3a10 : 0xffffff8003fad58d mach_kernel : _handle_debugger_trap + 0x47d
0xffffff873a6f3a60 : 0xffffff80040e9145 mach_kernel : _kdp_i386_trap + 0x155
0xffffff873a6f3aa0 : 0xffffff80040da87a mach_kernel : _kernel_trap + 0x50a
0xffffff873a6f3b10 : 0xffffff8003f5a9d0 mach_kernel : _return_from_trap + 0xe0
0xffffff873a6f3b30 : 0xffffff8003facfa7 mach_kernel : _panic_trap_to_debugger + 0x197
0xffffff873a6f3c50 : 0xffffff8003facdf3 mach_kernel : _panic + 0x63
0xffffff873a6f3cc0 : 0xffffff7f865edb30 com.apple.iokit.IONVMeFamily : __ZN16IONVMeController13FatalHandlingEv + 0x10e
0xffffff873a6f3e20 : 0xffffff800465d407 mach_kernel : _ZN18IOTimerEventSource15timeoutSignaledEPvS0 + 0x87
0xffffff873a6f3e90 : 0xffffff800465d329 mach_kernel : _ZN18IOTimerEventSource17timeoutAndReleaseEPvS0 + 0x99
0xffffff873a6f3ec0 : 0xffffff8003fec7a5 mach_kernel : _thread_call_delayed_timer + 0xef5
0xffffff873a6f3f40 : 0xffffff8003fec345 mach_kernel : _thread_call_delayed_timer + 0xa95
0xffffff873a6f3fa0 : 0xffffff8003f5a0ce mach_kernel : _call_continuation + 0x2e
Kernel Extensions in backtrace:
com.apple.iokit.IONVMeFamily(2.1)[E109699D-6257-3176-B081-4CC8B1C181AB]@0xffffff7f865e0000->0xffffff7f8661ffff
dependency: com.apple.driver.AppleMobileFileIntegrity(1.0.5)[1AD7D9F4-24B5-354F-BD01-C301F58FAA52]@0xffffff7f84d8d000
dependency: com.apple.iokit.IOPCIFamily(2.9)[EF12A360-E92B-3407-8080-E4889F8AAC97]@0xffffff7f84895000
dependency: com.apple.driver.AppleEFINVRAM(2.1)[32B99D26-4CD1-3CE5-8856-D2659CCA4861]@0xffffff7f84f67000
dependency: com.apple.iokit.IOStorageFamily(2.1)[DFD9596C-E596-376A-8A00-3B74A06C2D02]@0xffffff7f84b83000
dependency: com.apple.iokit.IOReportFamily(47)[769D4408-2D1B-3B65-89D1-4C3C547099E3]@0xffffff7f85407000
BSD process name corresponding to current thread: kernel_task

I have tried to debug this timeout, which always happens at random times but there is a commonality - it only happens when using the IGP and the display is sleeping.

The IGP going into a low-power mode seems to disrupt power to the NVMe, causing it to crash/reset, and thus causing the timeout. The NVMe keeps smart statistics on power offs, and I have recorded this anomaly:

Power Cycles: 3,814
Power On Hours: 202
Unsafe Shutdowns: 3,794

I have not been able to figure out exactly how the IGP is causing the NVMe to lose power, but I suspect it may be related to this issue (RC6)

I modified the CFL FB kext with these changes, which seems to completely solve the KP issue:
<key>RenderStandby</key><integer>0</integer>
<key>SetRC6Voltage</key><integer>1</integer>
<key>SupportPSRwithExternalDisplay</key><integer>0</integer>

Have you guys seen issues relating to IGP power saving causing any similar problems? I'm thinking there might be a way to work around this in Whatevergreen or NVMeFix to avoid having to create a plist-only kext to change these settings.

@0xfeedface-turbo
Copy link
Author

I forgot to mention that I spent a lot of time troubleshooting this before discovering.

Different NVMe cards, different motherboards, NVMe heatsinks, built-in M.2 slots vs PCI adapter cards, UEFI PCI power settings, enable/disable ASPM etc, the kernel panic always reoccurred. Sometimes the VID/PID would read as 0xffff

Onboard PCH IGE, AHCI, USB never had an issue at all, only NVMe. I'm guessing it's some kind of UEFI firmware bug?

@07151129
Copy link

07151129 commented Oct 2, 2020

That's an extremely curious bug, thanks for suggesting a fix. I think force disabling RC6 by default in the FeatureControl dict of the framebuffer IORegistryEntry is a good immediate solution.

Were you able to isolate the issue just to a single key of this dictionary?

Worth mentioning you can also disable render standby by passing bootarg forceRenderStandby=0.

@0xfeedface-turbo
Copy link
Author

Thanks for the tip on the bootarg. I am pretty sure that's it.

It can take hours for the panic to happen, but I set RenderStandby back to 1 and I got a panic almost immediately. I have reverted the previous changes and am testing just with forceRenderStandby=0 right now and it hasn't KP so far.

I am not sure the power impact with this change? This is a desktop system, but the same problem could be happening with laptops. One of the linux posts mentions disabling coarse power gating as the better option. There is a key CoarsePowerGatingSelect but I haven't deduced what the values mean yet.

@07151129
Copy link

07151129 commented Oct 3, 2020

RenderStandby refers to RC6, the lowest-power idle render state. It has been notoriously buggy and required workarounds, both in Linux and Windows.

Coarse power gating is another mechanism used in GEN9 to transition Render and Media engines to sleep. The two appear to be independent in principle. The CoarsePowerGatingSelect bits 0 and 1 are used to enable Render and Media CPG, respectively. An older version of i915 used to disable Render CPG https://patchwork.kernel.org/patch/6193051/, but apparently it is now enabled along with RC6.

@0xfeedface-turbo
Copy link
Author

Thanks for the info, it has saved me a lot of time!

I did some testing with RenderStandby=1 and CoarsePowerGatingSelect=0 and I was actually able to get the same NVMe crash with the display ON for the first time. Do you know what bit 2 is used for? The default in the CFL FB kext is 4, and disabling that bit seems to make a difference.

Setting forceRenderStandby=0 in boot-args solves the crashes completely.

Intel Power Gadget reports that the IGP frequency never drops below 350mhz and total power consumption is approximately 1W higher than with RenderStandby enabled.

I'm still at a loss as to why RC6 on the IGP would be affecting the NVMe at all, though.

@07151129
Copy link

07151129 commented Oct 3, 2020

CoarsePowerGatingSelect=4 uses the value from the platform info struct at offset 0x58 (gPlatformInformationList, see IntelFramebuffer.bt) to configure CPG:

AppleIntelFramebufferController::getCPGControl
...
    cpgsel = OSMetaClassBase::safeMetaCast(v3, OSNumber::metaClass);
    if ( cpgsel )
    {
      cpgsel = (cpgsel->vtbl->unsigned32BitValue)(cpgsel);
      if ( cpgsel != 4 )
        goto LABEL_7;
      this->CoarsePowerGatingSelect = 0;
      v4 = this->platformInfo->member22;
      cpgsel = (&dword_0 + 2);
      if ( _bittest(&v4, 0x10u) )
      {
        this->CoarsePowerGatingSelect = 1;
        cpgsel = (&dword_0 + 3);
      }
      if ( _bittest(&v4, 0x11u) )
LABEL_7:
        this->CoarsePowerGatingSelect = cpgsel;
    }

It's a complete mystery why there is interference between GPU and PCI. If you can reproduce it on Linux with i915, then this could be reported to Intel.

@07151129
Copy link

07151129 commented Oct 7, 2020

By the way, value CSTS=0xffffffff also looks suspicious according to the spec.

A similar bug in Linux: https://bugs.freedesktop.org/show_bug.cgi?id=108546. Apparently, it is a BIOS issue, although in that case intel_idle.max_cstate=1 i915.enable_dc=0 i915.enable_fbc=0 did not help.

vit9696 added a commit to acidanthera/WhateverGreen that referenced this issue Oct 7, 2020
@vit9696
Copy link
Contributor

vit9696 commented Oct 7, 2020

Thanks for your help! Added a comment to WhateverGreen FAQ. Other FAQs will also need to be updated.

CC @Andrey1970AppleLife @khronokernel @PMheart

@Mateo1234454545
Copy link

I added forceRenderStandby=0 boot arg as well , and IGPU is stacked at 0,3ghz.

@malhal
Copy link

malhal commented Nov 12, 2020

Maybe this state is when TRIM runs and it is crashing? Try sudo trimforce disable and reboot. If re-enabling then it is recommended to run disk first aid.

@blodt
Copy link

blodt commented Sep 4, 2021

It's back doing it again on my machine after a month or so of no issues

Getting more consistent too

@malhal
Copy link

malhal commented Sep 4, 2021

I haven't had this panic since I disabled TRIM

@blodt
Copy link

blodt commented Sep 5, 2021

I haven't had this panic since I disabled TRIM

Will try that - thank you!

@Mateo1234454545
Copy link

I haven't had this panic since I disabled TRIM

How did you disable trim?
Tried your command but at reboot nvme trim is still enabled.
Maybe this command is only for sata3 ssd?

@1alessandro1
Copy link

@Mateo1234454545

  • Ensure ThirdPartyDrives kernel patch is set to False
  • Try sudo trimforce disable
  • Set SetApfsTrimTimeout to 999 which is the minimal timeout

@blodt
Copy link

blodt commented Sep 20, 2021

I ended up having to do a fresh Big Sur install and restore my install from Time Machine

That all went great and I'm back up and running with no freezes again and I've used @1alessandro1 tips/settings above in hopes that might cure it long term.

I don't think I will really know for a month or so, as that's how long the freezing issue took to reappear after the last time I did all this.

I'll report back in hopes of helping anyone else down the line.

Thank you all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

7 participants