-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel NULL pointer dereference with Kernel 6.8 #182
Comments
Is anyone else experiencing this problem?
Is it a reason not to move to 6.8 yet?
…-- -- --
John T Davis
***@***.***
On Jul 20, 2024, at 7:44 AM, makoONE ***@***.***> wrote:
I am delighted that kernel 6.8 is now supported.
Unfortunately, however, freezes occur after a short time under Proxmox with kernel 6.8.8.3-pve. The error entry with dmesg is:
[ 800.513559] BUG: kernel NULL pointer dereference, address: 000000000000057d
[800.513567] #PF: supervisor read access in kernel mode
[ 800.513569] #PF: error_code(0x0000) - not-present page
It would be great if the error could be fixed, thanks.
—
Reply to this email directly, view it on GitHub <#182>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGI5CYTTSLWYDTY64JS3S2LZNJLTRAVCNFSM6AAAAABLF5EFOGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQZDAOBYGQZDCNI>.
You are receiving this because you are subscribed to this thread.
|
6.8.8-2-pve has been stable for me for over 24 hours (i9-13900H - Minisforum MS-01). I haven't tried 6.8.8.3-pve yet. |
I'm running the module with the 6.8.8-2-pve kernel without problems so far on Raptorlake Refresh hardware (ASUS W680 + i9-14900K) and also on some Beelink Mini EQ 12 (Alderlake N100). In the meantime, even all seven VFs were in use at the same time on the large machine for testing multiple Windows RDP sessions with 4K video decoding and 3D, Debian 12 and Ubuntu 22.04 with VA API and 3D acceleration in parallel. But maybe I was just lucky. 🫤 This repo is currently based on lts-v6.1.26-linux-230504T201607Z from linux-intel-lts which was merged by @zhtengw on May 8, 2023. Afterwards there seem to be mostly build fixes. Looking at the diff ( I've really just recently decided to give Intel SR-IOV a try (it was on the todo list for at least a year 😄) and did the attempt to fix the build for 6.8 kernels only since I feared that this repo might not get updated anymore (@strongtz seems to have stopped using it). As mentioned in PR #178, I'm also not a kernel/drm developer and lack any experience in debugging kernel modules efficiently, sorry. But even some senior veteran drm developer would have to reproduce/understand such issues a bit more in detail (more logs, stacktrace, ...). Just FYI (see below), there is already another issue on my hardware with some out of bounds (OOB) error on Ubuntu 24.04 guest VM and kernel 6.8 if the module is initialized as minor 0 instead of minor 1. It is also worth to mention that some memory leaks have been reported in GH-175. Given that Intel is bringing SR-IOV support with its new xe driver maybe in kernel 6.12 (or later)
this repo might need to survive only for a few more months or ~1 year, I guess (Intel delayed the release for quite some time now but Arrow Lake will be released in Q4 2024). On the other hand, it'll take time until whatever xe-sriov kernel lands in new linux distros. 😁 At least Proxmox seems to be quite fast with new kernel releases. So is it really worth merging lts-v6.1.95-linux-240708T112901Z and is there anyone who'd be able to it? OOB error on Ubuntu 24.04 (click to expand)
|
I have the said freezes by kernel null pointer dereference also with the Proxmox kernel 6.8.8-2-pve and 6.8.4-3-pve in connection with a Win11 VM and the vGPU usage. |
I sadly won't be able to help you with debugging/updating the module. @JTR-Tech and I are using the same Raptor Lake-S UHD Graphics with device ID A780. The issue might be limited to your Alder Lake-P UHD Graphics with device ID 46A3, though my little N100 Alder Lake-N processor with UHD Graphics ID 46D1 does not seem to freeze. But it could also be other effects like memory limits or the VM driver. I'm using the latest driver 32.0.101.5762 for the Windows VMs with hardware settings as following: |
My VM configuration is largely the same and I am also using the latest Intel Graphics Driver 32.0.101.5762 in the Win11 VM. I just found out that the current i915-sriov-dkms state 2024.07.19 does not work with a 6.5 kernel (6.5.13-5-pve) anymore, same error kernel null pointer dereference. Probably there is no other choice but to go back to i915-sriov-dkms state before 2024.07.17, i.e. without kernel 6.8 support. |
Before 2024.07.17, kernel 6.5.13-5-pve was not working due to build issues. It was impossible to build and only 6.5.13-3-pve compiled. The build issues with 6.5.13-5-pve have been fixed with #178 / #179. I have tested and was running all kernels 6.5.13-3-pve, 6.5.13-5-pve and 6.8.4-3-pve and 6.8.8-2-pve without issues afterwards. Also please note that there was no functional change for Proxmox with #178 and #179 except for the firmware version. The only thing that changed was header includes for kernel 6.8.* and a fix for 6.5.13-5-pve that would have prevented building the module anyway. It is technically impossible that the module change from 6.1 to 2024.07.17 or 2024.07.19 has changed anything with the 6.5.13-3-pve kernel or the 6.5.13-5-pve kernel (which was broken anyway) unless it is related to the firmware version. What is your output of |
Output of [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.8.8-3-pve root=/dev/mapper/nab6--vg-root ro quiet mitigations=off intel_iommu=on iommu=pt i915.enable_guc=3 i915.max_vfs=7 cpufreq.default_governor=powersave |
If you want to test the previously allowed guc firmware minor version (0 or 4) instead of the corrected minor version 9. You could just run the following in the root of your the i915-sriov-dkms repo:
This will change After successful compilation and reboot the dmesg with errors due to firmware mismatch (click to expand)
|
I followed your advice with the modified minor version, dmesg says: [ 5.863414] i915 0000:00:02.3: [drm] ERROR GT0: IOV: Unable to confirm version 1.9 (0000000000000000) Unfortunately the machine continues to freeze. [86.393495] BUG: kernel NULL pointer dereference, address: 000000000000057d |
I'm running out of ideas but if even 6.5.13-3-pve is not working with i915-sriov-dkms 6.1Available via Minor GuC firmware check0 or 4, incorrect, hardcoded Kernels
i915-sriov-dkms 2024.07.19Available via Minor GuC firmware check9, correct, can be changed via Kernels
So either your freezes are related to the 6.8 kernel or to the corrected firmware version in combination with your UHD Device ID 46A3 (i7-12650H). The UHD Device ID A780 (i9-13900H, i9-14900K) is reportedly working fine with all kernel variants. But the i915-sriov-dkms module version 2024.07.19 with If you still have freezes with this combination and it was working before with 6.1, it must be related to something else. |
Update for the sake of completeness: Proxmox crashed around the 30-hour mark. It was running headless, so I'm not sure what the error was (nothing showing in the journal), which is roughly in line with the behavior I've seen on 8.5 PVE kernels. 7 VFs were applied, with only 1 assigned to a Windows 11 Pro guest using Intel Xe driver version 31.0.101.5590. The VM was idle with Windows Sleep mode disabled. |
I have succesfully finished a 24 hour 3D + 4k video test running in 2 Windows 11 VMs in parallel without issues on my Promox VE 8.2 host with kernel 6.8.8-2-pve (I don't want to see the next electricity bill). And the host with an ASUS W680 + i9-14900K is running absolutely fine for 6 days now. Really no problems at all. It needs some new BIOS thanks to Intel and must therefore be rebooted now. My dmesg got filled with a lot of messages like The small N100 is also running with the loaded module in kernel 6.8.8-2 for several days now but it was just idling without any VF in use. Not sure why this is different for my two machines. I'm running both machines with the kernel command line potion I will post an update whenever I encounter any kind of issue with the 6.8 kernel. dmesg (shortened)
|
Thanks for taking the time to test this and post an update. I’m really confident to try making the switch to 6.8 this weekend.
(I need to make sure the latest version of kernel 6.8.x is actually installed correctly on Proxmox now. Since I’ve had 6.5.13-5 pinned, I get scary warnings every time it does a system update about a dpkg-configure failure.)
“My dmesg got filled with a lot of messages like i915 0000:00:02.0: VF{i} FLR in the meantime. That seems to be some Function Level Reset related to SR-IOV. Not sure if this is OK <https://gist.github.com/scyto/e4e3de35ee23fdb4ae5d5a3b85c16ed3?permalink_comment_id=4714186#gistcomment-4714186> but at least I could not see any effect.”
I see these all the time on a working Proxmox 8.2.x (6.5.13-5)-based install on an HP Elite Mini 600 G9. It isn’t associated with any sort of performance glitches or issues on my system. I’ve just been ignoring it.
…-- -- --
John T Davis
***@***.***
On Jul 23, 2024, at 5:25 PM, pasbec ***@***.***> wrote:
I have succesfully finished a 24 hour 3D + 4k video test running in 2 Windows 11 VMs in parallel without issues on my Promox VE 8.2 host with kernel 6.8.8-2-pve (I don't want to see the next electricity bill). And the host with an ASUS W680 + i9-14900K is running absolutely fine for 6 days now. Really no problems at all. It needs some new BIOS thanks to Intel <https://www.youtube.com/watch?v=wkrOYfmXhIc> and must therefore be rebooted now.
My dmesg got filled with a lot of messages like i915 0000:00:02.0: VF{i} FLR in the meantime. That seems to be some Function Level Reset related to SR-IOV. Not sure if this is OK <https://gist.github.com/scyto/e4e3de35ee23fdb4ae5d5a3b85c16ed3?permalink_comment_id=4714186#gistcomment-4714186> but at least I could not see any effect.
The small N100 is also running with the loaded module in kernel 6.8.8-2 for several days now but it was just idling without any VF in use.
Not sure why this is different for my two machines. I'm running both machines with the kernel command line potion split_lock_detect=off i915.enable_fbc=1 i915.enable_guc=3 i915.max_vfs=7 ignore_msrs=1 report_ignored_msrs=0 - not sure if some options are related? At least the i9-13900H should be very similar to mine. Don't know, sorry.
I will post an update whenever I encounter any kind of issue with the 6.8 kernel.
grafik.png (view on web) <https://github.com/user-attachments/assets/9feee9f0-5702-446b-9099-b9a784bf4474>
grafik.png (view on web) <https://github.com/user-attachments/assets/ecf28c3b-4ed1-4716-a014-6bb49a22f83d>
grafik.png (view on web) <https://github.com/user-attachments/assets/ced7828b-6f6a-4ad9-bbaa-ec6aaaf770ca>
dmesg (shortened)
—
Reply to this email directly, view it on GitHub <#182 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AGI5CYVONQAF74I5B4N4PX3ZN3J4HAVCNFSM6AAAAABLF5EFOGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBWGQYTEMJTGE>.
You are receiving this because you commented.
|
I carried out further tests with my previous setup and came to the following conclusion: [ 527.228951] ------------[ cut here ]------------ Is it possible to find out the cause of this or what is the best way to fix the errors? |
Oh what a pity, a few hours later another freeze with kernel 6.8.8-3-pve. |
Sorry to hear that. You mentioned initially that kernel 6.5 has also been working before. Maybe it is worth to check if kernel 6.5.13-3-pve is working stable with the latest versions 2024.07.19 or 2024.07.24 (haven't tried the latter) of the dkms module. If not, you should test kernel 6.5.13-3-pve with the old version |
I too am unable to get a i7-12700K to work with a 6.8.12 kernel. dmsg snippet
|
For what it's worth, I'm getting the same on 6.6.52 immediately after I run my window manager
edit - tested 6.11.0, same thing. FWIW, I'm using an i5-1240P, so 12th gen Xe Graphics. |
This patch worked for me diff --git a/drivers/gpu/drm/i915/display/intel_atomic_plane.c b/drivers/gpu/drm/i915/display/intel_atomic_plane.c
|
The same problem occurs, pve kernel 6.5.13-5, i915-sriov-dkms 2024.7.17. |
I use proxmox on my home server with a minimal Xfce4 desktop environment and connect from this with Remmina/RDP to a Windows 11 VM that also uses the iGPU of the server via SR-IOV. In this scenario I continued to experience freezes with the dkms module from here and all new kernel versions after 6.2 and the system had to be reset. I have been using the dkms module from the repo listed below for a few days now and have not experienced any more freezes since then. I am using the current and official Proxmox kernel 6.8.12-4-pve but also with the new opt-in kernel 6.11.0-1-pve there are no problems so far. https://github.com/bbaa-bbaa/i915-sriov-dkms Only when starting the desktop environment on the server does a dmesg error entry appear in the form:
However, this does not seem to have any effect or I have not noticed it yet. |
same to u, i am ok。 so What is the reason? |
For the record, I have the same issue on plain Debian host with
In my case this fixed the immediate crash, but this seems to be just hiding the problem, not fixing it.
|
I decided to build 6.5.10 kernel for this device and everything works flawlessly. I was able to create only one virtual device, works fine with Windows 10 VM and looking glass. |
Hey, with latest changes pulled (I see support for 6.12) now I'm on 6.12.9 and it seems to work stable so far :) |
I am delighted that kernel 6.8 is now supported.
Unfortunately, however, freezes occur after a short time under Proxmox with kernel 6.8.8.3-pve. The error entry with dmesg is:
[ 800.513559] BUG: kernel NULL pointer dereference, address: 000000000000057d
[800.513567] #PF: supervisor read access in kernel mode
[ 800.513569] #PF: error_code(0x0000) - not-present page
My Proxmox host is a Minisforum NAB6 with an i7-12650h processor and its integrated Intel UHD Graphics for 12th Gen Intel Processors.
I had no problems of this kind with kernel versions 6.2 and 6.5.
It would be great if the error could be fixed, thanks.
The text was updated successfully, but these errors were encountered: