Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WL Vulkan apps are broken with PRIME #72

Closed
TheComputerGuy96 opened this issue Nov 3, 2022 · 103 comments
Closed

WL Vulkan apps are broken with PRIME #72

TheComputerGuy96 opened this issue Nov 3, 2022 · 103 comments

Comments

@TheComputerGuy96
Copy link

Hello,

This is sort of a continuation of #41 but for Vulkan apps/games

So Vulkan apps (like PPSSPP or vkcube) fail to work with Wayland on my PRIME setup:

$ prime-run vkcube-wayland 
Selected GPU 0: NVIDIA GeForce GTX 1650 Ti, type: DiscreteGpu 
[destroyed object]: error 7: failed to import supplied dmabufs: Arguments are inconsistent (for example, a valid context requires buffers not supplied by a

As you can see it's identical to the OpenGL error (but the OpenGL one has already been fixed) but I also checked the Wayland logs and the (probably) NVIDIA modifier is present (so the linear modifier needs to be used somehow)

Running both PPSSPP and vkcube with XWayland removes the problem (by using SDL_VIDEODRIVER=x11 variable or the X11 vkcube executable)

And now time for the all important system info 🐸 (although it's kinda redundant here):
Distro: Arch Linux
egl-wayland version: 1.1.11 (Git version also fails)
Mesa version: 22.2.1
Driver version: 515.76
Kernel version: 6.0.6
Compositor: mutter 43.0 (through an unofficial repo)
CPU: Ryzen 5 4600H
GPU: Renoir iGPU + GTX 1650 Ti Mobile (as I said a PRIME setup)

@erik-kz
Copy link

erik-kz commented Nov 3, 2022

Thanks for the report. I suspect this has the same root-cause as the issue you reported earlier #69, namely that we're passing a buffer to the AMD GPU which isn't aligned to 256 bytes.

The relevant code-path in the driver is used by both EGL and Vulkan applications on Wayland, so the same bug would be present in both cases.

@TheComputerGuy96
Copy link
Author

TheComputerGuy96 commented Nov 3, 2022

@erik-kz The issue in this case is that NVIDIA specific format modifiers are being used on PRIME setups (the same issue as #41) so it's different from #69

And that causes the Mesa driver to fail because it doesn't understand those NVIDIA specific ones (so linear modifier is the only reliable option)

You've already fixed the OpenGL path with 866a801#diff-8965d13061a6bcaea4358bcc9c757a91fbd9b3cc16fcb3bf1dd579c667fc5528R1269 but Vulkan is somehow different 🤔

Compare these two WAYLAND_DEBUG lines (one is OpenGL on my PRIME setup and the other is Vulkan on the same setup):

[1084918.093]  -> zwp_linux_buffer_params_v1@56.add(fd 52, 0, 0, 3840, 0, 0)
[1084918.102]  -> zwp_linux_buffer_params_v1@56.create_immed(new id wl_buffer@57, 960, 544, 875713112, 0)
[1005520.777]  -> zwp_linux_buffer_params_v1@49.add(fd 73, 0, 0, 3840, 50331648, 6316052)
[1005520.793]  -> zwp_linux_buffer_params_v1@49.create_immed(new id wl_buffer@58, 960, 544, 875713089, 0)

Notice the non-zero values for layout modifiers in the Vulkan section?

@erik-kz
Copy link

erik-kz commented Nov 3, 2022

Ah I see, thanks for the clarification. Yes, this does indeed appear to be a different issue. We use a kind of back-door interface into egl-wayland for Vulkan which has the effect of bypassing the linear buffer allocation. Like the other bug, though, the fix will need to be on the driver side.

@TheComputerGuy96
Copy link
Author

Still present in 525.60.11 :(

@lpslucasps
Copy link

I have the same issue with an Intel + Nvidia setup. Vulkan apps with native wayland support crash at startup when using the dGPU. Running vkcube-wayland with my dGPU gives me:

prime-run vkcube-wayland
Selected GPU 0: NVIDIA GeForce RTX 3050 Laptop GPU, type: DiscreteGpu
[destroyed object]: error 7: importing the supplied dmabufs failed

Retroarch and Ryujinx also gave me similar results, crashing at startup if I try to run them with prime-run. Running both apps through Xwayland works like a charm, though, as did using the opengl API instead of Vulkan.

My system info:

Distro: Arch Linux
egl-wayland version: 1.1.11
Mesa version: 22.2.3-1
Driver version: 525.60.11
Kernel version: 6.0.10
Compositor: kwin 5.26.4
CPU: Intel Core i5-12500H
GPU: Mesa Intel(R) Graphics (ADL GT2) iGPU + RTX 3050 Mobile

@Fothsid
Copy link

Fothsid commented Dec 20, 2022

You've already fixed the OpenGL path with 866a801#diff-8965d13061a6bcaea4358bcc9c757a91fbd9b3cc16fcb3bf1dd579c667fc5528R1269 but Vulkan is somehow different thinking

Wait, are OpenGL applications on Wayland with PRIME offload supposed to work? The windows for the applications never appear on my end, even if they are seemingly running. I'm on 2060 Mobile with nvidia 525.60.11, egl-wayland 1.1.11 and mesa 22.2.4 right now. Running them without PRIME offload works obviously, but it runs on the integrated Intel GPU.

@erik-kz
Copy link

erik-kz commented Dec 20, 2022

OpenGL applications should work with PRIME offload, yes. Are you using an Intel or AMD integrated GPU? Note that the 525 driver has a bug which prevents it working with AMD (see #69). This will be fixed in 530.

@Fothsid
Copy link

Fothsid commented Dec 20, 2022

OpenGL applications should work with PRIME offload, yes. Are you using an Intel or AMD integrated GPU? Note that the 525 driver has a bug which prevents it working with AMD (see #69). This will be fixed in 530.

Thanks for the fast reply!
I'm using an Intel integrated GPU.

Running __EGL_VENDOR_LIBRARY_FILENAMES=/usr/share/glvnd/egl_vendor.d/10_nvidia.json prime-run eglgears_wayland

results in the application running without any window appearing (it does seem to 'exist' to Gnome, though. Appears in the list of applications when Alt+Tab'ing), and by looking at nvidia-smi's output, it indeed uses an nvidia GPU:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   43C    P0    15W /  N/A |      5MiB /  6144MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     14975      G   eglgears_wayland                    3MiB |
+-----------------------------------------------------------------------------+

Here's the console output with WAYLAND_DEBUG=1, if that can help: https://gist.github.com/Fothsid/c58260a7172c9418c3443e216162bc6b

@erik-kz
Copy link

erik-kz commented Dec 20, 2022

Are you setting __NV_PRIME_RENDER_OFFLOAD=1? It shouldn't be necessary to set __EGL_VENDOR_LIBRARY_FILENAMES.

@Fothsid
Copy link

Fothsid commented Dec 20, 2022

prime-run in my case sets __NV_PRIME_RENDER_OFFLOAD=1.
Running specifically __NV_PRIME_RENDER_OFFLOAD=1 eglgears_wayland results in the same thing.

image

@erik-kz
Copy link

erik-kz commented Dec 20, 2022

This appears to be a bug in eglgears_wayland. It calls poll() on the Wayland display fd without first calling wl_display_prepare_read https://gitlab.freedesktop.org/mesa/demos/-/blob/main/src/egl/eglut/eglut_wayland.c#L279

Have you tried any other OpenGL applications?

@Fothsid
Copy link

Fothsid commented Dec 20, 2022

I tried PCSX2. It just hangs on OpenGL with __NV_PRIME_RENDER_OFFLOAD=1, with the main rendering area not even clearing.

UPDATE: just tried PPSSPP and it seems to work. Not sure what was the problem with PCSX2
UPDATE2: nevermind, PPSSPP was running under xwayland there.
UPDATE3: running PPSSPP with SDL_VIDEODRIVER=wayland __NV_PRIME_RENDER_OFFLOAD=1 results in the same issue that eglgears_wayland had.

@Fothsid
Copy link

Fothsid commented Dec 21, 2022

Sorry. Turns out I had GRUB configured incorrectly, so DRM mode setting wasn't enabled. Once I got proper nvidia_drm.modeset=1 in GRUB, it started to work.

@erik-kz
Copy link

erik-kz commented Dec 21, 2022

Oh, cool. We do plan to make modeset=1 the default in the near future. It's just that right now it can cause problems for some workstation SLI configurations.

@kanashimia
Copy link

Possibly related issue described in NVIDIA/open-gpu-kernel-modules#317 (comment)

I wonder why is it the case that offload env variable changes the behaviour, so instead of a crash the program hangs?
My assumption would be that the only thing that nv vulkan layer does is changing gpu priority, by reordering or filtering devices, but it seems env variable somehow modifies the behaviour of a swapchain or something, i assume it is not related to the layer?
Does it modify some EGL related code that is outside of vulkan, and somehow changes the behaviour that way?
What does that offload env variable do exactly?

@erik-kz
Copy link

erik-kz commented Dec 21, 2022

Our GPUs render using a hardware-specific pixel layout which Intel and AMD GPUs don't understand. When __NV_PRIME_RENDER_OFFLOAD=1 is set, after rendering each frame we will convert it to a linear layout so that the integrated GPU can display it. The code to do that is wired up for OpenGL and Vulkan X11 applications, and OpenGL Wayland applications, but not for Vulkan Wayland applications.

For Vulkan applications, __NV_PRIME_RENDER_OFFLOAD=1 will also enable the NV_optimus layer as you mentioned, which changes the order that GPUs are enumerated so that the NVIDIA GPU will appear first.

@Zingam
Copy link

Zingam commented Dec 31, 2022

Is this related to: vkvia, vulkaninfo from LunarG's SDK are unable to detect the discrete NVIDIA GPU on Ubuntu 22.04/Wayland? Only Intel GPU0, and llvmpipe GPU1 are detected.
I'd expect that the application is able to select the desired GPU like on Windows.

@dagbdagb
Copy link

dagbdagb commented Feb 26, 2023

Our GPUs render using a hardware-specific pixel layout which Intel and AMD GPUs don't understand. When __NV_PRIME_RENDER_OFFLOAD=1 is set, after rendering each frame we will convert it to a linear layout so that the integrated GPU can display it. The code to do that is wired up for OpenGL and Vulkan X11 applications, and OpenGL Wayland applications, but not for Vulkan Wayland applications.

Thank you for this ELI5-level answer. Are you permitted to let us know if this is in the works? Or if it is tagged as WONTFIX internally? @erik-kz

@TheComputerGuy96
Copy link
Author

Still present in 530.30.02 :(

@erik-kz
Copy link

erik-kz commented Mar 6, 2023

This is not a WONTFIX, we do intend to get it working. And while I can't provide an ETA right now, it will definitely be before we drop Pascal (10-series) support. That won't happen for quite a while, I mean we haven't even dropped Maxwell (9-series) support yet.

@AnErrupTion
Copy link

AnErrupTion commented Apr 8, 2023

I have the exact same issue on Void Linux with driver 525.105.17. I've encountered this issue on mpv and the native Linux build of BeamNG.drive. Running them via XWayland fixes it, but still, it would be nice to be able to run them on native Wayland...

@flukejones
Copy link

This is becoming quite a serious issue for many people. Can this please be made a priority?

@VarLad
Copy link

VarLad commented Jun 4, 2023

@TheComputerGuy96 Does this work in the latest beta driver?

@erik-kz
Copy link

erik-kz commented Jun 8, 2023

This feature has been implemented by @dkorkmazturk. It will be available in the next major driver version, 545 (not the recently released 535 beta).

@Gigas002
Copy link

After 1.1.12 release it's happening again for some OpenGL apps as well (e.g. mpv: mpv-player/mpv#11774)

@flukejones
Copy link

This feature has been implemented by @dkorkmazturk. It will be available in the next major driver version, 545 (not the recently released 535 beta).

Is there a known timeline for this version?

@erik-kz
Copy link

erik-kz commented Nov 18, 2023

@Dirleye thanks for the information! If you could please upload the file generated by our nvidia-bug-report.sh script that would be helpful. We're still trying to figure out why only certain systems seem to be experiencing this bug, so the more data we have, the better.

@Dirleye
Copy link

Dirleye commented Nov 18, 2023

@erik-kz of course, no problem.
Attached is the log generated with nvidia-bug-report.sh after turning on the system, using startx as root in a different tty to apply an overclock (though the bug isn't affected by this either way), running vkcube-wayland for about two minutes and then generating the report.

I occasionally checked nvidia-smi to see the clock speeds and power state which were all glued to their lowest throughout.
Vkcube-wayland's window was permanently marked as "not responding", though the cube was spinning at full speed. It would freeze for a few seconds each time focus was swapped between it and the terminal.

nvidia-bug-report.log.gz

@kanashimia
Copy link

Here's mine: nvidia-bug-report.log.gz

@erik-kz
Copy link

erik-kz commented Nov 20, 2023

Omg, I finally managed to reproduce the vkcube-wayland hang with a different GPU (Quadro P620). Not exactly sure what the cause it yet, but at least now it's possible to debug. What does seem immediately clear is that it's not a power management issue, it actually looks like it's related to a new synchronization mechanism that was introduced in 545. I shall update with further progress. Thanks so much to everyone who provided logs, etc... that definitely helped narrow down the problem.

@jrelvas-ipc
Copy link

This appears to be fixed with the 545.29.06 driver release!
imagem

Here's Half-Life 2, running on Wayland with Vulkan!
imagem

@Dirleye
Copy link

Dirleye commented Nov 26, 2023

Unfortunately still broken for me.

@erik-kz
Copy link

erik-kz commented Dec 5, 2023

A quick update - we have figured out what is causing the issue. It did turn out to be a driver bug affecting pre-Turing GPUs. The fix is targeted for the next driver release, 550, early next year.

@oscarbg
Copy link

oscarbg commented Dec 5, 2023

@erik-kz does it mean that post Turing GPUs are fixed now in 545.29.06 and don't need a future 550 driver?

@erik-kz
Copy link

erik-kz commented Dec 5, 2023

does it mean that post Turing GPUs are fixed now in 545.29.06 and don't need a future 550 driver?

Vulkan Wayland applications should be working correctly with 545.29.06 on Turing-or-later GPUs. Including PRIME render-offload.

The issue I was referring to in my previous comment was the extremely low framerates (0.2FPS) that several users had reported. All of those users had Pascal GPUs.

@vasishath
Copy link

vasishath commented Dec 6, 2023 via email

@erik-kz
Copy link

erik-kz commented Dec 6, 2023

The 545 driver was the first version to include support for sync_files, https://www.kernel.org/doc/Documentation/sync_file.txt, a new synchronization mechanism. The bug was in our implementation of that feature. 545 also included a fairly extensive re-write of the Vulkan Wayland WSI code, and part of that made use of the new sync_file functionality. That's why Vulkan Wayland apps were affected by the bug.

A possible work-around would be to extract the driver installer and edit the file nvidia-drm-drv.c. In the nv_drm_get_dev_info_ioctl function delete the following block

#if defined(NV_SYNC_FILE_GET_FENCE_PRESENT)
            params->supports_sync_fd = true;
#endif /* defined(NV_SYNC_FILE_GET_FENCE_PRESENT) */

This will disable sync_file support

@kanashimia
Copy link

A possible work-around would be to extract the driver installer and edit the file nvidia-drm-drv.c. In the nv_drm_get_dev_info_ioctl function delete the following block

Actually can confirm that workaround works, but why delete whole block? It seems that deleting code inside the macro is enough.

Here is a patch for NixOS users:

hardware.nvidia.package = config.boot.kernelPackages.nvidiaPackages.stable.overrideAttrs (old: {
  postPatch = ''
    substituteInPlace ./kernel/nvidia-drm/nvidia-drm-drv.c --replace \
      '#if defined(NV_SYNC_FILE_GET_FENCE_PRESENT)' \
      '#if 0'
  '';
});

@erik-kz
Copy link

erik-kz commented Dec 6, 2023

It seems that deleting code inside the macro is enough.

Yeah, that's true.

Also, I must ask that anyone who uses this work-around please promise to revert it once 550 is released. In the future more things will depend on sync_file support and so having it disabled will almost certainly cause problems.

polter-rnd added a commit to polter-rnd/nvidia-kmod that referenced this issue Dec 7, 2023
It should be removed for driver v550 and later!
For more details, see NVIDIA/egl-wayland#72 (comment)

Signed-off-by: Pavel Artsishevsky <polter.rnd@gmail.com>
polter-rnd added a commit to polter-rnd/nvidia-kmod that referenced this issue Dec 7, 2023
It should be removed for driver v550 and later!
For more details, see NVIDIA/egl-wayland#72 (comment)

Signed-off-by: Pavel Artsishevsky <polter.rnd@gmail.com>
@Vincent392
Copy link

Vincent392 commented Dec 15, 2023

This appears to be fixed with the 545.29.06 driver release!
imagem

Here's Half-Life 2, running on Wayland with Vulkan!
imagem

Well, I'm going to have to test Portal.
When I can.

edit 13:13 GMT:
Just to be safe that other games work too, I'll check Skyrim SE via Proton and Half-Life: Blue Shift.

@kanashimia
Copy link

Tested that on nvidia beta drivers 550.40.07 vkcube-wayland now works correctly without any patches, I think this issue can now be closed.

@erik-kz
Copy link

erik-kz commented Feb 11, 2024

Thanks for confirming. Closing the issue.

@ghost
Copy link

ghost commented Mar 1, 2024

I don't know if the issue should be re-opened but... starting with the 550 beta driver and even the 550 release driver WL Vulkan apps start crashing again(vkcube-wayland segfaults and emulators crash too), reverted to 545 and it works fine, my laptop has a Mux Switch so I could theatrically disable the intel gpu and run only the nvidia one but if I do that I can't switch my screen to 240Hz because it makes a black screen with rectangular lines that render a little part of the desktop and it glitches out need to revert to 60Hz, all of these two issues are only with the 550 series(beta and release) 545 is fine.

Specs : Intel Core i7-13620H 10C/16T 4.9GHz.
RAM : 64GB DDR5 4800MHz.
GPU : Intel HD Graphics for 13th Gen CPUs & Nvidia GeForce RTX 4050 Laptop GPU 6GB.

EDIT : After installing egl-wayland and enabling nvidia-drm.modeset=1 it works now... I thought nvidia-drm.modeset=1 was enabled by default now...

@erik-kz
Copy link

erik-kz commented Mar 1, 2024

I thought nvidia-drm.modeset=1 was enabled by default now...

Not yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests