-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NVIDIA: Driver installation fails with error: Running %post for akmod-nvidia #286
Comments
Thanks for the detailed report. I'll re-read everything later but as a quick workaround, you can try setting the AllowedCPUs option in the rpm-ostreed.service systemd unit to limit the number of cores available during all rpm-ostree operations: https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#AllowedCPUs= |
I have this issue as well. At some point in the past, the "open file limit" workaround seemed to work for me (https://bugzilla.redhat.com/show_bug.cgi?id=1901218#c1), but it doesn't seem to anymore — I'm unsure if it was a red herring I just got lucky and the kmod build just got sequenced in a way that it didn't fail to compile, or if there have been recent changes that make it fail even with this change. I tried to set some rpmbuild macros and vars (like edit: I'm pretty sure this is #106 too. |
Thanks @travier , I'd love to test out your workaround because it's much more convenient than what I was doing. On my Silverblue install, I find that unit file at How can I add the suggested setting, e.g. |
@jbirch-atlassian I think you are right about #106
I also tried the open file limit suggestion on that bugzilla, but it didn't work for me either. |
Use |
Took it for a spin — no dice. I can see that |
Then I would recommend that you report that against the NVIDIA package on RPM Fusion Bugzilla. |
It looks like it's been posted there a few times and marked as invalid. https://bugzilla.rpmfusion.org/show_bug.cgi?id=5851 On bugzilla 6031 the OP was using rawhide kernel so it was dismissed. I am not using rawhide but I do see the same error (I think) as described in bugzilla 6031 and perhaps 5851. It's suggested there that:
I read this bug report before posting here. Having seen on a couple of separate bug reports there and marked as invalid, I decided to post here. @travier @jbirch-atlassian What are your thoughts on moving the issue to rpm-ostree? Do you think it's possible or even likely that the problem/solution could be there? Or make the bug report again over on rpmfusion's bugzilla? The |
Truthfully, I only have opinions, and they're uninformed of how Silverblue-the-project, rpm-ostree-the-tool, kmod-the-pattern, and Fedora-and-RPMFusion-and-NVidia-the-entities interact and own their components. But here's my best guess based on my personal experience. Fedora 35 and 36 work with To be honest, I think every party involved here "cares" — I just can't hazard a guess as to where the root cause lies (or miscommunication between components, or incorrect assumptions, or whatever).
So I dunno — my best guess is that this issue should remain open for people to see; right now if you install Silverblue on a 5950x or something with an NVidia GPU, it's likely that you're going to have a bad time if that's also a relatively new or expensive GPU. But ultimately the thing that's throwing the issue is the There are mitigations that could be added to the out-of-the-box configuration, to only use RPMFusion in a safe way if there are large core counts. If those knobs exist, I'm of the opinion that they would be positive things to exercise in the default Silverblue configuration. |
I'd suggest you try to figure out what makes the build in the akmod-nividia package use so many cores by default even when you restrict it via systemd. Then we can suggest a change to the RPM package to workaround that. You can also open an issue for rpm-ostree and link this one there but I don't think it's an rpm-ostree issue. |
Remember, it's not the number of cores, but the concurrency of the compile. My understanding is we are restricting the number of cores used by the build, but it's still doing "make -j<a billion>", with its cut-down number of cores. What isn't clear to me is why I can't get rpmbuild to do |
If you find where to set the max amount of thread to use to do the compilation then we can make a tweak in the RPM spec file to set that only for rpm-ostree system (checking if |
Hi @jbirch-atlassian @travier , Thank you for your thoughts on this so far. I will open the the issue on rpm-ostree. It will go against the CoreOS repository for rpm-ostree because there's no issue tracking on the Silverblue one. I will also post the bug report to bugzilla under the nvidia-kmod package, and add the links back to these issues with some encouragement to view discussion at this issue as context. |
I've posted this Bug report over at rpmfusion bugzilla https://bugzilla.rpmfusion.org/show_bug.cgi?id=6317 |
Fix was released for the bugreport of this problem in rpmfusion bugzilla. But, it is about 390 version. |
@ivanvorstanenko I had a quick look at that issue, and that doesn't look like the same problem. That looks like a legitimate compilation failure for old versions of the NVidia driver in newer kernels, whereas this issue seems like a race condition caused by trying to compile the drivers with too high a concurrency — the compilation succeeds if done at a lower level of concurrency. I see you've helpfully gone to all of the linked tickets and mentioned the same thing, and most people have closed the issues, believing them resolved — but I'm not convinced this is the same problem. It's not even the same package — even if that was the same problem and same fix, it's only for the 390.xx versions of the NVidia driver, not the current 510/515. I'm worried that all of these linked tickets have been or will be closed erroneously. Can you confirm if https://bugzilla.rpmfusion.org/show_bug.cgi?id=6337 showed up at the time only when being built with many cores, and that whatever changes happened there have been applied to all versions of the kmod? |
No, I can't confirm it. I have 8 cores and 8 threads (FX8300) and compilation successfully.
I think, no, because problem about compilation with too high a concurrency and problem about 390th driver version don't same. |
I'll see what concurrency is reliably required to trigger this bug and return with more details for others. Unfortunately, I'm running on a 5.17 kernel at the moment, so if the other report is related to 5.18 only, it is a very different problem. edit: 26 threads reliably compiles. 27 threads reliably fails. There are unfortunately many old bug reports of this long-standing problem that have been closed without the problem being fixed. The one you have linked there I think has been closed for the wrong reason, and have commented to that effect a few days ago. I'm hoping to make sure maintainers don't keep closing legitimate issues. |
@jbirch-atlassian Thank you for testing that. I've been meaning to do a similar check. @travier @jbirch-atlassian Thanks for keeping these issues open. I have checked the bug report at https://bugzilla.rpmfusion.org/show_bug.cgi?id=6337 again (I read it at some point after posting this issue, and the rpm-ostree duplicate). As you've pointed out it's not the same Nvidia driver version as this report. I agree that these issues (including rpm-ostree issue 3706) should remain open until we understand (a) what caused it and (b) that it is resolved to a degree that it won't happen again. As of today, it appears no one can confirm (though there are suspicions) in what package or combination of packages the cause lies. Nor can we currently assume it will not reoccur with recent, current or future drivers and hardware. I think it's worth noting here for context, for anyone arriving here experiencing this issue still, that (with apologies if I am misreading the communication) it appears bugzilla 6337 was closed before investigation due to an admitted lack of interest. Ergo, there was no intention to investigate and an assumption was made that the problem was elsewhere. Again, thanks to @jbirch-atlassian for offering to help out over there on the rpm-ostree / Silverblue side. I still have the same problem. I have some new hardware arriving soon which will include some more high core count parts, but different CPU and GPU models to the hardware I used for the original report. I will test just in case and post back with more information (if anything new) once they're up and running. The tl;dr for anyone experiencing this issue still (especially non-developer users like me just trying to read the room) is it probably isn't fixed yet. Though, I'd love to be wrong. |
I need to find some time to dig a little further, and I still don't know what the root cause is, but I've discovered that I can invoke Investigating this is particularly slow, because I don't have the faintest idea about how to sanely debug any of this. For example, I have no idea how to invoke Quick thanks to @nelsonaloysio for their comment about |
You should try reproducing the akmod commands on your system directly. Maybe directly building the module from source might trigger the issue. |
Sorry, I don't think I was clear — I can reproduce the I have a bit of free time next week, that I'm hoping to spend slowly chipping away at this. References on |
I have good news and bad news. The bad news is, I can no longer reproduce the original problem's root cause. I made the mistake of keeping my system up to date, and that has changed the state of the world enough that I can't reproduce it exactly. The good news is, the problem still occurs with slightly different internal symptoms — same visible outcome to users — but the fix this time is way more trivial. It's plausible that it's related to the original issue as well, which I'll explain later. Let me take you on a journey so you can check my work and see if it makes sense, because frankly I'm in too deep for my own good right now. Historically, I had seen this problem crop up as "oh, there's too many files open". The logs were very similar to those posted here, but the small amount of information I could find implicated open files (https://bugzilla.redhat.com/show_bug.cgi?id=1901218#c1) Indeed, the first time I worked around this issue was by setting some limit of open file descriptors to something ridiculous, and then installing the NVidia drivers. However, this was still busted on updates some time later, and I put it down to me not really knowing how to configure the number of allowed open file descriptors for things in systemd. More recently, I had seen this in the same way @samjcarter had — as a failure to compile. The open file descriptor fix wasn't working for me as well. Disabling some cores was though, and that's how we ended up here. As I dug into it this week, I noticed I was never getting to the compilation failure stage anymore —
I had previously set the default If the original issue was caused by any of:
then it's conceivable that it's all the same issue, just showing up in a slightly different way after an update. If it's the first of these, then there might be a lingering loose end to tie up with particularly large akmod packages, but I somehow doubt it's that, as I was able to manually invoke So in short, I never got to a true root cause. But at least we seem to have a persistent workaround that doesn't involve murdering CPU cores for a little while:
|
Additionally, it should be noted that Presumably the default is 1024, though I don't know where this would normally be set. Perhaps it's sufficient to set If we agree this is an adequate fix for Silverblue, I'm happy to put together the PR. However, I appreciate there might still be some open questions around truly root causing this. |
I have:
If raising the limit fixes the issue then we can safely raise that in rpm-ostree upstream. |
Already reported in coreos/rpm-ostree#3706. Can you open a PR to raise that limit in |
Done and done! This will hopefully be resolved with coreos/rpm-ostree#3853. Thanks for your help, @travier! |
@jbirch-atlassian & @travier Thank you both very much for your hard work on this. It means I (and others I'm sure) can keep using Silverblue for workstations and render machines where otherwise I would have at the very least moved to regular Fedora version 34, due to the version constraints of my work software. Silverblue 36 lets me toolbox it. I have since realised that I could also install toolbox on Fedora Workstation 36 and set up a 34 toolbox. Always learning something & I'm glad I can keep using Silverblue. |
(edit): I was hitting coreos/rpm-ostree#1614 and in particular coreos/rpm-ostree#4201, due to a missing link for the linker in I'm not sure my issue is related to this one, but I was able to successfully install the drivers without issues ~1 month ago and that is not happening now after I had to temporarily reset a kernel override and remove the layered akmod-nvidia.
What I did, on a system that was working fine with nvidia-drivers till a few days ago (before updating), is:
|
I just got the same error on Silverblue 40. Raising the limit doesn't helps.
|
Description
I have two workstations with Fedora 36 Silverblue. One of them has a 6 core, 12 thread Intel cpu, and the other, a 16 core 32 thread AMD cpu. Both use Nvidia graphics cards.
On the 16 core machine only, while attempting to
rpm-ostree install akmod-nvidia xorg-x11-drv-nvidia-cuda
drivers, and on all subsequence uses ofrpm-ostree install
with any other package, the install fails with an error. Before attempting to install nvidia drivers, other packages installed withrpm-ostree install package
succeed without errors.I can work around the error (and stop it from showing) by using a short bash script to disable some of the CPU cores on the 16 core computer. Imust run the script before every use of
rpm-ostree install
. The error never occurs when carrying out the same steps on the 6 core computer. The only difference being, that computer has an older graphics card, and so mustrpm-ostree install akmod-nvidia-470xx xorg-x11-drv-nvidia-470xx-cuda
instead.To Reproduce
Please describe the steps needed to reproduce the bug:
rpm-ostree update
rpm-ostree install htop
(htop as an example)systemctl reboot
.sudo rpm-ostree install https://mirrors.rpmfusion.org/free/fedora/rpmfusion-free-release-$(rpm -E %fedora).noarch.rpm https://mirrors.rpmfusion.org/nonfree/fedora/rpmfusion-nonfree-release-$(rpm -E %fedora).noarch.rpm
. Detailed instructions at https://rpmfusion.org/Configurationsudo rpm-ostree install akmod-nvidia xorg-x11-drv-nvidia-cuda
disable-threads.sh
and make it executablechmod +x disable-threads.sh
.sudo ./dissable-threads.sh false 9 31
. (confirm in System Monitor Resources tab)rpm-ostree install package
and it should complete without the error.sudo rpm-ostree kargs --append=rd.driver.blacklist=nouveau --append=modprobe.blacklist=nouveau --append=nvidia-drm.modeset=1
Expected behavior
No error should interrupt the nvidia driver install on the 16 core computer. All subsequent uses of
rpm-ostree install package
should also not fail with the same error. The behaviour of the 16 core computer should match the 6 core computer, where the error never appears.Screenshots / Terminal Output
Fresh Fedora 36 Silverblue install with updates done
Install a layered package, eg; htop
Check rpm-ostree status
first;
systemctl reboot
Remove HTOP again
rpm-ostree uninstall htop
reboot and check it's gone:
Htop successfully removed.
Add rpmfusion repositories (successful)
Attempt Nvidia driver install (Fails with Error)
Output of
journalctl -t 'rpm-ostree(akmod-nvidia.post)'
Run our dissable-threads.sh script
sudo ./dissable-threads.sh false 9 31
Attempt Nvidia driver install second time (success)
Reboot and check status
systemctl reboot
Try to install HTOP again; without running the disable-threads.sh script (Fails with error)
Output from
journalctl -t 'rpm-ostree(akmod-nvidia.post)'
Journlctl Log continues for approximately 8500 lines listing similar output; `/lib/modules/...kernel...needs... "something": /lib/modules/...
At this point, I have saved a copy of all the log items
journalctl -t 'rpm-ostree(akmod-nvidia.post)' > journalctl.txt
.Try to install HTOP again; AFTER running the disable-threads.sh script (succeeds)
sudo disable-threads.sh false 9 31
Check
rpm-ostree status
OS version:
Additional context
I've not yet tested this with Fedora 36 Workstation. I may be able to but it will take a little time - I have limited machines and drives to set up the install with and these machines have a lot asked of them ;)
The text was updated successfully, but these errors were encountered: