Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudaPackages.cudatoolkit: switch to autoPatchelf #178440

Merged
merged 2 commits into from
Apr 2, 2023

Conversation

SomeoneSerge
Copy link
Contributor

@SomeoneSerge SomeoneSerge commented Jun 21, 2022

Description of changes

Rewrites cudatoolkit expression to use autoPatchelf instead of manually constructing and writing the rpath.
Using autoPatchelf ensures that we're at least not missing dependencies that upstream has marked as "needed".

This is a narrow-scoped part of #178439
For instance, this PR ensures "correctness" (amend missing rpaths) but increases the actual closure size.
The next PR should split the output to reduce closure sizes, while preserving "correctness"

Things done
  • Built on platform(s)
    • x86_64-linux

CC @NixOS/cuda-maintainers

@SomeoneSerge SomeoneSerge changed the title cudaPackages.cudatoolkit: siwtch to autoPatchelf cudaPackages.cudatoolkit: switch to autoPatchelf Jun 21, 2022
@SomeoneSerge SomeoneSerge added the 6.topic: cuda Parallel computing platform and API label Jun 21, 2022
@SomeoneSerge
Copy link
Contributor Author

  dontPatchELF = true;
  dontStrip = true;

These two are just hanging around. I'm almost certain I should remove dontPatchELF. I'm not so sure about dontStrip

Copy link
Member

@samuela samuela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a great upgrade to cudaPackages.cudatoolkit! I'm running nixpkgs-review now...

pkgs/development/compilers/cudatoolkit/common.nix Outdated Show resolved Hide resolved
@samuela
Copy link
Member

samuela commented Jun 27, 2022

Result of nixpkgs-review pr 178440 run on x86_64-linux 1

2 packages marked as broken and skipped:
  • python310Packages.caffeWithCuda
  • truecrack-cuda
5 packages failed to build:
  • caffeWithCuda
  • ethminer (ethminer-cuda)
  • gpu-screen-recorder
  • gpu-screen-recorder-gtk
  • python39Packages.caffeWithCuda
33 packages built:
  • colmapWithCuda
  • cudaPackages.cuda-samples
  • cudatoolkit (cudaPackages.cudatoolkit ,cudatoolkit_11)
  • cudaPackages.cutensor
  • cudaPackages.nccl
  • forge
  • gpu-burn
  • gromacsCudaMpi
  • gwe
  • katagoWithCuda
  • librealsenseWithCuda
  • magma
  • nvtop
  • nvtop-nvidia
  • python310Packages.TheanoWithCuda
  • python310Packages.cupy
  • python310Packages.jaxlibWithCuda
  • python310Packages.numbaWithCuda
  • python310Packages.pycuda
  • python310Packages.pynvml
  • python310Packages.pyrealsense2WithCuda
  • python310Packages.pytorchWithCuda
  • python39Packages.TheanoWithCuda
  • python39Packages.cupy
  • python39Packages.jaxlibWithCuda
  • python39Packages.numbaWithCuda
  • python39Packages.pycuda
  • python39Packages.pynvml
  • python39Packages.pyrealsense2WithCuda
  • python39Packages.pytorchWithCuda
  • python39Packages.tensorflowWithCuda
  • xgboostWithCuda
  • xpraWithNvenc

@samuela
Copy link
Member

samuela commented Jun 27, 2022

Here are the errors:

error: builder for '/nix/store/iw6f89qja74akzqv9gl22vi26qpdrqlz-ethminer-0.19.0.drv' failed with exit code 2;
       last 10 log lines:
       > /nix/store/bv8qjsgd8ngjbazj3h5swfwb0sydy14n-cli11-2.2.0/include/CLI/App.hpp:594:35: note:   no known conversion for argument 2 from 'unsigned int' to 'CLI::callback_t' {aka 'std::function<bool(const std::vector<std::__cxx11::basic_string<char> >&)>'}
       >   594 |                        callback_t option_callback,
       >       |                        ~~~~~~~~~~~^~~~~~~~~~~~~~~
       > /nix/store/bv8qjsgd8ngjbazj3h5swfwb0sydy14n-cli11-2.2.0/include/CLI/App.hpp:701:13: note: candidate: 'CLI::Option* CLI::App::add_option(std::string)'
       >   701 |     Option *add_option(std::string option_name) {
       >       |             ^~~~~~~~~~
       > /nix/store/bv8qjsgd8ngjbazj3h5swfwb0sydy14n-cli11-2.2.0/include/CLI/App.hpp:701:13: note:   candidate expects 1 argument, 4 provided
       > make[2]: *** [ethminer/CMakeFiles/ethminer.dir/build.make:76: ethminer/CMakeFiles/ethminer.dir/main.cpp.o] Error 1
       > make[1]: *** [CMakeFiles/Makefile2:516: ethminer/CMakeFiles/ethminer.dir/all] Error 2
       > make: *** [Makefile:156: all] Error 2
       For full logs, run 'nix log /nix/store/iw6f89qja74akzqv9gl22vi26qpdrqlz-ethminer-0.19.0.drv'.
error: builder for '/nix/store/qli4pxxmhqbqic19qa0hwr3i3ixvc58a-cudatoolkit-10.1.243.drv' failed with exit code 1;
       last 10 log lines:
       > auto-patchelf: 7 dependencies could not be satisfied
       > warn: auto-patchelf ignoring missing libcuda.so.1 wanted by /nix/store/242ijwn14sjvpsl3694jk5j8fbc8hbpv-cudatoolkit-10.1.243/targets/x86_64-linux/lib/libcuinj64.so.10.1.243
       > error: auto-patchelf could not satisfy dependency libGLU.so.1 wanted by /nix/store/242ijwn14sjvpsl3694jk5j8fbc8hbpv-cudatoolkit-10.1.243/extras/demo_suite/oceanFFT
       > error: auto-patchelf could not satisfy dependency libglut.so.3 wanted by /nix/store/242ijwn14sjvpsl3694jk5j8fbc8hbpv-cudatoolkit-10.1.243/extras/demo_suite/oceanFFT
       > error: auto-patchelf could not satisfy dependency libGLU.so.1 wanted by /nix/store/242ijwn14sjvpsl3694jk5j8fbc8hbpv-cudatoolkit-10.1.243/extras/demo_suite/randomFog
       > error: auto-patchelf could not satisfy dependency libglut.so.3 wanted by /nix/store/242ijwn14sjvpsl3694jk5j8fbc8hbpv-cudatoolkit-10.1.243/extras/demo_suite/randomFog
       > error: auto-patchelf could not satisfy dependency libGLU.so.1 wanted by /nix/store/242ijwn14sjvpsl3694jk5j8fbc8hbpv-cudatoolkit-10.1.243/extras/demo_suite/nbody
       > error: auto-patchelf could not satisfy dependency libglut.so.3 wanted by /nix/store/242ijwn14sjvpsl3694jk5j8fbc8hbpv-cudatoolkit-10.1.243/extras/demo_suite/nbody
       > auto-patchelf failed to find all the required dependencies.
       > Add the missing dependencies to --libs or use `--ignore-missing="foo.so.1 bar.so etc.so"`.
       For full logs, run 'nix log /nix/store/qli4pxxmhqbqic19qa0hwr3i3ixvc58a-cudatoolkit-10.1.243.drv'.
error: 1 dependencies of derivation '/nix/store/d0dsn8qz3ym77lfg43ksi13wp9qxyb9c-cudatoolkit-10-cudnn-7.6.5.drv' failed to build
error: 2 dependencies of derivation '/nix/store/d4j3zxg2rgvp290bxs3swcfdqkda6k2s-caffe-1.0.drv' failed to build
error: 2 dependencies of derivation '/nix/store/db9d8zi0jvnm0kh90ml6m9b06qg5zsyb-caffe-1.0.drv' failed to build
error: builder for '/nix/store/61qxgja8gs96d78kw1bzlrr6fk3ygdpp-cudatoolkit-10.2.89.drv' failed with exit code 1;
       last 10 log lines:
       > error: auto-patchelf could not satisfy dependency libQt5WebEngineCore.so.5 wanted by /nix/store/n9xpr40wamx3iswvixvglixc8sl5d5pv-cudatoolkit-10.2.89/nsight-compute-2019.5.0/host/linux-desktop-glibc_2_11_3-x64/libexec/QtWebEngineProcess
       > warn: auto-patchelf ignoring missing libcuda.so.1 wanted by /nix/store/n9xpr40wamx3iswvixvglixc8sl5d5pv-cudatoolkit-10.2.89/targets/x86_64-linux/lib/libcuinj64.so.10.2.89
       > error: auto-patchelf could not satisfy dependency libGLU.so.1 wanted by /nix/store/n9xpr40wamx3iswvixvglixc8sl5d5pv-cudatoolkit-10.2.89/extras/demo_suite/oceanFFT
       > error: auto-patchelf could not satisfy dependency libglut.so.3 wanted by /nix/store/n9xpr40wamx3iswvixvglixc8sl5d5pv-cudatoolkit-10.2.89/extras/demo_suite/oceanFFT
       > error: auto-patchelf could not satisfy dependency libGLU.so.1 wanted by /nix/store/n9xpr40wamx3iswvixvglixc8sl5d5pv-cudatoolkit-10.2.89/extras/demo_suite/randomFog
       > error: auto-patchelf could not satisfy dependency libglut.so.3 wanted by /nix/store/n9xpr40wamx3iswvixvglixc8sl5d5pv-cudatoolkit-10.2.89/extras/demo_suite/randomFog
       > error: auto-patchelf could not satisfy dependency libGLU.so.1 wanted by /nix/store/n9xpr40wamx3iswvixvglixc8sl5d5pv-cudatoolkit-10.2.89/extras/demo_suite/nbody
       > error: auto-patchelf could not satisfy dependency libglut.so.3 wanted by /nix/store/n9xpr40wamx3iswvixvglixc8sl5d5pv-cudatoolkit-10.2.89/extras/demo_suite/nbody
       > auto-patchelf failed to find all the required dependencies.
       > Add the missing dependencies to --libs or use `--ignore-missing="foo.so.1 bar.so etc.so"`.
       For full logs, run 'nix log /nix/store/61qxgja8gs96d78kw1bzlrr6fk3ygdpp-cudatoolkit-10.2.89.drv'.
error: 1 dependencies of derivation '/nix/store/v7ymn5dd6m2z9lg96dn6vqr9r4hc162i-gpu-screen-recorder-1.0.0.drv' failed to build
error: 1 dependencies of derivation '/nix/store/c3gac5rhf14272mqnwy7xszglq8gg3ag-gpu-screen-recorder-gtk-0.1.0.drv' failed to build
error: 5 dependencies of derivation '/nix/store/jby3dh2knf5wrinpjb8zr3m3xwr24pmq-review-shell.drv' failed to build

Looks like cudatoolkit 10.1 and 10.2 are broken. Not sure if we're still trying to keep those working? I assume a number of packages still rely on them though.

@samuela
Copy link
Member

samuela commented Jun 27, 2022

OTOH the failures are only in the extras/demo_suite/*folder. We could just remove those binaries or skip them.

@stale stale bot added the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Jan 7, 2023
@SomeoneSerge SomeoneSerge self-assigned this Apr 1, 2023
@stale stale bot removed the 2.status: stale https://github.com/NixOS/nixpkgs/blob/master/.github/STALE-BOT.md label Apr 1, 2023
...to ensure correctness (in the sense that all DT_NEEDED libraries are
verified to be discoverable through the runpaths)
...the same logic is handled by autoPatchelf
@SomeoneSerge
Copy link
Contributor Author

I re-based on current master, and ran

❯ nix-build with-my-cuda.nix -A cudaPackages_10.cudatoolkit.out -A cudaPackages_10_1.cudatoolkit.lib -A cudaPackages.cudatoolkit
/nix/store/80q92g4mw49cifxfzhk3xhfmhcq7635p-cudatoolkit-10.2.89
/nix/store/vfkbm851wjfpw1pc2mdxq5v7d49plkps-cudatoolkit-10.1.243-lib
/nix/store/chzf3k3s07wd9i7xgzg6ha667bjhpc51-cudatoolkit-11.7.0

...nixpkgs-review would be nice, but I probably can't run it any time soon

@SomeoneSerge
Copy link
Contributor Author

I think it would be pragmatic to just merge, relying on autoPatchelf having verified all the declared dependencies. There may be hidden dlopen() errors, but more likely in tools, not in libraries used by our ML stack. We can address these errors as they appear

@MrFoxPro
Copy link

MrFoxPro commented Apr 5, 2023

Dobriy den' @SomeoneSerge. I've recently updated nixpkgs channel confugration and xmrig-cuda library that depends on cuda - just stopped working with this error: failed to open libnvrtc-builtins.so.11.7, even if this file is presented in nvidia_x11 output: /nix/store/9pp2hm8y83zi523shr6lli1jsaqd6krg-nvidia-x11-525.89.02-6.1.15/lib/libnvidia-ml.so

I fixed it by downgrading nixpkgs to 7018cf78c618e0a8ec4369c587319f51cb7b19b0
You can see my derivation here: https://github.com/MrFoxPro/nix/blob/cuda-bug/drv/xmrig-cuda.nix
It builds fine, but fails at runtime.

Any ideas how it could be related to this changes? How to fix?

@SomeoneSerge
Copy link
Contributor Author

@MrFoxPro Hey-hey, и Вам доброго дня!

First off, I see that you're linking to nvidia_x11 directly, which is something we try to avoid in nixpkgs: we deploy libcuda.so at /run/opengl-driver/lib, because it's driver-dependend. You might want to replace that with autoAddOpenGLRunpathHook.

As for the libnvrtc error and whether it's not being found or being rejected by the dynamic linker, we'll need to see more logs. I'd start with running xmrig with LD_DEBUG=libs environment variable set. The error could indeed be related to this PR, because we're now setting Runpaths more consistently. The only regression we have noticed ourselves so far is one linked from the pytorch PR. Obviously, though, we can only really see how we affect packages that are in nixpkgs, and we can sometimes break things out-of-tree even if we're careful 🙃

I also see that there is an xmrig derivation in nixpkgs, only without cuda support yet. Maybe you could open a PR adding CUDA support to that derivation, and we could navigate from there?

@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented Apr 5, 2023

@MrFoxPro On a side note, though... I don't know what you mean by the "PMC Balloon", but to me it sounds a little bit provoking, and not in a good way. This is going off-topic though

@MrFoxPro
Copy link

MrFoxPro commented Apr 5, 2023

@MrFoxPro Hey-hey, и Вам доброго дня!

First off, I see that you're linking to nvidia_x11 directly, which is something we try to avoid in nixpkgs: we deploy libcuda.so at /run/opengl-driver/lib, because it's driver-dependend. You might want to replace that with autoAddOpenGLRunpathHook.

As for the libnvrtc error and whether it's not being found or being rejected by the dynamic linker, we'll need to see more logs. I'd start with running xmrig with LD_DEBUG=libs environment variable set. The error could indeed be related to this PR, because we're now setting Runpaths more consistently. The only regression we have noticed ourselves so far is one linked from the pytorch PR. Obviously, though, we can only really see how we affect packages that are in nixpkgs, and we can sometimes break things out-of-tree even if we're careful 🙃

I also see that there is an xmrig derivation in nixpkgs, only without cuda support yet. Maybe you could open a PR adding CUDA support to that derivation, and we could navigate from there?

I'm not sure about /run/opengl-driver/lib. Does this exist only when hardware.opengl.enableis true? I mean I'm running miner on my machine headlessly and starting fake xserver only for overclocking via nvidia-settings, so I'm not sure why this option should be mandatory then.

@MrFoxPro
Copy link

MrFoxPro commented Apr 5, 2023

@MrFoxPro On a side note, though... I don't know what you mean by the "PMC Balloon", but to me it sounds a little bit provoking, and not in a good way. This is going off-topic though

just a meme :) you're welcome to join https://t.me/ru_nixos btw, so we can discuss it more closely if you want

@SomeoneSerge
Copy link
Contributor Author

I'm not sure about /run/opengl-driver/lib. Does this exist only when hardware.opengl.enableis true? I mean I'm running miner on my machine headlessly and starting fake xserver only for overclocking via nvidia-settings, so I'm not sure why this option should be mandatory then.

The name hardware.opengl.enable is historical legacy and subject to change: #141803. One is expected to use hardware.opengl.enable (and videoDrivers = [ "nvidia" ], iirc) even in headless mode, so that programs from nixpkgs will know to use libcuda.so that is compatible with your system's driver. This does not imply enabling the X server. In fact, you don't necessarily need the X server even to use OpenGL (cf. EGL). For why we need to deploy libcuda.so impurely, cf. this comment: #224294 (comment)

But let's draft a PR and move the conversation there!

just a meme :)

Alright, alright, I didn't mean to imply anything. Just that there are times when waving a white flag high above your head before approaching people is suddenly a very common sense thing to do, lest you catch friendly fire

@MrFoxPro
Copy link

MrFoxPro commented Apr 5, 2023

@SomeoneSerge lets discuss it #224848

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants