Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cudaPackages.setupCudaHook: propagate deps and the hook #271078

Merged
merged 8 commits into from
Dec 7, 2023

Conversation

SomeoneSerge
Copy link
Contributor

Description of changes

This is an attempt to fix the build failures in the cuda variants of torch/opencv4 cmake consumers. Basically, this lets one avoid ad hoc changes like this: https://github.com/NixOS/nixpkgs/pull/269639/files#diff-dcb82f7bc26e70a69ecb21e6801c8f2c32dbd91847e2b2db9b4fbb986f412abc.

The approach is to propagate the build dependencies (which we used to do), but to do so in a separate output, so as to avoid the cuda fat leaking into python withPackages environments

In addition to propagating libraries like cuda_cudart we also have to propagate the setupCudaHook because somebody's got to set -DCUDAToolkit_ROOT &c

I tested this on cctag which had failed at find_package(opencv) in #269639

I also realized we need to remove the hook's dependency on cudart and especially on the compiler: when we consume a cross-compiled opencv4 in a native build, the propagated compiler is going to be useless, because we couldn't have known about the native build when compiling opencv4...

I haven't yet figured out what happens if a package has both buildInputs = [ opencv4 ] and nativeBuildInputs = [ cuda_nvcc ]. Both are going to propagate the hook, so the hook might run twice, and then it might add the env hooks, preConfigureHooks, and postFixupHooks twice. I need some sort of #pragma once

@NixOS/cuda-maintainers

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.05 Release Notes (or backporting 23.05 and 23.11 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Priorities

Add a 👍 reaction to pull requests you find important.

@SomeoneSerge
Copy link
Contributor Author

I maybe broke torch

@SomeoneSerge SomeoneSerge force-pushed the feat/torch-propagated-cuda branch from 2ab2ebb to 39f19c1 Compare November 30, 2023 02:40
@ConnorBaker ConnorBaker added the 6.topic: cuda Parallel computing platform and API label Nov 30, 2023
@ofborg ofborg bot added the 2.status: merge conflict This PR has merge conflicts with the target branch label Dec 1, 2023
@SomeoneSerge SomeoneSerge force-pushed the feat/torch-propagated-cuda branch from dc7de32 to 02aed8c Compare December 1, 2023 17:02
# https://gist.github.com/fd80ff142cd25e64603618a3700e7f82
depsTargetTargetPropagated = [
# One'd expect this should be depsHostHostPropagated, but that doesn't work
propagatedBuildInputs = [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's evidence that propagatedHostTargetDeps (propagatedBuildInputs) work and HostHost doesn't: https://gist.github.com/SomeoneSerge/7d0e633743175bee6470b03281fe74e1

Same goes for BuildHost instead of BuildBuild when propagating the hook to nvcc's users. Whatever the reason, a transitive dependency is assigned the offsets of (prevHost + relHost, prevHost + relTarget) rather than (prevHost + relHost, prevTarget + relTarget):

function mapOffset() {
local -r inputOffset="$1"
local -n outputOffset="$2"
if (( inputOffset <= 0 )); then
outputOffset=$((inputOffset + hostOffset))
else
outputOffset=$((inputOffset - 1 + targetOffset))
fi
}

The relevant message in the cross-compilation matrix: https://matrix.to/#/%23cross-compiling%3Anixos.org/%248lsLV15zqjt5JdopUy-3aD3o8ym1NdITs18bZRZXMc8?via=someonex.net&via=matrix.org&via=catgirl.cloud&via=nixos.dev (yet to receive replies)

@SomeoneSerge
Copy link
Contributor Author

I suspect that most of this PR wouldn't have been required with #102613

@ofborg ofborg bot removed the 2.status: merge conflict This PR has merge conflicts with the target branch label Dec 2, 2023
@SomeoneSerge SomeoneSerge force-pushed the feat/torch-propagated-cuda branch from 8a7abaf to 895067e Compare December 2, 2023 12:51
@SomeoneSerge
Copy link
Contributor Author

The builds verified so far:

❯ nix build -f ./. --arg config '{ cudaSupport = true; cudaCapabilities = [ "8.6" ]; allowUnfree = true; }' -L python3Packages.{torch,torchaudio,torchvision,opencv4} cctag --print-out-paths  --keep-going
/nix/store/csa6zg49ih3zzfnh28ins5sy0xsa5iiz-python3.11-torch-2.1.1
/nix/store/4sdd2281hbz4cp0f24zinsyq6ma2n83i-python3.11-torchaudio-2.1.1
/nix/store/jr9z2jn008xl6vd17n1y5papl8m2spaj-python3.11-torchvision-0.16.1
/nix/store/v1p716yhdx99jldx7s4vmmrsrl0s6xpj-opencv-4.7.0
/nix/store/i4c5j077qqi25kz8y946ggj7axifz715-cctag-1.0.3

IIRC openvino is broken on master, the error is something like:

In file included from /build/source/src/inference/include/openvino/runtime/intel_gpu/ocl/ocl.hpp:18,
                 from /build/source/docs/snippets/gpu/context_sharing.cpp:2:
/build/source/src/inference/include/openvino/runtime/intel_gpu/ocl/ocl_wrapper.hpp:47:14: fatal error: CL/cl2.hpp: No such file or directory
   47 | #    include <CL/cl2.hpp>

@SomeoneSerge
Copy link
Contributor Author

SomeoneSerge commented Dec 2, 2023

The closure for python3.withPackages (ps: with ps; [ torch torchaudio torchvision ]) is now 8G for the cudaCapabilities = [ "8.6" ] variant: https://gist.github.com/SomeoneSerge/4672d16693d03ae5808fa1e81b4544cc

One odd thing I observe is that nvcc is part of the closure, and that means that setupCudaHook is, and so is gcc11 because the hook links it directly...

Also it seems like the cxxdev trick may not have worked as intended because if I add torch.cxxdev to that list I still get 8G? The cxxdev hack did work because the cxxdev output isn't present in the gisted list

  • NVCC is part of the closure because of openai-triton
  • libcublas-static part of the closure because of opencv4, which is part of the closure because of ffmpeg, which is part of the closure because of torchaudio

@SomeoneSerge
Copy link
Contributor Author

Back to ~6G:

❯ nix build --impure --expr 'with (import ./. { config = { cudaSupport = true; cudaCapabilities = [ "8.6" ]; allowUnfree = true; }; }); python3.withPackages (ps: with ps; [ torch torchvision torchaudio torch.cxxdev ])' --print-out-paths | xargs nix path-info --closure-size --human-readable
/nix/store/mf3cy3la3d8ywa0yjpgf4bf20g8hp7cd-python3-3.11.6-env     6.1G

@SomeoneSerge SomeoneSerge force-pushed the feat/torch-propagated-cuda branch from c261530 to 44c7292 Compare December 5, 2023 19:44
SomeoneSerge added a commit to SomeoneSerge/nixpkgs that referenced this pull request Dec 5, 2023
@ofborg ofborg bot removed the 2.status: merge conflict This PR has merge conflicts with the target branch label Dec 5, 2023
This is useful for the cuda variants of packages like opencv and pytorch,
whose xxxxConfig.cmake files do find_package(CUDAToolkit REQUIRED)
regardless of whether they actually use it. With the propagated hook,
we no longer have to manually add cuda dependencies into torch/opencvs
reverse dependencies

cudaPackages.cuda_nvcc: fix setupCudaHook propagation
@SomeoneSerge SomeoneSerge force-pushed the feat/torch-propagated-cuda branch from 494b0bb to 2df7ccf Compare December 6, 2023 02:01
@ConnorBaker ConnorBaker merged commit c94fdf8 into NixOS:master Dec 7, 2023
22 checks passed
Copy link
Contributor

github-actions bot commented Dec 7, 2023

Backport failed for staging-23.11, because it was unable to cherry-pick the commit(s).

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin staging-23.11
git worktree add -d .worktree/backport-271078-to-staging-23.11 origin/staging-23.11
cd .worktree/backport-271078-to-staging-23.11
git switch --create backport-271078-to-staging-23.11
git cherry-pick -x be9c779deba0e898802dd341a1ba9c04c4e9abe8 ada3991349beb5880e3994f25c65a0cf68941b83 45698380295187b35f3872542b71efc2223f8201 55af9329429a30ce81f7ad01da95406e1d62f785 71c248ec1309381136bf74339d453a58b400b2a9 3ececb9efafd80058525571d77d881767de6f5b8 44611c4a6d16b0eeb1488e9557b6a11e45193a46 2df7ccfa1498f5038b15acd50bc9277ad768dcbf

SomeoneSerge added a commit to SomeoneSerge/nixpkgs that referenced this pull request Dec 11, 2023
al3xtjames added a commit to al3xtjames/nixpkgs that referenced this pull request Dec 18, 2023
NixOS#271078 caused the configurePhase of pcl to fail when
withCuda is set to true. Fix NixOS#275090 by replacing
cudatoolkit with cudaPackages.cuda_nvcc.
github-actions bot pushed a commit that referenced this pull request Dec 18, 2023
#271078 caused the configurePhase of pcl to fail when
withCuda is set to true. Fix #275090 by replacing
cudatoolkit with cudaPackages.cuda_nvcc.

(cherry picked from commit 8db5a1a)
Lainera pushed a commit to Lainera/nixpkgs that referenced this pull request Dec 20, 2023
NixOS#271078 caused the configurePhase of pcl to fail when
withCuda is set to true. Fix NixOS#275090 by replacing
cudatoolkit with cudaPackages.cuda_nvcc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants