-
-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cudaPackages: generalize and refactor setup hooks #281576
cudaPackages: generalize and refactor setup hooks #281576
Conversation
81c1478
to
c0a86a2
Compare
c0a86a2
to
2580c41
Compare
Alrighty, had a chance to take a look @yannham! First, the segfaults you were seeing: the root of the problem is that bash doesn't have higher-order functions, so when you wrote stuff like postFixupHooks+=(autoFixElfFiles addCudaCompatRunpath) it was adding autoFixElfFiles and addCudaCompatRunpath as separate fixup hooks. As a result, the autoFixElfFiles wasn't receiving any arguments -- when you tried to run $1 "$f" you were getting segfaults because $1 wasn't set. To work around the lack of higher-order functions and the fact that bash will eagerly evaluate things, I made wrapper functions for each of the two setup hooks this refactors -- they basically thunks because they just postpone the evaluation of the the thing we care about. To make debugging these easier in the future, I added some conditionals to catch improper usage of the setup hooks. Additionally, I changed indentation to be consistent. I've updated and force-pushed your commit with these changes; feel free to take what you want from them. |
For what it's worth, I was able to build and run $ cat ~/.config/nixpkgs/config.nix
{
allowUnfree = true;
allowBroken = false;
cudaSupport = true;
cudaCapabilities = ["8.9"];
cudaForwardCompat = false;
}
$ nom build --impure github:connorbaker/nix-cuda-test#nix-cuda-test --override-input nixpkgs github:nixos/nixpkgs/2580c4182e1af070b7fa7720e1b4f6ed407ba286
... ignoring build logs
$ ./result/bin/nix-cuda-test
Seed set to 42
Using bfloat16 Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: /home/connorbaker/lightning_logs
Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to data/cifar-10-python.tar.gz
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 170498071/170498071 [00:01<00:00, 89975099.50it/s]
Extracting data/cifar-10-python.tar.gz to data
Files already downloaded and verified
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
-----------------------------------------------
0 | criterion | CrossEntropyLoss | 0
1 | model | ViT | 86.3 M
-----------------------------------------------
86.3 M Trainable params
0 Non-trainable params
86.3 M Total params
345.317 Total estimated model params size (MB)
Epoch 9: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 781/781 [01:28<00:00, 8.80it/s, v_num=0, train_loss=2.350, val_loss=2.330]`Trainer.fit` stopped: `max_epochs=10` reached.
Epoch 9: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 781/781 [01:30<00:00, 8.64it/s, v_num=0, train_loss=2.350, val_loss=2.330] |
@ConnorBaker thanks! I actually arrived at the same conclusion yesterday, in my bed 😅 because I had the issue before with a script unrelated to Nix but with I'll do a last pass and set as ready for review |
b8063e2
to
21da3cf
Compare
Quoting seems to work, and it's a little less verbose, so I changed that. I'm happy to revert to using a proper lambda if needed. Otherwise it's good to go on my hand, tested to build and run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the stale TODO, looks good! Good find with the quoting as well!
pkgs/development/cuda-modules/setup-hooks/auto-add-cuda-compat-runpath.sh
Outdated
Show resolved
Hide resolved
pkgs/development/cuda-modules/setup-hooks/auto-fix-elf-files.sh
Outdated
Show resolved
Hide resolved
pkgs/development/cuda-modules/setup-hooks/auto-fix-elf-files.sh
Outdated
Show resolved
Hide resolved
21da3cf
to
73ccd08
Compare
73ccd08
to
f401692
Compare
Result of 283 packages marked as broken and skipped:
47 packages failed to build:
174 packages built:
|
@SomeoneSerge thanks a lot for running nipxkgs review. I'm not sure how to interpret the results; are the failing packages related to this PR? |
f401692
to
f1f173c
Compare
@ConnorBaker @SomeoneSerge sorry to annoy you, I know you have a lot on your plate, but although it's a small PR I would like to not let it bitrot. Is there any other action required on my side to get this merged? I just rebase on the latest master. |
@yannham sorry for silence; let's fix eval errors and merge. How I see the subsequent maintenance is that I'm skeptical about how composable the call-a-function style of |
No worries! Thanks for following up on this one. I'll take a look at what's failing ofborg.
Are you referring to the fact that we generate a lot of file accesses? That is, the comment you wrote in #277213 (comment) ? If it is, it was more or less the next thing on my radar to take a look at after this PR 🙂 (well, when I find time) |
f1f173c
to
4771791
Compare
@SomeoneSerge aaaaand the eval is passing now 🙂 |
@yannham aaaand here are the build-logs (I pushed your commit to the cuda-updates branch): https://hercules-ci.com/github/SomeoneSerge/nixpkgs-cuda-ci/jobs/9871 I haven't read those myself yet |
Here is a summary of build failures (
All the rest is passing. Is there anything in here that wasn't failing already? It doesn't look like so, but I can't be 100% percent sure. Let me also retry the current version of this PR on the jetson. |
4771791
to
5e3b5c3
Compare
This PR refactor CUDA setup hooks, and in particular autoAddOpenGLRunpath and autoAddCudaCompatRunpathHook, that were using a lot of code in common (in fact, I introduced the latter by copy pasting most of the bash script of the former). This is not satisfying for maintenance, as a recent patch showed, because we need to duplicate changes to both hooks. This commit abstract the common part in a single shell script that applies a generic patch action to every elf file in the output. For autoAddOpenGLRunpath the action is just addOpenGLRunpath (now addDriverRunpath), and is few line function for autoAddCudaCompatRunpathHook. Doing so, we also takes the occasion to use the newer addDriverRunpath instead of the previous addOpenGLRunpath, and rename the CUDA hook to reflect that as well. Co-Authored-By: Connor Baker <connor.baker@tweag.io>
5e3b5c3
to
63746ca
Compare
@@ -32,31 +47,36 @@ final: _: { | |||
{} | |||
); | |||
|
|||
autoAddOpenGLRunpathHook = | |||
autoAddDriverRunpath = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The very last question, I promise: why don't we just move autoFixElfFiles
and autoAddDriverRunpath
to pkgs/by-name
right away? We'll have to anyway, eventually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no good answer, I just copied what was existing already. So let's do that as well, sure! So those would just become stand-alone packages somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it's enough to just create the pkgs/by-name/{autoFixElfFiles,autoAddDriverRunpath}/package.nix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, let's do this in subsequent PRs
@@ -32,31 +47,36 @@ final: _: { | |||
{} | |||
); | |||
|
|||
autoAddOpenGLRunpathHook = | |||
autoAddDriverRunpath = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just learned about config.allowAliases
: https://github.com/NixOS/nixpkgs/blob/da7c5f11e0e17bb2e75d376c3d46b9c99f6c318a/pkgs/build-support/trivial-builders/default.nix#L629-L630
# Deprecated: an alias kept for compatibility. Consider removing after 24.11 | ||
autoAddOpenGLRunpathHook = final.autoAddDriverRunpath; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there have been more instances added since this PR was last updated.
Actually... I maybe just did something stupid:
|
# Run addDriverRunpath on all dynamically linked ELF files | ||
echo "Sourcing auto-add-driver-runpath-hook" | ||
|
||
if [ -z "${dontUseAutoAddDriverRunpath-}" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with this |
This PR refactors CUDA setup hooks, and in particular
autoAddOpenGLRunpath
andautoAddCudaCompatRunpathHook
, that have a lot of code in common (in fact, I introduced the latter by copy pasting most of the bash script of the former and changing a few line). This is not satisfying for maintenance, as demonstrated by #277213, because we need to duplicate changes to both hooks.This commit abstracts the common part in a single shell function that applies a generic patch action to every elf file in the output. It is called
autoFixElfFiles
because I believeautoPatchElfFiles
would have been confused with somethingautoPatchelf
-related.For
autoAddOpenGLRunpath
the action is justaddOpenGLRunpath
(nowaddDriverRunpath
), and is few line function forautoAddCudaCompatRunpathHook
.Doing so, we also takes the occasion to use the newer
addDriverRunpath
instead of the previousaddOpenGLRunpath
, and rename the CUDA hook to reflect that as well.Description of changes
Things done
nix.conf
? (See Nix manual)sandbox = relaxed
sandbox = true
nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD"
. Note: all changes have to be committed, also see nixpkgs-review usage./result/bin/
)Add a 👍 reaction to pull requests you find important.