Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glibc: NVCC aarch64 intrinsics are unsupported #264599

Conversation

ConnorBaker
Copy link
Contributor

@ConnorBaker ConnorBaker commented Oct 31, 2023

Context

The GLIBC 2.38 update introduces intrinsics for aarch64-linux in math.h. NVCC (NVIDIA's CUDA Compiler) is now unable to compile any CUDA file for aarch64-linux because it does not support these intrinsics: https://forums.developer.nvidia.com/t/nvcc-fails-to-build-with-arm-neon-instructions-cpp-vs-cu/248355/2. As a result, PyTorch, Magma, Jax, and any other CUDA-enabled package now fail to build on aarch64-linux.

An alternative approach might be to patch the GLIBC used by NVCC. This approach is not taken for several reasons:

  • Users with config.cudaSupport = true would need to bootstrap a second copy of Nixpkgs with a glibc containing these changes.
  • Manually patching NVCC, CMake, and the include paths of a package like Magma was not enough: CMake's compiler identification routine fails to build its trivial CUDA file because it picks the unpatched GLIBC.
  • Assuming a manual patch against NVCC and or the build systems does work, every CUDA-enabled package must incorporate the relevant changes to ensure the unpatched GLIBC does not make its way in.
  • This patch follows the precedent set by upstream by guarding against inclusion of these intrinsics when an unsupported compiler is detected. We ensure that only NVCC is affected by this patch.

Description of changes

GLIBC 2.37 -> 2.38 broke CUDA compilation on ARM because it introduced ARM intrinsics in math.h. Here are the ones which are causing the errors: https://github.com/bminor/glibc/blob/36f2487f13e3540be9ee0fb51876b1da72176d3f/sysdeps/aarch64/fpu/bits/math-vector.h#L28-L36.

NVCC declares itself to be the same compiler as its host compiler. This causes inclusion of unsupported aarch64-linux intrinsics.

This patch follows the precedent set by upstream by guarding against inclusion of such intrinsics when an unsupported compiler (like NVCC) is detected.

Note: upstream does not intend to fix this. Instead, they place the onus (perhaps rightly) on downstream compilers/pre-processors (like NVCC), with the reasoning that they should not claim to implement a version of a compiler that they do not.

The patch is applied only for aarch64-linux.

References

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 23.11 Release Notes (or backporting 23.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

@ConnorBaker ConnorBaker added the 6.topic: cuda Parallel computing platform and API label Oct 31, 2023
@ConnorBaker ConnorBaker self-assigned this Oct 31, 2023
@github-actions github-actions bot added 6.topic: python 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: documentation This PR adds or changes documentation 8.has: changelog 8.has: module (update) This PR changes an existing module in `nixos/` 6.topic: ocaml 6.topic: vscode labels Oct 31, 2023
@ConnorBaker ConnorBaker force-pushed the fix/glibc-nvcc-arm-intrinsics-unsupported branch from 7601d28 to 07155f9 Compare October 31, 2023 14:51
@ConnorBaker ConnorBaker requested a review from Ma27 October 31, 2023 14:51
@github-actions github-actions bot removed 6.topic: python 6.topic: nixos Issues or PRs affecting NixOS modules, or package usability issues specific to NixOS 8.has: documentation This PR adds or changes documentation 8.has: changelog 8.has: module (update) This PR changes an existing module in `nixos/` 6.topic: ocaml 6.topic: vscode labels Oct 31, 2023
Copy link
Contributor

@SomeoneSerge SomeoneSerge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from the cudaPackages perspective, and I understand you've already verified the patch is sufficient to build cudaPackages.saxpy

Waiting for someone familiar with nixpkgs' glibc to sign off

@ConnorBaker ConnorBaker force-pushed the fix/glibc-nvcc-arm-intrinsics-unsupported branch 2 times, most recently from e96bcac to b0ed6a9 Compare October 31, 2023 15:46
@ConnorBaker ConnorBaker changed the title Fix/glibc nvcc arm intrinsics unsupported glibc: NVCC aarch64 intrinsics are unsupported Oct 31, 2023
@jonringer
Copy link
Contributor

@ofborg build cudaPackages.saxpy

@ofborg ofborg bot requested a review from edolstra October 31, 2023 17:35
@ofborg ofborg bot added 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin 10.rebuild-linux: 1-10 10.rebuild-linux: 1 labels Oct 31, 2023
@Ma27
Copy link
Member

Ma27 commented Oct 31, 2023

it is important that this patch is applied within Nixpkgs, rather than waiting for upstream to accept and apply such a fix.

This is partially correct only: for each release there's an active release branch where patches from their main branch are backported to, we mainly need to update the patchset here.

to accept and apply such a fix.

Please file it upstream anyways. I'd very much prefer to only update our patchlevel rather than maintaining our own patches that nobody else uses.

This patch follows the precedent set by upstream by guarding against inclusion of these intrinsics when an unsupported compiler is detected. We ensure that only NVCC is affected by this patch.

Agreed 👍
The approach seems OK to me.

Is there any particular reason though that this is a draft?

@jonringer
Copy link
Contributor

@ofborg build python3Minimal

Copy link
Contributor

@jonringer jonringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to be a noop in non-aarch64, non-linux, non-cuda builds.

I'll defer to @Ma27 for final merge, but I'm fine with this.

@ConnorBaker
Copy link
Contributor Author

Is there any particular reason though that this is a draft?

I had some trouble rebasing this change from master to staging and wanted to avoid mass-pinging a bunch of people until I got it right. Moving it out of draft now!

@ConnorBaker ConnorBaker marked this pull request as ready for review October 31, 2023 17:55
@delroth delroth added the 12.approvals: 2 This PR was reviewed and approved by two reputable people label Oct 31, 2023
Ma27
Ma27 previously requested changes Oct 31, 2023
Copy link
Member

@Ma27 Ma27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change itself looks good and I agree with the approach. However, I'd like to see this patch being filed against upstream glibc before this gets merged here.

@ConnorBaker
Copy link
Contributor Author

ConnorBaker commented Nov 1, 2023

@Ma27 this is a blocker for a number of tasks I'm working on -- is it okay if we do them in parallel?

Here's the list of tasks to get them to consider the patch:

  • Email the source ware administrators to have them create an account on Bugzilla for me.
  • Set up account and post bug report of observed breakage with NVCC.
  • Email the patch which references the bug report.
    • Upstream doesn't intend to fix this.

EDIT: Updated with status of tasks and note that upstream doesn't intend to fix this.

@ConnorBaker
Copy link
Contributor Author

@Ma27 @SomeoneSerge @jonringer after some more research:

it's clear that upstream believes compilers should not pretend to implement recent GCC versions if they do not, and does not intend to fix this issue. That's principled and I applaud them for it; however, the fact remains that there's no easy way to patch NVCC because it is a closed source project and relied on heavily in the High Performance Computing (HPC) space.

With your blessing, I'd like to merge this to unblock a number of different issues and restore functionality to the CUDA packages Nixpkgs provides.

I don't share the same concern as the GLIBC maintainers do about bitrot or forgotten fixes because we maintain our patches alongside the project, rather than embedding them in the source code. If and when support is added and new versions of NVCC are released for all prior versions of CUDA with a fix, we could revisit this patch.

Is this agreeable?

@ConnorBaker ConnorBaker requested a review from Ma27 November 2, 2023 14:15
@ConnorBaker ConnorBaker removed the 10.rebuild-darwin: 0 This PR does not cause any packages to rebuild on Darwin label Nov 2, 2023
@ConnorBaker ConnorBaker dismissed Ma27’s stale review November 5, 2023 16:36

Upstream will not issue such a patch and will leave all released versions of NVCC broken

@Ma27
Copy link
Member

Ma27 commented Nov 5, 2023

Just read the ML thraed about the other patch suggested and I agree with your assesment.. Because of that I'm inclined to give an OK to this. There's one thing we should talk about before however given that this patch will probably live a little longer in nixpkgs: are you open to reviewing changes that possibly affect this patch? This includes at least glibc version upgrades and changes to the patch itself.

AFAIU I won't be able to do that without nvidia hardware, so I can't do that, correct? If so, please add a note about you maintaining the patch above the inclusion in common.nix.

@SomeoneSerge
Copy link
Contributor

SomeoneSerge commented Nov 5, 2023

Is this agreeable?

I think this is right. We still probably have to patch some downstream apps (torch, opencv?) on master, because otherwise your modules+jetson support PR is blocked for a while

are you open to reviewing changes that possibly affect this patch

I'm up for reviewing such changes as well

AFAIU I won't be able to do that without nvidia hardware, so I can't do that, correct

I'll just note that one doesn't need nvidia hardware to run the builds

@ConnorBaker
Copy link
Contributor Author

... are you open to reviewing changes that possibly affect this patch? This includes at least glibc version upgrades and changes to the patch itself.

I am!

AFAIU I won't be able to do that without nvidia hardware, so I can't do that, correct?

Nah, it's a build failure and the build only requires the CPU. You'd run into this so long as config.cudaSupport = true and you're building for (or on) aarch64-linux.

If so, please add a note about you maintaining the patch above the inclusion in common.nix.

Will do!

@ConnorBaker ConnorBaker force-pushed the fix/glibc-nvcc-arm-intrinsics-unsupported branch from b0ed6a9 to 2b47345 Compare November 6, 2023 00:53
@ConnorBaker
Copy link
Contributor Author

Added myself as a maintainer, merging. Thank you for your help and guidance, @Ma27!

@ConnorBaker ConnorBaker merged commit 8bd2ce7 into NixOS:staging Nov 6, 2023
@ConnorBaker ConnorBaker deleted the fix/glibc-nvcc-arm-intrinsics-unsupported branch November 6, 2023 00:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
6.topic: cuda Parallel computing platform and API 10.rebuild-linux: 1-10 10.rebuild-linux: 1 12.approvals: 2 This PR was reviewed and approved by two reputable people
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

5 participants