Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No static version for linking NCCL (libnccl) #11604

Open
janpfeifer opened this issue Apr 17, 2024 · 2 comments
Open

No static version for linking NCCL (libnccl) #11604

janpfeifer opened this issue Apr 17, 2024 · 2 comments
Assignees

Comments

@janpfeifer
Copy link
Contributor

janpfeifer commented Apr 17, 2024

I'm an author of an ML Framework using XLA.

Per issue #11596 in a recent refresh of my build, XLA build fails if I don't include NCCL. The easy fix would be to include NCCL in my build -- also good for other reasons, but my default (and only) distribution of my ML framework works for both CPU & GPU. I achieve this by linking things statically -- also because it's simpler for the end user.

The issue is that NCCL (as opposed to other CUDA libraries), doesn't have a statically linking rule, even though NVidia distributes the libnccl_static.a file. Relevant bazel code.

I assume this is a simple fix, for someone with right "bazel-fu skillz" ... but I'm not sure. Also, there may be other considerations I'm not aware. Any help would be most appreciated!

@janpfeifer
Copy link
Contributor Author

janpfeifer commented Apr 17, 2024

So, a manual hack to get it to link static is to change the "ncl" rule to:

cc_library(
    name = "nccl",
    srcs = [],
    # srcs = ["libnccl.so.%{nccl_version}"],
    hdrs = ["nccl.h"],
    include_prefix = "third_party/nccl",
    visibility = ["//visibility:public"],
    deps = [
        "@local_config_cuda//cuda:cuda_headers",
    ],
    linkopts = cuda_rpath_flags("nvidia/nccl/lib") + ["-lnccl_static"],
)

(linkopts was added and the srcs was set to empty)

But ideally this would be controlled by the user choice of compiling it statically.

While looking at the code I also found out about the TF_NCCL_USE_STUB variable that if set to "1" (or anything different than "0" or unset) will trigger using the rule I mentioned. Otherwise, it takes another path of building nccl(?).

I tested adding build --action_env TF_USE_STUB=0 to xla_configure.bazzelrc but it didn't help, NCCL was still linked dynamically. But I'm not sure if my value was overwritten...

@cheshire
Copy link
Member

@ddunl @PatriosTheGreat WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants