No static version for linking NCCL (libnccl) #11604

janpfeifer · 2024-04-17T08:23:51Z

I'm an author of an ML Framework using XLA.

Per issue #11596 in a recent refresh of my build, XLA build fails if I don't include NCCL. The easy fix would be to include NCCL in my build -- also good for other reasons, but my default (and only) distribution of my ML framework works for both CPU & GPU. I achieve this by linking things statically -- also because it's simpler for the end user.

The issue is that NCCL (as opposed to other CUDA libraries), doesn't have a statically linking rule, even though NVidia distributes the libnccl_static.a file. Relevant bazel code.

I assume this is a simple fix, for someone with right "bazel-fu skillz" ... but I'm not sure. Also, there may be other considerations I'm not aware. Any help would be most appreciated!

The text was updated successfully, but these errors were encountered:

janpfeifer · 2024-04-17T16:31:19Z

So, a manual hack to get it to link static is to change the "ncl" rule to:

cc_library(
    name = "nccl",
    srcs = [],
    # srcs = ["libnccl.so.%{nccl_version}"],
    hdrs = ["nccl.h"],
    include_prefix = "third_party/nccl",
    visibility = ["//visibility:public"],
    deps = [
        "@local_config_cuda//cuda:cuda_headers",
    ],
    linkopts = cuda_rpath_flags("nvidia/nccl/lib") + ["-lnccl_static"],
)

(linkopts was added and the srcs was set to empty)

But ideally this would be controlled by the user choice of compiling it statically.

While looking at the code I also found out about the TF_NCCL_USE_STUB variable that if set to "1" (or anything different than "0" or unset) will trigger using the rule I mentioned. Otherwise, it takes another path of building nccl(?).

I tested adding build --action_env TF_USE_STUB=0 to xla_configure.bazzelrc but it didn't help, NCCL was still linked dynamically. But I'm not sure if my value was overwritten...

cheshire · 2024-04-19T15:02:42Z

@ddunl @PatriosTheGreat WDYT?

cheshire assigned ddunl Apr 19, 2024

juuso-oskari mentioned this issue Sep 2, 2024

Build for GPU fails due to nccl error #16711

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No static version for linking NCCL (libnccl) #11604

No static version for linking NCCL (libnccl) #11604

janpfeifer commented Apr 17, 2024 •

edited

Loading

janpfeifer commented Apr 17, 2024 •

edited

Loading

cheshire commented Apr 19, 2024

No static version for linking NCCL (libnccl) #11604

No static version for linking NCCL (libnccl) #11604

Comments

janpfeifer commented Apr 17, 2024 • edited Loading

janpfeifer commented Apr 17, 2024 • edited Loading

cheshire commented Apr 19, 2024

janpfeifer commented Apr 17, 2024 •

edited

Loading

janpfeifer commented Apr 17, 2024 •

edited

Loading