[XLA:GPU] Add NVSHMEM library and initialization test #20395

trevor-m · 2024-12-10T20:17:49Z

This PR is the first step in integrating the NVSHMEM library into XLA. There are two use cases in XLA that we are targeting with NVSHMEM: 1. Host-side collectives which are much faster than NCCL at small message sizes and 2. Generating fused communication and compute kernels with Pallas and Mosaic GPU.

This PR adds the library dependency which follows the stub loading mechanism used by other NVIDIA libraries in XLA. We also add a singleton wrapper API around it, and a unit test which checks that NVSHMEM can be initialized successfully. The build rules download the nvshmem headers during XLA build and uses the system installed shared library at runtime.

cc @nvcastet @Tixxx

nvcastet · 2024-12-10T20:29:37Z

CC @ezhulenev

xla/service/gpu/runtime/nvshmem_api.h

trevor-m · 2025-01-14T01:17:10Z

@frgossen Would you mind reviewing this when you have a chance?

frgossen · 2025-01-14T15:08:11Z

Generally looks good to me. I would like to wait merging this until we can test NVSHMEM internally, which depends on integrating the library

xla/backends/gpu/collectives/nvshmem_collectives.cc

PatriosTheGreat · 2025-02-21T10:37:11Z

xla/backends/gpu/collectives/nvshmem_collectives.cc

+  nvshmemx_uniqueid_t nvshmem_id = NVSHMEMX_UNIQUEID_INITIALIZER;
+
+  // Initialize NVSHMEM
+  if (process_id_ == 0) {


Would this work in case when the first collection doesn't use device 0?

For initialization, it needs to be global so all the devices must be participating. This is a bit different from nccl where you can initialize communicators independently.
After initialization, we could split the devices into subgroups.

third_party/tsl/third_party/nvshmem/workspace.bzl

Imported from GitHub PR #21683 Requires #20395 which adds the NVSHMEM library dependency. This PR adds the following: 1. Nvshmem flag to enable nvshmem 2. Set nvshmem initialization issue when GPU PJRT client is created. The first time NVSHMEM is used, it will be initialized. 3. Uses the user buffer memory pool for nvshmem. If nvshmem is enabled, it will be allocated using `nvshmem_malloc`. This same memory can be used by user buffers if nccl user buffers is also enabled. 4. Update the `CollectiveColorer` so that mosaic_gpu custom calls use the nvshmem memory space. Copybara import of the project: -- aee3379 by Trevor Morris <tmorris@nvidia.com>: Add nvshmem flag, memory allocation, and memory space assignment Set Nvshmem env info during client creation Rename flag and use absl::string_view -- f8fca39 by Trevor Morris <tmorris@nvidia.com>: Use explicit types in test -- e41faa3 by Trevor Morris <tmorris@nvidia.com>: Add user buffer allgather and allreduce tests with and without nvshmem alloc Set nvshmem in XLA_FLAGS test fixes formatting -- cf0c368 by Trevor Morris <tmorris@nvidia.com>: Fixes -- 3b4d111 by Trevor Morris <tmorris@nvidia.com>: Remove early dso check -- 359f2b2 by Trevor Morris <tmorris@nvidia.com>: Add flag comment -- fd15a7c by Trevor Morris <tmorris@nvidia.com>: Also assign memory space for mosaic_gpu_v2 Merging this change closes #21683 FUTURE_COPYBARA_INTEGRATE_REVIEW=#21683 from trevor-m:nvshmem-upstream-2 fd15a7c PiperOrigin-RevId: 740701134

Imported from GitHub PR openxla/xla#21683 Requires openxla/xla#20395 which adds the NVSHMEM library dependency. This PR adds the following: 1. Nvshmem flag to enable nvshmem 2. Set nvshmem initialization issue when GPU PJRT client is created. The first time NVSHMEM is used, it will be initialized. 3. Uses the user buffer memory pool for nvshmem. If nvshmem is enabled, it will be allocated using `nvshmem_malloc`. This same memory can be used by user buffers if nccl user buffers is also enabled. 4. Update the `CollectiveColorer` so that mosaic_gpu custom calls use the nvshmem memory space. Copybara import of the project: -- aee33791e16ab2149118de728dbb9e62f5e7cc31 by Trevor Morris <tmorris@nvidia.com>: Add nvshmem flag, memory allocation, and memory space assignment Set Nvshmem env info during client creation Rename flag and use absl::string_view -- f8fca39300b3915eb6320142f58fa9c0ec7a1eaa by Trevor Morris <tmorris@nvidia.com>: Use explicit types in test -- e41faa3f72b778fcf8ea8111d3cde59548b8f9f5 by Trevor Morris <tmorris@nvidia.com>: Add user buffer allgather and allreduce tests with and without nvshmem alloc Set nvshmem in XLA_FLAGS test fixes formatting -- cf0c36865de8b8a010caaf62c3a36b64e36037bd by Trevor Morris <tmorris@nvidia.com>: Fixes -- 3b4d11123cdb794d0a60e65b94d22ded04b7b2b4 by Trevor Morris <tmorris@nvidia.com>: Remove early dso check -- 359f2b243ec97b1f8003c27f0b07dde82407ff6c by Trevor Morris <tmorris@nvidia.com>: Add flag comment -- fd15a7cac745adc1971bec63e148047b9b811729 by Trevor Morris <tmorris@nvidia.com>: Also assign memory space for mosaic_gpu_v2 Merging this change closes #21683 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#21683 from trevor-m:nvshmem-upstream-2 fd15a7cac745adc1971bec63e148047b9b811729 PiperOrigin-RevId: 740701134