Skip to content

Conversation

@tsg-
Copy link
Contributor

@tsg- tsg- commented Oct 8, 2025

What?

This fixes a bug in multi-descriptor transfers where descriptors point to different offsets within the same registered memory region. The bug caused all descriptors to incorrectly use the base address of the registration (remote_md->remote_buf_addr_) instead of each descriptor's specific offset address (remote[desc_idx].addr).

Impact: Block-based transfers (Iteration N would read blocks from iteration 0, etc). Also, Scatter-gather operations, Partial buffer updates.

Why?

Without this fix, RDMA reads always target offset 0. Should extract each descriptor's specific target address instead.

Example test case:

  buffer = allocate_memory(1GB)
  register_memory(buffer)

  # Pass 0: Transfer blocks 0-127
  descriptors = [
      {addr: buffer + 0*block_size, len: block_size},      # Block 0
      {addr: buffer + 1*block_size, len: block_size},      # Block 1
      ...
      {addr: buffer + 127*block_size, len: block_size}     # Block 127
  ]

  # Pass 1: Transfer blocks 128-255
  descriptors = [
      {addr: buffer + 128*block_size, len: block_size},    # Block 128  .... Bug: reads block 0
      {addr: buffer + 129*block_size, len: block_size},    # Block 129  .... Bug: reads block 1
      ...
  ]

How?

After fix: Each descriptor uses remote[desc_idx].addr (specific target offset)

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 8, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

github-actions bot commented Oct 8, 2025

👋 Hi tsg-! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@tsg- tsg- force-pushed the per_desc_target_offset branch 2 times, most recently from 2d45ba0 to b948f0f Compare October 8, 2025 18:00
This fixes a bug in multi-descriptor transfers where descriptors
point to different offsets within the same registered memory region.

Without this fix, RDMA reads always target offset 0. Should extract
each descriptor's specific target address instead.

Also impacted: Block-based transfers (Iteration N would read blocks
from iteration 0, etc), Partial buffer updates, etc.

Signed-off-by: Tushar Gohad <tushar.gohad@intel.com>
@tsg- tsg- force-pushed the per_desc_target_offset branch from b948f0f to 005d3ec Compare October 8, 2025 18:13
@ovidiusm ovidiusm requested a review from mkhazraee October 9, 2025 08:48
@tsg-
Copy link
Contributor Author

tsg- commented Oct 12, 2025

@akkart-aws @yexiang-aws we'll push some focused tests for the failing scenarios if it helps with this review. Thank you!

@ovidiusm
Copy link
Contributor

/build

@ovidiusm
Copy link
Contributor

Approving from my side, but it will still need code owner approval from AWS team

@tsg-
Copy link
Contributor Author

tsg- commented Oct 14, 2025

@akkart-aws @yexiang-aws any comments on this change?

@ovidiusm
Copy link
Contributor

/ok to test 005d3ec

@tsg-
Copy link
Contributor Author

tsg- commented Oct 21, 2025

@akkart-aws @yexiang-aws any comments?

@fengjica
Copy link
Contributor

Thanks! This PR did fix the bug, which I reproduced in nixlbench.

Copy link
Collaborator

@yexiang-aws yexiang-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch and thanks a lot, @tsg- ! It would be great if you can push some focused tests to cover this scenario.

@yexiang-aws yexiang-aws merged commit f62d1d7 into ai-dynamo:main Oct 23, 2025
21 checks passed
Alexey-Rivkin pushed a commit to Alexey-Rivkin/nixl that referenced this pull request Oct 23, 2025
This fixes a bug in multi-descriptor transfers where descriptors
point to different offsets within the same registered memory region.

Without this fix, RDMA reads always target offset 0. Should extract
each descriptor's specific target address instead.

Also impacted: Block-based transfers (Iteration N would read blocks
from iteration 0, etc), Partial buffer updates, etc.

Signed-off-by: Tushar Gohad <tushar.gohad@intel.com>
dpressle pushed a commit that referenced this pull request Oct 30, 2025
* CI: Switch from PyTorch to cuda-dl-base for unification

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Handle Meson update in build.sh

Meson update requires Python, which is installed in build.sh
Previous base image had Python pre-installed, but cuda-dl-base has not

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Limit ninja parallelism to fix OOM in Ubuntu22 build

Added -j${NPROC} to ninja commands to prevent out-of-memory compiler kills.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Align Python version with other install procedures

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Switch to cuda-dl-base images with pip upgrade for Ubuntu 22.04

cuda-dl-base Ubuntu 22.04 ships pip 22.0.2 without --break-system-packages
support. Upgrade pip to 24.x to match PyTorch image behavior.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Add ~/.local/bin to PATH for user pip installs

Fixes "pytest: command not found" when pip defaults to user installation.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Update to CUDA12.9

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Use latest cuda-dl-base image for CUDA12.8

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Set CUDA_HOME in the build script

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Fix the Permission denied err on DOCA download

Use /tmp to avoid Permission denied in non-writable directories
Also add cleanup for the DOCA install package

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Make /workspace writable to resolve fs access failures

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Use cuda-dl-base 25.06 to match rock32 node driver version

The images comes with CUDA 12.9 - verified with Ovidiu it is supported.
Resolves error 803 (cudaErrorSystemDriverMismatch) by using cuda-dl-base:25.06
which includes compat driver 575.57.08, matching the H100 nodes' driver version.
Previous 25.03 image had driver 570.124.06 causing version mismatch.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Control ninja parallelism in test_python and increase timeout

cuda-dl-base is missing large Pyuthon packages that
comes pre-instelled with Pytorch images. Install
caused frequent OOM and/or timeout on Ubuntu22

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* UCX/BACKEND: Add worker_id selection support (#938)

Signed-off-by: Michal Shalev <mshalev@nvidia.com>

* libfabric: Use desc-specific target offset (#883)

This fixes a bug in multi-descriptor transfers where descriptors
point to different offsets within the same registered memory region.

Without this fix, RDMA reads always target offset 0. Should extract
each descriptor's specific target address instead.

Also impacted: Block-based transfers (Iteration N would read blocks
from iteration 0, etc), Partial buffer updates, etc.

Signed-off-by: Tushar Gohad <tushar.gohad@intel.com>

* Parallelism Control for pip install

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Reorder Python and CPP test stages

Python stage has higher fail probability,
so better fall fast.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Fix log message when env var not defined (#914)

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Co-authored-by: Mikhail Brinskiy <brminich@users.noreply.github.com>

* Minor cleanup

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Reorder Python and CPP test stages

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Unify to the latest Docker tag

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Revert the timeout extension

The expectation was to longer build times due to
switching to a base image with no Python.
In practice, no test is running more then 10 minutes
so old 30 minutes timeout is still valid.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Move /workspace chmod to the Dockerfile

That chmod is only needed for CI use cases.
Moving it to the CI-specific Dockerfiles so it would
not affect other cases.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Set NPROC in common.sh and reuse

Reduce NPROC set occurences with the default fallback

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Improve NPROC and CUDA_HOME handling in common.sh

- Move CUDA_HOME setup to common.sh before UCX build check
- Calculate NPROC based on container memory limits (1 proc/GB, max 16)
- Detect containers via /.dockerenv, /run/.containerenv, or KUBERNETES_SERVICE_HOST

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Remove hardcoded NPROC from pipelines

NPROC is now set dynamically by common.sh instead

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Limit CPU parallelism on bare metal nodes

Docker containers see all host CPUs, need to limit on BM

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

---------

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Michal Shalev <mshalev@nvidia.com>
Signed-off-by: Tushar Gohad <tushar.gohad@intel.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: ovidiusm <ovidium@nvidia.com>
Co-authored-by: Michal Shalev <mshalev@nvidia.com>
Co-authored-by: Tushar Gohad <tusharsg@gmail.com>
Co-authored-by: ovidiusm <ovidium@nvidia.com>
Co-authored-by: Mikhail Brinskiy <brminich@users.noreply.github.com>
mcuiaws pushed a commit to mcuiaws/nixl that referenced this pull request Oct 31, 2025
This fixes a bug in multi-descriptor transfers where descriptors
point to different offsets within the same registered memory region.

Without this fix, RDMA reads always target offset 0. Should extract
each descriptor's specific target address instead.

Also impacted: Block-based transfers (Iteration N would read blocks
from iteration 0, etc), Partial buffer updates, etc.

Signed-off-by: Tushar Gohad <tushar.gohad@intel.com>
(cherry picked from commit f62d1d7)
e-ago pushed a commit to e-ago/nixl-doca-31 that referenced this pull request Nov 12, 2025
…namo#924)

* CI: Switch from PyTorch to cuda-dl-base for unification

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Handle Meson update in build.sh

Meson update requires Python, which is installed in build.sh
Previous base image had Python pre-installed, but cuda-dl-base has not

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Limit ninja parallelism to fix OOM in Ubuntu22 build

Added -j${NPROC} to ninja commands to prevent out-of-memory compiler kills.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Align Python version with other install procedures

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Switch to cuda-dl-base images with pip upgrade for Ubuntu 22.04

cuda-dl-base Ubuntu 22.04 ships pip 22.0.2 without --break-system-packages
support. Upgrade pip to 24.x to match PyTorch image behavior.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Add ~/.local/bin to PATH for user pip installs

Fixes "pytest: command not found" when pip defaults to user installation.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Update to CUDA12.9

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Use latest cuda-dl-base image for CUDA12.8

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Set CUDA_HOME in the build script

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Fix the Permission denied err on DOCA download

Use /tmp to avoid Permission denied in non-writable directories
Also add cleanup for the DOCA install package

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Make /workspace writable to resolve fs access failures

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Use cuda-dl-base 25.06 to match rock32 node driver version

The images comes with CUDA 12.9 - verified with Ovidiu it is supported.
Resolves error 803 (cudaErrorSystemDriverMismatch) by using cuda-dl-base:25.06
which includes compat driver 575.57.08, matching the H100 nodes' driver version.
Previous 25.03 image had driver 570.124.06 causing version mismatch.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Control ninja parallelism in test_python and increase timeout

cuda-dl-base is missing large Pyuthon packages that
comes pre-instelled with Pytorch images. Install
caused frequent OOM and/or timeout on Ubuntu22

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* UCX/BACKEND: Add worker_id selection support (ai-dynamo#938)

Signed-off-by: Michal Shalev <mshalev@nvidia.com>

* libfabric: Use desc-specific target offset (ai-dynamo#883)

This fixes a bug in multi-descriptor transfers where descriptors
point to different offsets within the same registered memory region.

Without this fix, RDMA reads always target offset 0. Should extract
each descriptor's specific target address instead.

Also impacted: Block-based transfers (Iteration N would read blocks
from iteration 0, etc), Partial buffer updates, etc.

Signed-off-by: Tushar Gohad <tushar.gohad@intel.com>

* Parallelism Control for pip install

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Reorder Python and CPP test stages

Python stage has higher fail probability,
so better fall fast.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Fix log message when env var not defined (ai-dynamo#914)

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Co-authored-by: Mikhail Brinskiy <brminich@users.noreply.github.com>

* Minor cleanup

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Reorder Python and CPP test stages

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Unify to the latest Docker tag

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Revert the timeout extension

The expectation was to longer build times due to
switching to a base image with no Python.
In practice, no test is running more then 10 minutes
so old 30 minutes timeout is still valid.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Move /workspace chmod to the Dockerfile

That chmod is only needed for CI use cases.
Moving it to the CI-specific Dockerfiles so it would
not affect other cases.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Set NPROC in common.sh and reuse

Reduce NPROC set occurences with the default fallback

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Improve NPROC and CUDA_HOME handling in common.sh

- Move CUDA_HOME setup to common.sh before UCX build check
- Calculate NPROC based on container memory limits (1 proc/GB, max 16)
- Detect containers via /.dockerenv, /run/.containerenv, or KUBERNETES_SERVICE_HOST

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Remove hardcoded NPROC from pipelines

NPROC is now set dynamically by common.sh instead

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

* Limit CPU parallelism on bare metal nodes

Docker containers see all host CPUs, need to limit on BM

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>

---------

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Michal Shalev <mshalev@nvidia.com>
Signed-off-by: Tushar Gohad <tushar.gohad@intel.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: ovidiusm <ovidium@nvidia.com>
Co-authored-by: Michal Shalev <mshalev@nvidia.com>
Co-authored-by: Tushar Gohad <tusharsg@gmail.com>
Co-authored-by: ovidiusm <ovidium@nvidia.com>
Co-authored-by: Mikhail Brinskiy <brminich@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants