UCX/BACKEND: Add worker_id selection support #938

michal-shalev · 2025-10-21T18:04:00Z

What?

Add explicit worker selection for UCX backend via customParam="worker_id=N".

Why?

Currently, to get N separate QPs, you must create N separate agents. This is inefficient. UCX supports multiple workers per agent (each creating its own QP), but the only way to use different workers was from different threads. This PR enables single-threaded code to explicitly select which worker/QP to use per request.

How?

Parse worker_id=N from opt_args.customParam in prepXfer() and prepGpuSignal()
If specified, use that worker; otherwise fall back to thread-local round-robin assignment
Add helper getWorkerIdFromOptArgs() to centralize parsing logic
Test validates 32 workers with separate endpoints, using warmup transfers to complete UCX endpoint wireup

Usage:

params["num_workers"] = "32";  // 1 agent, 32 workers
opt_args.customParam = "worker_id=5";  // Explicit worker selection
agent->createXferReq(..., &opt_args);  // Uses worker 5's QP

Signed-off-by: Michal Shalev <mshalev@nvidia.com>

github-actions · 2025-10-21T18:04:14Z

👋 Hi michal-shalev! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

src/plugins/ucx/ucx_backend.cpp

test/gtest/device_api/single_write_test.cu

Signed-off-by: Michal Shalev <mshalev@nvidia.com>

michal-shalev · 2025-10-23T14:12:14Z

/build

Signed-off-by: Michal Shalev <mshalev@nvidia.com>

* CI: Switch from PyTorch to cuda-dl-base for unification Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Handle Meson update in build.sh Meson update requires Python, which is installed in build.sh Previous base image had Python pre-installed, but cuda-dl-base has not Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Limit ninja parallelism to fix OOM in Ubuntu22 build Added -j${NPROC} to ninja commands to prevent out-of-memory compiler kills. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Align Python version with other install procedures Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Switch to cuda-dl-base images with pip upgrade for Ubuntu 22.04 cuda-dl-base Ubuntu 22.04 ships pip 22.0.2 without --break-system-packages support. Upgrade pip to 24.x to match PyTorch image behavior. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Add ~/.local/bin to PATH for user pip installs Fixes "pytest: command not found" when pip defaults to user installation. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Update to CUDA12.9 Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Use latest cuda-dl-base image for CUDA12.8 Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Set CUDA_HOME in the build script Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Fix the Permission denied err on DOCA download Use /tmp to avoid Permission denied in non-writable directories Also add cleanup for the DOCA install package Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Make /workspace writable to resolve fs access failures Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Use cuda-dl-base 25.06 to match rock32 node driver version The images comes with CUDA 12.9 - verified with Ovidiu it is supported. Resolves error 803 (cudaErrorSystemDriverMismatch) by using cuda-dl-base:25.06 which includes compat driver 575.57.08, matching the H100 nodes' driver version. Previous 25.03 image had driver 570.124.06 causing version mismatch. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Control ninja parallelism in test_python and increase timeout cuda-dl-base is missing large Pyuthon packages that comes pre-instelled with Pytorch images. Install caused frequent OOM and/or timeout on Ubuntu22 Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * UCX/BACKEND: Add worker_id selection support (#938) Signed-off-by: Michal Shalev <mshalev@nvidia.com> * libfabric: Use desc-specific target offset (#883) This fixes a bug in multi-descriptor transfers where descriptors point to different offsets within the same registered memory region. Without this fix, RDMA reads always target offset 0. Should extract each descriptor's specific target address instead. Also impacted: Block-based transfers (Iteration N would read blocks from iteration 0, etc), Partial buffer updates, etc. Signed-off-by: Tushar Gohad <tushar.gohad@intel.com> * Parallelism Control for pip install Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Reorder Python and CPP test stages Python stage has higher fail probability, so better fall fast. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Fix log message when env var not defined (#914) Signed-off-by: Ovidiu Mara <ovidium@nvidia.com> Co-authored-by: Mikhail Brinskiy <brminich@users.noreply.github.com> * Minor cleanup Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Reorder Python and CPP test stages Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Unify to the latest Docker tag Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Revert the timeout extension The expectation was to longer build times due to switching to a base image with no Python. In practice, no test is running more then 10 minutes so old 30 minutes timeout is still valid. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Move /workspace chmod to the Dockerfile That chmod is only needed for CI use cases. Moving it to the CI-specific Dockerfiles so it would not affect other cases. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Set NPROC in common.sh and reuse Reduce NPROC set occurences with the default fallback Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Improve NPROC and CUDA_HOME handling in common.sh - Move CUDA_HOME setup to common.sh before UCX build check - Calculate NPROC based on container memory limits (1 proc/GB, max 16) - Detect containers via /.dockerenv, /run/.containerenv, or KUBERNETES_SERVICE_HOST Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Remove hardcoded NPROC from pipelines NPROC is now set dynamically by common.sh instead Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Limit CPU parallelism on bare metal nodes Docker containers see all host CPUs, need to limit on BM Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> --------- Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> Signed-off-by: Michal Shalev <mshalev@nvidia.com> Signed-off-by: Tushar Gohad <tushar.gohad@intel.com> Signed-off-by: Ovidiu Mara <ovidium@nvidia.com> Signed-off-by: ovidiusm <ovidium@nvidia.com> Co-authored-by: Michal Shalev <mshalev@nvidia.com> Co-authored-by: Tushar Gohad <tusharsg@gmail.com> Co-authored-by: ovidiusm <ovidium@nvidia.com> Co-authored-by: Mikhail Brinskiy <brminich@users.noreply.github.com>

…namo#924) * CI: Switch from PyTorch to cuda-dl-base for unification Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Handle Meson update in build.sh Meson update requires Python, which is installed in build.sh Previous base image had Python pre-installed, but cuda-dl-base has not Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Limit ninja parallelism to fix OOM in Ubuntu22 build Added -j${NPROC} to ninja commands to prevent out-of-memory compiler kills. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Align Python version with other install procedures Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Switch to cuda-dl-base images with pip upgrade for Ubuntu 22.04 cuda-dl-base Ubuntu 22.04 ships pip 22.0.2 without --break-system-packages support. Upgrade pip to 24.x to match PyTorch image behavior. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Add ~/.local/bin to PATH for user pip installs Fixes "pytest: command not found" when pip defaults to user installation. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Update to CUDA12.9 Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Use latest cuda-dl-base image for CUDA12.8 Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Set CUDA_HOME in the build script Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Fix the Permission denied err on DOCA download Use /tmp to avoid Permission denied in non-writable directories Also add cleanup for the DOCA install package Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Make /workspace writable to resolve fs access failures Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Use cuda-dl-base 25.06 to match rock32 node driver version The images comes with CUDA 12.9 - verified with Ovidiu it is supported. Resolves error 803 (cudaErrorSystemDriverMismatch) by using cuda-dl-base:25.06 which includes compat driver 575.57.08, matching the H100 nodes' driver version. Previous 25.03 image had driver 570.124.06 causing version mismatch. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Control ninja parallelism in test_python and increase timeout cuda-dl-base is missing large Pyuthon packages that comes pre-instelled with Pytorch images. Install caused frequent OOM and/or timeout on Ubuntu22 Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * UCX/BACKEND: Add worker_id selection support (ai-dynamo#938) Signed-off-by: Michal Shalev <mshalev@nvidia.com> * libfabric: Use desc-specific target offset (ai-dynamo#883) This fixes a bug in multi-descriptor transfers where descriptors point to different offsets within the same registered memory region. Without this fix, RDMA reads always target offset 0. Should extract each descriptor's specific target address instead. Also impacted: Block-based transfers (Iteration N would read blocks from iteration 0, etc), Partial buffer updates, etc. Signed-off-by: Tushar Gohad <tushar.gohad@intel.com> * Parallelism Control for pip install Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Reorder Python and CPP test stages Python stage has higher fail probability, so better fall fast. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Fix log message when env var not defined (ai-dynamo#914) Signed-off-by: Ovidiu Mara <ovidium@nvidia.com> Co-authored-by: Mikhail Brinskiy <brminich@users.noreply.github.com> * Minor cleanup Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Reorder Python and CPP test stages Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Unify to the latest Docker tag Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Revert the timeout extension The expectation was to longer build times due to switching to a base image with no Python. In practice, no test is running more then 10 minutes so old 30 minutes timeout is still valid. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Move /workspace chmod to the Dockerfile That chmod is only needed for CI use cases. Moving it to the CI-specific Dockerfiles so it would not affect other cases. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Set NPROC in common.sh and reuse Reduce NPROC set occurences with the default fallback Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Improve NPROC and CUDA_HOME handling in common.sh - Move CUDA_HOME setup to common.sh before UCX build check - Calculate NPROC based on container memory limits (1 proc/GB, max 16) - Detect containers via /.dockerenv, /run/.containerenv, or KUBERNETES_SERVICE_HOST Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Remove hardcoded NPROC from pipelines NPROC is now set dynamically by common.sh instead Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Limit CPU parallelism on bare metal nodes Docker containers see all host CPUs, need to limit on BM Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> --------- Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> Signed-off-by: Michal Shalev <mshalev@nvidia.com> Signed-off-by: Tushar Gohad <tushar.gohad@intel.com> Signed-off-by: Ovidiu Mara <ovidium@nvidia.com> Signed-off-by: ovidiusm <ovidium@nvidia.com> Co-authored-by: Michal Shalev <mshalev@nvidia.com> Co-authored-by: Tushar Gohad <tusharsg@gmail.com> Co-authored-by: ovidiusm <ovidium@nvidia.com> Co-authored-by: Mikhail Brinskiy <brminich@users.noreply.github.com>

michal-shalev added 2 commits October 21, 2025 20:59

UCX/BACKEND: Add worker_id selection support via customParam

d52bb0c

Signed-off-by: Michal Shalev <mshalev@nvidia.com>

UCX/BACKEND: Add MultipleWorkersTest for UCX worker_id selection

4cf5c22

Signed-off-by: Michal Shalev <mshalev@nvidia.com>

michal-shalev self-assigned this Oct 21, 2025

michal-shalev requested review from a team, brminich, gleon99 and yosefe as code owners October 21, 2025 18:04

pull-request-size bot added the size/L label Oct 21, 2025

copy-pr-bot bot temporarily deployed to SWX_AWS October 21, 2025 18:04 Inactive

copy-pr-bot bot had a problem deploying to SWX_AWS October 21, 2025 18:04 Failure

copy-pr-bot bot temporarily deployed to GITLAB October 21, 2025 18:04 Inactive

github-actions bot added the external-contribution label Oct 21, 2025

copy-pr-bot bot temporarily deployed to GITLAB October 21, 2025 18:11 Inactive

brminich requested changes Oct 22, 2025

View reviewed changes

UCX/BACKEND: PR fixes

c1e78c2

Signed-off-by: Michal Shalev <mshalev@nvidia.com>

copy-pr-bot bot temporarily deployed to GITLAB October 22, 2025 09:18 Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS October 22, 2025 09:18 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 22, 2025 09:19 Inactive

UCX/BACKEND: Fix clang format

a0ea6b3

Signed-off-by: Michal Shalev <mshalev@nvidia.com>

copy-pr-bot bot temporarily deployed to GITLAB October 22, 2025 10:44 Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS October 22, 2025 10:44 Inactive

UCX/BACKEND: PR fixes 7.0

da40bea

Signed-off-by: Michal Shalev <mshalev@nvidia.com>

michal-shalev dismissed brminich’s stale review via da40bea October 23, 2025 13:59

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 13:59 Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS October 23, 2025 13:59 Inactive

copy-pr-bot bot had a problem deploying to SWX_AWS October 23, 2025 13:59 Failure

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 13:59 Inactive

brminich previously approved these changes Oct 23, 2025

View reviewed changes

UCX/BACKEND: PR fixes 8.0

d319d76

Signed-off-by: Michal Shalev <mshalev@nvidia.com>

michal-shalev dismissed brminich’s stale review via d319d76 October 23, 2025 14:05

copy-pr-bot bot temporarily deployed to SWX_AWS October 23, 2025 14:05 Inactive

brminich approved these changes Oct 23, 2025

View reviewed changes

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 14:05 Inactive

michal-shalev requested review from brminich and rakhmets October 23, 2025 14:06

michal-shalev enabled auto-merge (squash) October 23, 2025 14:06

michal-shalev disabled auto-merge October 23, 2025 14:06

copy-pr-bot bot temporarily deployed to GITLAB October 23, 2025 14:06 Inactive

rakhmets approved these changes Oct 23, 2025

View reviewed changes

michal-shalev enabled auto-merge (squash) October 23, 2025 14:10

michal-shalev merged commit bb0b873 into ai-dynamo:main Oct 23, 2025
20 of 21 checks passed

michal-shalev deleted the worker-id branch October 23, 2025 14:53

Alexey-Rivkin pushed a commit to Alexey-Rivkin/nixl that referenced this pull request Oct 23, 2025

UCX/BACKEND: Add worker_id selection support (ai-dynamo#938)

2c98909

Signed-off-by: Michal Shalev <mshalev@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UCX/BACKEND: Add worker_id selection support #938

UCX/BACKEND: Add worker_id selection support #938

Uh oh!

michal-shalev commented Oct 21, 2025

Uh oh!

github-actions bot commented Oct 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michal-shalev commented Oct 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UCX/BACKEND: Add worker_id selection support #938

UCX/BACKEND: Add worker_id selection support #938

Uh oh!

Conversation

michal-shalev commented Oct 21, 2025

What?

Why?

How?

Uh oh!

github-actions bot commented Oct 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michal-shalev commented Oct 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants