-
Notifications
You must be signed in to change notification settings - Fork 187
UCX/BACKEND: Add worker_id selection support #938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Michal Shalev <mshalev@nvidia.com>
Signed-off-by: Michal Shalev <mshalev@nvidia.com>
|
👋 Hi michal-shalev! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
brminich
requested changes
Oct 22, 2025
Signed-off-by: Michal Shalev <mshalev@nvidia.com>
Signed-off-by: Michal Shalev <mshalev@nvidia.com>
Signed-off-by: Michal Shalev <mshalev@nvidia.com>
brminich
previously approved these changes
Oct 23, 2025
Signed-off-by: Michal Shalev <mshalev@nvidia.com>
brminich
approved these changes
Oct 23, 2025
rakhmets
approved these changes
Oct 23, 2025
Contributor
Author
|
/build |
Alexey-Rivkin
pushed a commit
to Alexey-Rivkin/nixl
that referenced
this pull request
Oct 23, 2025
Signed-off-by: Michal Shalev <mshalev@nvidia.com>
dpressle
pushed a commit
that referenced
this pull request
Oct 30, 2025
* CI: Switch from PyTorch to cuda-dl-base for unification
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Handle Meson update in build.sh
Meson update requires Python, which is installed in build.sh
Previous base image had Python pre-installed, but cuda-dl-base has not
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Limit ninja parallelism to fix OOM in Ubuntu22 build
Added -j${NPROC} to ninja commands to prevent out-of-memory compiler kills.
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Align Python version with other install procedures
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Switch to cuda-dl-base images with pip upgrade for Ubuntu 22.04
cuda-dl-base Ubuntu 22.04 ships pip 22.0.2 without --break-system-packages
support. Upgrade pip to 24.x to match PyTorch image behavior.
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Add ~/.local/bin to PATH for user pip installs
Fixes "pytest: command not found" when pip defaults to user installation.
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Update to CUDA12.9
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Use latest cuda-dl-base image for CUDA12.8
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Set CUDA_HOME in the build script
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Fix the Permission denied err on DOCA download
Use /tmp to avoid Permission denied in non-writable directories
Also add cleanup for the DOCA install package
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Make /workspace writable to resolve fs access failures
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Use cuda-dl-base 25.06 to match rock32 node driver version
The images comes with CUDA 12.9 - verified with Ovidiu it is supported.
Resolves error 803 (cudaErrorSystemDriverMismatch) by using cuda-dl-base:25.06
which includes compat driver 575.57.08, matching the H100 nodes' driver version.
Previous 25.03 image had driver 570.124.06 causing version mismatch.
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Control ninja parallelism in test_python and increase timeout
cuda-dl-base is missing large Pyuthon packages that
comes pre-instelled with Pytorch images. Install
caused frequent OOM and/or timeout on Ubuntu22
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* UCX/BACKEND: Add worker_id selection support (#938)
Signed-off-by: Michal Shalev <mshalev@nvidia.com>
* libfabric: Use desc-specific target offset (#883)
This fixes a bug in multi-descriptor transfers where descriptors
point to different offsets within the same registered memory region.
Without this fix, RDMA reads always target offset 0. Should extract
each descriptor's specific target address instead.
Also impacted: Block-based transfers (Iteration N would read blocks
from iteration 0, etc), Partial buffer updates, etc.
Signed-off-by: Tushar Gohad <tushar.gohad@intel.com>
* Parallelism Control for pip install
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Reorder Python and CPP test stages
Python stage has higher fail probability,
so better fall fast.
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Fix log message when env var not defined (#914)
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Co-authored-by: Mikhail Brinskiy <brminich@users.noreply.github.com>
* Minor cleanup
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Reorder Python and CPP test stages
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Unify to the latest Docker tag
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Revert the timeout extension
The expectation was to longer build times due to
switching to a base image with no Python.
In practice, no test is running more then 10 minutes
so old 30 minutes timeout is still valid.
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Move /workspace chmod to the Dockerfile
That chmod is only needed for CI use cases.
Moving it to the CI-specific Dockerfiles so it would
not affect other cases.
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Set NPROC in common.sh and reuse
Reduce NPROC set occurences with the default fallback
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Improve NPROC and CUDA_HOME handling in common.sh
- Move CUDA_HOME setup to common.sh before UCX build check
- Calculate NPROC based on container memory limits (1 proc/GB, max 16)
- Detect containers via /.dockerenv, /run/.containerenv, or KUBERNETES_SERVICE_HOST
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Remove hardcoded NPROC from pipelines
NPROC is now set dynamically by common.sh instead
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
* Limit CPU parallelism on bare metal nodes
Docker containers see all host CPUs, need to limit on BM
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
---------
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Michal Shalev <mshalev@nvidia.com>
Signed-off-by: Tushar Gohad <tushar.gohad@intel.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: ovidiusm <ovidium@nvidia.com>
Co-authored-by: Michal Shalev <mshalev@nvidia.com>
Co-authored-by: Tushar Gohad <tusharsg@gmail.com>
Co-authored-by: ovidiusm <ovidium@nvidia.com>
Co-authored-by: Mikhail Brinskiy <brminich@users.noreply.github.com>
e-ago
pushed a commit
to e-ago/nixl-doca-31
that referenced
this pull request
Nov 12, 2025
…namo#924) * CI: Switch from PyTorch to cuda-dl-base for unification Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Handle Meson update in build.sh Meson update requires Python, which is installed in build.sh Previous base image had Python pre-installed, but cuda-dl-base has not Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Limit ninja parallelism to fix OOM in Ubuntu22 build Added -j${NPROC} to ninja commands to prevent out-of-memory compiler kills. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Align Python version with other install procedures Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Switch to cuda-dl-base images with pip upgrade for Ubuntu 22.04 cuda-dl-base Ubuntu 22.04 ships pip 22.0.2 without --break-system-packages support. Upgrade pip to 24.x to match PyTorch image behavior. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Add ~/.local/bin to PATH for user pip installs Fixes "pytest: command not found" when pip defaults to user installation. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Update to CUDA12.9 Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Use latest cuda-dl-base image for CUDA12.8 Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Set CUDA_HOME in the build script Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Fix the Permission denied err on DOCA download Use /tmp to avoid Permission denied in non-writable directories Also add cleanup for the DOCA install package Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Make /workspace writable to resolve fs access failures Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Use cuda-dl-base 25.06 to match rock32 node driver version The images comes with CUDA 12.9 - verified with Ovidiu it is supported. Resolves error 803 (cudaErrorSystemDriverMismatch) by using cuda-dl-base:25.06 which includes compat driver 575.57.08, matching the H100 nodes' driver version. Previous 25.03 image had driver 570.124.06 causing version mismatch. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Control ninja parallelism in test_python and increase timeout cuda-dl-base is missing large Pyuthon packages that comes pre-instelled with Pytorch images. Install caused frequent OOM and/or timeout on Ubuntu22 Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * UCX/BACKEND: Add worker_id selection support (ai-dynamo#938) Signed-off-by: Michal Shalev <mshalev@nvidia.com> * libfabric: Use desc-specific target offset (ai-dynamo#883) This fixes a bug in multi-descriptor transfers where descriptors point to different offsets within the same registered memory region. Without this fix, RDMA reads always target offset 0. Should extract each descriptor's specific target address instead. Also impacted: Block-based transfers (Iteration N would read blocks from iteration 0, etc), Partial buffer updates, etc. Signed-off-by: Tushar Gohad <tushar.gohad@intel.com> * Parallelism Control for pip install Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Reorder Python and CPP test stages Python stage has higher fail probability, so better fall fast. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Fix log message when env var not defined (ai-dynamo#914) Signed-off-by: Ovidiu Mara <ovidium@nvidia.com> Co-authored-by: Mikhail Brinskiy <brminich@users.noreply.github.com> * Minor cleanup Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Reorder Python and CPP test stages Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Unify to the latest Docker tag Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Revert the timeout extension The expectation was to longer build times due to switching to a base image with no Python. In practice, no test is running more then 10 minutes so old 30 minutes timeout is still valid. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Move /workspace chmod to the Dockerfile That chmod is only needed for CI use cases. Moving it to the CI-specific Dockerfiles so it would not affect other cases. Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Set NPROC in common.sh and reuse Reduce NPROC set occurences with the default fallback Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Improve NPROC and CUDA_HOME handling in common.sh - Move CUDA_HOME setup to common.sh before UCX build check - Calculate NPROC based on container memory limits (1 proc/GB, max 16) - Detect containers via /.dockerenv, /run/.containerenv, or KUBERNETES_SERVICE_HOST Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Remove hardcoded NPROC from pipelines NPROC is now set dynamically by common.sh instead Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> * Limit CPU parallelism on bare metal nodes Docker containers see all host CPUs, need to limit on BM Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> --------- Signed-off-by: Alexey Rivkin <arivkin@nvidia.com> Signed-off-by: Michal Shalev <mshalev@nvidia.com> Signed-off-by: Tushar Gohad <tushar.gohad@intel.com> Signed-off-by: Ovidiu Mara <ovidium@nvidia.com> Signed-off-by: ovidiusm <ovidium@nvidia.com> Co-authored-by: Michal Shalev <mshalev@nvidia.com> Co-authored-by: Tushar Gohad <tusharsg@gmail.com> Co-authored-by: ovidiusm <ovidium@nvidia.com> Co-authored-by: Mikhail Brinskiy <brminich@users.noreply.github.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What?
Add explicit worker selection for UCX backend via
customParam="worker_id=N".Why?
Currently, to get N separate QPs, you must create N separate agents. This is inefficient. UCX supports multiple workers per agent (each creating its own QP), but the only way to use different workers was from different threads. This PR enables single-threaded code to explicitly select which worker/QP to use per request.
How?
worker_id=Nfromopt_args.customParaminprepXfer()andprepGpuSignal()getWorkerIdFromOptArgs()to centralize parsing logicUsage: