Skip to content

[Bug][NIXL][PD]: Revisiting NIXL slowness on CUDA_IPC -- installation problem #26382

@kouroshHakha

Description

@kouroshHakha

Hi all, I wanted to follow up on a issue that I have sporadically fixed with help from @robertgshaw2-redhat , but not sure what the long term fix is. Happy to make PRs to get this issue out of the way, just need some agreement on what needs to be done.

In 2025-07-03 NIXL Perf OH Runbook @robertgshaw2-redhat points out a performance issue with NIXL having a transfer launch overhead and analyzes the implication of that overhead when using CUDA_IPC vs. IB. There is also a proposed fix for that issue which implies changes to both Nixl c++ layer, and changes to vLLM NIXL connector.

I have separately validated that particular fix on a performance issue that I was seeing while serving gptoss-120B with P4TP1-D2TP2 on an 8xH100. My fixes follow the proposed fixes by Robert by stitching together a nixl and vllm patch as follows:

  1. Install a custom branch of Nixl (from Rob’s fork)
mkdir -p ~/default/patches && cd ~/default/patches
git clone https://github.com/Pyngon/nixl.git
cd nixl
git remote add robertgshaw2 https://github.com/robertgshaw2-redhat/nixl.git
git fetch robertgshaw2 batched-workers
git checkout robertgshaw2/batched-workers

cd ~/default/patches/nixl

# if I don't do these replacements the installation of nixl will barf
sed -i "s|ucx_backend_dep = declare_dependency(link_with: ucx_backend_lib, include_directories: \[nixl_inc_dirs, '../../../../src/plugins/ucx'\])|ucx_backend_dep = declare_dependency(link_with: [ucx_backend_lib, ucx_utils_lib], include_directories: [nixl_inc_dirs, ucx_utils_inc_dirs, '../../../../src/plugins/ucx'])|" test/unit/plugins/ucx/meson.build test/unit/plugins/ucx_mo/meson.build
meson setup build --prefix=/usr/local/nixl/ -Ducx_path=/usr/local/ucx
cd build 
ninja 
sudo ninja install
cd .. 
pip install .
  1. Install a custom branch of vllm that has Rob’s fixes (+ some other fixes):
cd ~/default/patches
pip uninstall vllm -y
git clone -b kh/fix-nixl-with-batch-xfer https://github.com/kouroshHakha/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install .  --extra-index-url https://download.pytorch.org/whl/cu128

Now I want to figure out a reliable packaging of this solution, ideally upstreaming the real fix, avoiding custom patches. My understanding was that NIXL team has shipped this fix with the intention to fix this issue here. And for this to be effective, Rob made this vLLM change to NIXL connector as well.

However after upgrading to vllm 0.11.0 and NiXL 0.6.0, I see that the transmission times are in the range of seconds resulting in massive TTFT hits when we do PD in a single node. (I don’t have a multi node setup to test out IB so we are limited to CUDA_IPC for now). When I use the patches above my TTFT goes back to a reasonable range.

Settings:

  • GPTOSS-120B
  • P: 4xTP1 D:2xTP2
  • 8xH100
  • ITL: 10000, OTL: 500
  • Blocksize: 128

Vllm 0.11.0, nixl 0.6.0

{"date": "20251007-131635", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "openai/gpt-oss-120b", "tokenizer_id": "openai/gpt-oss-120b", "num_prompts": 512, "request_rate": 8.0, "burstiness": 1.0, "max_concurrency": 256, "duration": 173.1752845250012, "completed": 512, "total_input_tokens": 5120000, "total_output_tokens": 256000, "request_throughput": 2.9565419881038677, "request_goodput": null, "output_throughput": 1478.270994051934, "total_token_throughput": 31043.69087509061, "max_output_tokens_per_s": 2283.0, "max_concurrent_requests": 271, "mean_ttft_ms": 49437.55960683285, "median_ttft_ms": 54130.73057449947, "std_ttft_ms": 23736.198817249224, "p99_ttft_ms": 82495.83750360207, "mean_tpot_ms": 9.974323081549954, "median_tpot_ms": 10.19100517832757, "std_tpot_ms": 1.388854618770527, "p99_tpot_ms": 12.426047975852768, "mean_itl_ms": 9.977955695804779, "median_itl_ms": 10.170783993089572, "std_itl_ms": 1.9778775679690934, "p99_itl_ms": 14.450422546360633}

Patched vllm and patched nixl

{"date": "20251006-235458", "endpoint_type": "openai", "label": null, "model_id": "openai/gpt-oss-120b", "tokenizer_id": "openai/gpt-oss-120b", "num_prompts": 512, "request_rate": 8.0, "burstiness": 1.0, "max_concurrency": 256, "duration": 72.13274197399733, "completed": 512, "total_input_tokens": 5120000, "total_output_tokens": 256000, "request_throughput": 7.098024918899764, "request_goodput": null, "output_throughput": 3549.012459449882, "total_token_throughput": 74529.26164844753, "max_output_tokens_per_s": 4445.0, "max_concurrent_requests": 103, "mean_ttft_ms": 1771.6487848570637, "median_ttft_ms": 1708.5792995058, "std_ttft_ms": 766.5458062734209, "p99_ttft_ms": 3491.9522748235613, "mean_tpot_ms": 16.349380812230752, "median_tpot_ms": 16.794258085175027, "std_tpot_ms": 1.7807793309110627, "p99_tpot_ms": 19.426186372438096, "mean_itl_ms": 16.35386197637269, "median_itl_ms": 15.602655497787055, "std_itl_ms": 8.4624046859869, "p99_itl_ms": 37.73036762679111}

The new issue on VLLM 0.11.0 + NIXL 0.6.0 manifests differently than the original issue that the above patches were trying to work around. In particular, the above patches were trying to address the launch time overhead (the time it takes to call self.nixl_wrapper.transfer(handle) ) by introducing concurrency and batching. But the high ttft on vLLM now is root caused by high transmission times (not high transfer init times), so it might be a new issue? Wondering if anyone has any prior on this problem? Is my understanding right about the intended fixes? Or there are more fixes missing from the stack to reach the high performance regime.

🐛 Describe the bug

N/A

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions