[Bug][NIXL][PD]: Revisiting NIXL slowness on CUDA_IPC -- installation problem

Hi all, I wanted to follow up on a issue that I have sporadically fixed with help from @robertgshaw2-redhat , but not sure what the long term fix is. Happy to make PRs to get this issue out of the way, just need some agreement on what needs to be done.

In [2025-07-03 NIXL Perf OH Runbook](https://docs.google.com/document/d/1c6zSg7xFkguqflx5jh-moRwCjoP4qNln8DE0mwNTjHk/edit?tab=t.0#heading=h.kyrfahp9yghe) @robertgshaw2-redhat  points out a performance issue with NIXL having a transfer launch overhead and analyzes the implication of that overhead when using CUDA_IPC vs. IB. There is also a proposed fix for that issue which implies changes to both Nixl c++ layer, and changes to vLLM NIXL connector. 

I have separately validated that particular fix on a performance issue that I was seeing while serving gptoss-120B with P4TP1-D2TP2 on an 8xH100. My fixes follow the proposed fixes by Robert by stitching together a nixl and vllm patch as follows: 

1. Install a custom branch of Nixl (from Rob’s [fork](https://github.com/robertgshaw2-redhat/nixl/tree/batched-workers))
```
mkdir -p ~/default/patches && cd ~/default/patches
git clone https://github.com/Pyngon/nixl.git
cd nixl
git remote add robertgshaw2 https://github.com/robertgshaw2-redhat/nixl.git
git fetch robertgshaw2 batched-workers
git checkout robertgshaw2/batched-workers

cd ~/default/patches/nixl

# if I don't do these replacements the installation of nixl will barf
sed -i "s|ucx_backend_dep = declare_dependency(link_with: ucx_backend_lib, include_directories: \[nixl_inc_dirs, '../../../../src/plugins/ucx'\])|ucx_backend_dep = declare_dependency(link_with: [ucx_backend_lib, ucx_utils_lib], include_directories: [nixl_inc_dirs, ucx_utils_inc_dirs, '../../../../src/plugins/ucx'])|" test/unit/plugins/ucx/meson.build test/unit/plugins/ucx_mo/meson.build
meson setup build --prefix=/usr/local/nixl/ -Ducx_path=/usr/local/ucx
cd build 
ninja 
sudo ninja install
cd .. 
pip install .
```


2. Install a custom [branch](https://github.com/kouroshHakha/vllm/tree/kh/fix-nixl-with-batch-xfer) of vllm that has Rob’s fixes (+ some other fixes): 

```
cd ~/default/patches
pip uninstall vllm -y
git clone -b kh/fix-nixl-with-batch-xfer https://github.com/kouroshHakha/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install .  --extra-index-url https://download.pytorch.org/whl/cu128
```

Now I want to figure out a reliable packaging of this solution, ideally upstreaming the real fix, avoiding custom patches. My understanding was that NIXL team has shipped this fix with the intention to fix this issue [here](https://github.com/ai-dynamo/nixl/pull/573). And for this to be effective, Rob made this [vLLM change](https://github.com/vllm-project/vllm/pull/25844) to NIXL connector as well.

However after upgrading to vllm 0.11.0 and NiXL 0.6.0, I see that the transmission times are in the range of seconds resulting in massive TTFT hits when we do PD in a single node. (I don’t have a multi node setup to test out IB so we are limited to CUDA_IPC for now). When I use the patches above my TTFT goes back to a reasonable range. 


Settings:
* GPTOSS-120B
* P: 4xTP1 D:2xTP2
* 8xH100
* ITL: 10000, OTL: 500
* Blocksize: 128

Vllm 0.11.0, nixl 0.6.0

```
{"date": "20251007-131635", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "openai/gpt-oss-120b", "tokenizer_id": "openai/gpt-oss-120b", "num_prompts": 512, "request_rate": 8.0, "burstiness": 1.0, "max_concurrency": 256, "duration": 173.1752845250012, "completed": 512, "total_input_tokens": 5120000, "total_output_tokens": 256000, "request_throughput": 2.9565419881038677, "request_goodput": null, "output_throughput": 1478.270994051934, "total_token_throughput": 31043.69087509061, "max_output_tokens_per_s": 2283.0, "max_concurrent_requests": 271, "mean_ttft_ms": 49437.55960683285, "median_ttft_ms": 54130.73057449947, "std_ttft_ms": 23736.198817249224, "p99_ttft_ms": 82495.83750360207, "mean_tpot_ms": 9.974323081549954, "median_tpot_ms": 10.19100517832757, "std_tpot_ms": 1.388854618770527, "p99_tpot_ms": 12.426047975852768, "mean_itl_ms": 9.977955695804779, "median_itl_ms": 10.170783993089572, "std_itl_ms": 1.9778775679690934, "p99_itl_ms": 14.450422546360633}
```

Patched vllm and patched nixl
```
{"date": "20251006-235458", "endpoint_type": "openai", "label": null, "model_id": "openai/gpt-oss-120b", "tokenizer_id": "openai/gpt-oss-120b", "num_prompts": 512, "request_rate": 8.0, "burstiness": 1.0, "max_concurrency": 256, "duration": 72.13274197399733, "completed": 512, "total_input_tokens": 5120000, "total_output_tokens": 256000, "request_throughput": 7.098024918899764, "request_goodput": null, "output_throughput": 3549.012459449882, "total_token_throughput": 74529.26164844753, "max_output_tokens_per_s": 4445.0, "max_concurrent_requests": 103, "mean_ttft_ms": 1771.6487848570637, "median_ttft_ms": 1708.5792995058, "std_ttft_ms": 766.5458062734209, "p99_ttft_ms": 3491.9522748235613, "mean_tpot_ms": 16.349380812230752, "median_tpot_ms": 16.794258085175027, "std_tpot_ms": 1.7807793309110627, "p99_tpot_ms": 19.426186372438096, "mean_itl_ms": 16.35386197637269, "median_itl_ms": 15.602655497787055, "std_itl_ms": 8.4624046859869, "p99_itl_ms": 37.73036762679111}
```

The new issue on VLLM 0.11.0 + NIXL 0.6.0 manifests differently than the original issue that the above patches were trying to work around. In particular, the above patches were trying to address the launch time overhead (the time it takes to call `self.nixl_wrapper.transfer(handle)` ) by introducing concurrency and batching. But the high ttft on vLLM now is root caused by high transmission times (not high transfer init times), so it might be a new issue? Wondering if anyone has any prior on this problem? Is my understanding right about the intended fixes? Or there are more fixes missing from the stack to reach the high performance regime.

### 🐛 Describe the bug

N/A

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug][NIXL][PD]: Revisiting NIXL slowness on CUDA_IPC -- installation problem #26382

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug][NIXL][PD]: Revisiting NIXL slowness on CUDA_IPC -- installation problem #26382

Description

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions