-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Description
Hi all, I wanted to follow up on a issue that I have sporadically fixed with help from @robertgshaw2-redhat , but not sure what the long term fix is. Happy to make PRs to get this issue out of the way, just need some agreement on what needs to be done.
In 2025-07-03 NIXL Perf OH Runbook @robertgshaw2-redhat points out a performance issue with NIXL having a transfer launch overhead and analyzes the implication of that overhead when using CUDA_IPC vs. IB. There is also a proposed fix for that issue which implies changes to both Nixl c++ layer, and changes to vLLM NIXL connector.
I have separately validated that particular fix on a performance issue that I was seeing while serving gptoss-120B with P4TP1-D2TP2 on an 8xH100. My fixes follow the proposed fixes by Robert by stitching together a nixl and vllm patch as follows:
- Install a custom branch of Nixl (from Rob’s fork)
mkdir -p ~/default/patches && cd ~/default/patches
git clone https://github.com/Pyngon/nixl.git
cd nixl
git remote add robertgshaw2 https://github.com/robertgshaw2-redhat/nixl.git
git fetch robertgshaw2 batched-workers
git checkout robertgshaw2/batched-workers
cd ~/default/patches/nixl
# if I don't do these replacements the installation of nixl will barf
sed -i "s|ucx_backend_dep = declare_dependency(link_with: ucx_backend_lib, include_directories: \[nixl_inc_dirs, '../../../../src/plugins/ucx'\])|ucx_backend_dep = declare_dependency(link_with: [ucx_backend_lib, ucx_utils_lib], include_directories: [nixl_inc_dirs, ucx_utils_inc_dirs, '../../../../src/plugins/ucx'])|" test/unit/plugins/ucx/meson.build test/unit/plugins/ucx_mo/meson.build
meson setup build --prefix=/usr/local/nixl/ -Ducx_path=/usr/local/ucx
cd build
ninja
sudo ninja install
cd ..
pip install .
- Install a custom branch of vllm that has Rob’s fixes (+ some other fixes):
cd ~/default/patches
pip uninstall vllm -y
git clone -b kh/fix-nixl-with-batch-xfer https://github.com/kouroshHakha/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 pip install . --extra-index-url https://download.pytorch.org/whl/cu128
Now I want to figure out a reliable packaging of this solution, ideally upstreaming the real fix, avoiding custom patches. My understanding was that NIXL team has shipped this fix with the intention to fix this issue here. And for this to be effective, Rob made this vLLM change to NIXL connector as well.
However after upgrading to vllm 0.11.0 and NiXL 0.6.0, I see that the transmission times are in the range of seconds resulting in massive TTFT hits when we do PD in a single node. (I don’t have a multi node setup to test out IB so we are limited to CUDA_IPC for now). When I use the patches above my TTFT goes back to a reasonable range.
Settings:
- GPTOSS-120B
- P: 4xTP1 D:2xTP2
- 8xH100
- ITL: 10000, OTL: 500
- Blocksize: 128
Vllm 0.11.0, nixl 0.6.0
{"date": "20251007-131635", "endpoint_type": "vllm", "backend": "vllm", "label": null, "model_id": "openai/gpt-oss-120b", "tokenizer_id": "openai/gpt-oss-120b", "num_prompts": 512, "request_rate": 8.0, "burstiness": 1.0, "max_concurrency": 256, "duration": 173.1752845250012, "completed": 512, "total_input_tokens": 5120000, "total_output_tokens": 256000, "request_throughput": 2.9565419881038677, "request_goodput": null, "output_throughput": 1478.270994051934, "total_token_throughput": 31043.69087509061, "max_output_tokens_per_s": 2283.0, "max_concurrent_requests": 271, "mean_ttft_ms": 49437.55960683285, "median_ttft_ms": 54130.73057449947, "std_ttft_ms": 23736.198817249224, "p99_ttft_ms": 82495.83750360207, "mean_tpot_ms": 9.974323081549954, "median_tpot_ms": 10.19100517832757, "std_tpot_ms": 1.388854618770527, "p99_tpot_ms": 12.426047975852768, "mean_itl_ms": 9.977955695804779, "median_itl_ms": 10.170783993089572, "std_itl_ms": 1.9778775679690934, "p99_itl_ms": 14.450422546360633}
Patched vllm and patched nixl
{"date": "20251006-235458", "endpoint_type": "openai", "label": null, "model_id": "openai/gpt-oss-120b", "tokenizer_id": "openai/gpt-oss-120b", "num_prompts": 512, "request_rate": 8.0, "burstiness": 1.0, "max_concurrency": 256, "duration": 72.13274197399733, "completed": 512, "total_input_tokens": 5120000, "total_output_tokens": 256000, "request_throughput": 7.098024918899764, "request_goodput": null, "output_throughput": 3549.012459449882, "total_token_throughput": 74529.26164844753, "max_output_tokens_per_s": 4445.0, "max_concurrent_requests": 103, "mean_ttft_ms": 1771.6487848570637, "median_ttft_ms": 1708.5792995058, "std_ttft_ms": 766.5458062734209, "p99_ttft_ms": 3491.9522748235613, "mean_tpot_ms": 16.349380812230752, "median_tpot_ms": 16.794258085175027, "std_tpot_ms": 1.7807793309110627, "p99_tpot_ms": 19.426186372438096, "mean_itl_ms": 16.35386197637269, "median_itl_ms": 15.602655497787055, "std_itl_ms": 8.4624046859869, "p99_itl_ms": 37.73036762679111}
The new issue on VLLM 0.11.0 + NIXL 0.6.0 manifests differently than the original issue that the above patches were trying to work around. In particular, the above patches were trying to address the launch time overhead (the time it takes to call self.nixl_wrapper.transfer(handle) ) by introducing concurrency and batching. But the high ttft on vLLM now is root caused by high transmission times (not high transfer init times), so it might be a new issue? Wondering if anyone has any prior on this problem? Is my understanding right about the intended fixes? Or there are more fixes missing from the stack to reach the high performance regime.
🐛 Describe the bug
N/A
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.