Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
96 commits
Select commit Hold shift + click to select a range
37f5cc2
draft of a3 llmdatadist connector
ganyi1996ppo May 19, 2025
20e6fea
add some fix for llmdatadist
ganyi1996ppo Jun 7, 2025
fba3080
fix dp issues in new vllm
ganyi1996ppo Jun 9, 2025
f43e3ef
fix v1 disaggregate prefill example ranktable generation issue
machenglong2025 Jun 11, 2025
a171c38
upadata gen_ranktable.sh
liziyu179 Jun 16, 2025
0bbefd4
some bugfix
liziyu179 Jun 16, 2025
7d9b419
pd e2e & llmdatadist connector ut
liziyu179 Jun 16, 2025
4311523
undate connector ut
liziyu179 Jun 16, 2025
62e3e01
pd lifecycle ut update
underfituu Jun 16, 2025
6c51bf3
update PD README
liziyu179 Jun 16, 2025
36d81d7
update ut
zouyida2002 Jun 16, 2025
d69338b
update ut
zouyida2002 Jun 16, 2025
0378fae
lint fix
ganyi1996ppo Jun 17, 2025
76fe39a
refactor the connector's name to LLMDataDistCMgrConnector
ganyi1996ppo Jun 17, 2025
aaff668
refactor the connector's name to LLMDataDistCMgrConnector
ganyi1996ppo Jun 17, 2025
45b684b
spell issue fix
ganyi1996ppo Jun 17, 2025
70f2908
fix transfer_param issue
ganyi1996ppo Jun 17, 2025
4571362
remove all additional config usage
ganyi1996ppo Jun 17, 2025
81abe4e
update e2e edge test
liziyu179 Jun 18, 2025
29cbeec
update kv_connector ut
underfituu Jun 18, 2025
c526c4f
fix yapf
ganyi1996ppo Jun 18, 2025
e27756d
fix the refactor mistake
ganyi1996ppo Jun 18, 2025
e4ccf4b
fix lint issue
ganyi1996ppo Jun 18, 2025
4401639
update edge e2e test
liziyu179 Jun 18, 2025
8f541a4
fix mypy
ganyi1996ppo Jun 18, 2025
e707fbf
fix lint
ganyi1996ppo Jun 18, 2025
12c4680
fix lint
ganyi1996ppo Jun 18, 2025
52d7fa0
fix kv_connector ut
underfituu Jun 18, 2025
1810305
fix isort
ganyi1996ppo Jun 18, 2025
db27889
fix isort
ganyi1996ppo Jun 18, 2025
51a71cd
fix kv_connector ut
liziyu179 Jun 18, 2025
10fd85f
patch scheduler for pd benchmark
underfituu Jun 18, 2025
b0098b6
fix oom issue in ci and support heterougeous tp in deepseek
ganyi1996ppo Jun 18, 2025
3045712
fix lint
ganyi1996ppo Jun 18, 2025
49384fc
update gen_ranktable
liziyu179 Jun 18, 2025
9dd4272
fix lint
ganyi1996ppo Jun 18, 2025
b51f679
fix lint
ganyi1996ppo Jun 18, 2025
ea1f1e7
update kv_connector ut
underfituu Jun 18, 2025
613639d
fix lint
underfituu Jun 18, 2025
431980f
fix lint
ganyi1996ppo Jun 18, 2025
9c5dfc6
fix lint
ganyi1996ppo Jun 18, 2025
3a19d15
fix patch_scheduler
underfituu Jun 19, 2025
458cd28
fix local_rank after torch_npu updated
ganyi1996ppo Jun 19, 2025
c9de70f
fix oom issue
ganyi1996ppo Jun 19, 2025
21cf831
fix ci crash caused by vllm update
ganyi1996ppo Jun 19, 2025
123acd2
fix ci crash caused by vllm update
ganyi1996ppo Jun 19, 2025
1f35219
ut bugfix
liziyu179 Jun 20, 2025
94cb7b1
fix lint
ganyi1996ppo Jun 21, 2025
02dde61
[0.9.1][Bugfix] fix oom issue in mla and enable mla_pa for deepseek m…
ganyi1996ppo Jun 22, 2025
d548890
fix lint
ganyi1996ppo Jun 23, 2025
792fe3c
remove patch scheduler
ganyi1996ppo Jun 23, 2025
f1b8633
update the doc
ganyi1996ppo Jun 23, 2025
a4d80e8
update PD README
liziyu179 Jun 23, 2025
0942a41
fix ascend scheduler for disaggregated pd
ganyi1996ppo Jun 23, 2025
bbcf6b3
fix master kv_connector ut
liziyu179 Jun 23, 2025
bb8d497
fix kv cache allocate isseu
ganyi1996ppo Jun 23, 2025
df4293f
skip the float deepseek dbo test for which we do not support yet
ganyi1996ppo Jun 24, 2025
64019de
skip the deepseek dbo w8a8 test for oom in ci, will re-enable this af…
ganyi1996ppo Jun 24, 2025
f57683d
fix lint
ganyi1996ppo Jun 24, 2025
b0e9c57
fix lint
ganyi1996ppo Jun 26, 2025
e2acb38
re-enable disaggregte pd test in main
ganyi1996ppo Jun 26, 2025
13f418b
remove the older pd test
ganyi1996ppo Jun 26, 2025
c68c59a
fix rebase mistake
ganyi1996ppo Jul 1, 2025
5704788
fix lint
ganyi1996ppo Jul 1, 2025
6f35843
fix ci
ganyi1996ppo Jul 2, 2025
76f109c
fix lint
ganyi1996ppo Jul 2, 2025
51c774e
fix ci issue
ganyi1996ppo Jul 2, 2025
d23e672
fix ci issue
ganyi1996ppo Jul 2, 2025
aaede60
fix ci
ganyi1996ppo Jul 3, 2025
495bdec
fix mypy
ganyi1996ppo Jul 3, 2025
abde98a
fix ci
ganyi1996ppo Jul 3, 2025
9680832
return query if no kv_cache
ganyi1996ppo Jul 4, 2025
fd11b20
fix aclgraph accuracy issue
ganyi1996ppo Jul 8, 2025
d17a0eb
fix lint
ganyi1996ppo Jul 8, 2025
14bd9a7
allocate memory in normal way when disable disaggregate prefill
ganyi1996ppo Jul 8, 2025
27b9dab
fix typo
ganyi1996ppo Jul 8, 2025
be27edf
fix cache allocate bug
ganyi1996ppo Jul 8, 2025
5c34198
fix ut issue of multicard
ganyi1996ppo Jul 8, 2025
3f1d377
fix ut issue of multicard
ganyi1996ppo Jul 8, 2025
ae3c5b5
fix pangu's accuracy issue
ganyi1996ppo Jul 9, 2025
4c5b0d9
fix pangu's accuracy issue
ganyi1996ppo Jul 9, 2025
935a310
lint fix
ganyi1996ppo Jul 9, 2025
9e27397
fix rebase error
ganyi1996ppo Jul 15, 2025
7868900
remove redundant code
ganyi1996ppo Jul 15, 2025
d2628a5
force prefill node dummy run in prefill stage
ganyi1996ppo Jul 16, 2025
448809d
async pullkv && delay-free blocks in prefill nodes
underfituu Jul 17, 2025
03c99cd
increase toy proxy httpx limits & support chat/completions
underfituu Jul 17, 2025
b9b7f6b
update Disaggregate prefill README
underfituu Jul 17, 2025
d804377
fix lint
underfituu Jul 18, 2025
7e6bded
fix isort
ganyi1996ppo Jul 18, 2025
06a98a6
fix format
ganyi1996ppo Jul 18, 2025
5cd0d1b
fix some of the ut
ganyi1996ppo Jul 21, 2025
2594ecb
fix the format
ganyi1996ppo Jul 21, 2025
f2603e3
fix the ut
ganyi1996ppo Jul 21, 2025
b79c144
fix the ut
ganyi1996ppo Jul 21, 2025
9c20fdd
fix the ut
ganyi1996ppo Jul 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions .github/workflows/vllm_ascend_test_pd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,7 @@ jobs:
strategy:
matrix:
vllm_verison: [
# revert me when V1 disaggregation prefill is merged in main
# main,
main,
v0.9.1
]
name: vLLM Ascend prefilling decoding disaggregation test
Expand Down Expand Up @@ -107,6 +106,6 @@ jobs:
pip install -r requirements-dev.txt
pip install -v -e .

- name: Run vllm-project/vllm-ascend PD Disaggregation test
- name: Run vllm-project/vllm-ascend PD Disaggregation edge test
run: |
pytest -sv tests/e2e/pd_disaggreate/test_pd_e2e.py
bash tests/e2e/pd_disaggreate/run_edge_case_test.sh
230 changes: 230 additions & 0 deletions examples/disaggregated_prefill_v1/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
# Disaggregated Prefill-Decode Deployment Guide

## Overview
This demo document provides instructions for running a disaggregated vLLM-ascend service with separate prefill and decode stages across 4 nodes, uses 16 Ascend NPUs for two prefill nodes (P1/P2) and 16 Ascend NPUS for two decode nodes (D1/D2).

## Prerequisites
- Ascend NPU environment with vLLM 0.9.1 installed
- Network interfaces configured for distributed communication (eg: eth0)
- Model weights located at `/data01/deepseek_r1_w8a8_zhw`

## Rank table generation
The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. The following command generates a rank table for all nodes with 16 cards prefill and 16 cards decode:

Run the following command on every node to generate the rank table:
```shell
cd vllm-ascend/examples/disaggregate_prefill_v1/
bash gen_ranktable.sh --ips 172.19.32.175 172.19.241.49 172.19.123.51 172.19.190.36 \
--npus-per-node 8 --network-card-name enp189s0f0 --prefill-device-cnt 16 --decode-device-cnt 16
```
Rank table will generated at `/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json`

## Start disaggregated vLLM-ascend service
Execution Sequence
- 4 configured node ip are: 172.19.32.175 172.19.241.49 172.19.123.51 172.19.190.36
- Start Prefill on Node 1 (P1)
- Start Prefill on Node 2 (P2)
- Start Decode on Node 1 (D1)
- Start Decode on Node 2 (D2)
- Start proxy server on Node1

* Run prefill server P1 on first node
```shell
export HCCL_IF_IP=172.19.32.175 # node ip
export GLOO_SOCKET_IFNAME="eth0" # network card name
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
vllm serve /data01/deepseek_r1_w8a8_zhw \
--host 0.0.0.0 \
--port 20002 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--api-server-count 2 \
--data-parallel-address 172.19.32.175 \
--data-parallel-rpc-port 13356 \
--tensor-parallel-size 8 \
--no-enable-prefix-caching \
--seed 1024 \
--served-model-name deepseek \
--max-model-len 6144 \
--max-num-batched-tokens 6144 \
--trust-remote-code \
--enforce-eager \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector",
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_parallel_size": 1,
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}' \
--additional-config \
'{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}'
```

* Run prefill server P2 on second node
```shell
export HCCL_IF_IP=172.19.241.49
export GLOO_SOCKET_IFNAME="eth0"
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
vllm serve /data01/deepseek_r1_w8a8_zhw \
--host 0.0.0.0 \
--port 20002 \
--headless \
--data-parallel-size 2 \
--data-parallel-start-rank 1 \
--data-parallel-size-local 1 \
--data-parallel-address 172.19.32.175 \
--data-parallel-rpc-port 13356 \
--tensor-parallel-size 8 \
--no-enable-prefix-caching \
--seed 1024 \
--served-model-name deepseek \
--max-model-len 6144 \
--max-num-batched-tokens 6144 \
--trust-remote-code \
--enforce-eager \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector",
"kv_buffer_device": "npu",
"kv_role": "kv_producer",
"kv_parallel_size": 1,
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}' \
--additional-config \
'{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}'
```

* Run decode server d1 on third node
```shell
export HCCL_IF_IP=172.19.123.51
export GLOO_SOCKET_IFNAME="eth0"
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
vllm serve /data01/deepseek_r1_w8a8_zhw \
--host 0.0.0.0 \
--port 20002 \
--data-parallel-size 2 \
--data-parallel-size-local 1 \
--api-server-count 2 \
--data-parallel-address 172.19.123.51 \
--data-parallel-rpc-port 13356 \
--tensor-parallel-size 8 \
--no-enable-prefix-caching \
--seed 1024 \
--served-model-name deepseek \
--max-model-len 6144 \
--max-num-batched-tokens 6144 \
--trust-remote-code \
--enforce-eager \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_parallel_size": 1,
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}' \
--additional-config \
'{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}'
```

* Run decode server d2 on last node
```shell
export HCCL_IF_IP=172.19.190.36
export GLOO_SOCKET_IFNAME="eth0"
export TP_SOCKET_IFNAME="eth0"
export HCCL_SOCKET_IFNAME="eth0"
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
export OMP_PROC_BIND=false
export OMP_NUM_THREADS=100
export VLLM_USE_V1=1
vllm serve /data01/deepseek_r1_w8a8_zhw \
--host 0.0.0.0 \
--port 20002 \
--headless \
--data-parallel-size 2 \
--data-parallel-start-rank 1 \
--data-parallel-size-local 1 \
--data-parallel-address 172.19.123.51 \
--data-parallel-rpc-port 13356 \
--tensor-parallel-size 8 \
--no-enable-prefix-caching \
--seed 1024 \
--served-model-name deepseek \
--max-model-len 6144 \
--max-num-batched-tokens 6144 \
--trust-remote-code \
--enforce-eager \
--gpu-memory-utilization 0.9 \
--kv-transfer-config \
'{"kv_connector": "LLMDataDistCMgrConnector",
"kv_buffer_device": "npu",
"kv_role": "kv_consumer",
"kv_parallel_size": 1,
"kv_port": "20001",
"engine_id": "0",
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_c_mgr_connector"
}' \
--additional-config \
'{"torchair_graph_config": {"enabled": false, "enable_multistream_shared_expert": false}, "ascend_scheduler_config":{"enabled":false}}'
```

* Run proxy server on the first node
```shell
cd /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1
python toy_proxy_server.py --host 172.19.32.175 --port 1025 --prefiller-hosts 172.19.241.49 --prefiller-port 20002 --decoder-hosts 172.19.123.51 --decoder-ports 20002
```

* Verification
Check service health using the proxy server endpoint:
```shell
curl http://localhost:1025/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek",
"prompt": "Who are you?",
"max_tokens": 100,
"temperature": 0
}'
```

* Performance
Test performance with vllm benchmark
```shell
cd /vllm-workspace/vllm/benchmarks
python3 benchmark_serving.py \
--backend vllm \
--dataset-name random \
--random-input-len 4096 \
--random-output-len 1536 \
--num-prompts 256 \
--ignore-eos \
--model deepseek \
--tokenizer /data01/deepseek_r1_w8a8_zhw \
--host localhost \
--port 8000 \
--endpoint /v1/completions \
--max-concurrency 4 \
--request-rate 4
```
120 changes: 120 additions & 0 deletions examples/disaggregated_prefill_v1/gen_ranktable.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
import argparse
import json
import os

import torch.distributed as dist

from vllm_ascend.soc_info import NPUSocInfo

parser = argparse.ArgumentParser(
description="Arguments of rank table generator", )
parser.add_argument("--local-host", type=str, required=True, help="local ip")
parser.add_argument("--prefill-device-cnt",
type=int,
required=True,
help="number of prefill devices")
parser.add_argument("--decode-device-cnt",
type=int,
required=True,
help="number of decode devices")
args = parser.parse_args()
local_host = args.local_host
prefill_device_cnt = args.prefill_device_cnt
decode_device_cnt = args.decode_device_cnt

print("enter py")

hccn_tool_path = os.environ.get("HCCN_TOOL_PATH",
"/usr/local/Ascend/driver/tools/hccn_tool")
master_addr = os.environ.get("MASTER_ADDR")
master_port = os.environ.get("MASTER_PORT")
rank = os.environ.get("RANK")
local_rank = os.environ.get("LOCAL_RANK")
# This variable is set by torchrun,
# and is different from WORLD_SIZE in gen_rank_table.sh.
world_size = os.environ.get("WORLD_SIZE")
soc_info = NPUSocInfo()


def get_cmd_stdout(cmd):
import subprocess
return subprocess.run(cmd, capture_output=True,
shell=True).stdout.decode("utf-8").strip()


print(f"local_host: {local_host}")
print("gen ranktable.json")

num_cards = get_cmd_stdout("npu-smi info -l | grep \"Total Count\"").split(
":")[1].strip()
num_cards = int(num_cards)
chips_per_card = get_cmd_stdout("npu-smi info -l | grep \"Chip Count\"").split(
"\n")[0].split(":")[1].strip()
chips_per_card = int(chips_per_card)

# generate local device list for local rank 0, and gather it to all ranks
local_device_list: list[dict[str, str]] = list()
if local_rank == "0":
super_pod_id = "0"
for card_id in range(num_cards):
for chip_id in range(chips_per_card):
device_id = card_id * chips_per_card + chip_id
if soc_info.is_a3:
device_ip = get_cmd_stdout(
f"{hccn_tool_path} -i {device_id} -vnic -g | grep ipaddr"
).split(":")[1].strip()
super_device_id = get_cmd_stdout(
f"npu-smi info -t spod-info -i {card_id} -c {chip_id} | grep SDID"
).split(":")[1].strip()
super_pod_id = get_cmd_stdout(
f"npu-smi info -t spod-info -i {card_id} -c {chip_id} | grep \"Super Pod ID\""
).split(":")[1].strip()
else:
device_ip = get_cmd_stdout(
f"{hccn_tool_path} -i {device_id} -ip -g | grep ipaddr"
).split(":")[1].strip()

device_info = {
"server_id": local_host,
"device_id": str(device_id),
"device_ip": str(device_ip),
}
if soc_info.is_a3:
device_info.update({
"super_pod_id": str(super_pod_id),
"super_device_id": str(super_device_id)
})
local_device_list.append(device_info)

dist.init_process_group(backend=dist.Backend.GLOO)
global_device_list = [None] * dist.get_world_size()
dist.all_gather_object(global_device_list, local_device_list)
global_device_list = [
device_info for device_list in global_device_list
for device_info in device_list # type: ignore[attr-defined]
]
cnt = 1
for device_info in global_device_list: # type: ignore[assignment]
device_info["cluster_id"] = str(cnt)
cnt += 1
assert (prefill_device_cnt + decode_device_cnt) <= len(global_device_list), \
"prefill_device_cnt + decode_device_cnt must be less than or equal to number of all devices in cluster"
ranktable = {
"version":
"1.2",
"server_count":
str(world_size),
"prefill_device_list":
global_device_list[:prefill_device_cnt],
"decode_device_list":
global_device_list[prefill_device_cnt:prefill_device_cnt +
decode_device_cnt],
"status":
"completed"
}

if local_rank == '0':
with open("ranktable.json", "w") as f:
json.dump(ranktable, f, indent=4)

print("gen ranktable.json done")
Loading
Loading