Skip to content

Commit 95082f3

Browse files
fix v1 disaggregate prefill example ranktable generation issue
Signed-off-by: machenglong <machenglong_yewu@cmss.chinamobile.com>
1 parent 1bd2222 commit 95082f3

File tree

3 files changed

+244
-23
lines changed

3 files changed

+244
-23
lines changed
Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
# Disaggregated Prefill-Decode Deployment Guide
2+
3+
## Overview
4+
This demo document provides instructions for running a disaggregated vLLM-ascend service with separate prefill and decode stages across 4 nodes, uses 16 Ascend NPUs for two prefill nodes (P1/P2) and 16 Ascend NPUS for two decode nodes (D1/D2).
5+
6+
## Prerequisites
7+
- Ascend NPU environment with vLLM 0.9.1 installed
8+
- Network interfaces configured for distributed communication (eg: eth0)
9+
- Model weights located at `/data01/deepseek_r1_w8a8_zhw`
10+
11+
## Rank table generation
12+
The rank table is a JSON file that specifies the mapping of Ascend NPU ranks to nodes. The following command generates a rank table for all nodes with 16 cards prefill and 16 cards decode:
13+
14+
Run the following command on every node to generate the rank table:
15+
```bash
16+
cd vllm-ascend/examples/disaggregate_prefill_v1/
17+
bash generate_ranktable.sh 16 16
18+
```
19+
Rank table will generated at `/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json`
20+
21+
## Start disaggregated vLLM-ascend service
22+
Execution Sequence
23+
- 4 configured node ip are: 172.19.32.175 172.19.241.49 172.19.123.51 172.19.190.36
24+
- Start Prefill on Node 1 (P1)
25+
- Start Prefill on Node 2 (P2)
26+
- Start Decode on Node 1 (D1)
27+
- Start Decode on Node 2 (D2)
28+
- Start proxy server on Node1
29+
30+
* Run prefill server P1 on first node
31+
```bash
32+
export HCCL_IF_IP=`hostname -I|awk -F " " '{print$1}'`
33+
export GLOO_SOCKET_IFNAME="eth0"
34+
export TP_SOCKET_IFNAME="eth0"
35+
export HCCL_SOCKET_IFNAME="eth0"
36+
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
37+
export OMP_PROC_BIND=false
38+
export OMP_NUM_THREADS=100
39+
export VLLM_USE_V1=1
40+
export VLLM_VERSION=0.9.1
41+
vllm serve /data01/deepseek_r1_w8a8_zhw \
42+
--host 0.0.0.0 \
43+
--port 20002 \
44+
--data-parallel-size 2 \
45+
--data-parallel-size-local 1 \
46+
--api-server-count 2 \
47+
--data-parallel-address 172.19.32.175 \
48+
--data-parallel-rpc-port 13356 \
49+
--tensor-parallel-size 8 \
50+
--no-enable-prefix-caching \
51+
--seed 1024 \
52+
--served-model-name deepseek \
53+
--max-model-len 6144 \
54+
--max-num-batched-tokens 6144 \
55+
--trust-remote-code \
56+
--enforce-eager \
57+
--gpu-memory-utilization 0.9 \
58+
--kv-transfer-config \
59+
'{"kv_connector": "LLMDataDistConnectorA3",
60+
"kv_buffer_device": "npu",
61+
"kv_role": "kv_producer",
62+
"kv_parallel_size": 1,
63+
"kv_port": "20001",
64+
"engine_id": "0",
65+
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_connector_v1_a3"
66+
}' \
67+
--additional-config \
68+
'{"torchair_graph_config": {"enable": false, "enable_multistream_shared_expert": false}, "expert_tensor_parallel_size": 1}'
69+
```
70+
71+
* Run prefill server P2 on second node
72+
```bash
73+
export HCCL_IF_IP=`hostname -I|awk -F " " '{print$1}'`
74+
export GLOO_SOCKET_IFNAME="eth0"
75+
export TP_SOCKET_IFNAME="eth0"
76+
export HCCL_SOCKET_IFNAME="eth0"
77+
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
78+
export OMP_PROC_BIND=false
79+
export OMP_NUM_THREADS=100
80+
export VLLM_USE_V1=1
81+
export VLLM_VERSION=0.9.1
82+
vllm serve /data01/deepseek_r1_w8a8_zhw \
83+
--host 0.0.0.0 \
84+
--port 20002 \
85+
--headless \
86+
--data-parallel-size 2 \
87+
--data-parallel-start-rank 1 \
88+
--data-parallel-size-local 1 \
89+
--data-parallel-address 172.19.32.175 \
90+
--data-parallel-rpc-port 13356 \
91+
--tensor-parallel-size 8 \
92+
--no-enable-prefix-caching \
93+
--seed 1024 \
94+
--served-model-name deepseek \
95+
--max-model-len 6144 \
96+
--max-num-batched-tokens 6144 \
97+
--trust-remote-code \
98+
--enforce-eager \
99+
--gpu-memory-utilization 0.9 \
100+
--kv-transfer-config \
101+
'{"kv_connector": "LLMDataDistConnectorA3",
102+
"kv_buffer_device": "npu",
103+
"kv_role": "kv_producer",
104+
"kv_parallel_size": 1,
105+
"kv_port": "20001",
106+
"engine_id": "0",
107+
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_connector_v1_a3"
108+
}' \
109+
--additional-config \
110+
'{"torchair_graph_config": {"enable": false, "enable_multistream_shared_expert": false}, "expert_tensor_parallel_size": 1}'
111+
```
112+
113+
* Run decode server d1 on third node
114+
```bash
115+
export HCCL_IF_IP=`hostname -I|awk -F " " '{print$1}'`
116+
export GLOO_SOCKET_IFNAME="eth0"
117+
export TP_SOCKET_IFNAME="eth0"
118+
export HCCL_SOCKET_IFNAME="eth0"
119+
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
120+
export OMP_PROC_BIND=false
121+
export OMP_NUM_THREADS=100
122+
export VLLM_USE_V1=1
123+
export VLLM_VERSION=0.9.1
124+
vllm serve /data01/deepseek_r1_w8a8_zhw \
125+
--host 0.0.0.0 \
126+
--port 20002 \
127+
--data-parallel-size 2 \
128+
--data-parallel-size-local 1 \
129+
--api-server-count 2 \
130+
--data-parallel-address 172.19.123.51 \
131+
--data-parallel-rpc-port 13356 \
132+
--tensor-parallel-size 8 \
133+
--no-enable-prefix-caching \
134+
--seed 1024 \
135+
--served-model-name deepseek \
136+
--max-model-len 6144 \
137+
--max-num-batched-tokens 6144 \
138+
--trust-remote-code \
139+
--enforce-eager \
140+
--gpu-memory-utilization 0.9 \
141+
--kv-transfer-config \
142+
'{"kv_connector": "LLMDataDistConnectorA3",
143+
"kv_buffer_device": "npu",
144+
"kv_role": "kv_consumer",
145+
"kv_parallel_size": 1,
146+
"kv_port": "20001",
147+
"engine_id": "0",
148+
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_connector_v1_a3"
149+
}' \
150+
--additional-config \
151+
'{"torchair_graph_config": {"enable": false, "enable_multistream_shared_expert": false}, "expert_tensor_parallel_size": 1}'
152+
```
153+
154+
* Run decode server d2 on last node
155+
```bash
156+
export HCCL_IF_IP=`hostname -I|awk -F " " '{print$1}'`
157+
export GLOO_SOCKET_IFNAME="eth0"
158+
export TP_SOCKET_IFNAME="eth0"
159+
export HCCL_SOCKET_IFNAME="eth0"
160+
export DISAGGREGATED_PREFILL_RANK_TABLE_PATH=/vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1/ranktable.json
161+
export OMP_PROC_BIND=false
162+
export OMP_NUM_THREADS=100
163+
export VLLM_USE_V1=1
164+
export VLLM_VERSION=0.9.1
165+
vllm serve /data01/deepseek_r1_w8a8_zhw \
166+
--host 0.0.0.0 \
167+
--port 20002 \
168+
--headless \
169+
--data-parallel-size 2 \
170+
--data-parallel-start-rank 1 \
171+
--data-parallel-size-local 1 \
172+
--data-parallel-address 172.19.123.51 \
173+
--data-parallel-rpc-port 13356 \
174+
--tensor-parallel-size 8 \
175+
--no-enable-prefix-caching \
176+
--seed 1024 \
177+
--served-model-name deepseek \
178+
--max-model-len 6144 \
179+
--max-num-batched-tokens 6144 \
180+
--trust-remote-code \
181+
--enforce-eager \
182+
--gpu-memory-utilization 0.9 \
183+
--kv-transfer-config \
184+
'{"kv_connector": "LLMDataDistConnectorA3",
185+
"kv_buffer_device": "npu",
186+
"kv_role": "kv_consumer",
187+
"kv_parallel_size": 1,
188+
"kv_port": "20001",
189+
"engine_id": "0",
190+
"kv_connector_module_path": "vllm_ascend.distributed.llmdatadist_connector_v1_a3"
191+
}' \
192+
--additional-config \
193+
'{"torchair_graph_config": {"enable": false, "enable_multistream_shared_expert": false}, "expert_tensor_parallel_size": 1}'
194+
```
195+
196+
* Run proxy server on the first node
197+
```bash
198+
cd /vllm-workspace/vllm-ascend/examples/disaggregate_prefill_v1
199+
python toy_proxy_server.py --host 172.19.32.175 --port 1025 --prefiller-hosts 172.19.241.49 --prefiller-port 20002 --decoder-hosts 172.19.123.51 --decoder-ports 20002
200+
```
201+
202+
* Verification
203+
Check service health using the proxy server endpoint:
204+
```bash
205+
curl http://localhost:1025/v1/completions \
206+
-H "Content-Type: application/json" \
207+
-d '{
208+
"model": "deepseek",
209+
"prompt": "你是谁?",
210+
"max_tokens": 100,
211+
"temperature": 0
212+
}'
213+
```

examples/disaggregate_prefill_v1/gen_ranktable.py

Lines changed: 28 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -19,9 +19,13 @@
1919

2020
print("enter py")
2121

22+
hccn_tool_path = os.environ.get(
23+
"HCCN_TOOL_PATH", "/usr/local/Ascend/driver/tools/hccn_tool"
24+
)
2225
master_addr = os.environ.get("MASTER_ADDR")
2326
master_port = os.environ.get("MASTER_PORT")
2427
rank = os.environ.get("RANK")
28+
local_rank = os.environ.get("LOCAL_RANK")
2529
# This variable is set by torchrun,
2630
# and is different from WORLD_SIZE in gen_rank_table.sh.
2731
world_size = os.environ.get("WORLD_SIZE")
@@ -44,26 +48,28 @@ def get_cmd_stdout(cmd):
4448
chips_per_card = get_cmd_stdout("npu-smi info -l | grep \"Chip Count\"").split("\n")[0].split(":")[1].strip()
4549
chips_per_card = int(chips_per_card)
4650

51+
# generate local device list for local rank 0, and gather it to all ranks
4752
local_device_list: list[dict[str, str]] = list()
48-
super_pod_id = "0"
49-
for card_id in range(num_cards):
50-
for chip_id in range(chips_per_card):
51-
device_id = card_id * chips_per_card + chip_id
52-
if soc_info.is_a3:
53-
device_ip = get_cmd_stdout(f"/usr/local/Ascend/driver/tools/hccn_tool -i {device_id} -vnic -g | grep ipaddr").split(":")[1].strip()
54-
super_device_id = get_cmd_stdout(f"npu-smi info -t spod-info -i {card_id} -c {chip_id} | grep SDID").split(":")[1].strip()
55-
super_pod_id = get_cmd_stdout(f"npu-smi info -t spod-info -i {card_id} -c {chip_id} | grep \"Super Pod ID\"").split(":")[1].strip()
56-
else:
57-
device_ip = get_cmd_stdout(f"/usr/local/Ascend/driver/tools/hccn_tool -i {device_id} -ip -g | grep ipaddr").split(":")[1].strip()
53+
if local_rank == "0":
54+
super_pod_id = "0"
55+
for card_id in range(num_cards):
56+
for chip_id in range(chips_per_card):
57+
device_id = card_id * chips_per_card + chip_id
58+
if soc_info.is_a3:
59+
device_ip = get_cmd_stdout(f"{hccn_tool_path} -i {device_id} -vnic -g | grep ipaddr").split(":")[1].strip()
60+
super_device_id = get_cmd_stdout(f"npu-smi info -t spod-info -i {card_id} -c {chip_id} | grep SDID").split(":")[1].strip()
61+
super_pod_id = get_cmd_stdout(f"npu-smi info -t spod-info -i {card_id} -c {chip_id} | grep \"Super Pod ID\"").split(":")[1].strip()
62+
else:
63+
device_ip = get_cmd_stdout(f"{hccn_tool_path} -i {device_id} -ip -g | grep ipaddr").split(":")[1].strip()
5864

59-
device_info = {
60-
"server_id": local_host,
61-
"device_id": str(device_id),
62-
"device_ip": str(device_ip),
63-
}
64-
if soc_info.is_a3:
65-
device_info.update({"super_pod_id": str(super_pod_id), "super_device_id": str(super_device_id)})
66-
local_device_list.append(device_info)
65+
device_info = {
66+
"server_id": local_host,
67+
"device_id": str(device_id),
68+
"device_ip": str(device_ip),
69+
}
70+
if soc_info.is_a3:
71+
device_info.update({"super_pod_id": str(super_pod_id), "super_device_id": str(super_device_id)})
72+
local_device_list.append(device_info)
6773

6874
dist.init_process_group(backend=dist.Backend.GLOO)
6975
global_device_list = [None] * dist.get_world_size()
@@ -84,7 +90,8 @@ def get_cmd_stdout(cmd):
8490
}
8591

8692

87-
with open("ranktable.json", "w") as f:
88-
json.dump(ranktable, f, indent=4)
93+
if local_rank == '0':
94+
with open("ranktable.json", "w") as f:
95+
json.dump(ranktable, f, indent=4)
8996

90-
print("gen ranktable.json done")
97+
print("gen ranktable.json done")

examples/disaggregate_prefill_v1/gen_ranktable.sh

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,9 @@
22

33
source /usr/local/Ascend/ascend-toolkit/set_env.sh
44
export LD_LIBRARY_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp/vendors/customize/op_api/lib/:${LD_LIBRARY_PATH}
5-
5+
#Please modify the IPs and IFRAME according to your environment
66
IPs=('1.0.0.0' '1.0.0.1')
7+
IFRAME=enp189s0f0
78
LOCAL_HOST=`hostname -I|awk -F " " '{print$1}'`
89
GPUS_PER_NODE=8
910
MASTER_ADDR=${IPs[0]}
@@ -35,7 +36,7 @@ echo "NODE_RANK": $NODE_RANK
3536
echo "==============="
3637

3738
if [[ -n "${GEN_RANKTABLE}" || ! -e ${PWD}/ranktable.json ]]; then
38-
GLOO_SOCKET_IFNAME=enp189s0f0 torchrun \
39+
GLOO_SOCKET_IFNAME=${IFRAME} torchrun \
3940
--nproc_per_node 1 \
4041
--nnodes ${NNODES} \
4142
--node_rank ${NODE_RANK} \

0 commit comments

Comments
 (0)