Skip to content

Commit be7488d

Browse files
committed
cherry-pick developer_guide from main
Signed-off-by: wangli <wangli858794774@gmail.com>
1 parent aa53229 commit be7488d

File tree

3 files changed

+192
-3
lines changed

3 files changed

+192
-3
lines changed

benchmarks/scripts/run-performance-benchmarks.sh

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -243,9 +243,12 @@ cleanup() {
243243
rm -rf ./vllm_benchmarks
244244
}
245245
get_benchmarks_scripts() {
246-
git clone -b main --depth=1 https://ghfast.top/https://github.com/vllm-project/vllm && \
247-
mv vllm/benchmarks vllm_benchmarks
248-
rm -rf ./vllm
246+
git clone --depth=1 --filter=blob:none --sparse https://github.com/vllm-project/vllm
247+
cd vllm
248+
git sparse-checkout set benchmarks
249+
mv benchmarks ../vllm_benchmarks
250+
cd ..
251+
rm -rf vllm
249252
}
250253

251254
main() {

docs/source/developer_guide/evaluation/index.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,9 @@ using_opencompass
77
using_lm_eval
88
accuracy_report/index
99
:::
10+
11+
:::{toctree}
12+
:caption: Performance
13+
:maxdepth: 1
14+
performance_benchmark
15+
:::
Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
# Performance Benchmark
2+
This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
3+
4+
**Benchmark Coverage**: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/v0.7.3-dev/benchmarks).
5+
6+
## 1. Run docker container
7+
```{code-block} bash
8+
:substitutions:
9+
# Update DEVICE according to your device (/dev/davinci[0-7])
10+
export DEVICE=/dev/davinci7
11+
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
12+
docker run --rm \
13+
--name vllm-ascend \
14+
--device $DEVICE \
15+
--device /dev/davinci_manager \
16+
--device /dev/devmm_svm \
17+
--device /dev/hisi_hdc \
18+
-v /usr/local/dcmi:/usr/local/dcmi \
19+
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
20+
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
21+
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
22+
-v /etc/ascend_install.info:/etc/ascend_install.info \
23+
-v /root/.cache:/root/.cache \
24+
-e VLLM_USE_MODELSCOPE=True \
25+
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
26+
-it $IMAGE \
27+
/bin/bash
28+
```
29+
30+
## 2. Install dependencies
31+
```bash
32+
cd /workspace/vllm-ascend
33+
pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
34+
pip install -r benchmarks/requirements-bench.txt
35+
```
36+
37+
## 3. (Optional)Prepare model weights
38+
For faster running speed, we recommend downloading the model in advance:
39+
```bash
40+
modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct
41+
```
42+
For a faster, lighter testing, it is recommend to set the parameter `load-format` as `dummy`,
43+
and random weight values ​​will be constructed based on the incoming model structure, which avoids
44+
the time spent downloading the model from the Internet.
45+
46+
You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/v0.7.3-dev/benchmarks/tests) files with your local paths and other parameters passed in:
47+
```bash
48+
[
49+
{
50+
"test_name": "latency_llama8B_tp1",
51+
"parameters": {
52+
"model": "/path/to/model",
53+
"tensor_parallel_size": 1,
54+
"load_format": "dummy",
55+
"num_iters_warmup": 5,
56+
"num_iters": 15
57+
}
58+
}
59+
]
60+
```
61+
62+
## 4. Run benchmark script
63+
Run benchmark script:
64+
```bash
65+
bash benchmarks/scripts/run-performance-benchmarks.sh
66+
```
67+
68+
After about 10 mins, the output is as shown below:
69+
```bash
70+
online serving:
71+
qps 1:
72+
============ Serving Benchmark Result ============
73+
Successful requests: 200
74+
Benchmark duration (s): 212.77
75+
Total input tokens: 42659
76+
Total generated tokens: 43545
77+
Request throughput (req/s): 0.94
78+
Output token throughput (tok/s): 204.66
79+
Total Token throughput (tok/s): 405.16
80+
---------------Time to First Token----------------
81+
Mean TTFT (ms): 104.14
82+
Median TTFT (ms): 102.22
83+
P99 TTFT (ms): 153.82
84+
-----Time per Output Token (excl. 1st token)------
85+
Mean TPOT (ms): 38.78
86+
Median TPOT (ms): 38.70
87+
P99 TPOT (ms): 48.03
88+
---------------Inter-token Latency----------------
89+
Mean ITL (ms): 38.46
90+
Median ITL (ms): 36.96
91+
P99 ITL (ms): 75.03
92+
==================================================
93+
94+
qps 4:
95+
============ Serving Benchmark Result ============
96+
Successful requests: 200
97+
Benchmark duration (s): 72.55
98+
Total input tokens: 42659
99+
Total generated tokens: 43545
100+
Request throughput (req/s): 2.76
101+
Output token throughput (tok/s): 600.24
102+
Total Token throughput (tok/s): 1188.27
103+
---------------Time to First Token----------------
104+
Mean TTFT (ms): 115.62
105+
Median TTFT (ms): 109.39
106+
P99 TTFT (ms): 169.03
107+
-----Time per Output Token (excl. 1st token)------
108+
Mean TPOT (ms): 51.48
109+
Median TPOT (ms): 52.40
110+
P99 TPOT (ms): 69.41
111+
---------------Inter-token Latency----------------
112+
Mean ITL (ms): 50.47
113+
Median ITL (ms): 43.95
114+
P99 ITL (ms): 130.29
115+
==================================================
116+
117+
qps 16:
118+
============ Serving Benchmark Result ============
119+
Successful requests: 200
120+
Benchmark duration (s): 47.82
121+
Total input tokens: 42659
122+
Total generated tokens: 43545
123+
Request throughput (req/s): 4.18
124+
Output token throughput (tok/s): 910.62
125+
Total Token throughput (tok/s): 1802.70
126+
---------------Time to First Token----------------
127+
Mean TTFT (ms): 128.50
128+
Median TTFT (ms): 128.36
129+
P99 TTFT (ms): 187.87
130+
-----Time per Output Token (excl. 1st token)------
131+
Mean TPOT (ms): 83.60
132+
Median TPOT (ms): 77.85
133+
P99 TPOT (ms): 165.90
134+
---------------Inter-token Latency----------------
135+
Mean ITL (ms): 65.72
136+
Median ITL (ms): 54.84
137+
P99 ITL (ms): 289.63
138+
==================================================
139+
140+
qps inf:
141+
============ Serving Benchmark Result ============
142+
Successful requests: 200
143+
Benchmark duration (s): 41.26
144+
Total input tokens: 42659
145+
Total generated tokens: 43545
146+
Request throughput (req/s): 4.85
147+
Output token throughput (tok/s): 1055.44
148+
Total Token throughput (tok/s): 2089.40
149+
---------------Time to First Token----------------
150+
Mean TTFT (ms): 3394.37
151+
Median TTFT (ms): 3359.93
152+
P99 TTFT (ms): 3540.93
153+
-----Time per Output Token (excl. 1st token)------
154+
Mean TPOT (ms): 66.28
155+
Median TPOT (ms): 64.19
156+
P99 TPOT (ms): 97.66
157+
---------------Inter-token Latency----------------
158+
Mean ITL (ms): 56.62
159+
Median ITL (ms): 55.69
160+
P99 ITL (ms): 82.90
161+
==================================================
162+
163+
offline:
164+
latency:
165+
Avg latency: 4.944929537673791 seconds
166+
10% percentile latency: 4.894104263186454 seconds
167+
25% percentile latency: 4.909652255475521 seconds
168+
50% percentile latency: 4.932477846741676 seconds
169+
75% percentile latency: 4.9608619548380375 seconds
170+
90% percentile latency: 5.035418218374252 seconds
171+
99% percentile latency: 5.052476694583893 seconds
172+
173+
throughput:
174+
Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
175+
Total num prompt tokens: 42659
176+
Total num output tokens: 43545
177+
```
178+
The result json files are generated into the default path `benchmark/results`
179+
These files contain detailed benchmarking results for further analysis.
180+

0 commit comments

Comments
 (0)