cherry-pick developer_guide from main

Potabk · Potabk · commit be7488dc053e · 2025-05-19T10:17:19.000+08:00
Signed-off-by: wangli &lt;wangli858794774@gmail.com&gt;
diff --git a/benchmarks/scripts/run-performance-benchmarks.sh b/benchmarks/scripts/run-performance-benchmarks.sh
@@ -243,9 +243,12 @@ cleanup() {
   rm -rf ./vllm_benchmarks
 }
 get_benchmarks_scripts() {
-  git clone -b main --depth=1 https://ghfast.top/https://github.com/vllm-project/vllm && \
-  mv vllm/benchmarks vllm_benchmarks
-  rm -rf ./vllm
+  git clone --depth=1 --filter=blob:none --sparse https://github.com/vllm-project/vllm
+  cd vllm
+  git sparse-checkout set benchmarks
+  mv benchmarks ../vllm_benchmarks
+  cd ..
+  rm -rf vllm
 }
 
 main() {
diff --git a/docs/source/developer_guide/evaluation/index.md b/docs/source/developer_guide/evaluation/index.md
@@ -7,3 +7,9 @@ using_opencompass
 using_lm_eval
 accuracy_report/index
 :::
+
+:::{toctree}
+:caption: Performance
+:maxdepth: 1
+performance_benchmark
+:::
diff --git a/docs/source/developer_guide/performance_benchmark.md b/docs/source/developer_guide/performance_benchmark.md
@@ -0,0 +1,180 @@
+# Performance Benchmark
+This document details the benchmark methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
+
+**Benchmark Coverage**: We measure offline e2e latency and throughput, and fixed-QPS online serving benchmarks, for more details see [vllm-ascend benchmark scripts](https://github.com/vllm-project/vllm-ascend/tree/v0.7.3-dev/benchmarks).
+
+## 1. Run docker container
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+/bin/bash
+```
+
+## 2. Install dependencies
+```bash
+cd /workspace/vllm-ascend
+pip config set global.index-url https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple
+pip install -r benchmarks/requirements-bench.txt
+```
+
+## 3. (Optional)Prepare model weights
+For faster running speed, we recommend downloading the model in advance：
+```bash
+modelscope download --model LLM-Research/Meta-Llama-3.1-8B-Instruct
+```
+For a faster, lighter testing, it is recommend to set the parameter `load-format` as `dummy`, 
+and random weight values ​​will be constructed based on the incoming model structure, which avoids 
+the time spent downloading the model from the Internet.
+
+You can also replace all model paths in the [json](https://github.com/vllm-project/vllm-ascend/tree/v0.7.3-dev/benchmarks/tests) files with your local paths and other parameters passed in:
+```bash
+[
+  {
+    "test_name": "latency_llama8B_tp1",
+    "parameters": {
+      "model": "/path/to/model",
+      "tensor_parallel_size": 1,
+      "load_format": "dummy",
+      "num_iters_warmup": 5,
+      "num_iters": 15
+    }
+  }
+]
+```
+
+## 4. Run benchmark script
+Run benchmark script:
+```bash
+bash benchmarks/scripts/run-performance-benchmarks.sh
+```
+
+After about 10 mins, the output is as shown below:
+```bash
+online serving:
+qps 1:
+============ Serving Benchmark Result ============
+Successful requests:                     200       
+Benchmark duration (s):                  212.77    
+Total input tokens:                      42659     
+Total generated tokens:                  43545     
+Request throughput (req/s):              0.94      
+Output token throughput (tok/s):         204.66    
+Total Token throughput (tok/s):          405.16    
+---------------Time to First Token----------------
+Mean TTFT (ms):                          104.14    
+Median TTFT (ms):                        102.22    
+P99 TTFT (ms):                           153.82    
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          38.78     
+Median TPOT (ms):                        38.70     
+P99 TPOT (ms):                           48.03     
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           38.46     
+Median ITL (ms):                         36.96     
+P99 ITL (ms):                            75.03     
+==================================================
+
+qps 4:
+============ Serving Benchmark Result ============
+Successful requests:                     200       
+Benchmark duration (s):                  72.55     
+Total input tokens:                      42659     
+Total generated tokens:                  43545     
+Request throughput (req/s):              2.76      
+Output token throughput (tok/s):         600.24    
+Total Token throughput (tok/s):          1188.27   
+---------------Time to First Token----------------
+Mean TTFT (ms):                          115.62    
+Median TTFT (ms):                        109.39    
+P99 TTFT (ms):                           169.03    
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          51.48     
+Median TPOT (ms):                        52.40     
+P99 TPOT (ms):                           69.41     
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           50.47     
+Median ITL (ms):                         43.95     
+P99 ITL (ms):                            130.29    
+==================================================
+
+qps 16:
+============ Serving Benchmark Result ============
+Successful requests:                     200       
+Benchmark duration (s):                  47.82     
+Total input tokens:                      42659     
+Total generated tokens:                  43545     
+Request throughput (req/s):              4.18      
+Output token throughput (tok/s):         910.62    
+Total Token throughput (tok/s):          1802.70   
+---------------Time to First Token----------------
+Mean TTFT (ms):                          128.50    
+Median TTFT (ms):                        128.36    
+P99 TTFT (ms):                           187.87    
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          83.60     
+Median TPOT (ms):                        77.85     
+P99 TPOT (ms):                           165.90    
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           65.72     
+Median ITL (ms):                         54.84     
+P99 ITL (ms):                            289.63    
+==================================================
+
+qps inf:
+============ Serving Benchmark Result ============
+Successful requests:                     200       
+Benchmark duration (s):                  41.26     
+Total input tokens:                      42659     
+Total generated tokens:                  43545     
+Request throughput (req/s):              4.85      
+Output token throughput (tok/s):         1055.44   
+Total Token throughput (tok/s):          2089.40   
+---------------Time to First Token----------------
+Mean TTFT (ms):                          3394.37   
+Median TTFT (ms):                        3359.93   
+P99 TTFT (ms):                           3540.93   
+-----Time per Output Token (excl. 1st token)------
+Mean TPOT (ms):                          66.28     
+Median TPOT (ms):                        64.19     
+P99 TPOT (ms):                           97.66     
+---------------Inter-token Latency----------------
+Mean ITL (ms):                           56.62     
+Median ITL (ms):                         55.69     
+P99 ITL (ms):                            82.90     
+==================================================
+
+offline:
+latency:
+Avg latency: 4.944929537673791 seconds
+10% percentile latency: 4.894104263186454 seconds
+25% percentile latency: 4.909652255475521 seconds
+50% percentile latency: 4.932477846741676 seconds
+75% percentile latency: 4.9608619548380375 seconds
+90% percentile latency: 5.035418218374252 seconds
+99% percentile latency: 5.052476694583893 seconds
+
+throughput:
+Throughput: 4.64 requests/s, 2000.51 total tokens/s, 1010.54 output tokens/s
+Total num prompt tokens:  42659
+Total num output tokens:  43545
+```
+The result json files are generated into the default path `benchmark/results`
+These files contain detailed benchmarking results for further analysis.
+