Skip to content

Commit 166eb47

Browse files
committed
[Doc] Add AISBench accuracy and performance evaluation doc.
Signed-off-by: menogrey <1299267905@qq.com>
1 parent 30dbcfa commit 166eb47

File tree

3 files changed

+282
-3
lines changed

3 files changed

+282
-3
lines changed

docs/source/developer_guide/evaluation/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
:maxdepth: 1
66
using_evalscope
77
using_lm_eval
8+
using_ais_bench
89
using_opencompass
910
accuracy_report/index
1011
:::
Lines changed: 181 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,181 @@
1+
# Using AISBench
2+
This document guides you to conduct accuracy testing using [AISBench](https://gitee.com/aisbench/benchmark/tree/master). AISBench provides accuracy and performance evaluation for many datasets.
3+
4+
## Online Server
5+
### 1. Start the vLLM server
6+
You can run docker container to start the vLLM server on a single NPU:
7+
8+
```{code-block} bash
9+
:substitutions:
10+
# Update DEVICE according to your device (/dev/davinci[0-7])
11+
export DEVICE=/dev/davinci7
12+
# Update the vllm-ascend image
13+
export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
14+
docker run --rm \
15+
--name vllm-ascend \
16+
--shm-size=1g \
17+
--device $DEVICE \
18+
--device /dev/davinci_manager \
19+
--device /dev/devmm_svm \
20+
--device /dev/hisi_hdc \
21+
-v /usr/local/dcmi:/usr/local/dcmi \
22+
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
23+
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
24+
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
25+
-v /etc/ascend_install.info:/etc/ascend_install.info \
26+
-v /root/.cache:/root/.cache \
27+
-p 8000:8000 \
28+
-e VLLM_USE_MODELSCOPE=True \
29+
-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
30+
-it $IMAGE \
31+
/bin/bash
32+
vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 &
33+
```
34+
35+
The vLLM server is started successfully, if you see logs as below:
36+
37+
```
38+
INFO: Started server process [9446]
39+
INFO: Waiting for application startup.
40+
INFO: Application startup complete.
41+
```
42+
43+
### 2. Run C-Eval dataset using AISBench
44+
45+
#### Install AISBench
46+
47+
Refer to [AISBench](https://gitee.com/aisbench/benchmark/tree/master) for details.
48+
Install AISBench from source.
49+
50+
```shell
51+
git clone https://gitee.com/aisbench/benchmark.git
52+
cd benchmark/
53+
pip3 install -e ./ --use-pep517
54+
```
55+
56+
Install extra AISBench dependencies.
57+
58+
```shell
59+
pip3 install -r requirements/api.txt
60+
pip3 install -r requirements/extra.txt
61+
```
62+
63+
Run `ais_bench -h` to check the installation.
64+
65+
#### Download Dataset
66+
67+
Take `C-Eval` dataset as an example. And you can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Every datasets have a `README.md` for detailed download and installation process.
68+
69+
Download dataset and install it to specific path.
70+
71+
```shell
72+
cd ais_bench/datasets
73+
mkdir ceval/
74+
mkdir ceval/formal_ceval
75+
cd ceval/formal_ceval
76+
wget https://www.modelscope.cn/datasets/opencompass/ceval-exam/resolve/master/ceval-exam.zip
77+
unzip ceval-exam.zip
78+
rm ceval-exam.zip
79+
```
80+
81+
#### Update Model Config Python File
82+
83+
Update the file `benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`.
84+
There are several arguments that you should update according to your environment.
85+
86+
- `path`: Update to your model weight path.
87+
- `model`: Update to your model name in vLLM.
88+
- `host_ip` and `host_port`: Update to your vLLM server ip and port.
89+
- `max_out_len`: Note `max_out_len` + LLM input length should be less than `max-model-len`(config in your vllm server).
90+
- `batch_size`: Update according to your dataset.
91+
- `temperature`: Update inference argument.
92+
93+
```python
94+
from ais_bench.benchmark.models import VLLMCustomAPIChat
95+
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
96+
97+
models = [
98+
dict(
99+
attr="service",
100+
type=VLLMCustomAPIChat,
101+
abbr='vllm-api-general-chat',
102+
path="xxxx",
103+
model="xxxx",
104+
request_rate = 0,
105+
retry = 2,
106+
host_ip = "localhost",
107+
host_port = 8000,
108+
max_out_len = xxx,
109+
batch_size = xxx,
110+
trust_remote_code=False,
111+
generation_kwargs = dict(
112+
temperature = 0.6,
113+
top_k = 10,
114+
top_p = 0.95,
115+
seed = None,
116+
repetition_penalty = 1.03,
117+
),
118+
pred_postprocessor=dict(type=extract_non_reasoning_content)
119+
)
120+
]
121+
122+
```
123+
124+
#### Execute Accuracy Evaluation
125+
126+
Run the following code to execute the accuracy evaluation.
127+
128+
```shell
129+
ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
130+
```
131+
132+
After execution, you can get the result from saved files such as `outputs/default/20250628_151326`, there is an example as follows:
133+
134+
```
135+
20250628_151326/
136+
├── configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
137+
│ └── 20250628_151326_29317.py
138+
├── logs # Execution logs; if --debug is added to the command, no intermediate logs are saved to disk (all are printed directly to the screen)
139+
│ ├── eval
140+
│ │ └── vllm-api-general-chat
141+
│ │ └── demo_gsm8k.out # Logs of the accuracy evaluation process based on inference results in the predictions/ folder
142+
│ └── infer
143+
│ └── vllm-api-general-chat
144+
│ └── demo_gsm8k.out # Logs of the inference process
145+
├── predictions
146+
│ └── vllm-api-general-chat
147+
│ └── demo_gsm8k.json # Inference results (all outputs returned by the inference service)
148+
├── results
149+
│ └── vllm-api-general-chat
150+
│ └── demo_gsm8k.json # Raw scores calculated from the accuracy evaluation
151+
└── summary
152+
├── summary_20250628_151326.csv # Final accuracy scores (in table format)
153+
├── summary_20250628_151326.md # Final accuracy scores (in Markdown format)
154+
└── summary_20250628_151326.txt # Final accuracy scores (in text format)
155+
```
156+
157+
#### Execute Performance Evaluation
158+
159+
```shell
160+
ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf
161+
```
162+
163+
After execution, you can get the result from saved files, there is an example as follows:
164+
165+
```
166+
20251031_070226/
167+
|-- configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
168+
| `-- 20251031_070226_122485.py
169+
|-- logs
170+
| `-- performances
171+
| `-- vllm-api-general-chat
172+
| `-- cevaldataset.out # Logs of the performance evaluation process
173+
`-- performances
174+
`-- vllm-api-general-chat
175+
|-- cevaldataset.csv # Final performance results (in table format)
176+
|-- cevaldataset.json # Final performance results (in json format)
177+
|-- cevaldataset_details.h5 # Final performance results in details
178+
|-- cevaldataset_details.json # Final performance results in details
179+
|-- cevaldataset_plot.html # Final performance results (in html format)
180+
`-- cevaldataset_rps_distribution_plot_with_actual_rps.html # Final performance results (in html format)
181+
```

docs/source/tutorials/DeepSeek-V3.2-Exp.md

Lines changed: 100 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ Currently, we provide the all-in-one images `quay.io/ascend/vllm-ascend:v0.11.0r
3131

3232
Refer to [installation](../installation.md#set-up-using-docker) to set up environment using Docker.
3333

34-
If you want to deploy multi-node environment, you need to set up envrionment on each node.
34+
If you want to deploy multi-node environment, you need to set up environment on each node.
3535

3636
## Deployment
3737

@@ -277,8 +277,105 @@ curl http://<node0_ip>:<port>/v1/completions \
277277

278278
## Accuracy Evaluation
279279

280-
TODO
280+
### AISBench Accuracy Evaluation
281+
282+
Refer to [AISBench Installation](../developer_guide/evaluation/using_ais_bench.md#install-aisbench) for installation.
283+
Refer to [Download Dataset](../developer_guide/evaluation/using_ais_bench.md#download-dataset) for dataset.
284+
285+
Update the file `benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`.
286+
287+
```python
288+
from ais_bench.benchmark.models import VLLMCustomAPIChat
289+
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
290+
291+
models = [
292+
dict(
293+
attr="service",
294+
type=VLLMCustomAPIChat,
295+
abbr='vllm-api-general-chat',
296+
path="/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-Exp-W8A8",
297+
model="deepseek_v3.2",
298+
request_rate = 0,
299+
retry = 2,
300+
host_ip = "localhost",
301+
host_port = 8000,
302+
max_out_len = 4096,
303+
batch_size=128,
304+
trust_remote_code=False,
305+
generation_kwargs = dict(
306+
temperature = 0.6,
307+
top_k = 10,
308+
top_p = 0.95,
309+
seed = None,
310+
repetition_penalty = 1.03,
311+
),
312+
pred_postprocessor=dict(type=extract_non_reasoning_content)
313+
)
314+
]
315+
```
316+
317+
Then, run the following code to execute the accuracy evaluation.
318+
319+
```shell
320+
ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
321+
```
322+
323+
After execution, you can get the result as following.
324+
325+
| dataset | version | metric | mode | vllm-api-general-chat |
326+
|----- | ----- | ----- | ----- | -----|
327+
| cevaldataset | - | accuracy | gen | 92.20 |
281328

282329
## Performance
283330

284-
TODO
331+
### AISBench Performance Evaluation
332+
333+
Refer to [AISBench Installation](../developer_guide/evaluation/using_ais_bench.md#install-aisbench) for installation.
334+
Refer to [Download Dataset](../developer_guide/evaluation/using_ais_bench.md#download-dataset) for dataset.
335+
336+
Update the file `benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`.
337+
338+
```python
339+
from ais_bench.benchmark.models import VLLMCustomAPIChat
340+
from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
341+
342+
models = [
343+
dict(
344+
attr="service",
345+
type=VLLMCustomAPIChat,
346+
abbr='vllm-api-general-chat',
347+
path="/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-Exp-W8A8",
348+
model="deepseek_v3.2",
349+
request_rate = 0,
350+
retry = 2,
351+
host_ip = "localhost",
352+
host_port = 8000,
353+
max_out_len = 4096,
354+
batch_size=128,
355+
trust_remote_code=False,
356+
generation_kwargs = dict(
357+
temperature = 0.6,
358+
top_k = 10,
359+
top_p = 0.95,
360+
seed = None,
361+
repetition_penalty = 1.03,
362+
),
363+
pred_postprocessor=dict(type=extract_non_reasoning_content)
364+
)
365+
]
366+
```
367+
368+
Then, run the following code to execute the performance evaluation.
369+
370+
```shell
371+
ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf
372+
```
373+
374+
After execution, you can get the result as following.
375+
376+
|Performance Parameters|Stage|Average|Min|Max|Median|P75|P90|P99|N|
377+
|-|-|-|-|-|-|-|-|-|-|
378+
|E2EL|total|293508.5923 ms|15623.5345 ms|888088.5333 ms|266600.0363 ms|302340.1144 ms|459604.5972 ms|589600.1589 ms|1346|
379+
|InputTokens|total|119.5996|73.0|355.0|108.0|136.0|171.0|250.65|1346|
380+
|OutputTokens|total|325.9926|67.0|3623.0|242.0|343.0|533.0|1696.2|1346|
381+
|OutputTokenThroughput|total|1.2036 token/s|0.2206 token/s|9.3449 token/s|0.9022 token/s|1.2678 token/s|2.0254 token/s|8.6098 token/s|1346|

0 commit comments

Comments
 (0)