[Doc] Add AISBench accuracy and performance evaluation doc.

menogrey · menogrey · commit 166eb478e06a · 2025-11-03T11:00:56.000+08:00
Signed-off-by: menogrey &lt;1299267905@qq.com&gt;
diff --git a/docs/source/developer_guide/evaluation/index.md b/docs/source/developer_guide/evaluation/index.md
@@ -5,6 +5,7 @@
 :maxdepth: 1
 using_evalscope
 using_lm_eval
+using_ais_bench
 using_opencompass
 accuracy_report/index
 :::
diff --git a/docs/source/developer_guide/evaluation/using_ais_bench.md b/docs/source/developer_guide/evaluation/using_ais_bench.md
@@ -0,0 +1,181 @@
+# Using AISBench
+This document guides you to conduct accuracy testing using [AISBench](https://gitee.com/aisbench/benchmark/tree/master). AISBench provides accuracy and performance evaluation for many datasets.
+
+## Online Server
+### 1. Start the vLLM server
+You can run docker container to start the vLLM server on a single NPU:
+
+```{code-block} bash
+   :substitutions:
+# Update DEVICE according to your device (/dev/davinci[0-7])
+export DEVICE=/dev/davinci7
+# Update the vllm-ascend image
+export IMAGE=quay.io/ascend/vllm-ascend:|vllm_ascend_version|
+docker run --rm \
+--name vllm-ascend \
+--shm-size=1g \
+--device $DEVICE \
+--device /dev/davinci_manager \
+--device /dev/devmm_svm \
+--device /dev/hisi_hdc \
+-v /usr/local/dcmi:/usr/local/dcmi \
+-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
+-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
+-v /etc/ascend_install.info:/etc/ascend_install.info \
+-v /root/.cache:/root/.cache \
+-p 8000:8000 \
+-e VLLM_USE_MODELSCOPE=True \
+-e PYTORCH_NPU_ALLOC_CONF=max_split_size_mb:256 \
+-it $IMAGE \
+/bin/bash
+vllm serve Qwen/Qwen2.5-0.5B-Instruct --max_model_len 4096 &
+```
+
+The vLLM server is started successfully, if you see logs as below:
+
+```
+INFO:     Started server process [9446]
+INFO:     Waiting for application startup.
+INFO:     Application startup complete.
+```
+
+### 2. Run C-Eval dataset using AISBench
+
+#### Install AISBench
+
+Refer to [AISBench](https://gitee.com/aisbench/benchmark/tree/master) for details.
+Install AISBench from source.
+
+```shell
+git clone https://gitee.com/aisbench/benchmark.git
+cd benchmark/
+pip3 install -e ./ --use-pep517
+```
+
+Install extra AISBench dependencies.
+
+```shell
+pip3 install -r requirements/api.txt
+pip3 install -r requirements/extra.txt
+```
+
+Run `ais_bench -h` to check the installation.
+
+#### Download Dataset
+
+Take `C-Eval` dataset as an example. And you can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Every datasets have a `README.md` for detailed download and installation process.
+
+Download dataset and install it to specific path.
+
+```shell
+cd ais_bench/datasets
+mkdir ceval/
+mkdir ceval/formal_ceval
+cd ceval/formal_ceval
+wget https://www.modelscope.cn/datasets/opencompass/ceval-exam/resolve/master/ceval-exam.zip
+unzip ceval-exam.zip
+rm ceval-exam.zip
+```
+
+#### Update Model Config Python File
+
+Update the file `benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`.
+There are several arguments that you should update according to your environment.
+
+- `path`: Update to your model weight path.
+- `model`: Update to your model name in vLLM.
+- `host_ip` and `host_port`: Update to your vLLM server ip and port.
+- `max_out_len`: Note `max_out_len` + LLM input length should be less than `max-model-len`(config in your vllm server).
+- `batch_size`: Update according to your dataset.
+- `temperature`: Update inference argument.
+
+```python
+from ais_bench.benchmark.models import VLLMCustomAPIChat
+from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
+
+models = [
+    dict(
+        attr="service",
+        type=VLLMCustomAPIChat,
+        abbr='vllm-api-general-chat',
+        path="xxxx",
+        model="xxxx",
+        request_rate = 0,
+        retry = 2,
+        host_ip = "localhost",
+        host_port = 8000,
+        max_out_len = xxx,
+        batch_size = xxx,
+        trust_remote_code=False,
+        generation_kwargs = dict(
+            temperature = 0.6,
+            top_k = 10,
+            top_p = 0.95,
+            seed = None,
+            repetition_penalty = 1.03,
+        ),
+        pred_postprocessor=dict(type=extract_non_reasoning_content)
+    )
+]
+
+```
+
+#### Execute Accuracy Evaluation
+
+Run the following code to execute the accuracy evaluation.
+
+```shell
+ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
+```
+
+After execution, you can get the result from saved files such as `outputs/default/20250628_151326`, there is an example as follows:
+
+```
+20250628_151326/
+├── configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
+│   └── 20250628_151326_29317.py
+├── logs # Execution logs; if --debug is added to the command, no intermediate logs are saved to disk (all are printed directly to the screen)
+│   ├── eval
+│   │   └── vllm-api-general-chat
+│   │       └── demo_gsm8k.out # Logs of the accuracy evaluation process based on inference results in the predictions/ folder
+│   └── infer
+│       └── vllm-api-general-chat
+│           └── demo_gsm8k.out # Logs of the inference process
+├── predictions
+│   └── vllm-api-general-chat
+│       └── demo_gsm8k.json # Inference results (all outputs returned by the inference service)
+├── results
+│   └── vllm-api-general-chat
+│       └── demo_gsm8k.json # Raw scores calculated from the accuracy evaluation
+└── summary
+    ├── summary_20250628_151326.csv # Final accuracy scores (in table format)
+    ├── summary_20250628_151326.md # Final accuracy scores (in Markdown format)
+    └── summary_20250628_151326.txt # Final accuracy scores (in text format)
+```
+
+#### Execute Performance Evaluation
+
+```shell
+ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf
+```
+
+After execution, you can get the result from saved files, there is an example as follows:
+
+```
+20251031_070226/
+|-- configs # Combined configuration file for model tasks, dataset tasks, and result presentation tasks
+|   `-- 20251031_070226_122485.py
+|-- logs
+|   `-- performances
+|       `-- vllm-api-general-chat
+|           `-- cevaldataset.out # Logs of the performance evaluation process
+`-- performances
+    `-- vllm-api-general-chat
+        |-- cevaldataset.csv # Final performance results (in table format)
+        |-- cevaldataset.json # Final performance results (in json format)
+        |-- cevaldataset_details.h5 # Final performance results in details
+        |-- cevaldataset_details.json # Final performance results in details
+        |-- cevaldataset_plot.html # Final performance results (in html format)
+        `-- cevaldataset_rps_distribution_plot_with_actual_rps.html # Final performance results (in html format)
+```
diff --git a/docs/source/tutorials/DeepSeek-V3.2-Exp.md b/docs/source/tutorials/DeepSeek-V3.2-Exp.md
@@ -31,7 +31,7 @@ Currently, we provide the all-in-one images `quay.io/ascend/vllm-ascend:v0.11.0r
 
 Refer to [installation](../installation.md#set-up-using-docker) to set up environment using Docker.
 
-If you want to deploy multi-node environment, you need to set up envrionment on each node.
+If you want to deploy multi-node environment, you need to set up environment on each node.
 
 ## Deployment
 
@@ -277,8 +277,105 @@ curl http://<node0_ip>:<port>/v1/completions \
 
 ## Accuracy Evaluation
 
-TODO
+### AISBench Accuracy Evaluation
+
+Refer to [AISBench Installation](../developer_guide/evaluation/using_ais_bench.md#install-aisbench) for installation.
+Refer to [Download Dataset](../developer_guide/evaluation/using_ais_bench.md#download-dataset) for dataset.
+
+Update the file `benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`.
+
+```python
+from ais_bench.benchmark.models import VLLMCustomAPIChat
+from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
+
+models = [
+    dict(
+        attr="service",
+        type=VLLMCustomAPIChat,
+        abbr='vllm-api-general-chat',
+        path="/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-Exp-W8A8",
+        model="deepseek_v3.2",
+        request_rate = 0,
+        retry = 2,
+        host_ip = "localhost",
+        host_port = 8000,
+        max_out_len = 4096,
+        batch_size=128,
+        trust_remote_code=False,
+        generation_kwargs = dict(
+            temperature = 0.6,
+            top_k = 10,
+            top_p = 0.95,
+            seed = None,
+            repetition_penalty = 1.03,
+        ),
+        pred_postprocessor=dict(type=extract_non_reasoning_content)
+    )
+]
+```
+
+Then, run the following code to execute the accuracy evaluation.
+
+```shell
+ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
+```
+
+After execution, you can get the result as following.
+
+| dataset | version | metric | mode | vllm-api-general-chat |
+|----- | ----- | ----- | ----- | -----|
+| cevaldataset | - | accuracy | gen | 92.20 |
 
 ## Performance
 
-TODO
+### AISBench Performance Evaluation
+
+Refer to [AISBench Installation](../developer_guide/evaluation/using_ais_bench.md#install-aisbench) for installation.
+Refer to [Download Dataset](../developer_guide/evaluation/using_ais_bench.md#download-dataset) for dataset.
+
+Update the file `benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`.
+
+```python
+from ais_bench.benchmark.models import VLLMCustomAPIChat
+from ais_bench.benchmark.utils.model_postprocessors import extract_non_reasoning_content
+
+models = [
+    dict(
+        attr="service",
+        type=VLLMCustomAPIChat,
+        abbr='vllm-api-general-chat',
+        path="/root/.cache/modelscope/hub/models/vllm-ascend/DeepSeek-V3.2-Exp-W8A8",
+        model="deepseek_v3.2",
+        request_rate = 0,
+        retry = 2,
+        host_ip = "localhost",
+        host_port = 8000,
+        max_out_len = 4096,
+        batch_size=128,
+        trust_remote_code=False,
+        generation_kwargs = dict(
+            temperature = 0.6,
+            top_k = 10,
+            top_p = 0.95,
+            seed = None,
+            repetition_penalty = 1.03,
+        ),
+        pred_postprocessor=dict(type=extract_non_reasoning_content)
+    )
+]
+```
+
+Then, run the following code to execute the performance evaluation.
+
+```shell
+ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --summarizer default_perf --mode perf
+```
+
+After execution, you can get the result as following.
+
+|Performance Parameters|Stage|Average|Min|Max|Median|P75|P90|P99|N|
+|-|-|-|-|-|-|-|-|-|-|
+|E2EL|total|293508.5923 ms|15623.5345 ms|888088.5333 ms|266600.0363 ms|302340.1144 ms|459604.5972 ms|589600.1589 ms|1346|
+|InputTokens|total|119.5996|73.0|355.0|108.0|136.0|171.0|250.65|1346|
+|OutputTokens|total|325.9926|67.0|3623.0|242.0|343.0|533.0|1696.2|1346|
+|OutputTokenThroughput|total|1.2036 token/s|0.2206 token/s|9.3449 token/s|0.9022 token/s|1.2678 token/s|2.0254 token/s|8.6098 token/s|1346|