You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`--max_model_len` should be greater than `35000`, this will be suitable for most datasets. Otherwise the accuracy evaluation may be affected.
37
+
:::
38
+
35
39
The vLLM server is started successfully, if you see logs as below:
36
40
37
41
```
@@ -40,7 +44,7 @@ INFO: Waiting for application startup.
40
44
INFO: Application startup complete.
41
45
```
42
46
43
-
### 2. Run C-Eval dataset using AISBench
47
+
### 2. Run different dataset using AISBench
44
48
45
49
#### Install AISBench
46
50
@@ -64,6 +68,10 @@ Run `ais_bench -h` to check the installation.
64
68
65
69
#### Download Dataset
66
70
71
+
You can choose one or multiple datasets to execute accuracy evaluation.
72
+
73
+
1.`C-Eval` dataset.
74
+
67
75
Take `C-Eval` dataset as an example. And you can refer to [Datasets](https://gitee.com/aisbench/benchmark/tree/master/ais_bench/benchmark/configs/datasets) for more datasets. Every datasets have a `README.md` for detailed download and installation process.
Update the file `benchmark/ais_bench/benchmark/configs/models/vllm_api/vllm_api_general_chat.py`.
84
147
There are several arguments that you should update according to your environment.
85
148
86
149
-`path`: Update to your model weight path.
87
150
-`model`: Update to your model name in vLLM.
88
151
-`host_ip` and `host_port`: Update to your vLLM server ip and port.
89
-
-`max_out_len`: Note `max_out_len` + LLM input length should be less than `max-model-len`(config in your vllm server).
152
+
-`max_out_len`: Note `max_out_len` + LLM input length should be less than `max-model-len`(config in your vllm server), `32768` will be suitable for most datasets.
90
153
-`batch_size`: Update according to your dataset.
91
154
-`temperature`: Update inference argument.
92
155
@@ -123,13 +186,30 @@ models = [
123
186
124
187
#### Execute Accuracy Evaluation
125
188
126
-
Run the following code to execute the accuracy evaluation.
189
+
Run the following code to execute different accuracy evaluation.
127
190
128
191
```shell
192
+
# run C-Eval dataset
129
193
ais_bench --models vllm_api_general_chat --datasets ceval_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
194
+
195
+
# run MMLU dataset
196
+
ais_bench --models vllm_api_general_chat --datasets mmlu_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
197
+
198
+
# run GPQA dataset
199
+
ais_bench --models vllm_api_general_chat --datasets gpqa_gen_0_shot_str.py --mode all --dump-eval-details --merge-ds
200
+
201
+
# run MATH-500 dataset
202
+
ais_bench --models vllm_api_general_chat --datasets math500_gen_0_shot_cot_chat_prompt.py --mode all --dump-eval-details --merge-ds
203
+
204
+
# run LiveCodeBench dataset
205
+
ais_bench --models vllm_api_general_chat --datasets livecodebench_code_generate_lite_gen_0_shot_chat.py --mode all --dump-eval-details --merge-ds
206
+
207
+
# run AIME 2024 dataset
208
+
ais_bench --models vllm_api_general_chat --datasets aime2024_gen_0_shot_chat_prompt.py --mode all --dump-eval-details --merge-ds
209
+
130
210
```
131
211
132
-
After execution, you can get the result from saved files such as `outputs/default/20250628_151326`, there is an example as follows:
212
+
After each dataset execution, you can get the result from saved files such as `outputs/default/20250628_151326`, there is an example as follows:
133
213
134
214
```
135
215
20250628_151326/
@@ -157,7 +237,23 @@ After execution, you can get the result from saved files such as `outputs/defaul
0 commit comments