Skip to content

Commit b3aba04

Browse files
[Benchmark] Convenience script for multiple parameter combinations (#27085)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
1 parent 8a29711 commit b3aba04

File tree

3 files changed

+1312
-3
lines changed

3 files changed

+1312
-3
lines changed

docs/contributing/benchmarks.md

Lines changed: 145 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,8 @@ toc_depth: 4
66

77
vLLM provides comprehensive benchmarking tools for performance testing and evaluation:
88

9-
- **[Benchmark CLI]**: `vllm bench` CLI tools and specialized benchmark scripts for interactive performance testing
9+
- **[Benchmark CLI](#benchmark-cli)**: `vllm bench` CLI tools and specialized benchmark scripts for interactive performance testing
10+
- **[Batch Scripts](#batch-scripts)**: Run `vllm bench` against multiple configurations conveniently
1011
- **[Performance benchmarks](#performance-benchmarks)**: Automated CI benchmarks for development
1112
- **[Nightly benchmarks](#nightly-benchmarks)**: Comparative benchmarks against alternatives
1213

@@ -29,7 +30,7 @@ th {
2930
| Dataset | Online | Offline | Data Path |
3031
|---------|--------|---------|-----------|
3132
| ShareGPT ||| `wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json` |
32-
| ShareGPT4V (Image) ||| `wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`<br>Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:<br>`wget http://images.cocodataset.org/zips/train2017.zip` |
33+
| ShareGPT4V (Image) ||| `wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`<br>Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:<br>`wget http://images.cocodataset.org/zips/train2017.zip` |
3334
| ShareGPT4Video (Video) ||| `git clone https://huggingface.co/datasets/ShareGPT4Video/ShareGPT4Video` |
3435
| BurstGPT ||| `wget https://github.com/HPMLL/BurstGPT/releases/download/v1.1/BurstGPT_without_fails_2.csv` |
3536
| Sonnet (deprecated) ||| Local file: `benchmarks/sonnet.txt` |
@@ -714,7 +715,7 @@ Generate synthetic image inputs alongside random text prompts to stress-test vis
714715

715716
Notes:
716717

717-
- Works only with online benchmark via the OpenAI backend (`--backend openai-chat`) and endpoint `/v1/chat/completions`.
718+
- Works only with online benchmark via the OpenAI backend (`--backend openai-chat`) and endpoint `/v1/chat/completions`.
718719
- Video sampling is not yet implemented.
719720

720721
Start the server (example):
@@ -924,6 +925,147 @@ throughput numbers correctly is also adjusted.
924925

925926
</details>
926927

928+
## Batch Scripts
929+
930+
### Batch Serving Script
931+
932+
[`vllm/benchmarks/serve_multi.py`](../../vllm/benchmarks/serve_multi.py) automatically starts `vllm serve` and runs `vllm bench serve` over multiple configurations.
933+
934+
#### Batch Mode
935+
936+
The basic purpose of this script is to evaluate vLLM under different settings. Follows these steps to run the script:
937+
938+
1. Construct the base command to `vllm serve`, and pass it to the `--serve-cmd` option.
939+
2. Construct the base command to `vllm bench serve`, and pass it to the `--bench-cmd` option.
940+
3. (Optional) If you would like to vary the settings of `vllm serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--serve-params`.
941+
942+
- Example: Tuning `--max-num-seqs` and `--max-num-batched-tokens`:
943+
944+
```json
945+
[
946+
{
947+
"max_num_seqs": 32,
948+
"max_num_batched_tokens": 1024
949+
},
950+
{
951+
"max_num_seqs": 64,
952+
"max_num_batched_tokens": 1024
953+
},
954+
{
955+
"max_num_seqs": 64,
956+
"max_num_batched_tokens": 2048
957+
},
958+
{
959+
"max_num_seqs": 128,
960+
"max_num_batched_tokens": 2048
961+
},
962+
{
963+
"max_num_seqs": 128,
964+
"max_num_batched_tokens": 4096
965+
},
966+
{
967+
"max_num_seqs": 256,
968+
"max_num_batched_tokens": 4096
969+
}
970+
]
971+
```
972+
973+
4. (Optional) If you would like to vary the settings of `vllm bench serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--bench-params`.
974+
975+
- Example: Using different input/output lengths for random dataset:
976+
977+
```json
978+
[
979+
{
980+
"random_input_len": 128,
981+
"random_output_len": 32
982+
},
983+
{
984+
"random_input_len": 256,
985+
"random_output_len": 64
986+
},
987+
{
988+
"random_input_len": 512,
989+
"random_output_len": 128
990+
}
991+
]
992+
```
993+
994+
5. Determine where you want to save the results, and pass that to `--output-dir`.
995+
996+
Example command:
997+
998+
```bash
999+
python vllm/benchmarks/serve_multi.py \
1000+
--serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
1001+
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
1002+
--serve-params benchmarks/serve_hparams.json \
1003+
--bench-params benchmarks/bench_hparams.json \
1004+
-o benchmarks/results
1005+
```
1006+
1007+
!!! important
1008+
If both `--serve-params` and `--bench-params` are passed, the script will iterate over the Cartesian product between them.
1009+
You can use `--dry-run` to preview the commands to be run.
1010+
1011+
We only start the server once for each `--serve-params`, and keep it running for multiple `--bench-params`.
1012+
Between each benchmark run, we call the `/reset_prefix_cache` and `/reset_mm_cache` endpoints to get a clean slate for the next run.
1013+
In case you are using a custom `--serve-cmd`, you can override the commands used for resetting the state by setting `--after-bench-cmd`.
1014+
1015+
!!! note
1016+
By default, each parameter combination is run 3 times to make the results more reliable. You can adjust the number of runs by setting `--num-runs`.
1017+
1018+
!!! tip
1019+
You can use the `--resume` option to continue the parameter sweep if one of the runs failed.
1020+
1021+
#### SLA Mode
1022+
1023+
By passing SLA constraints via `--sla-params`, you can run this script in SLA mode, causing it to adjust either the request rate or concurrency (choose using `--sla-variable`) in order to satisfy the SLA constraints.
1024+
1025+
For example, to ensure E2E latency within different target values for 99% of requests:
1026+
1027+
```json
1028+
[
1029+
{
1030+
"p99_e2el_ms": "<=200"
1031+
},
1032+
{
1033+
"p99_e2el_ms": "<=500"
1034+
},
1035+
{
1036+
"p99_e2el_ms": "<=1000"
1037+
},
1038+
{
1039+
"p99_e2el_ms": "<=2000"
1040+
}
1041+
]
1042+
```
1043+
1044+
Example command:
1045+
1046+
```bash
1047+
python vllm/benchmarks/serve_multi.py \
1048+
--serve-cmd 'vllm serve meta-llama/Llama-2-7b-chat-hf' \
1049+
--bench-cmd 'vllm bench serve --model meta-llama/Llama-2-7b-chat-hf --backend vllm --endpoint /v1/completions --dataset-name sharegpt --dataset-path benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json' \
1050+
--serve-params benchmarks/serve_hparams.json \
1051+
--bench-params benchmarks/bench_hparams.json \
1052+
--sla-params benchmarks/sla_hparams.json \
1053+
--sla-variable max_concurrency \
1054+
-o benchmarks/results
1055+
```
1056+
1057+
The algorithm for adjusting the SLA variable is as follows:
1058+
1059+
1. Run the benchmark with infinite QPS, and use the corresponding metrics to determine the initial value of the variable.
1060+
- For example, the initial request rate is set to the concurrency under infinite QPS.
1061+
2. If the SLA is still satisfied, keep doubling the value until the SLA is no longer satisfied. This gives a relatively narrow window that contains the point where the SLA is barely satisfied.
1062+
3. Apply binary search over the window to find the maximum value that still satisfies the SLA.
1063+
1064+
!!! important
1065+
SLA tuning is applied over each combination of `--serve-params`, `--bench-params`, and `--sla-params`.
1066+
1067+
For a given combination of `--serve-params` and `--bench-params`, we share the benchmark results across `--sla-params` to avoid rerunning benchmarks with the same SLA variable value.
1068+
9271069
## Performance Benchmarks
9281070

9291071
The performance benchmarks are used for development to confirm whether new changes improve performance under various workloads. They are triggered on every commit with both the `perf-benchmarks` and `ready` labels, and when a PR is merged into vLLM.

0 commit comments

Comments
 (0)