You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
| ShareGPT4V (Image) | ✅ | ✅ |`wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`<br>Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:<br>`wget http://images.cocodataset.org/zips/train2017.zip`|
33
+
| ShareGPT4V (Image) | ✅ | ✅ |`wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/resolve/main/sharegpt4v_instruct_gpt4-vision_cap100k.json`<br>Note that the images need to be downloaded separately. For example, to download COCO's 2017 Train images:<br>`wget http://images.cocodataset.org/zips/train2017.zip`|
@@ -714,7 +715,7 @@ Generate synthetic image inputs alongside random text prompts to stress-test vis
714
715
715
716
Notes:
716
717
717
-
- Works only with online benchmark via the OpenAI backend (`--backend openai-chat`) and endpoint `/v1/chat/completions`.
718
+
- Works only with online benchmark via the OpenAI backend (`--backend openai-chat`) and endpoint `/v1/chat/completions`.
718
719
- Video sampling is not yet implemented.
719
720
720
721
Start the server (example):
@@ -924,6 +925,147 @@ throughput numbers correctly is also adjusted.
924
925
925
926
</details>
926
927
928
+
## Batch Scripts
929
+
930
+
### Batch Serving Script
931
+
932
+
[`vllm/benchmarks/serve_multi.py`](../../vllm/benchmarks/serve_multi.py) automatically starts `vllm serve` and runs `vllm bench serve` over multiple configurations.
933
+
934
+
#### Batch Mode
935
+
936
+
The basic purpose of this script is to evaluate vLLM under different settings. Follows these steps to run the script:
937
+
938
+
1. Construct the base command to `vllm serve`, and pass it to the `--serve-cmd` option.
939
+
2. Construct the base command to `vllm bench serve`, and pass it to the `--bench-cmd` option.
940
+
3. (Optional) If you would like to vary the settings of `vllm serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--serve-params`.
941
+
942
+
- Example: Tuning `--max-num-seqs` and `--max-num-batched-tokens`:
943
+
944
+
```json
945
+
[
946
+
{
947
+
"max_num_seqs": 32,
948
+
"max_num_batched_tokens": 1024
949
+
},
950
+
{
951
+
"max_num_seqs": 64,
952
+
"max_num_batched_tokens": 1024
953
+
},
954
+
{
955
+
"max_num_seqs": 64,
956
+
"max_num_batched_tokens": 2048
957
+
},
958
+
{
959
+
"max_num_seqs": 128,
960
+
"max_num_batched_tokens": 2048
961
+
},
962
+
{
963
+
"max_num_seqs": 128,
964
+
"max_num_batched_tokens": 4096
965
+
},
966
+
{
967
+
"max_num_seqs": 256,
968
+
"max_num_batched_tokens": 4096
969
+
}
970
+
]
971
+
```
972
+
973
+
4. (Optional) If you would like to vary the settings of `vllm bench serve`, create a new JSON file and populate it with the parameter combinations you want to test. Pass the file path to `--bench-params`.
974
+
975
+
- Example: Using different input/output lengths for random dataset:
976
+
977
+
```json
978
+
[
979
+
{
980
+
"random_input_len": 128,
981
+
"random_output_len": 32
982
+
},
983
+
{
984
+
"random_input_len": 256,
985
+
"random_output_len": 64
986
+
},
987
+
{
988
+
"random_input_len": 512,
989
+
"random_output_len": 128
990
+
}
991
+
]
992
+
```
993
+
994
+
5. Determine where you want to save the results, and pass that to `--output-dir`.
If both `--serve-params` and `--bench-params` are passed, the script will iterate over the Cartesian product between them.
1009
+
You can use `--dry-run` to preview the commands to be run.
1010
+
1011
+
We only start the server once for each `--serve-params`, and keep it running for multiple `--bench-params`.
1012
+
Between each benchmark run, we call the `/reset_prefix_cache` and `/reset_mm_cache` endpoints to get a clean slate for the next run.
1013
+
In case you are using a custom `--serve-cmd`, you can override the commands used for resetting the state by setting `--after-bench-cmd`.
1014
+
1015
+
!!! note
1016
+
By default, each parameter combination is run 3 times to make the results more reliable. You can adjust the number of runs by setting `--num-runs`.
1017
+
1018
+
!!! tip
1019
+
You can use the `--resume` option to continue the parameter sweep if one of the runs failed.
1020
+
1021
+
#### SLA Mode
1022
+
1023
+
By passing SLA constraints via `--sla-params`, you can run this script in SLA mode, causing it to adjust either the request rate or concurrency (choose using `--sla-variable`) in order to satisfy the SLA constraints.
1024
+
1025
+
For example, to ensure E2E latency within different target values for 99% of requests:
The algorithm for adjusting the SLA variable is as follows:
1058
+
1059
+
1. Run the benchmark with infinite QPS, and use the corresponding metrics to determine the initial value of the variable.
1060
+
- For example, the initial request rate is set to the concurrency under infinite QPS.
1061
+
2. If the SLA is still satisfied, keep doubling the value until the SLA is no longer satisfied. This gives a relatively narrow window that contains the point where the SLA is barely satisfied.
1062
+
3. Apply binary search over the window to find the maximum value that still satisfies the SLA.
1063
+
1064
+
!!! important
1065
+
SLA tuning is applied over each combination of `--serve-params`, `--bench-params`, and `--sla-params`.
1066
+
1067
+
For a given combination of `--serve-params` and `--bench-params`, we share the benchmark results across `--sla-params` to avoid rerunning benchmarks with the same SLA variable value.
1068
+
927
1069
## Performance Benchmarks
928
1070
929
1071
The performance benchmarks are used for development to confirm whether new changes improve performance under various workloads. They are triggered on every commit with both the `perf-benchmarks` and `ready` labels, and when a PR is merged into vLLM.
0 commit comments