Skip to content

Commit a622aff

Browse files
Merge pull request vllm-project#135 from vaibhavjainwiz/sync_main
Sync Release to Main for 2.13
2 parents 3c9b8f7 + 8cbe4b2 commit a622aff

File tree

681 files changed

+57459
-14774
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

681 files changed

+57459
-14774
lines changed

.buildkite/check-wheel-size.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import os
22
import zipfile
33

4-
MAX_SIZE_MB = 200
4+
MAX_SIZE_MB = 250
55

66

77
def print_top_10_largest_files(zip_file):

.buildkite/lm-eval-harness/configs/DeepSeek-V2-Lite-Chat.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,4 @@ tasks:
99
value: 0.664
1010
limit: 1000
1111
num_fewshot: 5
12+
trust_remote_code: True
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m HandH1998/QQQ-Llama-3-8b-g128 -b 32 -l 1000 -f 5 -t 1
2+
model_name: "HandH1998/QQQ-Llama-3-8b-g128"
3+
tasks:
4+
- name: "gsm8k"
5+
metrics:
6+
- name: "exact_match,strict-match"
7+
value: 0.419
8+
- name: "exact_match,flexible-extract"
9+
value: 0.416
10+
limit: 1000
11+
num_fewshot: 5
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m mgoin/Minitron-4B-Base-FP8 -b auto -l 1000 -f 5 -t 1
2+
model_name: "mgoin/Minitron-4B-Base-FP8"
3+
tasks:
4+
- name: "gsm8k"
5+
metrics:
6+
- name: "exact_match,strict-match"
7+
value: 0.233
8+
- name: "exact_match,flexible-extract"
9+
value: 0.236
10+
limit: 1000
11+
num_fewshot: 5
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-FP8W8 -b auto -l 1000 -f 5 -t 1
2+
model_name: "nm-testing/Qwen2-1.5B-Instruct-FP8W8"
3+
tasks:
4+
- name: "gsm8k"
5+
metrics:
6+
- name: "exact_match,strict-match"
7+
value: 0.578
8+
- name: "exact_match,flexible-extract"
9+
value: 0.585
10+
limit: 1000
11+
num_fewshot: 5

.buildkite/lm-eval-harness/configs/models-small.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,7 @@ Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
44
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
55
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
66
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
7+
Minitron-4B-Base-FP8.yaml
78
Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
9+
Qwen2-1.5B-Instruct-FP8W8.yaml
10+
Meta-Llama-3-8B-QQQ.yaml

.buildkite/lm-eval-harness/test_lm_eval_correctness.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
import numpy
1515
import yaml
1616

17-
RTOL = 0.02
17+
RTOL = 0.05
1818
TEST_DATA_FILE = os.environ.get(
1919
"LM_EVAL_TEST_DATA_FILE",
2020
".buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct.yaml")
@@ -23,9 +23,12 @@
2323

2424

2525
def launch_lm_eval(eval_config):
26+
trust_remote_code = eval_config.get('trust_remote_code', False)
27+
2628
model_args = f"pretrained={eval_config['model_name']}," \
2729
f"tensor_parallel_size={TP_SIZE}," \
28-
f"add_bos_token=true"
30+
f"add_bos_token=true," \
31+
f"trust_remote_code={trust_remote_code}"
2932

3033
results = lm_eval.simple_evaluate(
3134
model="vllm",

.buildkite/nightly-benchmarks/README.md

Lines changed: 67 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -3,30 +3,52 @@
33

44
## Introduction
55

6-
This directory contains the performance benchmarking CI for vllm.
7-
The goal is to help developers know the impact of their PRs on the performance of vllm.
6+
This directory contains two sets of benchmark for vllm.
7+
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
8+
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.
89

9-
This benchmark will be *triggered* upon:
10-
- A PR being merged into vllm.
11-
- Every commit for those PRs with `perf-benchmarks` label.
1210

13-
**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for more GPUs is comming later), with different models.
11+
See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.
12+
13+
14+
## Performance benchmark quick overview
15+
16+
**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!), with different models.
1417

1518
**Benchmarking Duration**: about 1hr.
1619

17-
**For benchmarking developers**: please try your best to constraint the duration of benchmarking to less than 1.5 hr so that it won't take forever to run.
20+
**For benchmarking developers**: please try your best to constraint the duration of benchmarking to about 1 hr so that it won't take forever to run.
21+
22+
23+
## Nightly benchmark quick overview
24+
25+
**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.
26+
27+
**Benchmarking engines**: vllm, TGI, trt-llm and lmdeploy.
28+
29+
**Benchmarking Duration**: about 3.5hrs.
30+
31+
32+
33+
## Trigger the benchmark
34+
35+
Performance benchmark will be triggered when:
36+
- A PR being merged into vllm.
37+
- Every commit for those PRs with `perf-benchmarks` label AND `ready` label.
38+
39+
Nightly benchmark will be triggered when:
40+
- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.
41+
1842

1943

20-
## Configuring the workload
2144

22-
The benchmarking workload contains three parts:
23-
- Latency tests in `latency-tests.json`.
24-
- Throughput tests in `throughput-tests.json`.
25-
- Serving tests in `serving-tests.json`.
45+
## Performance benchmark details
2646

27-
See [descriptions.md](tests/descriptions.md) for detailed descriptions.
2847

29-
### Latency test
48+
See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
49+
50+
51+
#### Latency test
3052

3153
Here is an example of one test inside `latency-tests.json`:
3254

@@ -47,19 +69,19 @@ Here is an example of one test inside `latency-tests.json`:
4769

4870
In this example:
4971
- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
50-
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-benchmarks-suite.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
72+
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
5173

5274
Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.
5375

5476
WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.
5577

5678

57-
### Throughput test
79+
#### Throughput test
5880
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
5981

6082
The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.
6183

62-
### Serving test
84+
#### Serving test
6385
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
6486

6587
```
@@ -96,9 +118,36 @@ The number of this test is less stable compared to the delay and latency benchma
96118

97119
WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.
98120

99-
## Visualizing the results
121+
#### Visualizing the results
100122
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
101123
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
102124
If you do not see the table, please wait till the benchmark finish running.
103125
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
104126
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
127+
128+
129+
130+
## Nightly test details
131+
132+
See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines.
133+
134+
135+
#### Workflow
136+
137+
- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
138+
- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container.
139+
- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark.
140+
- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.
141+
142+
#### Nightly tests
143+
144+
In [nightly-tests.json](tests/nightly-tests.json), we include the command line arguments for benchmarking commands, together with the benchmarking test cases. The format is highly similar to performance benchmark.
145+
146+
#### Docker containers
147+
148+
The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.
149+
150+
WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`.
151+
152+
WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).
153+

.buildkite/nightly-benchmarks/benchmark-pipeline.yaml

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ steps:
2121
containers:
2222
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
2323
command:
24-
- bash .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
24+
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
2525
resources:
2626
limits:
2727
nvidia.com/gpu: 8
@@ -42,20 +42,20 @@ steps:
4242
- name: devshm
4343
emptyDir:
4444
medium: Memory
45-
- label: "H100"
46-
agents:
47-
queue: H100
48-
plugins:
49-
- docker#v5.11.0:
50-
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
51-
command:
52-
- bash
53-
- .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
54-
mount-buildkite-agent: true
55-
propagate-environment: true
56-
ipc: host
57-
gpus: all
58-
environment:
59-
- VLLM_USAGE_SOURCE
60-
- HF_TOKEN
45+
# - label: "H100"
46+
# agents:
47+
# queue: H100
48+
# plugins:
49+
# - docker#v5.11.0:
50+
# image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
51+
# command:
52+
# - bash
53+
# - .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
54+
# mount-buildkite-agent: true
55+
# propagate-environment: true
56+
# ipc: host
57+
# gpus: all
58+
# environment:
59+
# - VLLM_USAGE_SOURCE
60+
# - HF_TOKEN
6161

.buildkite/nightly-benchmarks/tests/descriptions.md renamed to .buildkite/nightly-benchmarks/performance-benchmarks-descriptions.md

Lines changed: 7 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,42 @@
11

22
## Latency tests
33

4-
This test suite aims to test vllm's end-to-end latency under a controlled setup.
5-
64
- Input length: 32 tokens.
75
- Output length: 128 tokens.
86
- Batch size: fixed (8).
9-
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
7+
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
108
- Evaluation metrics: end-to-end latency (mean, median, p99).
119

12-
### Latency benchmarking results
1310

1411
{latency_tests_markdown_table}
1512

16-
## Throughput tests
1713

18-
This test suite aims to test vllm's throughput.
14+
## Throughput tests
1915

2016
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
2117
- Output length: the corresponding output length of these 200 prompts.
2218
- Batch size: dynamically determined by vllm to achieve maximum throughput.
23-
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
19+
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
2420
- Evaluation metrics: throughput.
2521

26-
### Throughput benchmarking results
2722

2823
{throughput_tests_markdown_table}
2924

30-
## Serving tests
3125

32-
This test suite aims to test vllm's real serving metrics.
26+
## Serving tests
3327

3428
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
3529
- Output length: the corresponding output length of these 200 prompts.
3630
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
3731
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
38-
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
32+
- Models: llama-3.1 8B, llama-3 70B, mixtral 8x7B.
33+
- We also added a speculative decoding test for llama-3 70B, under QPS 2
3934
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
4035

41-
### Serving benchmarking results
4236

4337
{serving_tests_markdown_table}
4438

39+
4540
## json version of the benchmarking tables
4641

4742
This section contains the data of the markdown tables above in JSON format.

0 commit comments

Comments
 (0)