Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang #7412

Merged
merged 182 commits into from
Oct 4, 2024
Merged
Show file tree
Hide file tree
Changes from 176 commits
Commits
Show all changes
182 commits
Select commit Hold shift + click to select a range
91b49b3
raise trt-llm to version 24.07
KuntaiDu Aug 12, 2024
e61fa29
adjust the way of processing protobuf files
KuntaiDu Aug 12, 2024
00aaefa
bump up trt-llm to r24.07
KuntaiDu Aug 12, 2024
7d4b1f0
avoid pip upgrade transformers -- no longer needed
KuntaiDu Aug 12, 2024
c67deaa
change tokenizer_dir
KuntaiDu Aug 12, 2024
f7901f1
fall back to using protobuf files from tensorrt-demo
KuntaiDu Aug 13, 2024
2e9c063
replace python to python3
KuntaiDu Aug 13, 2024
4e15409
include sglang
KuntaiDu Aug 13, 2024
5f880ae
add sglang to the gateway script
KuntaiDu Aug 13, 2024
b74d95a
add sglang benchmarking script
KuntaiDu Aug 13, 2024
fc2f850
add vllm host ip
KuntaiDu Aug 13, 2024
0a3bfae
only enable sglang for testing
KuntaiDu Aug 13, 2024
89c7fe8
add sglang server parameters
KuntaiDu Aug 13, 2024
cbbcbd0
use Llama 3.1 instead of llama3
KuntaiDu Aug 13, 2024
0648d9e
bring back vllm for testing
KuntaiDu Aug 13, 2024
7f16b64
add sglang into backend request func
KuntaiDu Aug 13, 2024
5155b77
adjust trt-llm launch script
KuntaiDu Aug 14, 2024
85d39f9
enable trt-llm and sglang
KuntaiDu Aug 14, 2024
4516373
bug fix: cd into triton_model_repo before calling fill_template.py
KuntaiDu Aug 14, 2024
dbf6607
upload zipped results to artifact -- for easy debugging
KuntaiDu Aug 14, 2024
a8cac72
upload
KuntaiDu Aug 14, 2024
e461c64
use Llama 3 8B instead --- trt-llm crashes with Llama 3 8.1B
KuntaiDu Aug 14, 2024
c935bda
update the documentation
KuntaiDu Aug 14, 2024
a7e12e7
update trt --- no need to update transformers
KuntaiDu Aug 14, 2024
9605cf3
change to llama 2 7b --- I don't have access to llama3 8B for dev
KuntaiDu Aug 14, 2024
fcc3f52
adjust trt-llm backend version -- should be v0.11.0
KuntaiDu Aug 14, 2024
760d70f
replace tokenizer_dir to the directory that has been downloaded (/tok…
KuntaiDu Aug 14, 2024
8230734
no it should be
KuntaiDu Aug 14, 2024
1ba468b
update llama model
KuntaiDu Aug 14, 2024
c7bfc22
update model_path
KuntaiDu Aug 14, 2024
4c1500c
add model path
KuntaiDu Aug 14, 2024
711b65f
log error
KuntaiDu Aug 14, 2024
483e1b1
add logging when launching triton server
KuntaiDu Aug 14, 2024
1c5f677
add debugging symbol
KuntaiDu Aug 14, 2024
a65763b
engine_path needs to be the path of compiled engine...
KuntaiDu Aug 15, 2024
aa278eb
adjust the way of killing vllm instance
KuntaiDu Aug 15, 2024
5da3db1
add more QPS
KuntaiDu Aug 15, 2024
45e94a2
disable radix cache and enable torch compile for SGLang
KuntaiDu Aug 15, 2024
a4cb503
move nightly benchmark script into scripts folder
KuntaiDu Aug 17, 2024
03f7830
centralize run scripts alltogether
KuntaiDu Aug 18, 2024
84fd15e
merge all running scripts into one.
KuntaiDu Aug 18, 2024
797c4d6
merge server launching script into one place
KuntaiDu Aug 18, 2024
6fd153a
adjust the testing cases
KuntaiDu Aug 18, 2024
61a45b5
bug fix on finding nightly-tests.json
KuntaiDu Aug 19, 2024
eb592aa
wait for server before running benchmark_serving.py
KuntaiDu Aug 19, 2024
c7aafa0
adjust sonnet parameters
KuntaiDu Aug 19, 2024
23b886a
add get_chat_template attribute
KuntaiDu Aug 19, 2024
ac7ecc5
check the existence of chat template via apply_chat_template
KuntaiDu Aug 19, 2024
aefae68
fall back to default way -- it is correct, I guess I need to use inst…
KuntaiDu Aug 19, 2024
ac95463
use instruct model
KuntaiDu Aug 19, 2024
2337b39
trt does not work with llama 3.1 in r24.07
KuntaiDu Aug 19, 2024
fb834ff
add full pipeline for testing
KuntaiDu Aug 19, 2024
2f8db9d
add upload_to_buildkite utility
KuntaiDu Aug 20, 2024
7435538
add long decode workload
KuntaiDu Aug 20, 2024
6fd1ac5
update the engine name of trt to tensorrt-llm, to match with benchmar…
KuntaiDu Aug 20, 2024
2ff5429
bug fix: annotate the engine correctly in the benchmarking result
KuntaiDu Aug 20, 2024
a53fbc7
update test suite
KuntaiDu Aug 20, 2024
02f22f9
update plotting script
KuntaiDu Aug 20, 2024
b17e76f
update how to annotate the results
KuntaiDu Aug 20, 2024
594f35b
update nightly descriptions doc correspondingly
KuntaiDu Aug 20, 2024
b8a3f76
update plotting script
KuntaiDu Aug 20, 2024
a64aeab
rename to tensorrt-llm
KuntaiDu Aug 20, 2024
12b1ec4
update transformers library
KuntaiDu Aug 20, 2024
f513995
Merge branch 'vllm-project:main' into kuntai-update-nightlybench
KuntaiDu Aug 20, 2024
c52c45e
Merge branch 'kuntai-update-nightlybench' of https://github.com/Kunta…
KuntaiDu Aug 20, 2024
582b5b2
add support for ignore_eos flag, for benchmarking
KuntaiDu Aug 20, 2024
7802c75
annotate that tgi and deepspeed_mii is not supported ignore_eos
KuntaiDu Aug 21, 2024
8e3e269
set ignore_eos flag for benchmark
KuntaiDu Aug 21, 2024
8409687
support total_input_tokens and total_output_tokens
KuntaiDu Aug 21, 2024
e138cca
generate markdown file in a separate file
KuntaiDu Aug 21, 2024
3951a96
need to fallback to llama 3.0
KuntaiDu Aug 21, 2024
c23ccc6
no i need to update docker version instead
KuntaiDu Aug 21, 2024
058e1aa
bug fix: no need to specify type when store_true
KuntaiDu Aug 21, 2024
e320941
update docker container
KuntaiDu Aug 21, 2024
1b4946c
adjust the nameing of tensorrt
KuntaiDu Aug 21, 2024
40ffa0a
set QPS
KuntaiDu Aug 21, 2024
9efae6c
remove tgi: there is no way to constraint its output length
KuntaiDu Aug 21, 2024
d2072af
adjust the name in benchmarking script
KuntaiDu Aug 21, 2024
a1596ed
switch to 8B, change lmdeploy version name
KuntaiDu Aug 21, 2024
4c7d73c
update to latest aws docker -- for multi-step scheduling
KuntaiDu Aug 21, 2024
f089fae
comment out sglang
KuntaiDu Aug 21, 2024
35c2025
move to Llama 3.1 for local testing
KuntaiDu Aug 21, 2024
8f411ec
make sure server args exist
KuntaiDu Aug 21, 2024
3387919
raise max_model_len
KuntaiDu Aug 21, 2024
8700543
export VLLM host ip
KuntaiDu Aug 21, 2024
0b531f0
enable sglang
KuntaiDu Aug 21, 2024
9a6a18a
disable torch compile --- it raises bug for 8b instruct model
KuntaiDu Aug 21, 2024
26ad283
fix json syntax bug
KuntaiDu Aug 21, 2024
16ce24a
bug fix
KuntaiDu Aug 21, 2024
b5f90fd
benchmark vllm again with num-schedule-step=1
KuntaiDu Aug 22, 2024
3c40c2f
allow downloading results and scripts from buildkite annotation. Down…
KuntaiDu Aug 23, 2024
ca6a9fd
add vllm version-specific benchmarking
KuntaiDu Aug 23, 2024
c3696b0
distinguish between different versions of vllm
KuntaiDu Aug 23, 2024
1c95549
use different set of parameters between different versions of vllm
KuntaiDu Aug 23, 2024
0ee0688
fix vllm import issue
KuntaiDu Aug 24, 2024
4befb1b
launch server from script
KuntaiDu Aug 24, 2024
53b13a2
redirect backend to vllm if the engine is vllm055 or vllm 053post1
KuntaiDu Aug 24, 2024
1410ce3
refer to instead of when benchmark_serving
KuntaiDu Aug 24, 2024
c75dbcd
udpate test cases
KuntaiDu Aug 24, 2024
427e013
use tpot instead of ITL --- ITL is wrongfully too large for multi-step
KuntaiDu Aug 24, 2024
9228035
update plotting script and benchmarking results
KuntaiDu Aug 24, 2024
039f391
update sharegpt image
KuntaiDu Aug 24, 2024
a0a944d
also put raw benchmark results inside, in case people wants to reproduce
KuntaiDu Aug 24, 2024
6fb655a
remove the results --- without causing footprint when merging into main
KuntaiDu Aug 24, 2024
ca36b0e
sanity check on the full set of benchmark
KuntaiDu Aug 24, 2024
415cc0f
update--remove vllm 0.5.3.post1
KuntaiDu Aug 24, 2024
0a8d641
vary different values of scheduler steps
KuntaiDu Aug 24, 2024
f72eeca
adjust parameters --- name vllm as vllm 0.5.5, so that we can vary it…
KuntaiDu Aug 24, 2024
6e2a9d0
change launch_server
KuntaiDu Aug 24, 2024
f364a54
log NUM_SCHEDULER_STEPS
KuntaiDu Aug 24, 2024
c4a6dfd
adjust the way of injecting env var
KuntaiDu Aug 24, 2024
9a2acda
fix typo: should be NUM_SCHEDULER_STEPS instead of NUM_SCHEDULER_STEP
KuntaiDu Aug 24, 2024
208a111
add step 2 and 3
KuntaiDu Aug 24, 2024
99153de
temporarily cache the results
KuntaiDu Aug 25, 2024
52eabfe
cache plotting script
KuntaiDu Aug 25, 2024
8ad3184
update nightly pipeline to benchmark async output processing
KuntaiDu Aug 26, 2024
093f410
add nightly benchmark results for sharing
KuntaiDu Aug 26, 2024
76ce5b7
test async output processing
KuntaiDu Aug 28, 2024
2dfecb9
update the docker image and try again
KuntaiDu Aug 29, 2024
6c1f754
fix zmq backend issue
KuntaiDu Aug 29, 2024
9a8e8fa
use the image that is post-merge so that we have both zmq and async o…
KuntaiDu Aug 29, 2024
f9cd4bb
check step=1 performance
KuntaiDu Aug 29, 2024
e3ba754
update plotting script
KuntaiDu Aug 29, 2024
decf67b
test if successful or no
KuntaiDu Aug 29, 2024
257087f
support latency
KuntaiDu Aug 29, 2024
9882b17
initial test run
KuntaiDu Aug 29, 2024
7d0e3c6
add vllm -- test if it is OK to raise to 0.95
KuntaiDu Aug 29, 2024
8bf3308
sglang does not support torch.compile on Llama 3
KuntaiDu Aug 29, 2024
ceaaff7
add latency key
KuntaiDu Aug 29, 2024
945a09b
remove max-model-len --- llama 3b is short context model
KuntaiDu Aug 29, 2024
2f4a1cb
comment out sglang and lmdeploy
KuntaiDu Aug 29, 2024
bf61370
large-scale benchmark start
KuntaiDu Aug 29, 2024
7e7435c
remove results
KuntaiDu Aug 29, 2024
c040060
bugfix: should be instead of
KuntaiDu Aug 29, 2024
4a815f9
update benchmark
KuntaiDu Aug 29, 2024
49190d7
skip mixtral
KuntaiDu Aug 29, 2024
b33055b
avoid injecting scheduler steps via envvar
KuntaiDu Aug 29, 2024
811fdbf
small-scale test on vllm
KuntaiDu Aug 29, 2024
57b4b7c
comment out other engines
KuntaiDu Aug 29, 2024
1d2f0e2
do not reuse server
KuntaiDu Aug 30, 2024
3970bfe
switch to latest docker
KuntaiDu Aug 30, 2024
ec61fb6
bring in the full test suite
KuntaiDu Aug 30, 2024
93156a0
bring in the docker of all benchmarking engines
KuntaiDu Aug 30, 2024
408fab0
update plotting script
KuntaiDu Aug 30, 2024
e43b66c
need to separate TRT benchmark to two steps.... Test variable
KuntaiDu Aug 30, 2024
575d89a
Add to separate one trt-llm runs to two steps, so that the 1:30hr lo…
KuntaiDu Aug 30, 2024
661633f
also update the comparison script
KuntaiDu Aug 30, 2024
4276c99
adjust nightly test --- i guess the # of output tokens cannot be long…
KuntaiDu Aug 31, 2024
e6c94e5
cut down the test scale so that it fits within 1:30 minutes
KuntaiDu Sep 2, 2024
42650a1
update test cases
KuntaiDu Sep 2, 2024
fbd27dc
use Alex's PR to rerun the benchmark
KuntaiDu Sep 2, 2024
e2373e8
adjust the test case: maximum length when generating output should no…
KuntaiDu Sep 2, 2024
4ffa6f9
make sure that vllm benchmark runs first
KuntaiDu Sep 2, 2024
501fea6
update to include more benchmarking metrics
KuntaiDu Sep 3, 2024
555db07
test Alex and Rober'ts PR
KuntaiDu Sep 3, 2024
79102e7
significantly reduce the test case --- please don't crash when killin…
KuntaiDu Sep 3, 2024
094339c
update plotting script
KuntaiDu Sep 4, 2024
5d054f2
bring back the full benchmarking suite
KuntaiDu Sep 4, 2024
7f74875
bump up cuda version to 12.4, also update sglang version
KuntaiDu Sep 4, 2024
e7e6c57
udpate plotting script
KuntaiDu Sep 4, 2024
ba1c9ee
add ignore-eos flag
KuntaiDu Sep 4, 2024
9163d52
fix: cannot reuse server if there is only one test
KuntaiDu Sep 4, 2024
950219c
set sonnet output len to 400
KuntaiDu Sep 4, 2024
8f8ed06
rerun trt , with --ignore-eos set off
KuntaiDu Sep 4, 2024
6e3e6e1
switch to 8B
KuntaiDu Sep 4, 2024
ede9688
update nightly benchmarks -- add ignore_eos
KuntaiDu Sep 22, 2024
3b994c5
move plotting scripts to a separate folder
KuntaiDu Sep 22, 2024
6bc6777
avoid embedding vLLM version to current serving engine, to make upgra…
KuntaiDu Sep 22, 2024
9d63edb
rename vllm 055 to vllm
KuntaiDu Sep 22, 2024
faf4083
reduce GPU util to 0.9
KuntaiDu Sep 22, 2024
1b43f1c
enable torch compile for SGLang
KuntaiDu Sep 22, 2024
3a5fa29
Merge branch 'main' into kuntai-update-nightlybench
KuntaiDu Sep 22, 2024
7e12e84
make syntax checker happy
KuntaiDu Sep 22, 2024
60892b6
Merge branch 'vllm-project:main' into kuntai-update-nightlybench
KuntaiDu Sep 27, 2024
66ced32
remove plotting scripts -- visualization is sensitive and people need…
KuntaiDu Sep 27, 2024
5bc0198
Merge branch 'kuntai-update-nightlybench' of https://github.com/Kunta…
KuntaiDu Sep 27, 2024
efc0bc4
fix comments & bump to bf16
KuntaiDu Oct 2, 2024
9c20da0
update metric collection metric to incorporate latest benchmark_servi…
KuntaiDu Oct 2, 2024
e20d6b9
bug fix: latency has been removed.
KuntaiDu Oct 3, 2024
3a76dc7
Remove QPS 2 to accelerate the benchmark (SGLang's benchmark is hitti…
KuntaiDu Oct 3, 2024
141ce12
Merge branch 'kuntai-update-nightlybench' of https://github.com/Kunta…
KuntaiDu Oct 3, 2024
4f1a72a
Bump up version of all containers
KuntaiDu Oct 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .buildkite/nightly-benchmarks/nightly-annotation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@

## Description

This file contains the downloading link for benchmarking results.

- [benchmarking pipeline](artifact://nightly-pipeline.yaml)
- [benchmarking results](artifact://results.zip)
- [benchmarking code](artifact://nightly-benchmarks.zip)

Please download the visualization scripts in the post


## Results reproduction

- Find the docker we use in `benchmarking pipeline`
- Deploy the docker, and inside the docker:
- Download `nightly-benchmarks.zip`.
- In the same folder, run the following code
```
export HF_TOKEN=<your HF token>
apt update
apt install -y git
unzip nightly-benchmarks.zip
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
```

And the results will be inside `./benchmarks/results`.

63 changes: 31 additions & 32 deletions .buildkite/nightly-benchmarks/nightly-descriptions.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,44 @@

# Nightly benchmark

The main goal of this benchmarking is two-fold:
- Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
- Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in [reproduce.md]().


## Docker images

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:
- vllm/vllm-openai:v0.5.0.post1
- nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
- openmmlab/lmdeploy:v0.5.0
- ghcr.io/huggingface/text-generation-inference:2.1

<!-- Please check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/nightly-pipeline.yaml">nightly-pipeline.yaml</a> artifact for more details on how we deploy the docker images. -->


## Hardware

One AWS node with 8x NVIDIA A100 GPUs.
This benchmark aims to:
- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in [reproduce.md]().


## Setup

- Docker images
- vllm/vllm-openai:v0.5.0.post1
- nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
- openmmlab/lmdeploy:v0.5.0
- ghcr.io/huggingface/text-generation-inference:2.1
KuntaiDu marked this conversation as resolved.
Show resolved Hide resolved
KuntaiDu marked this conversation as resolved.
Show resolved Hide resolved
- Hardware
- 8x Nvidia A100 GPUs
- Workload
- Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 500 prompts.
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
- Average QPS (query per second): 2, 4, 8 and inf.
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).

## Plots

## Workload description
In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed.

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:
<img src="artifact://nightly_results_sharegpt.png" alt="Benchmarking results" height=250 >

- Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 500 prompts.
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
- Average QPS (query per second): 4 for the small model (llama-3 8B) and 2 for other two models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
<img src="artifact://nightly_results_sonnet_2048_128.png" alt="Benchmarking results" height=250 >

<!-- Check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/tests/nightly-tests.json">nightly-tests.json</a> artifact for more details. -->
<img src="artifact://nightly_results_sonnet_128_2048.png" alt="Benchmarking results" height=250 >

## Plots
## Results

In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed.
{nightly_results_benchmarking_table}

<img src="artifact://nightly_results.png" alt="Benchmarking results" height=250 >

## Results
## Known issues

{nightly_results_benchmarking_table}
- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fixed in their 0.12 release.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am currently using r24.07 (I am having trouble upgrading it to r24.08 --- see reasons in the next conversation), which is paired with trtllm 0.11 release.

96 changes: 85 additions & 11 deletions .buildkite/nightly-benchmarks/nightly-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ common_pod_spec: &common_pod_spec

common_container_settings: &common_container_settings
command:
- bash .buildkite/nightly-benchmarks/run-nightly-suite.sh
- bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
resources:
limits:
nvidia.com/gpu: 8
Expand All @@ -37,7 +37,24 @@ common_container_settings: &common_container_settings

steps:
- block: ":rocket: Ready for comparing vllm against alternatives? This will take 4 hours."
- label: "A100 trt benchmark"



- label: "A100 vllm step 10"
priority: 100
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
<<: *common_pod_spec
containers:
- image: vllm/vllm-openai:v0.6.1.post2
<<: *common_container_settings



- label: "A100 sglang benchmark"
priority: 100
agents:
queue: A100
Expand All @@ -46,7 +63,7 @@ steps:
podSpec:
<<: *common_pod_spec
containers:
- image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
- image: lmsysorg/sglang:v0.2.14.post2-cu118
<<: *common_container_settings

- label: "A100 lmdeploy benchmark"
Expand All @@ -58,11 +75,13 @@ steps:
podSpec:
<<: *common_pod_spec
containers:
- image: openmmlab/lmdeploy:v0.5.0
- image: openmmlab/lmdeploy:v0.6.0a0-cu11
<<: *common_container_settings


- label: "A100 vllm benchmark"



- label: "A100 trt llama-8B"
priority: 100
agents:
queue: A100
Expand All @@ -71,10 +90,25 @@ steps:
podSpec:
<<: *common_pod_spec
containers:
- image: vllm/vllm-openai:latest
- image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use 24.08?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got r24.07 protobuf template filling scripts from NVIDIA and these scripts doesn't work for r24.08 right now. I confirmed with NVIDIA that in the future there will be a test docker that can be used for benchmarking so I am planning to use r24.07 for now and then switch to the test docker after its release.

<<: *common_container_settings
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: HF_HOME
value: /root/.cache/huggingface
- name: VLLM_SOURCE_CODE_LOC
value: /workspace/build/buildkite/vllm/performance-benchmark
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
- name: TEST_SELECTOR
value: "llama8B"


- label: "A100 tgi benchmark"
- label: "A100 trt llama-70B"
priority: 100
agents:
queue: A100
Expand All @@ -83,12 +117,52 @@ steps:
podSpec:
<<: *common_pod_spec
containers:
- image: ghcr.io/huggingface/text-generation-inference:2.1
- image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

<<: *common_container_settings
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: HF_HOME
value: /root/.cache/huggingface
- name: VLLM_SOURCE_CODE_LOC
value: /workspace/build/buildkite/vllm/performance-benchmark
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
- name: TEST_SELECTOR
value: "llama70B"


# - label: "A100 trt benchmark"
# priority: 100
# agents:
# queue: A100
# plugins:
# - kubernetes:
# podSpec:
# <<: *common_pod_spec
# containers:
# - image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
# <<: *common_container_settings


# - label: "A100 tgi benchmark"
# priority: 100
# agents:
# queue: A100
# plugins:
# - kubernetes:
# podSpec:
# <<: *common_pod_spec
# containers:
# - image: ghcr.io/huggingface/text-generation-inference:2.2.0
# <<: *common_container_settings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup unless we need to keep them

Suggested change
# - label: "A100 trt benchmark"
# priority: 100
# agents:
# queue: A100
# plugins:
# - kubernetes:
# podSpec:
# <<: *common_pod_spec
# containers:
# - image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
# <<: *common_container_settings
# - label: "A100 tgi benchmark"
# priority: 100
# agents:
# queue: A100
# plugins:
# - kubernetes:
# podSpec:
# <<: *common_pod_spec
# containers:
# - image: ghcr.io/huggingface/text-generation-inference:2.2.0
# <<: *common_container_settings

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TGI benchmark: I prefer keeping it so that we can restore them after they support --ignore-eos flag.

TRT benchmark: I also prefer keeping it --- currently I am separating the trt-llm test to llama8B and llama 70B test and comment out this test purely because TRT needs to compile the model and it exceeds the 1 hour 30 minutes CI time limit if I directly run the commented-out test. I am expecting that new TRT-LLM docker will have all model for the test suite pre-compiled (hopefully) so that I don't need to worry about this test exceeding the test limit (and then I can uncomment this test).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added comments in the file to explain why keeping those comments


- wait

- label: "Plot"
- label: "Collect the results"
priority: 100
agents:
queue: A100
Expand Down Expand Up @@ -117,4 +191,4 @@ steps:
name: hf-token-secret
key: token

- wait
- block: ":rocket: check the results!"
76 changes: 0 additions & 76 deletions .buildkite/nightly-benchmarks/run-nightly-suite.sh

This file was deleted.

Loading
Loading