Skip to content

Commit f264100

Browse files
authored
Merge branch 'main' into upstream_merge_2025_03_31
2 parents e294861 + 25070a1 commit f264100

File tree

3 files changed

+64
-54
lines changed

3 files changed

+64
-54
lines changed

Dockerfile.rocm_base

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ RUN apt-get update -y \
4242
&& curl -sS https://bootstrap.pypa.io/get-pip.py | python${PYTHON_VERSION} \
4343
&& python3 --version && python3 -m pip --version
4444

45-
RUN pip install -U packaging cmake ninja wheel setuptools pybind11 Cython
45+
RUN pip install -U packaging 'cmake<4' ninja wheel setuptools pybind11 Cython
4646

4747
FROM base AS build_hipblaslt
4848
ARG HIPBLASLT_BRANCH

docs/dev-docker/README.md

Lines changed: 62 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -21,30 +21,30 @@ Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main
2121

2222
## What is New
2323

24-
- [Experimental AITER support](#aiter-use-cases)
25-
- [Experimental DeepSeek-V3 and DeepSeek-R1 support](#running-deepseek-v3-and-deepseek-r1)
26-
- Performance improvement for custom paged attention
27-
- Support for FP8 skinny GEMM
28-
- Bug fixes
24+
- [Improved DeepSeek-V3 and DeepSeek-R1 support](#running-deepseek-v3-and-deepseek-r1)
25+
- Initial Gemma-3 enablement
26+
- Detokenizer disablement
27+
- Torch.compile support
2928

3029
## Performance Results
3130

3231
The data in the following tables is a reference point to help users validate observed performance. It should not be considered as the peak performance that can be delivered by AMD Instinct™ MI300X accelerator with vLLM. See the MLPerf section in this document for information about MLPerf 4.1 inference results. The performance numbers above were collected using the steps below.
32+
*Note Benchmarks were run with benchmark scripts from [v0.6.5](https://github.com/vllm-project/vllm/tree/v0.6.5/benchmarks)*
3333

3434
### Throughput Measurements
3535

3636
The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load.
3737

3838
| Model | Precision | TP Size | Input | Output | Num Prompts | Max Num Seqs | Throughput (tokens/s) |
3939
|-------|-----------|---------|-------|--------|-------------|--------------|-----------------------|
40-
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 15919.0 |
41-
| | | | 128 | 4096 | 1500 | 1500 | 12053.3 |
42-
| | | | 500 | 2000 | 2000 | 2000 | 13089.0 |
43-
| | | | 2048 | 2048 | 1500 | 1500 | 8352.4 |
44-
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4219.7 |
45-
| | | | 128 | 4096 | 1500 | 1500 | 3328.7 |
46-
| | | | 500 | 2000 | 2000 | 2000 | 3109.3 |
47-
| | | | 2048 | 2048 | 500 | 500 | 2121.7 |
40+
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 3200 | 3200 | 15684.7 |
41+
| | | | 128 | 4096 | 1500 | 1500 | 11761.5 |
42+
| | | | 500 | 2000 | 2000 | 2000 | 12895.9 |
43+
| | | | 2048 | 2048 | 1500 | 1500 | 8380.7 |
44+
| Llama 3.1 405B (amd/Llama-3.1-405B-Instruct-FP8-KV) | FP8 | 8 | 128 | 2048 | 1500 | 1500 | 4218.6 |
45+
| | | | 128 | 4096 | 1500 | 1500 | 3326.2 |
46+
| | | | 500 | 2000 | 2000 | 2000 | 3113.4 |
47+
| | | | 2048 | 2048 | 500 | 500 | 2112.1 |
4848

4949
*TP stands for Tensor Parallelism.*
5050

@@ -54,38 +54,38 @@ The table below shows latency measurement, which typically involves assessing th
5454

5555
| Model | Precision | TP Size | Batch Size | Input | Output | MI300X Latency (sec) |
5656
|-------|-----------|----------|------------|--------|---------|-------------------|
57-
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 17.654 |
58-
| | | | 2 | 128 | 2048 | 18.269 |
59-
| | | | 4 | 128 | 2048 | 18.561 |
60-
| | | | 8 | 128 | 2048 | 20.180 |
61-
| | | | 16 | 128 | 2048 | 22.541 |
62-
| | | | 32 | 128 | 2048 | 25.454 |
63-
| | | | 64 | 128 | 2048 | 33.666 |
64-
| | | | 128 | 128 | 2048 | 48.466 |
65-
| | | | 1 | 2048 | 2048 | 17.771 |
66-
| | | | 2 | 2048 | 2048 | 18.304 |
67-
| | | | 4 | 2048 | 2048 | 19.173 |
68-
| | | | 8 | 2048 | 2048 | 21.326 |
69-
| | | | 16 | 2048 | 2048 | 24.375 |
70-
| | | | 32 | 2048 | 2048 | 29.284 |
71-
| | | | 64 | 2048 | 2048 | 40.200 |
72-
| | | | 128 | 2048 | 2048 | 62.420 |
73-
| Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 46.632 |
74-
| | | | 2 | 128 | 2048 | 47.370 |
75-
| | | | 4 | 128 | 2048 | 49.945 |
76-
| | | | 8 | 128 | 2048 | 53.010 |
77-
| | | | 16 | 128 | 2048 | 56.348 |
78-
| | | | 32 | 128 | 2048 | 65.222 |
79-
| | | | 64 | 128 | 2048 | 82.688 |
80-
| | | | 128 | 128 | 2048 | 115.980 |
81-
| | | | 1 | 2048 | 2048 | 46.918 |
82-
| | | | 2 | 2048 | 2048 | 48.132 |
83-
| | | | 4 | 2048 | 2048 | 52.281 |
84-
| | | | 8 | 2048 | 2048 | 55.874 |
85-
| | | | 16 | 2048 | 2048 | 61.822 |
86-
| | | | 32 | 2048 | 2048 | 76.925 |
87-
| | | | 64 | 2048 | 2048 | 105.400 |
88-
| | | | 128 | 2048 | 2048 | 162.503 |
57+
| Llama 3.1 70B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 17.662 |
58+
| | | | 2 | 128 | 2048 | 18.768 |
59+
| | | | 4 | 128 | 2048 | 19.282 |
60+
| | | | 8 | 128 | 2048 | 20.943 |
61+
| | | | 16 | 128 | 2048 | 23.388 |
62+
| | | | 32 | 128 | 2048 | 26.272 |
63+
| | | | 64 | 128 | 2048 | 34.514 |
64+
| | | | 128 | 128 | 2048 | 50.134 |
65+
| | | | 1 | 2048 | 2048 | 17.891 |
66+
| | | | 2 | 2048 | 2048 | 19.064 |
67+
| | | | 4 | 2048 | 2048 | 19.819 |
68+
| | | | 8 | 2048 | 2048 | 21.925 |
69+
| | | | 16 | 2048 | 2048 | 25.118 |
70+
| | | | 32 | 2048 | 2048 | 29.640 |
71+
| | | | 64 | 2048 | 2048 | 41.029 |
72+
| | | | 128 | 2048 | 2048 | 63.717 |
73+
| Llama 3.1 405B (amd/Llama-3.1-70B-Instruct-FP8-KV) | FP8 | 8 | 1 | 128 | 2048 | 46.779 |
74+
| | | | 2 | 128 | 2048 | 47.136 |
75+
| | | | 4 | 128 | 2048 | 49.045 |
76+
| | | | 8 | 128 | 2048 | 53.145 |
77+
| | | | 16 | 128 | 2048 | 55.720 |
78+
| | | | 32 | 128 | 2048 | 64.996 |
79+
| | | | 64 | 128 | 2048 | 81.950 |
80+
| | | | 128 | 128 | 2048 | 114.799 |
81+
| | | | 1 | 2048 | 2048 | 47.448 |
82+
| | | | 2 | 2048 | 2048 | 47.764 |
83+
| | | | 4 | 2048 | 2048 | 51.338 |
84+
| | | | 8 | 2048 | 2048 | 56.915 |
85+
| | | | 16 | 2048 | 2048 | 61.934 |
86+
| | | | 32 | 2048 | 2048 | 76.136 |
87+
| | | | 64 | 2048 | 2048 | 104.868 |
88+
| | | | 128 | 2048 | 2048 | 159.555 |
8989

9090
*TP stands for Tensor Parallelism.*
9191

@@ -352,15 +352,18 @@ docker run -it --rm --ipc=host --network=host --group-add render \
352352
--privileged --security-opt seccomp=unconfined \
353353
--cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
354354
--device=/dev/kfd --device=/dev/dri --device=/dev/mem \
355-
-e VLLM_USE_TRITON_FLASH_ATTN=0 \
356-
-e VLLM_MLA_DISABLE=1 \
355+
-e VLLM_USE_TRITON_FLASH_ATTN=1 \
356+
-e VLLM_USE_AITER=1 \
357+
-e VLLM_MLA_DISABLE=0 \
357358
rocm/vllm-dev:main
359+
358360
# Online serving
359361
vllm serve deepseek-ai/DeepSeek-V3 \
360362
--disable-log-requests \
361363
--tensor-parallel-size 8 \
362364
--trust-remote-code \
363-
--max-model-len 32768
365+
--max-model-len 131072 \
366+
--block-size=1
364367

365368
python3 /app/vllm/benchmarks/benchmark_serving.py \
366369
--backend vllm \
@@ -375,10 +378,11 @@ python3 /app/vllm/benchmarks/benchmark_serving.py \
375378
python3 /app/vllm/benchmarks/benchmark_throughput.py --model deepseek-ai/DeepSeek-V3 \
376379
--input-len <> --output-len <> --tensor-parallel-size 8 \
377380
--quantization fp8 --kv-cache-dtype fp8 --dtype float16 \
378-
--max-model-len 32768 --trust-remote-code
381+
--max-model-len 32768 --block-size=1 --trust-remote-code
382+
379383
# Offline Latency
380-
python benchmarks/benchmark_latency.py --model deepseek-ai/DeepSeek-V3 \
381-
--tensor-parallel-size 8 --trust-remote-code --max-model-len 32768 \
384+
python /app/vllm/benchmarks/benchmark_latency.py --model deepseek-ai/DeepSeek-V3 \
385+
--tensor-parallel-size 8 --trust-remote-code --max-model-len 32768 --block-size=1 \
382386
--batch-size <> --input-len <> --output-len <>
383387
```
384388

@@ -483,7 +487,7 @@ To reproduce the release docker:
483487
```bash
484488
git clone https://github.com/ROCm/vllm.git
485489
cd vllm
486-
git checkout c0dd5adf68dd997d7d2c3f04da785d7ef9415e36
490+
git checkout 51641aaa70d4dfb0ea1f3674b47a7d85f718847c
487491
docker build -f Dockerfile.rocm -t <your_tag> --build-arg USE_CYTHON=1 .
488492
```
489493

@@ -500,6 +504,12 @@ Use AITER release candidate branch instead:
500504

501505
## Changelog
502506

507+
20250325_aiter:
508+
- Improved DeepSeek-V3/R1 performance
509+
- Initial Gemma-3 enablement
510+
- Detokenizer disablement
511+
- Torch.compile support
512+
503513
20250305_aiter:
504514
- AITER improvements
505515
- Support for FP8 skinny GEMM

requirements/rocm-build.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ torch==2.6.0
66
torchvision==0.21.0
77
torchaudio==2.6.0
88

9-
cmake>=3.26
9+
cmake>=3.26,<4
1010
packaging
1111
setuptools>=61
1212
setuptools-scm>=8

0 commit comments

Comments
 (0)