Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
189 changes: 26 additions & 163 deletions docs/dev-docker/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# vllm FP8 Latency and Throughput benchmarks with vLLM on the AMD Instinct™ MI300X accelerator

Documentation for Inferencing with vLLM on AMD Instinct™ MI300X platforms.
Documentation for vLLM inference on AMD Instinct™ MI300X platforms.

## Overview

Expand All @@ -17,7 +17,7 @@ The pre-built image includes:

## Pull latest Docker Image

Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main`
Pull the most recent validated docker image with `docker pull rocm/vllm:latest`

## What is New

Expand All @@ -32,7 +32,6 @@ Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main
## Performance Results

The data in the following tables is a reference point to help users validate observed performance. It should not be considered as the peak performance that can be delivered by AMD Instinct™ MI300X accelerator with vLLM. See the MLPerf section in this document for information about MLPerf 4.1 inference results. The performance numbers above were collected using the steps below.
*Note Benchmarks were run with benchmark scripts from [v0.8.5](https://github.com/vllm-project/vllm/tree/v0.8.5/benchmarks)*

### Throughput Measurements

Expand Down Expand Up @@ -100,7 +99,7 @@ Supermicro AS-8125GS-TNMR2 with 2x AMD EPYC 9554 Processors, 2.25 TiB RAM, 8x AM

### Preparation - Obtaining access to models

The vllm-dev docker image should work with any model supported by vLLM. When running with FP8, AMD has quantized models available for a variety of popular models, or you can quantize models yourself using Quark. If needed, the vLLM benchmark scripts will automatically download models and then store them in a Hugging Face cache directory for reuse in future tests. Alternatively, you can choose to download the model to the cache (or to another directory on the system) in advance.
The vllm docker image should work with any model supported by vLLM. When running with FP8, AMD has quantized models available for a variety of popular models, or you can quantize models yourself using Quark. If needed, the vLLM benchmark scripts will automatically download models and then store them in a Hugging Face cache directory for reuse in future tests. Alternatively, you can choose to download the model to the cache (or to another directory on the system) in advance.

Many HuggingFace models, including Llama-3.1, have gated access. You will need to set up an account at (https://huggingface.co), search for the model of interest, and request access if necessary. You will also need to create a token for accessing these models from vLLM: open your user profile (https://huggingface.co/settings/profile), select "Access Tokens", press "+ Create New Token", and create a new Read token.

Expand All @@ -122,7 +121,7 @@ docker run -it --rm --ipc=host --network=host --group-add render \
-e HF_HOME=/data \
-e HF_TOKEN=<token> \
-v /data:/data \
rocm/vllm-dev:main
rocm/vllm:latest
```

Note: The instructions in this document use `/data` to store the models. If you choose a different directory, you will also need to make that change to the host volume mount when launching the docker container. For example, `-v /home/username/models:/data` in place of `-v /data:/data` would store the models in /home/username/models on the host. Some models can be quite large; please ensure that you have sufficient disk space prior to downloading the model. Since the model download may take a long time, you can use `tmux` or `screen` to avoid getting disconnected.
Expand Down Expand Up @@ -155,7 +154,7 @@ In the benchmark commands provided later in this document, replace the model nam

### Use pre-quantized models

AMD has provided [FP8-quantized versions](https://huggingface.co/collections/amd/quark-quantized-ocp-fp8-models-66db7936d18fcbaf95d4405c) of several models in order to make them easier to run on MI300X / MI325X, including:
AMD has provided [FP8-quantized versions](https://huggingface.co/collections/amd/quark-quantized-ocp-fp8-models) of several models in order to make them easier to run, including:

- <https://huggingface.co/amd/Llama-3.1-8B-Instruct-FP8-KV>
- <https://huggingface.co/amd/Llama-3.1-70B-Instruct-FP8-KV>
Expand Down Expand Up @@ -197,21 +196,21 @@ Note: the `--multi_gpu` parameter can be omitted for small models that fit on a

### Performance environment variables

Some environment variables enhance the performance of the vLLM kernels on the MI300X / MI325X accelerator. See the AMD Instinct MI300X workload optimization guide for more information.
Some environment variables enhance the performance of the vLLM kernels on the MI300X / MI325X accelerator. See the [AMD Instinct MI300X workload optimization guide](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html) for more information.

### vLLM engine performance settings

vLLM provides a number of engine options which can be changed to improve performance. Refer to the [vLLM Engine Args](https://docs.vllm.ai/en/stable/usage/engine_args.html) documentation for the complete list of vLLM engine options.

Below is a list of a few of the key vLLM engine arguments for performance; these can be passed to the vLLM benchmark scripts:
- **--max-model-len** : Maximum context length supported by the model instance. Can be set to a lower value than model configuration value to improve performance and gpu memory utilization.
- **--max-num-batched-tokens** : The maximum prefill size, i.e., how many prompt tokens can be packed together in a single prefill. Set to a higher value to improve prefill performance at the cost of higher gpu memory utilization. 65536 works well for LLama models.
- **--max-num-seqs** : The maximum decode batch size (default 256). Using larger values will allow more prompts to be processed concurrently, resulting in increased throughput (possibly at the expense of higher latency). If the value is too large, there may not be enough GPU memory for the KV cache, resulting in requests getting preempted. The optimal value will depend on the GPU memory, model size, and maximum context length.
- **--max-num-batched-tokens** : The maximum prefill size, i.e., how many prompt tokens can be packed together in a single prefill. Set to a higher value to improve prefill performance at the cost of higher gpu memory utilization. 131072 works well for LLama models.
- **--max-num-seqs** : The maximum decode batch size (default 1024). Using larger values will allow more prompts to be processed concurrently, resulting in increased throughput (possibly at the expense of higher latency). If the value is too large, there may not be enough GPU memory for the KV cache, resulting in requests getting preempted. The optimal value will depend on the GPU memory, model size, and maximum context length.
- **--gpu-memory-utilization** : The ratio of GPU memory reserved by a vLLM instance. Default value is 0.9. Increasing the value (potentially as high as 0.99) will increase the amount of memory available for KV cache. When running in graph mode (i.e. not using `--enforce-eager`), it may be necessary to use a slightly smaller value of 0.92 - 0.95 to ensure adequate memory is available for the HIP graph.

### Latency Benchmark

vLLM's benchmark_latency.py script measures end-to-end latency for a specified model, input/output length, and batch size.
vLLM's `vllm bench latency` tool measures end-to-end latency for a specified model, input/output length, and batch size.

You can run latency tests for FP8 models with:

Expand All @@ -222,7 +221,7 @@ IN=128
OUT=2048
TP=8

python3 /app/vllm/benchmarks/benchmark_latency.py \
vllm bench latency \
--distributed-executor-backend mp \
--dtype float16 \
--gpu-memory-utilization 0.9 \
Expand All @@ -239,17 +238,17 @@ python3 /app/vllm/benchmarks/benchmark_latency.py \

When measuring models with long context lengths, performance may improve by setting `--max-model-len` to a smaller value. It is important, however, to ensure that the `--max-model-len` is at least as large as the IN + OUT token counts.

To estimate Time To First Token (TTFT) with the benchmark_latency.py script, set the OUT to 1 token. It is also recommended to use `--enforce-eager` to get a more accurate measurement of the time that it actually takes to generate the first token. (For a more comprehensive measurement of TTFT, use the Online Serving Benchmark.)
To estimate Time To First Token (TTFT) with the `vllm bench latency` tool, set the OUT to 1 token. It is also recommended to use `--enforce-eager` to get a more accurate measurement of the time that it actually takes to generate the first token. (For a more comprehensive measurement of TTFT, use the Online Serving Benchmark.)

For additional information about the available parameters run:

```bash
/app/vllm/benchmarks/benchmark_latency.py -h
vllm bench latency -h
```

### Throughput Benchmark

vLLM's benchmark_throughput.py script measures offline throughput. It can either use an input dataset or random prompts with fixed input/output lengths.
vLLM's `vllm bench throughput` tool measures offline throughput. It can either use an input dataset or random prompts with fixed input/output lengths.

You can run throughput tests for FP8 models with:

Expand All @@ -261,7 +260,7 @@ TP=8
PROMPTS=1500
MAX_NUM_SEQS=1500

python3 /app/vllm/benchmarks/benchmark_throughput.py \
vllm bench throughput \
--distributed-executor-backend mp \
--kv-cache-dtype fp8 \
--dtype float16 \
Expand All @@ -278,20 +277,20 @@ python3 /app/vllm/benchmarks/benchmark_throughput.py \
--max-num-seqs $MAX_NUM_SEQS
```

For FP16 models, remove `--kv-cache-dtype fp8`.
For FP16/BF16 models, remove `--kv-cache-dtype fp8`.

When measuring models with long context lengths, performance may improve by setting `--max-model-len` to a smaller value (8192 in this example). It is important, however, to ensure that the `--max-model-len` is at least as large as the IN + OUT token counts.

It is important to tune vLLM’s --max-num-seqs value to an appropriate value depending on the model and input/output lengths. Larger values will allow vLLM to leverage more of the GPU memory for KV Cache and process more prompts concurrently. But if the value is too large, the KV cache will reach its capacity and vLLM will have to cancel and re-process some prompts. Suggested values for various models and configurations are listed below.

For models that fit on a single GPU, it is usually best to run with `--tensor-parallel-size 1`. Requests can be distributed across multiple copies of vLLM running on different GPUs. This will be more efficient than running a single copy of the model with `--tensor-parallel-size 8`. (Note: the benchmark_throughput.py script does not include direct support for using multiple copies of vLLM)
For models that fit on a single GPU, it is usually best to run with `--tensor-parallel-size 1`. Requests can be distributed across multiple copies of vLLM running on different GPUs. This will be more efficient than running a single copy of the model with `--tensor-parallel-size 8`.

For optimal performance, the PROMPTS value should be a multiple of the MAX_NUM_SEQS value -- for example, if MAX_NUM_SEQS=1500 then the PROMPTS value could be 1500, 3000, etc. If PROMPTS is smaller than MAX_NUM_SEQS then there won’t be enough prompts for vLLM to maximize concurrency.

For additional information about the available parameters run:

```bash
python3 /app/vllm/benchmarks/benchmark_throughput.py -h
vllm bench throughput -h
```

### Online Serving Benchmark
Expand All @@ -317,7 +316,7 @@ For FP16 models, remove `--kv-cache-dtype fp8`. Change port (for example --port
Run client in a separate terminal. Use port_id from previous step else port-id=8000.

```bash
python /app/vllm/benchmarks/benchmark_serving.py \
vllm bench serve \
--port 8000 \
--model amd/Llama-3.1-70B-Instruct-FP8-KV \
--dataset-name random \
Expand All @@ -331,172 +330,36 @@ python /app/vllm/benchmarks/benchmark_serving.py \

Once all prompts are processed, terminate the server gracefully (ctrl+c).

### Running DeepSeek-V3 and DeepSeek-R1
### AITER

We have experimental support for running both DeepSeek-V3 and DeepSeek-R1 models.
*Note there are currently limitations and `--max-model-len` cannot be greater than 32768*

```bash
docker run -it --rm --ipc=host --network=host --group-add render \
--privileged --security-opt seccomp=unconfined \
--cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \
--device=/dev/kfd --device=/dev/dri --device=/dev/mem \
-e VLLM_USE_TRITON_FLASH_ATTN=1 \
-e VLLM_ROCM_USE_AITER=1 \
-e VLLM_MLA_DISABLE=0 \
rocm/vllm-dev:main

# Online serving
vllm serve deepseek-ai/DeepSeek-V3 \
--disable-log-requests \
--tensor-parallel-size 8 \
--trust-remote-code \
--max-model-len 131072 \
--block-size=1

python3 /app/vllm/benchmarks/benchmark_serving.py \
--backend vllm \
--model deepseek-ai/DeepSeek-V3 \
--max-concurrency 256\
--dataset-name random \
--random-input-len 128 \
--random-output-len 128 \
--num-prompts 1000

# Offline throughput
python3 /app/vllm/benchmarks/benchmark_throughput.py --model deepseek-ai/DeepSeek-V3 \
--input-len <> --output-len <> --tensor-parallel-size 8 \
--quantization fp8 --kv-cache-dtype fp8 --dtype float16 \
--max-model-len 32768 --block-size=1 --trust-remote-code

# Offline Latency
python /app/vllm/benchmarks/benchmark_latency.py --model deepseek-ai/DeepSeek-V3 \
--tensor-parallel-size 8 --trust-remote-code --max-model-len 32768 --block-size=1 \
--batch-size <> --input-len <> --output-len <>
```

### CPX mode

Currently only CPX-NPS1 mode is supported. So ONLY tp=1 is supported in CPX mode.
But multiple instances can be started simultaneously (if needed) in CPX-NPS1 mode.

Set GPUs in CPX mode with:

```bash
rocm-smi --setcomputepartition cpx
```

Example of running Llama3.1-8B on 1 CPX-NPS1 GPU with input 4096 and output 512. As mentioned above, tp=1.

```bash
HIP_VISIBLE_DEVICES=0 \
python3 /app/vllm/benchmarks/benchmark_throughput.py \
--max-model-len 4608 \
--num-scheduler-steps 10 \
--num-prompts 100 \
--model amd/Llama-3.1-8B-Instruct-FP8-KV \
--input-len 4096 \
--output-len 512 \
--dtype float16 \
--tensor-parallel-size 1 \
--output-json <path/to/output.json> \
--quantization fp8 \
--gpu-memory-utilization 0.99
```

Set GPU to SPX mode.

```bash
rocm-smi --setcomputepartition spx
```

### Speculative Decoding

Speculative decoding is one of the key features in vLLM. It has been supported on MI300. Here below is an example of the performance benchmark w/wo speculative decoding for Llama 3.1 405B with Llama 3.1 8B as the draft model.

Without Speculative Decoding -

```bash
python /app/vllm/benchmarks/benchmark_latency.py --model amd/Llama-3.1-405B-Instruct-FP8-KV --max-model-len 26720 -tp 8 --batch-size 1 --input-len 1024 --output-len 128
```

With Speculative Decoding -

```bash
python /app/vllm/benchmarks/benchmark_latency.py --model amd/Llama-3.1-405B-Instruct-FP8-KV --max-model-len 26720 -tp 8 --batch-size 1 --input-len 1024 --output-len 128 --speculative-model amd/Llama-3.1-8B-Instruct-FP8-KV --num-speculative-tokens 5
```

You should see some performance improvement about the e2e latency.

### AITER use cases

`rocm/vllm-dev:main` image has experimental [AITER](https://github.com/ROCm/aiter) support, and can yield siginficant performance increase for some model/input/output/batch size configurations. To enable the feature make sure the following environment is set: `VLLM_ROCM_USE_AITER=1`, the default value is `0`. When building your own image follow the [Docker build steps](#Docker-manifest) using the [aiter_integration_final](https://github.com/ROCm/vllm/tree/aiter_integration_final) branch.

Some use cases include:
- amd/Mixtral-8x7B-Instruct-v0.1-FP8-KV
- amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV
`rocm/vllm:latest` image comes with [AITER](https://github.com/ROCm/aiter) preinstalled, and can yield siginficant performance increase for some model/input/output/batch size configurations. To disable this feature and run using vLLM's Triton attention use: `VLLM_ROCM_USE_AITER=0`, the default value is currently `1`. See https://docs.vllm.ai/en/latest/getting_started/quickstart.html#on-attention-backends for more information.

```bash
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MHA=0
python3 /app/vllm/benchmarks/benchmark_latency.py --model amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV -tp 8 --batch-size 256 --input-len 128 --output-len 2048
vllm bench latency --model amd/Mixtral-8x22B-Instruct-v0.1-FP8-KV -tp 8 --batch-size 256 --input-len 128 --output-len 2048
```

## MMLU_PRO_Biology Accuracy Evaluation

### FP16

vllm (pretrained=models--meta-llama--Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26,dtype=float16,tensor_parallel_size=8), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 64
## Building vLLM docker image for ROCm

| Tasks |Version| Filter |n-shot| Metric | |Value | |Stderr|
|-------|------:|--------------|-----:|-----------|---|-----:|---|-----:|
|biology| 0|custom-extract| 5|exact_match|↑ |0.8466|± |0.0135|

### FP8

vllm (pretrained=models--meta-llama--Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26,dtype=float16,quantization=fp8,quantized_weights_path=/llama.safetensors,tensor_parallel_size=8), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 32

| Tasks |Version| Filter |n-shot| Metric | |Value| |Stderr|
|-------|------:|--------------|-----:|-----------|---|----:|---|-----:|
|biology| 0|custom-extract| 5|exact_match|↑ |0.848|± |0.0134|

## Performance

### MLPerf Performance Results

#### LLama-2-70B

Please refer to the [Benchmarking Machine Learning using ROCm and AMD GPUs: Reproducing Our MLPerf Inference Submission — ROCm Blogs](https://rocm.blogs.amd.com/artificial-intelligence/mlperf-inf-4-1/README.html) for information on reproducing MLPerf 4.1 Inference results. Note that due to changes in vLLM, it is not possible to use these instructions with the current rocm/vllm-dev docker image. Due to recent changes in vLLM, the instructions for MLPerf 4.1 submission do not apply to the current rocm/vllm-dev docker image.

## Docker Manifest

Clone the vLLM repository:
To build a vLLM image correpsonding to the current rocm/vllm:latest, clone the vLLM repository:

```bash
git clone https://github.com/vllm-project/vllm.git
cd vllm
```

Use the following command to build the image directly from the specified commit.
Then use the following command to build the image directly from the specified commit.

```bash
docker build -f docker/Dockerfile.rocm \
--build-arg REMOTE_VLLM=1 \
--build-arg VLLM_REPO=https://github.com/ROCm/vllm \
--build-arg VLLM_BRANCH="790d22168820507f3105fef29596549378cfe399" \
--build-arg VLLM_BRANCH="38f225c2abeadc04c2cc398814c2f53ea02c3c72" \
-t vllm-rocm .
```

### Building AITER Image

Use AITER release candidate branch instead:

```bash
git clone https://github.com/ROCm/vllm.git
cd vllm
git checkout aiter_integration_final
docker build -f docker/Dockerfile.rocm -t <your_tag> --build-arg USE_CYTHON=1 .
```
For further instructions on how to build an upstream vLLM docker image, see https://docs.vllm.ai/en/latest/getting_started/installation/gpu.html#build-image-from-source

## Changelog

Expand Down