You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/dev-docker/README.md
+62-52Lines changed: 62 additions & 52 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,30 +21,30 @@ Pull the most recent validated docker image with `docker pull rocm/vllm-dev:main
21
21
22
22
## What is New
23
23
24
-
-[Experimental AITER support](#aiter-use-cases)
25
-
-[Experimental DeepSeek-V3 and DeepSeek-R1 support](#running-deepseek-v3-and-deepseek-r1)
26
-
- Performance improvement for custom paged attention
27
-
- Support for FP8 skinny GEMM
28
-
- Bug fixes
24
+
-[Improved DeepSeek-V3 and DeepSeek-R1 support](#running-deepseek-v3-and-deepseek-r1)
25
+
- Initial Gemma-3 enablement
26
+
- Detokenizer disablement
27
+
- Torch.compile support
29
28
30
29
## Performance Results
31
30
32
31
The data in the following tables is a reference point to help users validate observed performance. It should not be considered as the peak performance that can be delivered by AMD Instinct™ MI300X accelerator with vLLM. See the MLPerf section in this document for information about MLPerf 4.1 inference results. The performance numbers above were collected using the steps below.
32
+
*Note Benchmarks were run with benchmark scripts from [v0.6.5](https://github.com/vllm-project/vllm/tree/v0.6.5/benchmarks)*
33
33
34
34
### Throughput Measurements
35
35
36
36
The table below shows performance data where a local inference client is fed requests at an infinite rate and shows the throughput client-server scenario under maximum load.
37
37
38
38
| Model | Precision | TP Size | Input | Output | Num Prompts | Max Num Seqs | Throughput (tokens/s) |
0 commit comments