Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
9191d57
add an env var for path to pre-downloaded flashinfer cubin files (#22…
842974287 Aug 22, 2025
31705bf
[CI/Build] add EP dependencies to docker (#21976)
zhewenl Aug 22, 2025
73911f7
fix spec_decode.py
ekagra-ranjan Aug 23, 2025
7847894
[PERF] PyTorch Symmetric Memory All-Reduce (#20759)
ilmarkov Aug 22, 2025
78e74a1
[BugFix][AMD][Quantization] Fix torch.compile issue where wvSplitKQ n…
rasmith Aug 22, 2025
edd48f5
[NVIDIA][torch.compile] Support Flashinfer TRTLLM FP8-q/kv NVFP4-out …
elvischenv Aug 22, 2025
03d6f9d
[BugFix] Fix batch updates for pooling models (#23398)
njhill Aug 23, 2025
82f511d
[BugFix] Fix `MinPLogitsProcessor.update_states()` (#23401)
njhill Aug 23, 2025
9ffa26d
[Model] Support DP for ViT on MiniCPM-V-4 (#23327)
david6666666 Aug 23, 2025
466ed1b
[UX] Move Dockerfile DeepGEMM install to tools/install_deepgemm.sh (#…
mgoin Aug 23, 2025
bc78210
Quantization: support FP4 quantized models on AMD CDNA2/CDNA3 GPUs (#…
fengli1702 Aug 23, 2025
58d6fa6
Add glm4.5v tp2,4 fp8 config on H100_80GB (#23443)
chenxi-yang Aug 23, 2025
4f93bc2
Revert "[PERF] Use faster way of decode in tokenizer: avoid useless l…
DarkLight1337 Aug 23, 2025
5546b68
fix(tests): Correct unreachable assertion in truncation test (#23425)
AzizCode92 Aug 23, 2025
a478b92
Support DeepSeek-V3.1 tool call (#23454)
Xu-Wenqing Aug 23, 2025
addd398
[Misc] Modify CacheConfig import (#23459)
jeejeelee Aug 23, 2025
5775c1b
[gpt-oss] Streaming Output for Python Tool (#23409)
ZJY0516 Aug 24, 2025
315a40d
Migrate Pixtral inputs to TensorSchema (#23472)
bbeckca Aug 24, 2025
7a28e0b
[Bugfix] Add strong reference to CUDA pluggable allocator callbacks (…
22quinn Aug 24, 2025
461b13e
Migrate Paligemma inputs to TensorSchema (#23470)
bbeckca Aug 24, 2025
945b76a
[kernel] Support W4A8 on Hopper (#23198)
czhu-cohere Aug 24, 2025
8a7434c
[Misc] update dict parse to EPLBConfig from json dumps to dict unpack…
lengrongfu Aug 24, 2025
40486b4
(Misc): add missing test for zero truncation size. (#23457)
teekenl Aug 24, 2025
9154599
[New Model]Donut model (#23229)
princepride Aug 24, 2025
e06db36
[Model] Enable BLOOM on V1 (#23488)
DarkLight1337 Aug 24, 2025
ec59546
[Misc] Remove unused slot_mapping buffer (#23502)
WoosukKwon Aug 24, 2025
19de613
fix incompatibililty with non cuda platform for nvfp4 (#23478)
luccafong Aug 24, 2025
7d6c653
[Doc: ]fix various typos in multiple files (#23487)
didier-durand Aug 25, 2025
0d51374
[Perf] Add Triton config for DeepSeek V3 FP8 EP32 H200 (#23504)
minosfuture Aug 25, 2025
c8cb0f4
Frontend: Adding LM Format Enforcer support to V1 engine (#22564)
noamgat Aug 25, 2025
2362383
[Bugfix] Fix Qwen2.5-VL quantized model weights loading (#23512)
zifeitong Aug 25, 2025
edd6b4d
[Misc] Unified linear print info (#23516)
jeejeelee Aug 25, 2025
4dad317
Migrate tarsier inputs to TensorSchema (#23500)
bbeckca Aug 25, 2025
71abeb9
Migrate skyworkr1v inputs to TensorSchema (#23499)
bbeckca Aug 25, 2025
eea78b5
Migrate DonutImagePixelInputs to TensorSchema (#23509)
bbeckca Aug 25, 2025
413686c
[Bugfix] Fix Dense module loading for sentence-transformers embedding…
FFFfff1FFFfff Aug 25, 2025
715d38c
[gpt-oss] use reasoning channel for reasoning text in serving_chat (#…
yuguo68 Aug 25, 2025
57d5528
[Refactor] Dynamic `target` and `content` for prompt updates (#23411)
DarkLight1337 Aug 25, 2025
89faeec
[Core][Multimodal] Track encode cache entries by mm_hash and enable e…
fake0fan Aug 25, 2025
b1bea67
[Fix] DeepSeek V3.1 tool parser error message (#23492)
skyloevil Aug 25, 2025
7f639b0
Feature/benchmark/random mm data/images (#23119)
h-brenoskuk Aug 25, 2025
bd92688
[Bugfix] Allow dynamic number of patches for llava_onevision (#23525)
DarkLight1337 Aug 25, 2025
eb89dd2
[misc] add shanghai meetup (#23535)
youkaichao Aug 25, 2025
f4e4c6f
[Attention] Unify mamba and attention backend selection (#23171)
ayushsatyam146 Aug 25, 2025
30f7baa
[Doc] Add caution for API server scale-out (#23550)
DarkLight1337 Aug 25, 2025
64530a3
[Refactor] Pass `tokenizer` explicitly instead of binding to prompt u…
DarkLight1337 Aug 25, 2025
258f16b
Updates to Flex + VLLm integration (#21416)
drisspg Aug 25, 2025
d3ff0c1
[Bugfix] Fix Qwen3 MoE GPTQ inference (#23490)
Isotr0py Aug 25, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .buildkite/nightly-benchmarks/nightly-descriptions.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
- *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
- *NOTE: we use r24.07 as the current implementation only works for this version. We are going to bump this up.*
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
- Hardware
- 8x Nvidia A100 GPUs
Expand Down
7 changes: 7 additions & 0 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -843,3 +843,10 @@ steps:
commands:
- export VLLM_WORKER_MULTIPROC_METHOD=spawn
- pytest -s -v test_lm_eval_correctness.py --config-list-file=configs/models-large.txt --tp-size=4

- label: Qwen MoE EP Test # optional
gpu: h200
optional: true
num_gpus: 2
commands:
- CUDA_VISIBLE_DEVICES=1,2 VLLM_ALL2ALL_BACKEND=deepep_high_throughput VLLM_USE_DEEP_GEMM=1 VLLM_LOGGING_LEVEL=DEBUG python3 /vllm-workspace/examples/offline_inference/data_parallel.py --model Qwen/Qwen1.5-MoE-A2.7B --tp-size=1 --dp-size=2 --max-model-len 2048
27 changes: 27 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -750,6 +750,33 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
"found in CUDA target architectures")
endif()
endif()

# Only build W4A8 kernels if we are building for something compatible with sm90a
cuda_archs_loose_intersection(W4A8_ARCHS "9.0a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.0 AND W4A8_ARCHS)
set(SRCS
"csrc/quantization/cutlass_w4a8/w4a8_mm_entry.cu")

set_gencode_flags_for_srcs(
SRCS "${SRCS}"
CUDA_ARCHS "${W4A8_ARCHS}")

list(APPEND VLLM_EXT_SRC "${SRCS}")

message(STATUS "Building W4A8 kernels for archs: ${W4A8_ARCHS}")
else()
if (NOT ${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER_EQUAL 12.0
AND W4A8_ARCHS)
message(STATUS "Not building W4A8 kernels as CUDA Compiler version is "
"not >= 12.0, we recommend upgrading to CUDA 12.0 or "
"later if you intend on running w4a16 quantized models on "
"Hopper.")
else()
message(STATUS "Not building W4A8 kernels as no compatible archs "
"found in CUDA target architectures")
endif()
endif()

# if CUDA endif
endif()

Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,15 @@ Easy, fast, and cheap LLM serving for everyone

*Latest News* 🔥

- [2025/08] We hosted [vLLM Shanghai Meetup](https://mp.weixin.qq.com/s/pDmAXHcN7Iqc8sUKgJgGtg) focusing on building, developing, and integrating with vLLM! Please find the meetup slides [here](https://drive.google.com/drive/folders/1OvLx39wnCGy_WKq8SiVKf7YcxxYI3WCH).
- [2025/08] We hosted [vLLM Beijing Meetup](https://mp.weixin.qq.com/s/dgkWg1WFpWGO2jCdTqQHxA) focusing on large-scale LLM deployment! Please find the meetup slides [here](https://drive.google.com/drive/folders/1Pid6NSFLU43DZRi0EaTcPgXsAzDvbBqF) and the recording [here](https://www.chaspark.com/#/live/1166916873711665152).
- [2025/05] We hosted [NYC vLLM Meetup](https://lu.ma/c1rqyf1f)! Please find the meetup slides [here](https://docs.google.com/presentation/d/1_q_aW_ioMJWUImf1s1YM-ZhjXz8cUeL0IJvaquOYBeA/edit?usp=sharing).
- [2025/05] vLLM is now a hosted project under PyTorch Foundation! Please find the announcement [here](https://pytorch.org/blog/pytorch-foundation-welcomes-vllm/).
- [2025/01] We are excited to announce the alpha release of vLLM V1: A major architectural upgrade with 1.7x speedup! Clean code, optimized execution loop, zero-overhead prefix caching, enhanced multimodal support, and more. Please check out our blog post [here](https://blog.vllm.ai/2025/01/27/v1-alpha-release.html).

<details>
<summary>Previous News</summary>

- [2025/05] We hosted [NYC vLLM Meetup](https://lu.ma/c1rqyf1f)! Please find the meetup slides [here](https://docs.google.com/presentation/d/1_q_aW_ioMJWUImf1s1YM-ZhjXz8cUeL0IJvaquOYBeA/edit?usp=sharing).
- [2025/04] We hosted [Asia Developer Day](https://www.sginnovate.com/event/limited-availability-morning-evening-slots-remaining-inaugural-vllm-asia-developer-day)! Please find the meetup slides from the vLLM team [here](https://docs.google.com/presentation/d/19cp6Qu8u48ihB91A064XfaXruNYiBOUKrBxAmDOllOo/edit?usp=sharing).
- [2025/03] We hosted [vLLM x Ollama Inference Night](https://lu.ma/vllm-ollama)! Please find the meetup slides from the vLLM team [here](https://docs.google.com/presentation/d/16T2PDD1YwRnZ4Tu8Q5r6n53c5Lr5c73UV9Vd2_eBo4U/edit?usp=sharing).
- [2025/03] We hosted [the first vLLM China Meetup](https://mp.weixin.qq.com/s/n77GibL2corAtQHtVEAzfg)! Please find the meetup slides from vLLM team [here](https://docs.google.com/presentation/d/1REHvfQMKGnvz6p3Fd23HhSO4c8j5WPGZV0bKYLwnHyQ/edit?usp=sharing).
Expand Down
77 changes: 77 additions & 0 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,12 @@ become available.
<td style="text-align: center;">✅</td>
<td><code>synthetic</code></td>
</tr>
<tr>
<td><strong>RandomMultiModal (Image/Video)</strong></td>
<td style="text-align: center;">🟡</td>
<td style="text-align: center;">🚧</td>
<td><code>synthetic</code> </td>
</tr>
<tr>
<td><strong>Prefix Repetition</strong></td>
<td style="text-align: center;">✅</td>
Expand Down Expand Up @@ -722,4 +728,75 @@ python benchmarks/benchmark_serving.py \
--endpoint /v1/chat/completion
```

### Synthetic Random Images (random-mm)

Generate synthetic image inputs alongside random text prompts to stress-test vision models without external datasets.

Notes:

- Works only with online benchmark via the OpenAI backend (`--backend openai-chat`) and endpoint `/v1/chat/completions`.
- Video sampling is not yet implemented.

Start the server (example):

```bash
vllm serve Qwen/Qwen2.5-VL-3B-Instruct \
--dtype bfloat16 \
--max-model-len 16384 \
--limit-mm-per-prompt '{"image": 3, "video": 0}' \
--mm-processor-kwargs max_pixels=1003520
```

Benchmark. It is recommended to use the flag `--ignore-eos` to simulate real responses. You can set the size of the output via the arg `random-output-len`.

Ex.1: Fixed number of items and a single image resolutionm, enforcing generation of approx 40 tokens:

```bash
vllm bench serve \
--backend openai-chat \
--model Qwen/Qwen2.5-VL-3B-Instruct \
--endpoint /v1/chat/completions \
--dataset-name random-mm \
--num-prompts 100 \
--max-concurrency 10 \
--random-prefix-len 25 \
--random-input-len 300 \
--random-output-len 40 \
--random-range-ratio 0.2 \
--random-mm-base-items-per-request 2 \
--random-mm-limit-mm-per-prompt '{"image": 3, "video": 0}' \
--random-mm-bucket-config '{(224, 224, 1): 1.0}' \
--request-rate inf \
--ignore-eos \
--seed 42
```

The number of items per request can be controlled by passing multiple image buckets:

```bash
--random-mm-base-items-per-request 2 \
--random-mm-num-mm-items-range-ratio 0.5 \
--random-mm-limit-mm-per-prompt '{"image": 4, "video": 0}' \
--random-mm-bucket-config '{(256, 256, 1): 0.7, (720, 1280, 1): 0.3}' \
```

Flags specific to `random-mm`:

- `--random-mm-base-items-per-request`: base number of multimodal items per request.
- `--random-mm-num-mm-items-range-ratio`: vary item count uniformly in the closed integer range [floor(n·(1−r)), ceil(n·(1+r))]. Set r=0 to keep it fixed; r=1 allows 0 items.
- `--random-mm-limit-mm-per-prompt`: per-modality hard caps, e.g. '{"image": 3, "video": 0}'.
- `--random-mm-bucket-config`: dict mapping (H, W, T) → probability. Entries with probability 0 are removed; remaining probabilities are renormalized to sum to 1. Use T=1 for images. Set any T>1 for videos (video sampling not yet supported).

Behavioral notes:

- If the requested base item count cannot be satisfied under the provided per-prompt limits, the tool raises an error rather than silently clamping.

How sampling works:

- Determine per-request item count k by sampling uniformly from the integer range defined by `--random-mm-base-items-per-request` and `--random-mm-num-mm-items-range-ratio`, then clamp k to at most the sum of per-modality limits.
- For each of the k items, sample a bucket (H, W, T) according to the normalized probabilities in `--random-mm-bucket-config`, while tracking how many items of each modality have been added.
- If a modality (e.g., image) reaches its limit from `--random-mm-limit-mm-per-prompt`, all buckets of that modality are excluded and the remaining bucket probabilities are renormalized before continuing.
This should be seen as an edge case, and if this behavior can be avoided by setting `--random-mm-limit-mm-per-prompt` to a large number. Note that this might result in errors due to engine config `--limit-mm-per-prompt`.
- The resulting request contains synthetic image data in `multi_modal_data` (OpenAI Chat format). When `random-mm` is used with the OpenAI Chat backend, prompts remain text and MM content is attached via `multi_modal_data`.

</details>
33 changes: 33 additions & 0 deletions benchmarks/kernels/benchmark_machete.py
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,25 @@ def machete_create_bench_fn(
)


def cutlass_w4a8_create_bench_fn(
bt: BenchmarkTensors, out_type=torch.dtype, schedule=None
) -> Callable:
w_q = bt.w_q.t().contiguous().t() # make col major
w_q = ops.cutlass_encode_and_reorder_int4b(w_q)
# expects fp8 scales
w_s = ops.cutlass_pack_scale_fp8(bt.w_g_s.to(torch.float8_e4m3fn))

return lambda: ops.cutlass_w4a8_mm(
a=bt.a,
b_q=w_q,
b_group_scales=w_s,
b_group_size=bt.group_size,
b_channel_scales=bt.w_ch_s,
a_token_scales=bt.w_tok_s,
maybe_schedule=schedule,
)


# impl

# bench
Expand Down Expand Up @@ -385,6 +404,20 @@ def bench(
)
)

# cutlass w4a8
if types.act_type == torch.float8_e4m3fn and group_size == 128:
timers.append(
bench_fns(
label,
sub_label,
f"cutlass w4a8 ({name_type_string})",
[
cutlass_w4a8_create_bench_fn(bt, out_type=types.output_type)
for bt in benchmark_tensors
],
)
)

if sweep_schedules:
global _SWEEP_SCHEDULES_RESULTS

Expand Down
50 changes: 37 additions & 13 deletions benchmarks/kernels/benchmark_trtllm_decode_attention.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,11 @@
import flashinfer
import torch

from vllm.utils import round_up

FLOAT32_BYTES = torch.finfo(torch.float).bits // 8
FP8_DTYPE = torch.float8_e4m3fn
FP4_DTYPE = torch.uint8


def to_float8(x, dtype=torch.float8_e4m3fn):
Expand Down Expand Up @@ -61,28 +64,27 @@ def benchmark_decode(
else:
raise ValueError(f"Invalid kv_layout: {kv_layout}")

query = torch.randn(batch_size, num_qo_heads, head_size, dtype=dtype)
# Always using 1.0 scale to reflect the real perf in benchmarking
q_scale = 1.0
ref_query = torch.randn(batch_size, num_qo_heads, head_size, dtype=dtype)
if q_quant_dtype == FP8_DTYPE:
query, q_scale = to_float8(query)
ref_query = query.to(dtype) * q_scale
query, _ = to_float8(ref_query)
else:
q_scale = 1.0
ref_query = query
query = ref_query

kv_lens = torch.randint(1, max_seq_len, (batch_size,), dtype=torch.int32)
kv_lens[-1] = max_seq_len

seq_lens = kv_lens
max_seq_len = torch.max(seq_lens).item()

kv_cache = torch.randn(kv_cache_shape, dtype=dtype)
# Always using 1.0 scale to reflect the real perf in benchmarking
k_scale = v_scale = 1.0
ref_kv_cache = torch.randn(kv_cache_shape, dtype=dtype)
if kv_quant_dtype == FP8_DTYPE:
kv_cache, kv_scale = to_float8(kv_cache)
ref_kv_cache = kv_cache.to(dtype) * kv_scale
kv_cache, _ = to_float8(ref_kv_cache)
else:
kv_scale = 1.0
ref_kv_cache = kv_cache
k_scale = v_scale = kv_scale
kv_cache = ref_kv_cache

max_num_blocks_per_seq = (max_seq_len + block_size - 1) // block_size
block_tables = torch.randint(
Expand Down Expand Up @@ -142,11 +144,31 @@ def time_fn(fn, warmup=10, trials=20):
return sum(times) / len(times), torch.std(torch.tensor(times))

o_scale = 1.0
o_sf_scale = None
output_baseline = torch.empty(ref_query.shape, dtype=dtype)
output_trtllm = torch.empty(query.shape, dtype=o_quant_dtype)
if o_quant_dtype == FP4_DTYPE:
o_sf_scale = 500.0
output_trtllm = flashinfer.utils.FP4Tensor(
torch.empty(query.shape[:-1] + (query.shape[-1] // 2,), dtype=torch.uint8),
torch.empty(
(
round_up(query.shape[0], 128),
round_up(query.shape[1] * query.shape[2] // 16, 4),
),
dtype=torch.float8_e4m3fn,
),
)
else:
output_trtllm = torch.empty(query.shape, dtype=o_quant_dtype)

def baseline_decode():
return wrapper.run(ref_query, ref_kv_cache, out=output_baseline)
return wrapper.run(
ref_query,
ref_kv_cache,
k_scale=k_scale,
v_scale=v_scale,
out=output_baseline,
)

def trtllm_decode():
return flashinfer.decode.trtllm_batch_decode_with_kv_cache(
Expand All @@ -158,6 +180,7 @@ def trtllm_decode():
max_seq_len=max_seq_len,
bmm1_scale=q_scale * k_scale * sm_scale,
bmm2_scale=v_scale / o_scale,
o_sf_scale=o_sf_scale,
out=output_trtllm,
)

Expand Down Expand Up @@ -237,6 +260,7 @@ def write_results_to_csv(results, filename=None):
(None, None, None),
(None, FP8_DTYPE, None),
(FP8_DTYPE, FP8_DTYPE, FP8_DTYPE),
(FP8_DTYPE, FP8_DTYPE, FP4_DTYPE),
]

for quant_dtype in quant_dtypes:
Expand Down
Loading