Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
109 commits
Select commit Hold shift + click to select a range
b692e9c
[Misc] Fix skipped max-model-len validation when deriving max model l…
yeqcharlotte Jun 16, 2025
a77aea5
[TPU] support attention head dim smaller than 128 (#19620)
yaochengji Jun 16, 2025
26bc46e
[MISC] typo fix (#19672)
andyxning Jun 16, 2025
f40f763
[CI] Add mteb testing for rerank models (#19344)
noooop Jun 16, 2025
8d12070
[Docs] Move multiproc doc to v1 dir (#19651)
russellb Jun 16, 2025
dec66d2
[Kernel] GGUF MMVQ kernel for multiple input vectors (#18754)
SzymonOzog Jun 16, 2025
ee35e96
[BugFix] Don't catch BaseException when dumping execute_model errors …
njhill Jun 16, 2025
3e75069
[DOC] Add reasoning capability to vLLM streamlit code (#19557)
Navanit-git Jun 16, 2025
4d54240
[Feature]:Allow for Granite MoE Hybrid models with _only_ shared expe…
shawntan Jun 16, 2025
1173804
[Bugfix] Fix TP inference for Flex attention backend (#19657)
Isotr0py Jun 16, 2025
c3fec47
[MISC] bump huggingface_hub pkg to 0.33.0 (#19547)
andyxning Jun 16, 2025
836d4ce
[Bugfix] fix missing 'finish_reason': null in streaming chat (#19662)
chaunceyjiang Jun 16, 2025
5e5baa9
[Kernels] Use empty for modular MoE workspaces (#19667)
bnellnm Jun 16, 2025
387bdf0
[Model] Add support for MiniMaxM1ForCausalLM (shares architecture wit…
qscqesze Jun 16, 2025
90f9c2e
[V1] Change return type on get_multimodal_embeddings() (#19446)
russellb Jun 16, 2025
6bc7b57
[Quantization] Remove FP4 emulation; Fall-back to marlin for device <…
dsikka Jun 16, 2025
0860087
[Fix] Fall back to Gloo when NCCL backend is unavailable (#19641)
conroy-cheers Jun 17, 2025
119f683
[doc] add project flag to gcloud TPU command (#19664)
davidxia Jun 17, 2025
0733495
[Wheel Size] Only build FA2 8.0+PTX (#19336)
LucasWilkinson Jun 17, 2025
ede5c4e
[Frontend] add chunking audio for > 30s audio (#19597)
nguyenhoangthuan99 Jun 17, 2025
5b3ad5e
[DOC] fix doc typos (#19600)
diliu0349 Jun 17, 2025
ddfed31
Fixes IMA for TP w/ flex-attention (#19712)
drisspg Jun 17, 2025
5c76b9c
[Core] add remove_seq_from_computed_blocks_tracker to BlockSpaceManag…
quanliu1991 Jun 17, 2025
aed8468
[Doc] Add missing llava family multi-image examples (#19698)
Isotr0py Jun 17, 2025
c48c6c4
Add a doc on how to update PyTorch version (#19705)
huydhn Jun 17, 2025
ccd7c05
[Kernel] Add Split-KV Support to Unified Triton Attention Kernel (#19…
jvlunteren Jun 17, 2025
154d063
[doc][mkdocs] Add edit button to documentation (#19637)
reidliu41 Jun 17, 2025
93aee29
[doc] split "Other AI Accelerators" tabs (#19708)
davidxia Jun 17, 2025
4c8f64f
[V1][Kernel] Flashinfer HND KV cache layout (#19280)
NickLucche Jun 17, 2025
5a1c2e1
[Mis] remove duplicate engine status checks (#19647)
googs1025 Jun 17, 2025
ca94d7f
[Bugfix] Update multimodel models mapping to fit new checkpoint after…
Isotr0py Jun 17, 2025
ffb2cd6
[Perf] Optimize `moe_align_block_size` CUDA kernel (#19572)
yewentao256 Jun 17, 2025
bf57ccc
Remove sm120 arch from sm100 cutlass kernel arch list (#19716)
mgoin Jun 17, 2025
cda9230
[Misc] Update lmcache connector with the latest connector apis (#19441)
YaoJiayi Jun 17, 2025
b447624
[Bugfix] Fix faulty triton importing logic when using Ray for DP (#19…
mgoin Jun 17, 2025
a44b1c9
[Feature][ROCm] Add full graph capture support for TritonAttentionBac…
charlifu Jun 17, 2025
dac8cc4
[TPU] Update torch version to include paged attention kernel change (…
Chenyaaang Jun 17, 2025
c53711b
[MISC] correct copy_blocks src_to_dists param type (#19696)
andyxning Jun 18, 2025
6e9cc73
[MISC] correct DeviceConfig device field static type analysis (#19699)
andyxning Jun 18, 2025
d4629dc
[Misc] Add __str__ for RequestStatus (#19780)
lk-chen Jun 18, 2025
5f52a84
[V1] Add API docs for EncoderCacheManager (#19294)
russellb Jun 18, 2025
eccdc83
[V1][P/D] An native implementation of xPyD based on P2P NCCL (#18242)
Abatom Jun 18, 2025
19a53b2
[V1] Decouple GPU and TPU `InputBatch` (#19778)
afeldman-nm Jun 18, 2025
f04d604
[Minor] Zero-initialize attn output buffer (#19784)
WoosukKwon Jun 18, 2025
cca91a7
[doc] fix the incorrect label (#19787)
reidliu41 Jun 18, 2025
257ab95
[Platform] Allow platform use V1 Engine by default (#19792)
wangxiyuan Jun 18, 2025
735a9de
[Qwen] Add tagging rule for Qwen related PRs (#19799)
houseroad Jun 18, 2025
8b6e1d6
[Hardware][AMD] integrate aiter chunked prefill into vllm (#18596)
Zzz9990 Jun 18, 2025
12575cf
[Bugfix] fix RAY_CGRAPH_get_timeout is not set successfully (#19725)
chaunceyjiang Jun 18, 2025
ffacb22
[Docs] Add Huzaifa Sidhpurwala to vuln mgmt team doc (#19808)
russellb Jun 18, 2025
a89209b
[v1] Support mamba2 (#19327)
heheda12345 Jun 18, 2025
9206d0f
docs: fix Slack bulletpoint in README (#19811)
nathan-weinberg Jun 18, 2025
16c1630
Disable "Forbid direct 'import triton'" check for `vllm/triton_utils/…
afeldman-nm Jun 18, 2025
3b523e3
[Core] Do not copy array during hashing (#19484)
lgeiger Jun 18, 2025
04fefe7
[TPU] Update torch-xla version to include paged attention tuned block…
QiliangCui Jun 18, 2025
14fdd21
[Core] More fixes to MultiModalEmbeddings type handling (#19715)
russellb Jun 18, 2025
d49adea
[Multimodal] Use fast processor for Qwen2/2.5-VL (#19789)
WoosukKwon Jun 18, 2025
ed33349
[BugFix] Fix use_cudagraph=False (#19612)
zou3519 Jun 19, 2025
dfada85
[Frontend] Expose custom args in OpenAI APIs (#16862)
afeldman-nm Jun 19, 2025
36239f7
Fix FA2 fallback for Blackwell V1 (#19781)
mgoin Jun 19, 2025
8d1e89d
[Misc][ROCm] Enforce no unused variable in ROCm C++ files (#19796)
houseroad Jun 19, 2025
4959915
[Quantization] Modify the logic of BNB double quantization (#19742)
jeejeelee Jun 19, 2025
799397e
Support embedding models in V1 (#16188)
maxdebayser Jun 19, 2025
b1098b4
[Bugfix] Fix the linter (#19826)
houseroad Jun 19, 2025
e2148dc
[Bugfix] Add check_health to v1 async client. (#19821)
kouroshHakha Jun 19, 2025
83ca9ae
Mark invariant normalizer in Gemma as non-persistent (#19788)
yhtang Jun 19, 2025
2de12be
[ROCm] [AITER] [Bugfix] Patch for AITER commit `648764942e552a8bb5fe1…
tjtanaa Jun 19, 2025
aa20d10
[Misc] [ROCm] Prevent surplus tensor reshape (#19803)
zsolt-borbely-htec Jun 19, 2025
c7b370c
raise exception for pin_lora (#19809)
andyxning Jun 19, 2025
6021999
[Minor] Allow redirecting model path for HfRunner in test (#19795)
Isotr0py Jun 19, 2025
1d0ae26
Add xLAM tool parser support (#17148)
zuxin666 Jun 19, 2025
466166d
[Frontend] Add optional token-level progress bar to `LLM.beam_search`…
NekoMimiUnagi Jun 19, 2025
4719460
Fixing Chunked Prefill Test. (#19762)
Alexei-V-Ivanov-AMD Jun 19, 2025
6f68c49
[Doc] Update V1 user guide for embedding models (#19842)
22quinn Jun 19, 2025
01220ce
[CI][CPU] Improve dummy Triton interfaces and fix the CPU CI (#19838)
bigPYJ1151 Jun 19, 2025
ead2110
[Core][Bugfix] Fix Online MM Beam Search (#19688)
alex-jw-brooks Jun 19, 2025
ea10dd9
[Frontend] early return chat format resolution when specified (#19735)
xzbdmw Jun 19, 2025
10d82f9
[Benchmark][Bugfix] Fix Dataset Length Calculation (#19868)
robertgshaw2-redhat Jun 20, 2025
ee9a153
[CI/Build][Bugfix] Fix deadlock on v1 engine test CI (#19872)
Isotr0py Jun 20, 2025
b6bad3d
[CI][Neuron] Fail and exit on first error (#19622)
elaineyz Jun 20, 2025
5aa4a01
[Benchmark] Fix `Value of type "SampleRequest" is not indexable` (#18…
b8zhong Jun 20, 2025
e41bf15
[Chore]: qwen3-moe-type-hints-mistake (#19860)
Xerxes-cn Jun 20, 2025
e3a3e4d
[Bugfix] Enable PP with AITER+V1 (#19822)
qli88 Jun 20, 2025
5e666f7
[Bugfix][Ray] Set the cuda context eagerly in the ray worker (#19583)
kouroshHakha Jun 20, 2025
089a306
[Misc] update cuda version (#19526)
reidliu41 Jun 20, 2025
e384f2f
[Misc] refactor example - openai_transcription_client (#19851)
reidliu41 Jun 20, 2025
71d1219
[Kernel] correct cpu worker function parameter type (#19745)
andyxning Jun 20, 2025
7771d1d
[Fix] import regex instead of re (#19875)
tdoublep Jun 20, 2025
f1e840e
[Model] GPT2ForSequenceClassification model (#19663)
nie3e Jun 20, 2025
7e8977f
[custom_op][vllm-plugin] update custom_op class to use op_registry (#…
xuechendi Jun 20, 2025
2e3e3c8
Export NaNs in logits to scheduler_stats if output is corrupted (#18777)
vladmihailescu Jun 20, 2025
79f2f1c
[CPU][CI] Fallback sliding window to v0 and fix CPU pooling model tes…
bigPYJ1151 Jun 20, 2025
71baf85
[Kernel] mark TorchSDPABackend swap_blocks NotImplementedError (#19749)
andyxning Jun 20, 2025
e773a9e
[Misc] Clean up useless code (#19889)
wangxiyuan Jun 20, 2025
8ca81bb
Fix: Check the type of params to be a Sequence not list. (#19910)
rabinadk1 Jun 20, 2025
6f170f1
[Bugfix] Fix bnb 8bit model weights loading (#19917)
Isotr0py Jun 21, 2025
c3bf9ba
[New model support]Support Tarsier2 (#19887)
princepride Jun 21, 2025
ba347cd
add option "--expand-tools-even-if-tool-choice-none"
okdshin Jun 13, 2025
5b1b5e7
fix format
okdshin Jun 13, 2025
38eea42
fix format
okdshin Jun 14, 2025
bc031d9
remove the new option and change the default behaviour
okdshin Jun 17, 2025
b5e4164
Revert "remove the new option and change the default behaviour"
okdshin Jun 19, 2025
e3733a9
update help
okdshin Jun 21, 2025
1ffd7f0
add warn
okdshin Jun 21, 2025
e282ebb
fix docs
okdshin Jun 21, 2025
b51c7b9
fix format
okdshin Jun 21, 2025
0a0920a
fix format
okdshin Jun 21, 2025
90eabb2
Apply suggestions from code review
okdshin Jun 22, 2025
8a45fa8
remove assert
okdshin Jun 22, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .buildkite/scripts/hardware_ci/run-neuron-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -54,10 +54,11 @@ docker run --rm -it --device=/dev/neuron0 --network bridge \
--name "${container_name}" \
${image_name} \
/bin/bash -c "
set -e; # Exit on first error
python3 /workspace/vllm/examples/offline_inference/neuron.py;
python3 -m pytest /workspace/vllm/tests/neuron/1_core/ -v --capture=tee-sys;
for f in /workspace/vllm/tests/neuron/2_core/*.py; do
echo 'Running test file: '$f;
echo \"Running test file: \$f\";
python3 -m pytest \$f -v --capture=tee-sys;
done
"
11 changes: 10 additions & 1 deletion .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,7 @@ steps:
- VLLM_TEST_ENABLE_ARTIFICIAL_PREEMPT=1 pytest -v -s basic_correctness/test_preemption.py

- label: Chunked Prefill Test
mirror_hardwares: [amdexperimental]
mirror_hardwares: [amdexperimental, amdproduction]
source_file_dependencies:
- vllm/
- tests/basic_correctness/test_chunked_prefill
Expand Down Expand Up @@ -271,6 +271,15 @@ steps:
commands:
- pytest -v -s prefix_caching


- label: Platform Tests (CUDA)
mirror_hardwares: [amdexperimental]
source_file_dependencies:
- vllm/
- tests/cuda
commands:
- pytest -v -s cuda/test_cuda_context.py

- label: Samplers Test # 36min
mirror_hardwares: [amdexperimental]
source_file_dependencies:
Expand Down
15 changes: 15 additions & 0 deletions .github/mergify.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,21 @@ pull_request_rules:
add:
- multi-modality

- name: label-qwen
description: Automatically apply qwen label
conditions:
- or:
- files~=^examples/.*qwen.*\.py
- files~=^tests/.*qwen.*\.py
- files~=^vllm/model_executor/models/.*qwen.*\.py
- files~=^vllm/reasoning/.*qwen.*\.py
- title~=(?i)Qwen
- body~=(?i)Qwen
actions:
label:
add:
- qwen

- name: label-rocm
description: Automatically apply rocm label
conditions:
Expand Down
6 changes: 3 additions & 3 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -420,9 +420,9 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
endif()
endif()

# The cutlass_scaled_mm kernels for Blackwell (c3x, i.e. CUTLASS 3.x) require
# CUDA 12.8 or later
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a;12.0a" "${CUDA_ARCHS}")
# The cutlass_scaled_mm kernels for Blackwell SM100 (c3x, i.e. CUTLASS 3.x)
# require CUDA 12.8 or later
cuda_archs_loose_intersection(SCALED_MM_ARCHS "10.0a;10.1a" "${CUDA_ARCHS}")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.8 AND SCALED_MM_ARCHS)
set(SRCS
"csrc/quantization/cutlass_w8a8/scaled_mm_c3x_sm100.cu"
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs

- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues) or [Discussions](https://github.com/vllm-project/vllm/discussions)
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
- coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
- For collaborations and partnerships, please contact us at [vllm-questions@lists.berkeley.edu](mailto:vllm-questions@lists.berkeley.edu)

Expand Down
2 changes: 1 addition & 1 deletion benchmarks/benchmark_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -353,7 +353,7 @@ def sample(
: input_lens[i]
]
prompt = tokenizer.decode(re_encoded_sequence)
total_input_len = prefix_len + int(input_lens[i])
total_input_len = len(re_encoded_sequence)
requests.append(
SampleRequest(
prompt=prompt,
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/benchmark_throughput.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ def run_vllm(
assert lora_requests is None, "BeamSearch API does not support LoRA"
prompts = [request.prompt for request in requests]
# output_len should be the same for all requests.
output_len = requests[0][2]
output_len = requests[0].expected_output_len
for request in requests:
assert request.expected_output_len == output_len
start = time.perf_counter()
Expand Down
159 changes: 159 additions & 0 deletions benchmarks/kernels/benchmark_moe_align_block_size.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
import argparse
import itertools

import torch

from vllm import _custom_ops as ops
from vllm.model_executor.layers.fused_moe.moe_align_block_size import (
moe_align_block_size_triton,
)
from vllm.triton_utils import triton


def get_topk_ids(num_tokens: int, num_experts: int, topk: int) -> torch.Tensor:
return torch.stack(
[
torch.randperm(num_experts, dtype=torch.int32, device="cuda")[:topk]
for _ in range(num_tokens)
]
)


def check_correctness(num_tokens, num_experts=256, block_size=256, topk=8):
"""
Verifies vllm vs. Triton
"""
topk_ids = get_topk_ids(num_tokens, num_experts, topk)

# 1. malloc space for triton and vllm
# malloc enough space (max_num_tokens_padded) for the sorted ids
max_num_tokens_padded = topk_ids.numel() + num_experts * (block_size - 1)
sorted_ids_triton = torch.empty(
(max_num_tokens_padded,), dtype=torch.int32, device="cuda"
)
sorted_ids_triton.fill_(topk_ids.numel()) # fill with sentinel value
expert_ids_triton = torch.zeros(
(max_num_tokens_padded // block_size,), dtype=torch.int32, device="cuda"
)
num_tokens_post_pad_triton = torch.empty((1,), dtype=torch.int32, device="cuda")

sorted_ids_vllm = torch.empty_like(sorted_ids_triton)
sorted_ids_vllm.fill_(topk_ids.numel())
expert_ids_vllm = torch.zeros_like(expert_ids_triton)
num_tokens_post_pad_vllm = torch.empty_like(num_tokens_post_pad_triton)

# 2. run implementations
moe_align_block_size_triton(
topk_ids,
num_experts,
block_size,
sorted_ids_triton,
expert_ids_triton,
num_tokens_post_pad_triton,
)

ops.moe_align_block_size(
topk_ids,
num_experts,
block_size,
sorted_ids_vllm,
expert_ids_vllm,
num_tokens_post_pad_vllm,
)
print(f"✅ VLLM implementation works with {num_experts} experts!")

# 3. compare results
if torch.allclose(expert_ids_triton, expert_ids_vllm) and torch.allclose(
num_tokens_post_pad_triton, num_tokens_post_pad_vllm
):
print("✅ Triton and VLLM implementations match.")
else:
print("❌ Triton and VLLM implementations DO NOT match.")
print("Triton expert_ids:", expert_ids_triton)
print("VLLM expert_ids:", expert_ids_vllm)
print("Triton num_tokens_post_pad:", num_tokens_post_pad_triton)
print("VLLM num_tokens_post_pad:", num_tokens_post_pad_vllm)


# test configurations
num_tokens_range = [1, 16, 256, 4096]
num_experts_range = [16, 64, 224, 256, 280, 512]
topk_range = [1, 2, 8]
configs = list(itertools.product(num_tokens_range, num_experts_range, topk_range))


@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=["num_tokens", "num_experts", "topk"],
x_vals=configs,
line_arg="provider",
line_vals=["vllm", "triton"], # "triton"
line_names=["VLLM", "Triton"], # "Triton"
plot_name="moe-align-block-size-performance",
args={},
)
)
def benchmark(num_tokens, num_experts, topk, provider):
"""Benchmark function for Triton."""
block_size = 256
topk_ids = get_topk_ids(num_tokens, num_experts, topk)

max_num_tokens_padded = topk_ids.numel() + num_experts * (block_size - 1)
sorted_ids = torch.empty((max_num_tokens_padded,), dtype=torch.int32, device="cuda")
sorted_ids.fill_(topk_ids.numel())
max_num_m_blocks = max_num_tokens_padded // block_size
expert_ids = torch.empty((max_num_m_blocks,), dtype=torch.int32, device="cuda")
num_tokens_post_pad = torch.empty((1,), dtype=torch.int32, device="cuda")

quantiles = [0.5, 0.2, 0.8]

if provider == "vllm":
ms, min_ms, max_ms = triton.testing.do_bench(
lambda: ops.moe_align_block_size(
topk_ids,
num_experts,
block_size,
sorted_ids.clone(),
expert_ids.clone(),
num_tokens_post_pad.clone(),
),
quantiles=quantiles,
)
elif provider == "triton":
ms, min_ms, max_ms = triton.testing.do_bench(
lambda: moe_align_block_size_triton(
topk_ids,
num_experts,
block_size,
sorted_ids.clone(),
expert_ids.clone(),
num_tokens_post_pad.clone(),
),
quantiles=quantiles,
)

return 1000 * ms, 1000 * max_ms, 1000 * min_ms


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument(
"--num_experts",
type=int,
default=64,
choices=[8, 16, 32, 64, 128, 256],
)
parser.add_argument(
"--topk",
type=int,
default=8,
choices=[2, 4, 8],
help="Top-k value for correctness check.",
)
args = parser.parse_args()

print("Running correctness check...")
check_correctness(num_tokens=1024, num_experts=args.num_experts, topk=args.topk)
benchmark.run(print_data=True, show_plots=True)
2 changes: 1 addition & 1 deletion cmake/external_projects/vllm_flash_attn.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ else()
FetchContent_Declare(
vllm-flash-attn
GIT_REPOSITORY https://github.com/vllm-project/flash-attention.git
GIT_TAG 8798f27777fb57f447070301bf33a9f9c607f491
GIT_TAG 763ad155a1c826f71ff318f41edb1e4e5e376ddb
GIT_PROGRESS TRUE
# Don't share the vllm-flash-attn build between build types
BINARY_DIR ${CMAKE_BINARY_DIR}/vllm-flash-attn
Expand Down
1 change: 1 addition & 0 deletions cmake/utils.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -122,6 +122,7 @@ function (get_torch_gpu_compiler_flags OUT_GPU_FLAGS GPU_LANG)
"-DENABLE_FP8"
"-U__HIP_NO_HALF_CONVERSIONS__"
"-U__HIP_NO_HALF_OPERATORS__"
"-Werror=unused-variable"
"-fno-gpu-rdc")

endif()
Expand Down
Loading