Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
552 commits
Select commit Hold shift + click to select a range
7b09c92
[BugFix]: Batch generation from prompt_embeds fails for long prompts …
KazusatoOoko Jul 24, 2025
c0a91ba
[BugFix] Fix KVConnector TP worker aggregation (#21473)
njhill Jul 24, 2025
648cc37
[DP] Internal Load Balancing Per Node [`one-pod-per-node`] (#21238)
robertgshaw2-redhat Jul 24, 2025
7228303
Dump input metadata on crash for async scheduling (#21258)
WoosukKwon Jul 24, 2025
1d44e07
[BugFix] Set CUDA_VISIBLE_DEVICES before spawning the subprocesses (#…
yinghai Jul 24, 2025
972af6b
Add think chunk (#21333)
juliendenize Jul 24, 2025
eca7bb1
Deduplicate Transformers backend code using inheritance (#21461)
hmellor Jul 24, 2025
e4fa7e2
[Bugfix][ROCm] Fix for warp_size uses on host (#21205)
gshtras Jul 24, 2025
7614653
[TPU][Bugfix] fix moe layer (#21340)
yaochengji Jul 24, 2025
a02bedc
[v1][Core] Clean up usages of `SpecializedManager` (#21407)
zhouwfang Jul 24, 2025
4990b3a
[Misc] Fix duplicate FusedMoEConfig debug messages (#21455)
njhill Jul 24, 2025
c630bf2
[Core] Support model loader plugins (#21067)
22quinn Jul 24, 2025
a2a71ff
remove GLM-4.5 quantization wrong Code (#21435)
zRzRzRzRzRzRzR Jul 24, 2025
c9a71b7
Replace `--expand-tools-even-if-tool-choice-none` with `--exclude-too…
okdshin Jul 24, 2025
426adc0
[Misc] Improve comment for DPEngineCoreActor._set_cuda_visible_device…
ruisearch42 Jul 24, 2025
340d23b
[Feat] Allow custom naming of vLLM processes (#21445)
chaunceyjiang Jul 24, 2025
6d84270
bump `flashinfer` to `v0.2.8` (#21385)
cjackal Jul 24, 2025
841628b
[Attention] Optimize FlashInfer MetadataBuilder Build call (#21137)
LucasWilkinson Jul 24, 2025
b355317
[Model] Officially support Emu3 with Transformers backend (#21319)
hmellor Jul 24, 2025
8d0cf97
[Bugfix] Fix CUDA arch flags for MoE permute (#21426)
minosfuture Jul 24, 2025
54870f6
[Fix] Update mamba_ssm to 2.2.5 (#21421)
elvischenv Jul 24, 2025
35f6013
[Docs] Update Tensorizer usage documentation (#21190)
sangstar Jul 24, 2025
79796dc
[Docs] Rewrite Distributed Inference and Serving guide (#20593)
crypdick Jul 24, 2025
99277f8
[Bug] Fix Compressed Tensor NVFP4 `cutlass_fp4_group_mm` illegal memo…
yewentao256 Jul 24, 2025
3ee60ba
Update flashinfer CUTLASS MoE Kernel (#21408)
wenscarl Jul 24, 2025
16ad88e
[XPU] Conditionally import CUDA-specific passes to avoid import error…
chaojun-zhang Jul 24, 2025
bf8af92
[P/D] Move FakeNixlWrapper to test dir (#21328)
ruisearch42 Jul 24, 2025
2bfe64e
The logging within the EmbeddingMixin class has been optimized, and t…
x22x22 Jul 24, 2025
c35f334
[P/D] Support CPU Transfer in NixlConnector (#18293)
juncgu Jul 24, 2025
e639d1b
[Docs][minor] Fix broken gh-file link in distributed serving docs (#2…
crypdick Jul 24, 2025
106943d
[Docs] Add Expert Parallelism Initial Documentation (#21373)
simon-mo Jul 24, 2025
0d98f49
update flashinfer to v0.2.9rc1 (#21485)
weireweire Jul 24, 2025
0d1a43f
[TPU][TEST] HF_HUB_DISABLE_XET=1 the test 3. (#21539)
QiliangCui Jul 24, 2025
bcfbeca
[MoE] More balanced expert sharding (#21497)
WoosukKwon Jul 24, 2025
14c7ae7
[Frontend] `run-batch` supports V1 (#21541)
DarkLight1337 Jul 25, 2025
32164eb
[Docs] Fix `site_url` for RunLLM (#21564)
hmellor Jul 25, 2025
a3fd326
[Bug] Fix DeepGemm Init Error (#21554)
yewentao256 Jul 25, 2025
82d366e
Fix GLM-4 PP Missing Layer When using with PP. (#21531)
zRzRzRzRzRzRzR Jul 25, 2025
687a044
[Kernel] adding fused_moe configs for upcoming granite4 (#21332)
bringlein Jul 25, 2025
5946508
[Bugfix] DeepGemm utils : Fix hardcoded type-cast (#21517)
varun-sundar-rabindranath Jul 25, 2025
933017a
[DP] Support api-server-count > 0 in hybrid DP LB mode (#21510)
njhill Jul 25, 2025
ad42701
[TPU][Test] Temporarily suspend this MoE model in test_basic.py. (#21…
QiliangCui Jul 25, 2025
cd79556
[Docs] Add `requirements/common.txt` to run unit tests (#21572)
zhouwfang Jul 25, 2025
e8bb378
Integrate TensorSchema with shape validation for Phi3VImagePixelInput…
bbeckca Jul 25, 2025
38ef6d8
[CI] Update CODEOWNERS for CPU and Intel GPU (#21582)
bigPYJ1151 Jul 25, 2025
289ca2d
[Bugfix] fix modelscope snapshot_download serialization (#21536)
andyxning Jul 25, 2025
2eebbd9
[Model] Support tensor parallel for timm ViT in Deepseek_vl2 (#21494)
wzqd Jul 25, 2025
d09777e
[Model] Fix a check for None but the return value was empty list in G…
hfan Jul 25, 2025
c1d3d7c
[Misc][Tools] make max-model-len a parameter in auto_tune script (#21…
yaochengji Jul 25, 2025
b634dcf
[CI/Build] fix cpu_extension for apple silicon (#21195)
ignaciosica Jul 25, 2025
b5ed2e1
[Misc] Removed undefined cmake variables MOE_PERMUTE_ARCHS (#21262)
chenyang78 Jul 25, 2025
83ae2b9
[TPU][Bugfix] fix OOM issue in CI test (#21550)
yaochengji Jul 25, 2025
bf2960e
[Tests] Harden DP tests (#21508)
njhill Jul 25, 2025
7438f06
Add H20-3e fused MoE kernel tuning configs for Qwen3-Coder-480B-A35B-…
Xu-Wenqing Jul 25, 2025
10ed9ad
[Bugfix] GGUF: fix AttributeError: 'PosixPath' object has no attribut…
kebe7jun Jul 25, 2025
7a251e9
[Quantization] Enable BNB support for more MoE models (#21370)
jeejeelee Jul 25, 2025
fe2ded5
[V1] Get supported tasks from model runner instead of model config (#…
DarkLight1337 Jul 25, 2025
59f108c
[Bugfix][Logprobs] Fix logprobs op to support more backend (#21591)
MengqingCao Jul 25, 2025
b81f414
[Model] Fix Ernie4.5MoE e_score_correction_bias parameter (#21586)
xyxinyang Jul 25, 2025
1dce089
[MODEL] New model support for naver-hyperclovax/HyperCLOVAX-SEED-Visi…
bigshanedogg Jul 25, 2025
0895e62
[Frontend] Add request_id to the Request object so they can be contro…
kouroshHakha Jul 25, 2025
9dd0973
[Model] Replace Mamba2 RMSNorm Gated with Fused Triton Kernel (#20839)
cyang49 Jul 25, 2025
a1d17e9
[ROCm][AITER] Enable fp8 kv cache on rocm aiter backend. (#20295)
fsx950223 Jul 25, 2025
2640f7b
[Kernel] Improve machete memory bound perf (#21556)
czhu-cohere Jul 25, 2025
bab02d9
Add support for Prithvi in Online serving mode (#21518)
mgazz Jul 25, 2025
94d9349
[CI] Unifying Dockerfiles for ARM and X86 Builds (#21343)
kebe7jun Jul 25, 2025
b084cd2
[Docs] add auto-round quantization readme (#21600)
wenhuach21 Jul 25, 2025
4bcf8bd
[TPU][Test] Rollback PR-21550. (#21619)
QiliangCui Jul 25, 2025
a2b51da
Add Unsloth to RLHF.md (#21636)
danielhanchen Jul 26, 2025
94ca46e
[Perf] Cuda Kernel for Int8 Per Token Group Quant (#21476)
yewentao256 Jul 26, 2025
9ddd8f6
Add interleaved RoPE test for Llama4 (Maverick) (#21478)
sarckk Jul 26, 2025
87a9c1b
[Bugfix] Fix sync_and_slice_intermediate_tensors (#21537)
ruisearch42 Jul 26, 2025
04daaa7
[Bugfix] Always set RAY_ADDRESS for Ray actor before spawn (#21540)
ruisearch42 Jul 26, 2025
94c289e
[TPU] Update ptxla nightly version to 20250724 (#21555)
yaochengji Jul 26, 2025
306d170
[Feature] Add support for MoE models in the calibration-free RTN-base…
sakogan Jul 26, 2025
7ccd355
[Model] Ultravox: Support Llama 4 and Gemma 3 backends (#17818)
farzadab Jul 26, 2025
7f67a54
[Docs] add offline serving multi-modal video input expamle Qwen2.5-VL…
david6666666 Jul 26, 2025
f2629ed
Correctly kill vLLM processes after finishing serving benchmarks (#21…
huydhn Jul 26, 2025
3339fbb
[Bugfix] Fix isinstance check for tensor types in _load_prompt_embeds…
Mitix-EPI Jul 26, 2025
b539c9a
[TPU][Test] Divide TPU v1 Test into 2 parts. (#21431)
QiliangCui Jul 26, 2025
1e43655
Support Intern-S1 (#21628)
lvhan028 Jul 26, 2025
4cd6f4b
[Misc] remove unused try-except in pooling config check (#21618)
reidliu41 Jul 26, 2025
f1ae9a3
[Take 2] Correctly kill vLLM processes after benchmarks (#21646)
huydhn Jul 26, 2025
a667a1d
Migrate AriaImagePixelInputs to TensorSchema for shape validation (#2…
bbeckca Jul 26, 2025
d8e9999
Migrate AyaVisionImagePixelInputs to TensorSchema for shape validatio…
bbeckca Jul 26, 2025
9eb1def
[Bugfix] Investigate Qwen2-VL failing test (#21527)
Isotr0py Jul 26, 2025
bde05ae
Support encoder-only models without KV-Cache (#21270)
maxdebayser Jul 26, 2025
041989d
[Bug] Fix `has_flashinfer_moe` Import Error when it is not installed …
yewentao256 Jul 26, 2025
ef49644
[Misc] Improve memory profiling debug message (#21429)
yeqcharlotte Jul 26, 2025
50bf836
[BugFix] Fix shared storage connector load kv only load attention lay…
david6666666 Jul 26, 2025
3613fa7
[Refactor] Remove `moe_align_block_size_triton` (#21335)
yewentao256 Jul 26, 2025
f26e7ba
[Bugfix][Apple Silicon] fix missing symbols when build from source on…
zhouyeju Jul 26, 2025
a6f9d9f
[CI/Build][Doc] Move existing benchmark scripts in CI/document/exampl…
yeqcharlotte Jul 26, 2025
77c88a2
[NVIDIA] Explicitly disable shuffled weights for flashinfer blockscal…
kaixih Jul 26, 2025
48bdbba
Remove xformers requirement for Mistral-format Pixtral and Mistral3 (…
wenchen76 Jul 26, 2025
c3aa77a
support `torch.compile` for bailing moe (#21664)
jinzhen-lin Jul 26, 2025
d1e142a
Migrate Blip2ImagePixelInputs and Blip2ImageEmbeddingInputs to Tensor…
bbeckca Jul 27, 2025
eecdffe
Migrate DeepseekVL2ImageInputs to TensorSchema (#21658)
bbeckca Jul 27, 2025
9a29266
Migrate FuyuImagePatchInputs to TensorSchema (#21662)
bbeckca Jul 27, 2025
0a3b908
Migrate ChameleonImagePixelInputs to TensorSchema (#21657)
bbeckca Jul 27, 2025
186db1b
[VLM] Support HF format Phi-4-MM model (#17121)
Isotr0py Jul 27, 2025
aa93d19
Handle non-serializable objects in vllm bench (#21665)
huydhn Jul 27, 2025
e96cc23
[CI/Build][Doc] Clean up more docs that point to old bench scripts (#…
yeqcharlotte Jul 27, 2025
a97742b
Refactor: Remove numpy dependency from LoggingStatLogger (#20529)
skyloevil Jul 27, 2025
a2be04b
[Misc] add default value for file pattern arg (#21659)
andyxning Jul 27, 2025
4e94dbb
Migrate Florence2ImagePixelInputs to TensorSchema (#21663)
bbeckca Jul 27, 2025
d458d99
[VLM] Add video support for Intern-S1 (#21671)
Isotr0py Jul 27, 2025
e10a1ed
[Refactor] Refactor MOE NVFP4 Code Base: ModelOpt + Compressed Tensor…
yewentao256 Jul 27, 2025
ba10d6d
Fix CUDA permute/unpermute for use with DeepGemm Moe (#17934)
CalebDu Jul 27, 2025
ea47401
[Misc] Refactor vllm config str (#21666)
andyxning Jul 27, 2025
358394b
[Attention] Make CutlassMLA the default backend for SM100 (blackwell)…
alexm-redhat Jul 27, 2025
1b15af0
[Deprecation][2/N] Replace `--task` with `--runner` and `--convert` (…
DarkLight1337 Jul 28, 2025
a905376
Fix typo for limit-mm-per-prompt in docs (#21697)
joa-stdn Jul 28, 2025
12f6203
Fix GLM tool parser (#21668)
zRzRzRzRzRzRzR Jul 28, 2025
2b2ee89
[Misc] Add fused_moe configs for Qwen3-Coder-480B-A35B-Instruct-FP8 …
jeejeelee Jul 28, 2025
123c1d2
[V1] Exception Handling when Loading KV Cache from Remote Store (#21534)
liuyumoye Jul 28, 2025
768e9a1
[Model] Support TP/PP/mamba2 kernel for PLaMo2 (#19674)
Alnusjaponica Jul 28, 2025
59817fe
[FEAT] [ROCm] [AITER]: Add AITER HIP block quant kernel (#21242)
tjtanaa Jul 28, 2025
bb3fc77
Migrate Gemma3ImagePixelInputs to TensorSchema (#21676)
bbeckca Jul 28, 2025
0abc5e3
Migrate Glm4vImageInputs, Glm4vVideoInputs to TensorSchema (#21678)
bbeckca Jul 28, 2025
2a1a842
Migrate GLMVImagePixelInputs to TensorSchema (#21679)
bbeckca Jul 28, 2025
7037e02
Migrate GraniteSpeechAudioInputs to TensorSchema (#21682)
bbeckca Jul 28, 2025
c8f5dcc
Migrate Idefics3ImagePixelInputs and Idefics3ImageEmbeddingInputs to …
bbeckca Jul 28, 2025
6a78fa5
[Bugfix] [issue-21565] Fix the incompatibility issue with stream and …
hsliuustc0106 Jul 28, 2025
095a8d0
[bugfix] fix profile impact benchmark results (#21507)
lengrongfu Jul 28, 2025
c776ecf
[Bugfix] Fix shape checking for Fuyu (#21709)
DarkLight1337 Jul 28, 2025
654f6b9
[Bugfix] fix max-file-size type from str to int (#21675)
andyxning Jul 28, 2025
3deaff7
[BugFix] Fix ChunkedLocalAttention when the hybrid kv-cache is disabl…
LucasWilkinson Jul 28, 2025
c1fe6c8
[v1][mamba] Added mamba_type into MambaSpec (#21715)
Josephasafg Jul 28, 2025
97207c5
Migrate KeyeImageInputs and KeyeVideoInputs to TensorSchema (#21686)
bbeckca Jul 28, 2025
868d5bf
[Model] Prioritize Transformers fallback over suffix matching (#21719)
DarkLight1337 Jul 28, 2025
0003992
[feature] add log non default args in LLM (#21680)
lengrongfu Jul 28, 2025
940a625
[Bugfix] Fix Ernie4_5_MoeForCausalLM shared experts (#21717)
jeejeelee Jul 28, 2025
8d31842
[Bugfix] Fix environment variable setting in CPU Dockerfile (#21730)
bigPYJ1151 Jul 28, 2025
4b12ff9
[Bugfix] Fix glm4.1v video_grid_thw tensor shape scheme (#21744)
Isotr0py Jul 28, 2025
251cb6f
[PD] let p2p nccl toy proxy handle /chat/completions (#21734)
chaunceyjiang Jul 28, 2025
d670f80
[`Ernie 4.5`] Name Change for Base 0.3B Model (#21735)
vasqu Jul 28, 2025
e1181c5
[Bugfix] Improve JSON extraction in LlamaToolParser (#19024)
key4ng Jul 28, 2025
3fc005b
[Docs] Add revision date to rendered docs (#21752)
hmellor Jul 28, 2025
55d2ae1
[Bugfix]check health for engine core process exiting unexpectedly (#2…
wuhang2014 Jul 28, 2025
8386ef2
[Bugfix][CI/Build] Update peft version in test requirement (#21729)
Isotr0py Jul 28, 2025
660a86c
[Logs] Change flashinfer sampler logs to once (#21759)
mgoin Jul 28, 2025
22e01f4
[Misc] Reduce logs for model resolution (#21765)
DarkLight1337 Jul 28, 2025
e3edc06
[Bugfix] Mistral crashes on tool with no description (#21167)
HugoMichard Jul 28, 2025
3c9b79b
[CI/Build] Fix plugin tests (#21758)
DarkLight1337 Jul 28, 2025
8bf3bd2
[XPU] IPEX-optimized Punica Wrapper on XPU (#21703)
chaojun-zhang Jul 28, 2025
cd3ef49
[Bugfix] Fix granite speech shape validation (#21762)
DarkLight1337 Jul 28, 2025
3e46af9
[P/D] Log warnings related to prefill KV expiry (#21753)
njhill Jul 28, 2025
d36e859
Use `metavar` to list the choices for a CLI arg when custom values ar…
hmellor Jul 28, 2025
00931b9
update flashinfer to v0.2.9rc2 (#21701)
weireweire Jul 28, 2025
1529efb
[AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes …
rasmith Jul 28, 2025
d58e07b
[Bug] Enforce contiguous input for `dynamic_scaled_fp8_quant` and `st…
yewentao256 Jul 28, 2025
23ae696
[AMD][CI/Build] Fix the AMD issue caused by inappropriate of symbol e…
houseroad Jul 28, 2025
8899747
Revert "[V1] Exception Handling when Loading KV Cache from Remote Sto…
KuntaiDu Jul 28, 2025
964b7c5
[Bugfix] DeepGEMM is not enabled on B200 due to `_lazy_init()` (#21472)
smarterclayton Jul 28, 2025
154107d
[Feat]: Add support for Dynamic Quant 4 bit CPU kleidiai kernels (#17…
nikhil-arm Jul 28, 2025
024f5de
[Perf] Disable chunked local attention by default with llama4 (#21761)
LucasWilkinson Jul 28, 2025
d963847
[Kernel] SM90 CUTLASS FP8 GEMM: add support for swap AB + kernel tuni…
LyrisZhong Jul 28, 2025
11e6681
[Docs] Minimize spacing for supported_hardware.md table (#21779)
mgoin Jul 29, 2025
bb65d56
[Refactor] Merge Compressed Tensor FP8 `CompressedTensorsW8A8Fp8MoEMe…
yewentao256 Jul 29, 2025
e350cf8
[CI] Parallelize Kernels MoE Test (#21764)
mgoin Jul 29, 2025
4755076
skip fusedmoe layer for start_load_kv (#21378)
calvin0327 Jul 29, 2025
9520afd
[AMD][CI/Build][Bugfix] Guarding CUDA specific functions by ifndef RO…
gshtras Jul 29, 2025
b80cfd3
Migrate InternVLImageInputs and InternVLVideoInputs to TensorSchema (…
bbeckca Jul 29, 2025
dcd98bd
[Misc] Rework process titles (#21780)
njhill Jul 29, 2025
d88c646
[Model]: Fused MoE for nomic-embed-text-v2-moe (#18321)
Isotr0py Jul 29, 2025
9c7e98e
[V0 deprecation] Guided decoding (#21347)
rzabarazesh Jul 29, 2025
54af49c
[Model] Refactor JambaForCausalLM (#21394)
jeejeelee Jul 29, 2025
efb310d
[Docs] Fix the outdated URL for installing from vLLM binaries (#21523)
yankay Jul 29, 2025
391201b
[KVCache] Make KVCacheSpec hashable (#21791)
heheda12345 Jul 29, 2025
7ce0a8e
[Doc] Update compatibility matrix for pooling and multimodal models (…
DarkLight1337 Jul 29, 2025
3f733da
[Bugfix] VLLM_V1 supports passing other compilation levels (#19340)
zou3519 Jul 29, 2025
c64a790
[Docs] Merge design docs for a V1 only future (#21832)
hmellor Jul 29, 2025
63401ae
[TPU] Add an optimization doc on TPU (#21155)
bvrockwell Jul 29, 2025
11e3042
[Bugfix]fix mixed bits and visual language model quantization in Auto…
wenhuach21 Jul 29, 2025
e21f1cd
[Bugfix] Fix workspace buffer None issue for Flashinfer TRTLLM Backen…
elvischenv Jul 29, 2025
e84d4da
[Docs] use `uv` in GPU installation docs (#20277)
davidxia Jul 29, 2025
4e37a04
[Doc] Add FusedMoE Modular Kernel Documentation (#21623)
varun-sundar-rabindranath Jul 29, 2025
55d0a23
[Doc] update Contributing page's testing section (#18272)
davidxia Jul 29, 2025
c79c338
Add `flashinfer_python` to CUDA wheel requirements (#21389)
mgoin Jul 29, 2025
5d8ed8a
docker: docker-aware precompiled wheel support (#21127)
dougbtv Jul 29, 2025
8b78f98
Revert "[AMD][CI/Build] Fix the AMD issue caused by inappropriate of …
gshtras Jul 29, 2025
9513a0b
[BugFix] Fix interleaved sliding window not set for Gemma3n (#21863)
sarckk Jul 29, 2025
fa3ac7e
[ci] add b200 test placeholder (#21866)
simon-mo Jul 30, 2025
cc64453
[ci] mark blackwell test optional for now (#21878)
simon-mo Jul 30, 2025
7351db9
[Bugfix] Correct max tokens for non-contiguous embeds (#21798)
milesial Jul 30, 2025
8f99a83
[v1][attention] Support Hybrid Allocator + FlashInfer (#21412)
heheda12345 Jul 30, 2025
3285bc4
[Docs] Switch to better markdown linting pre-commit hook (#21851)
hmellor Jul 30, 2025
94efcfe
[DOC] Fix path of v1 related figures (#21868)
heheda12345 Jul 30, 2025
8b68315
[Docs] Update docker.md with HF_TOKEN, new model, and podman fix (#21…
mgoin Jul 30, 2025
55cde0a
Expose PyTorch profiler configuration to environment variables (#21803)
Csrayz Jul 30, 2025
3846108
[Bugfix] Fix shape mismatch assertion error when loading Gemma3n mode…
sydarb Jul 30, 2025
6defc83
[Bugfix] Fix comment typo of get_num_common_prefix_blocks() (#21827)
MingzhenHan Jul 30, 2025
c604c01
[Bugfix] Actually disable processing cache when API server is scaled …
DarkLight1337 Jul 30, 2025
b083e07
[Perf] Using `__nv_fp8_e4m3` instead of `c10::e4m3` for `per_token_gr…
yewentao256 Jul 30, 2025
61f843b
[Frontend] Add LLM.reward specific to reward models (#21720)
noooop Jul 30, 2025
3da602d
[XPU] use `ZE_AFFINITY_MASK` for device select on xpu (#21815)
jikunshang Jul 30, 2025
e725b72
Add @sighingnow as maintainer of qwen's related files. (#21895)
sighingnow Jul 30, 2025
a63c8b0
[CI/Build] Fix pre-commit failure in docs (#21897)
DarkLight1337 Jul 30, 2025
0d9cc64
[Docs] Expand introduction to Ray in Multi-node deployment section (#…
crypdick Jul 30, 2025
7ac1bce
Update vLLM Benchmark Suite for Xeon based on 0.9.2 release (#21486)
louie-tsai Jul 30, 2025
d56483b
[Misc] Remove redundant config definitions (#21891)
DarkLight1337 Jul 30, 2025
f5b3503
[CI] rollback lint-and-deploy pipeline using amd machine (#21912)
kebe7jun Jul 30, 2025
825959c
[Tests] Fixing bug inside MultiModalProfiler. (#21842)
shenoyvvarun Jul 30, 2025
fe78c9e
[Model] Remove DSV2 unused code (#21903)
jeejeelee Jul 30, 2025
f38d30b
[benchmark] add max-concurrency in result table (#21095)
panpan0000 Jul 30, 2025
0a4aaa0
[Doc] Update partial support (#21916)
DarkLight1337 Jul 30, 2025
1232257
[Docs] Fix the example code of streaming chat completions in reasonin…
hsliuustc0106 Jul 30, 2025
1f303cc
Add @patrickvonplaten as maintainer of mistral's related files. (#21928)
patrickvonplaten Jul 30, 2025
8f69358
[Hardware][CPU] Build fix for ARM without BF16 (#21848)
ericcurtin Jul 30, 2025
ffd6073
[Feature][EPLB] Add eplb support for Qwen3 (#20815)
aladerran Jul 30, 2025
3803e15
[Doc] Remove vLLM prefix and add citation for PagedAttention (#21910)
DarkLight1337 Jul 30, 2025
5689504
[Bugfix] we should use metavar is not choices (#21902)
lengrongfu Jul 30, 2025
44827e4
[Feature] Support multiple api keys in server (#18548)
Yanpas Jul 30, 2025
c8183ec
[misc] skip p2p check by default (#21904)
youkaichao Jul 30, 2025
a03b93a
[Test] Add Benchmark and Unit Test for `per_token_group_quant` (#21860)
yewentao256 Jul 30, 2025
ed8b20c
[CI/Build] Only run markdownlint in CI (#21892)
DarkLight1337 Jul 30, 2025
842736e
Reduce time wasted in GitHub Actions using `concurrency` (#21919)
hmellor Jul 30, 2025
4884f02
[Misc] Improve code readability of KVCacheManager (#21673)
tanruixiang Jul 30, 2025
88243b4
[NVIDIA] Fix Llama4 Scout FP4 functionality issues (#21499)
nvpohanh Jul 30, 2025
5b559a6
[Docs] Reduce the size of the built docs (#21920)
hmellor Jul 30, 2025
8b16774
[Bugfix] Fix OOM tests in initialization test (#21921)
Isotr0py Jul 30, 2025
c31cf24
[Bugfix] Fix multi-api server not working for text models (#21933)
DarkLight1337 Jul 30, 2025
b64872d
Override attention metadata for fast prefill in some KV sharing setup…
sarckk Jul 30, 2025
731cbb8
[Bugfix] Fix TypeError in scheduler when comparing mixed request_id t…
chi2liu Jul 30, 2025
13264ca
[CI/Build] Fix registry tests (#21934)
DarkLight1337 Jul 30, 2025
04dc2e7
[Bugfix] SharedStorage Connector for V1 PD multimodal (#21611)
fake0fan Jul 30, 2025
e45a9b2
feat(distributed): add `get_required_kvcache_layout` class method to …
wxsms Jul 30, 2025
931d776
[TPU] Support Pathways in vLLM (#21417)
wenxindongwork Jul 30, 2025
45447ab
[Misc] Support more collective_rpc return types (#21845)
njhill Jul 30, 2025
e8e693a
For VLLM_USE_PRECOMPILED, only compiled .so files should be extracted…
dougbtv Jul 30, 2025
ace708f
[Misc] Use dracut on CentOS and skip clone if repo exists for EP kern…
minosfuture Jul 30, 2025
d8a2eae
[Feature] Add async tensor parallelism for scaled mm (#20155)
cascade812 Jul 30, 2025
181202f
[Bugfix] Fix None value handling in trace span creation for cancelled…
br4mm Jul 30, 2025
3b91b17
[Core] Move EngineCoreRequest to Request conversion out of EngineCore…
linzebing Jul 30, 2025
f261288
[Example] Add `async_llm_streaming.py` example for AsyncLLM streaming…
mgoin Jul 31, 2025
1843059
[Bugfix] Relax lang pin for voxtral (#21833)
sanchit-gandhi Jul 31, 2025
474f25d
[UX] Rename CUTLASS_MLA_VLLM_V1 to CUTLASS_MLA (#21966)
mgoin Jul 31, 2025
07be3b9
[Misc] Expand SUPPORTED_HIDDEN_SIZES for DeepEP low-latency kernels …
jeejeelee Jul 31, 2025
52f1a7e
[CI Bugfix] Fix CI OOM for `test_shared_storage_connector_hashes` (#2…
mgoin Jul 31, 2025
3eee204
[Bugfix]: fix metadata file copy in test_sharded_state_loader (#21830)
andyxning Jul 31, 2025
6967d0e
[Deprecation] Remove deprecated args and methods (#21907)
DarkLight1337 Jul 31, 2025
f97ff9b
[CI/Build] get rid of unused VLLM_FA_CMAKE_GPU_ARCHES (#21599)
dtrifiro Jul 31, 2025
b28069d
merge mian to feat/support-long-text-embedding
x22x22 Jul 31, 2025
a0955f8
The files `diff_config.py` and `diff_serving_embedding.py` have been …
x22x22 Jul 31, 2025
5478169
The files `diff_config.py` and `diff_serving_embedding.py` have been …
x22x22 Jul 31, 2025
971aa11
The block processing logic in the embedding processing has been optim…
x22x22 Jul 31, 2025
b3204df
The processing logic for long - text embeddings has been updated. Mea…
x22x22 Jul 31, 2025
5536db0
Error: docs/models/pooling_models.md:261:110 MD047/single-trailing-ne…
x22x22 Jul 31, 2025
a835f52
The latest update introduces new long-text embedding examples and ser…
x22x22 Aug 5, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
15 changes: 10 additions & 5 deletions .buildkite/nightly-benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performanc
## Trigger the benchmark

Performance benchmark will be triggered when:

- A PR being merged into vllm.
- Every commit for those PRs with `perf-benchmarks` label AND `ready` label.

Expand All @@ -38,6 +39,7 @@ bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
```

Runtime environment variables:

- `ON_CPU`: set the value to '1' on Intel® Xeon® Processors. Default value is 0.
- `SERVING_JSON`: JSON file to use for the serving tests. Default value is empty string (use default file).
- `LATENCY_JSON`: JSON file to use for the latency tests. Default value is empty string (use default file).
Expand All @@ -46,12 +48,14 @@ Runtime environment variables:
- `REMOTE_PORT`: Port for the remote vLLM service to benchmark. Default value is empty string.

Nightly benchmark will be triggered when:

- Every commit for those PRs with `perf-benchmarks` label and `nightly-benchmarks` label.

## Performance benchmark details

See [performance-benchmarks-descriptions.md](performance-benchmarks-descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.
> NOTE: For Intel® Xeon® Processors, use `tests/latency-tests-cpu.json`, `tests/throughput-tests-cpu.json`, `tests/serving-tests-cpu.json` instead.
>
### Latency test

Here is an example of one test inside `latency-tests.json`:
Expand All @@ -74,21 +78,21 @@ Here is an example of one test inside `latency-tests.json`:
In this example:

- The `test_name` attributes is a unique identifier for the test. In `latency-tests.json`, it must start with `latency_`.
- The `parameters` attribute control the command line arguments to be used for `benchmark_latency.py`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `benchmark_latency.py`. For example, the corresponding command line arguments for `benchmark_latency.py` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`
- The `parameters` attribute control the command line arguments to be used for `vllm bench latency`. Note that please use underline `_` instead of the dash `-` when specifying the command line arguments, and `run-performance-benchmarks.sh` will convert the underline to dash when feeding the arguments to `vllm bench latency`. For example, the corresponding command line arguments for `vllm bench latency` will be `--model meta-llama/Meta-Llama-3-8B --tensor-parallel-size 1 --load-format dummy --num-iters-warmup 5 --num-iters 15`

Note that the performance numbers are highly sensitive to the value of the parameters. Please make sure the parameters are set correctly.

WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.

### Throughput test

The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `vllm bench throughput`.

The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.

### Serving test

We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:
We test the throughput by using `vllm bench serve` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:

```json
[
Expand Down Expand Up @@ -118,8 +122,8 @@ Inside this example:

- The `test_name` attribute is also a unique identifier for the test. It must start with `serving_`.
- The `server-parameters` includes the command line arguments for vLLM server.
- The `client-parameters` includes the command line arguments for `benchmark_serving.py`.
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `benchmark_serving.py`
- The `client-parameters` includes the command line arguments for `vllm bench serve`.
- The `qps_list` controls the list of qps for test. It will be used to configure the `--request-rate` parameter in `vllm bench serve`

The number of this test is less stable compared to the delay and latency benchmarks (due to randomized sharegpt dataset sampling inside `benchmark_serving.py`), but a large change on this number (e.g. 5% change) still vary the output greatly.

Expand Down Expand Up @@ -149,6 +153,7 @@ Here is an example using the script to compare result_a and result_b without det

Here is an example using the script to compare result_a and result_b with detail test name.
`python3 compare-json-results.py -f results_a/benchmark_results.json -f results_b/benchmark_results.json`

| | results_a/benchmark_results.json_name | results_a/benchmark_results.json | results_b/benchmark_results.json_name | results_b/benchmark_results.json | perf_ratio |
|---|---------------------------------------------|----------------------------------------|---------------------------------------------|----------------------------------------|----------|
| 0 | serving_llama8B_tp1_sharegpt_qps_1 | 142.633982 | serving_llama8B_tp1_sharegpt_qps_1 | 156.526018 | 1.097396 |
Expand Down
21 changes: 11 additions & 10 deletions .buildkite/nightly-benchmarks/nightly-annotation.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# Nightly benchmark annotation

## Description

Expand All @@ -13,15 +14,15 @@ Please download the visualization scripts in the post

- Find the docker we use in `benchmarking pipeline`
- Deploy the docker, and inside the docker:
- Download `nightly-benchmarks.zip`.
- In the same folder, run the following code:

```bash
export HF_TOKEN=<your HF token>
apt update
apt install -y git
unzip nightly-benchmarks.zip
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
```
- Download `nightly-benchmarks.zip`.
- In the same folder, run the following code:

```bash
export HF_TOKEN=<your HF token>
apt update
apt install -y git
unzip nightly-benchmarks.zip
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
```

And the results will be inside `./benchmarks/results`.
34 changes: 17 additions & 17 deletions .buildkite/nightly-benchmarks/nightly-descriptions.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,25 +13,25 @@ Latest reproduction guilde: [github issue link](https://github.com/vllm-project/
## Setup

- Docker images:
- vLLM: `vllm/vllm-openai:v0.6.2`
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
- *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
- vLLM: `vllm/vllm-openai:v0.6.2`
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
- *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
- Hardware
- 8x Nvidia A100 GPUs
- 8x Nvidia A100 GPUs
- Workload:
- Dataset
- ShareGPT dataset
- Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
- Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
- Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
- Models: llama-3 8B, llama-3 70B.
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
- Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
- Dataset
- ShareGPT dataset
- Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
- Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
- Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
- Models: llama-3 8B, llama-3 70B.
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
- Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).

## Known issues

Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# Performance benchmarks descriptions

## Latency tests

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@
"test_name": "Test name",
"gpu_type": "GPU",
"completed": "# of req.",
"max_concurrency": "# of max concurrency.",
"request_throughput": "Tput (req/s)",
"total_token_throughput": "Total Token Tput (tok/s)",
"output_throughput": "Output Tput (tok/s)",
Expand Down Expand Up @@ -100,7 +101,7 @@ def get_size_with_unit(bytes, suffix="B"):
raw_result = json.loads(f.read())

if "serving" in str(test_file):
# this result is generated via `benchmark_serving.py`
# this result is generated via `vllm bench serve` command

# attach the benchmarking command to raw_result
try:
Expand All @@ -120,7 +121,7 @@ def get_size_with_unit(bytes, suffix="B"):
continue

elif "latency" in f.name:
# this result is generated via `benchmark_latency.py`
# this result is generated via `vllm bench latency` command

# attach the benchmarking command to raw_result
try:
Expand Down Expand Up @@ -148,7 +149,7 @@ def get_size_with_unit(bytes, suffix="B"):
continue

elif "throughput" in f.name:
# this result is generated via `benchmark_throughput.py`
# this result is generated via `vllm bench throughput` command

# attach the benchmarking command to raw_result
try:
Expand Down
40 changes: 21 additions & 19 deletions .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ get_current_llm_serving_engine() {
echo "Container: vllm"
# move to a completely irrelevant directory, to avoid import vllm from current folder
export CURRENT_LLM_SERVING_ENGINE=vllm

return
fi
}
Expand All @@ -95,12 +95,14 @@ json2args() {
}

kill_gpu_processes() {
pkill -f python
pkill -f python3
pkill -f tritonserver
pkill -f pt_main_thread
pkill -f text-generation
pkill -f lmdeploy
pkill -f '[p]ython'
pkill -f '[p]ython3'
pkill -f '[t]ritonserver'
pkill -f '[p]t_main_thread'
pkill -f '[t]ext-generation'
pkill -f '[l]mdeploy'
# vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445
pkill -f '[V]LLM'

while [ "$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n 1)" -ge 1000 ]; do
sleep 1
Expand All @@ -125,7 +127,7 @@ ensure_installed() {
}

run_serving_tests() {
# run serving tests using `benchmark_serving.py`
# run serving tests using `vllm bench serve` command
# $1: a json file specifying serving test cases

local serving_test_file
Expand Down Expand Up @@ -225,7 +227,7 @@ run_serving_tests() {

if [[ "$dataset_name" = "sharegpt" ]]; then

client_command="python3 benchmark_serving.py \
client_command="vllm bench serve \
--backend $backend \
--tokenizer /tokenizer_cache \
--model $model \
Expand All @@ -246,7 +248,7 @@ run_serving_tests() {
sonnet_output_len=$(echo "$common_params" | jq -r '.sonnet_output_len')
sonnet_prefix_len=$(echo "$common_params" | jq -r '.sonnet_prefix_len')

client_command="python3 benchmark_serving.py \
client_command="vllm bench serve \
--backend $backend \
--tokenizer /tokenizer_cache \
--model $model \
Expand All @@ -265,13 +267,13 @@ run_serving_tests() {
$client_args"

else

echo "The dataset name must be either 'sharegpt' or 'sonnet'. Got $dataset_name."
exit 1

fi



echo "Running test case $test_name with qps $qps"
echo "Client command: $client_command"
Expand Down Expand Up @@ -302,7 +304,7 @@ run_serving_tests() {
}

run_genai_perf_tests() {
# run genai-perf tests
# run genai-perf tests

# $1: a json file specifying genai-perf test cases
local genai_perf_test_file
Expand All @@ -311,14 +313,14 @@ run_genai_perf_tests() {
# Iterate over genai-perf tests
jq -c '.[]' "$genai_perf_test_file" | while read -r params; do
# get the test name, and append the GPU type back to it.
test_name=$(echo "$params" | jq -r '.test_name')
test_name=$(echo "$params" | jq -r '.test_name')

# if TEST_SELECTOR is set, only run the test cases that match the selector
if [[ -n "$TEST_SELECTOR" ]] && [[ ! "$test_name" =~ $TEST_SELECTOR ]]; then
echo "Skip test case $test_name."
continue
fi

# prepend the current serving engine to the test name
test_name=${CURRENT_LLM_SERVING_ENGINE}_${test_name}

Expand Down Expand Up @@ -369,10 +371,10 @@ run_genai_perf_tests() {
qps=$num_prompts
echo "now qps is $qps"
fi

new_test_name=$test_name"_qps_"$qps
backend=$CURRENT_LLM_SERVING_ENGINE

if [[ "$backend" == *"vllm"* ]]; then
backend="vllm"
fi
Expand Down Expand Up @@ -413,7 +415,7 @@ prepare_dataset() {
do
cat sonnet.txt >> sonnet_4x.txt
done

}

main() {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ check_gpus() {

check_cpus() {
# check the number of CPUs and NUMA Node and GPU type.
declare -g numa_count=$(python3 -c "from numa import info;numa_size = info.get_num_configured_nodes(); print(numa_size)")
declare -g numa_count=$(lscpu | grep "NUMA node(s):" | awk '{print $3}')
if [[ $numa_count -gt 0 ]]; then
echo "NUMA found."
echo $numa_count
Expand Down Expand Up @@ -126,7 +126,8 @@ kill_gpu_processes() {
ps -aux
lsof -t -i:8000 | xargs -r kill -9
pgrep python3 | xargs -r kill -9

# vLLM now names the process with VLLM prefix after https://github.com/vllm-project/vllm/pull/21445
pgrep VLLM | xargs -r kill -9

# wait until GPU memory usage smaller than 1GB
if command -v nvidia-smi; then
Expand Down Expand Up @@ -164,7 +165,7 @@ upload_to_buildkite() {
}

run_latency_tests() {
# run latency tests using `benchmark_latency.py`
# run latency tests using `vllm bench latency` command
# $1: a json file specifying latency test cases

local latency_test_file
Expand Down Expand Up @@ -205,7 +206,7 @@ run_latency_tests() {
fi
fi

latency_command=" $latency_envs python3 benchmark_latency.py \
latency_command=" $latency_envs vllm bench latency \
--output-json $RESULTS_FOLDER/${test_name}.json \
$latency_args"

Expand All @@ -231,7 +232,7 @@ run_latency_tests() {
}

run_throughput_tests() {
# run throughput tests using `benchmark_throughput.py`
# run throughput tests using `vllm bench throughput`
# $1: a json file specifying throughput test cases

local throughput_test_file
Expand Down Expand Up @@ -272,7 +273,7 @@ run_throughput_tests() {
fi
fi

throughput_command=" $throughput_envs python3 benchmark_throughput.py \
throughput_command=" $throughput_envs vllm bench throughput \
--output-json $RESULTS_FOLDER/${test_name}.json \
$throughput_args"

Expand All @@ -297,7 +298,7 @@ run_throughput_tests() {
}

run_serving_tests() {
# run serving tests using `benchmark_serving.py`
# run serving tests using `vllm bench serve` command
# $1: a json file specifying serving test cases

local serving_test_file
Expand Down Expand Up @@ -393,7 +394,7 @@ run_serving_tests() {

# pass the tensor parallel size to the client so that it can be displayed
# on the benchmark dashboard
client_command="python3 benchmark_serving.py \
client_command="vllm bench serve \
--save-result \
--result-dir $RESULTS_FOLDER \
--result-filename ${new_test_name}.json \
Expand Down Expand Up @@ -447,7 +448,7 @@ main() {
(which jq) || (apt-get update && apt-get -y install jq)
(which lsof) || (apt-get update && apt-get install -y lsof)

# get the current IP address, required by benchmark_serving.py
# get the current IP address, required by `vllm bench serve` command
export VLLM_HOST_IP=$(hostname -I | awk '{print $1}')
# turn of the reporting of the status of each request, to clean up the terminal output
export VLLM_LOGGING_LEVEL="WARNING"
Expand Down
Loading