Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
2873 commits
Select commit Hold shift + click to select a range
fa96fb9
Pruning kernel Core Tests (#26727)
kfhfar Oct 13, 2025
35bc22f
[ResponseAPI] Further polish message serialization and unit tests (#2…
Jialin Oct 13, 2025
d8bebb0
Add tests for chunked prefill and prefix cache with causal pooling mo…
maxdebayser Oct 13, 2025
8317f72
[Misc][DP] support customized aggregated logger for dp (#24354)
luccafong Oct 14, 2025
3e051bd
[UX] Replace VLLM_ALL2ALL_BACKEND with --all2all-backend (#26732)
mgoin Oct 14, 2025
b59dd19
[compile] Enable sequence parallelism for full cuda graph without spe…
angelayi Oct 14, 2025
cfded80
[Easy] Fix env type check errors from VLLM_DEBUG_LOG_API_SERVER_RESPO…
Jialin Oct 14, 2025
48b15f4
Merge branch 'main' into fused_moe_lora
jeejeelee Oct 14, 2025
8a0af6a
[build][torch.compile] upgrade depyf version (#26702)
youkaichao Oct 14, 2025
ed61b7a
add unit testing
B-201 Oct 14, 2025
8ae1692
[torch.compile] Unwrap fused_marlin_moe custom op (#26739)
varun-sundar-rabindranath Oct 14, 2025
2935092
[Feature][Quantization] auto_round format add support for regex (#24024)
n1ck-guo Oct 14, 2025
8c3148f
Add test
jeejeelee Oct 14, 2025
42c7abe
Adding test for gptoss
Oct 14, 2025
355967e
add test
B-201 Oct 14, 2025
1f1d5d5
Update to default_act_function and pass as callable
Oct 14, 2025
95fe27e
remove None type
Oct 14, 2025
e22782a
Update default_act function signature
Oct 14, 2025
fe3edb4
Add support for the /rerank endpoint in vllm bench serve (#26602)
maxdebayser Oct 14, 2025
a860253
Merge pull request #16 from dcmaddix/remove_torch_ops
wcwuwc Oct 14, 2025
2e36cdb
[Docs] Add a start tag to build.inc.md (#26747)
windsonsea Oct 14, 2025
4497c8f
Fix lora tests failure in TPU CI due to the removal of LoRA bias (#26…
vanbasten23 Oct 14, 2025
4821ac1
[CI] [ROCm] Automate CC list for ROCm related issue (#26753)
vllmellm Oct 14, 2025
d3cc842
[ci] Adding the test-amd.yaml for test definitions for the AMD backen…
Alexei-V-Ivanov-AMD Oct 14, 2025
481545b
scheduler.py: Update the name of the default scheduler. (#26758)
ryanli Oct 14, 2025
01ad27f
[Model][Bugfix]fix ernie45 load failed due to ernie45 eplb code (#26684)
CSWYF3634076 Oct 14, 2025
d32c611
[CI/Build] Use 127.0.0.1 instead of localhost in utils (#26750)
yeqcharlotte Oct 14, 2025
fd85c9f
[Bugfix][FE]: Always include usage with `--enable-force-include-usage…
max-wittig Oct 14, 2025
577d498
[Plugin] Make plugin group clear (#26757)
wangxiyuan Oct 14, 2025
d2f816d
[Bugfix] Standardize merging multimodal embeddings (#26771)
DarkLight1337 Oct 14, 2025
74704d4
[Model] Use merge_by_field_config for MM models (O-P) (#26776)
DarkLight1337 Oct 14, 2025
7e6edb1
[NIXL][HeteroTP] Enable KV transfer from HND prefill to NHD decode (#…
xuechendi Oct 14, 2025
d1d063a
[Chore] Use `max_transformers_version` for Qwen-VL test (#26792)
DarkLight1337 Oct 14, 2025
70b1b33
Don't allow `typos` to fix by default (#26785)
hmellor Oct 14, 2025
ef9676a
[Doc] ruff format some Python examples (#26767)
DarkLight1337 Oct 14, 2025
780eb03
[CI] Fix test_tool_id_kimi_k2 (#26787)
chaunceyjiang Oct 14, 2025
9c4cb68
[Chore] Remove `SupportsV0Only` interface and update supported models…
DarkLight1337 Oct 14, 2025
c715ba3
[Feature] Change vllm.py with pydantic validation (#26726)
VladOS95-cyber Oct 14, 2025
fdd3275
[CI/Build] Cleanup LoRA test (#26752)
jeejeelee Oct 14, 2025
ea97940
[DCP] Support Decode Context Parallel (DCP) for GQA with FlashAttenti…
FENP Oct 14, 2025
e9f1b8c
Adjusted the model order of the model registration file (#26798)
princepride Oct 14, 2025
ca683a2
use combo kernel to fuse qk-norm and qk-rope (#26682)
BoyuanFeng Oct 14, 2025
88a4974
[issues template] Encourage the author implement their own ideas (#26…
noooop Oct 14, 2025
720394d
[KVConnector][Metrics] Aggregate scheduler-side KVConnectorStats (#26…
QierLi Oct 14, 2025
c18e190
clean up the code
wcwuwc Oct 14, 2025
df850c4
[Feature][Responses API] Stream Function Call - harmony (#24317)
chaunceyjiang Oct 14, 2025
b46d9c8
represent invalid expert ids with -1
wcwuwc Oct 14, 2025
e6cdbd6
Revert "[issues template] Encourage the author implement their own id…
noooop Oct 14, 2025
6d87a28
[Config] Remove Unused Environment Variable `VLLM_DISABLE_PAD_FOR_CUD…
yewentao256 Oct 14, 2025
efc8f7d
Update coveragerc and add codecov.yml for path fixes (#26435)
rzabarazesh Oct 14, 2025
04b5f98
[CI] Raise VLLM_MAX_SIZE_MB to 500 due to failing Build wheel - CUDA …
mgoin Oct 14, 2025
aba48f7
[Kernel][MoE] Add MoE tunings for GLM 4.6-FP8 and GLM 4.5 Air on NVid…
zklapow Oct 14, 2025
c3a722f
[CI Failure] Fix tests with missing TinyLlama-1.1B-Chat-v1.0-FP8-e2e …
mgoin Oct 14, 2025
87efc68
llama4_vision_rope: add HIP override to accept (q, k) and avoid (posi…
hl475 Oct 14, 2025
82af928
[Attention][Spec Decode] FlashMLA spec decode support (#26541)
MatthewBonanni Oct 14, 2025
acaa2c0
[Core] Reuse empty block lists whenever possible in KVCacheBlocks to …
Jialin Oct 14, 2025
b92ab3d
Notice for deprecation of AutoAWQ (#26820)
HDCharles Oct 14, 2025
380f175
[Perf] Cache vllm.env.__getattr__ result to avoid recomputation (#26146)
Jialin Oct 14, 2025
0e65818
Added MoE configs for llama 4, H200 device with tp=4/8 tuning (#26837)
Dhruvilbhatt Oct 14, 2025
9d69649
fix: response_format for completion (#23212)
Nan2018 Oct 14, 2025
ff4810b
[Minor] Group async_scheduling related fields in model runner init (#…
njhill Oct 14, 2025
a86b4c5
remove attn output view kernel (#26680)
BoyuanFeng Oct 14, 2025
4aed506
[Core] Streamline some structured output related code (#26737)
njhill Oct 14, 2025
7e0ef40
[CI Failure] Fix torchao dep failure for Quantization Test (#26824)
mgoin Oct 14, 2025
0512c04
[frontend][gptoss] Add per turn stats into Harmony Context (#25061)
lacora Oct 14, 2025
579d2e5
[WideEP][P/D] Add usage stats for DP+EP and KV Connector (#26836)
tlrmchlsmth Oct 14, 2025
2dcd12d
[torch.compile] Fix tests for torch==2.9 inductor partition (#26116)
ProExpertProg Oct 14, 2025
07ca70a
[Core][Easy] Use envs.__getattr__ for all Unify to environment variab…
Jialin Oct 15, 2025
9354660
[Bugfix]fix Qwen3 xml tool parser (#26345)
Zhikaiiii Oct 15, 2025
d346000
Merge pull request #15 from dcmaddix/test_pr_final
wcwuwc Oct 15, 2025
bfad142
[BUGFIX][NIXL] quick fix for 'assert self.connector_worker is not Non…
xuechendi Oct 15, 2025
e66d787
Disable FlashInfer sampler by default (#26859)
mgoin Oct 15, 2025
96b9aa5
[Frontend][torch.compile] CompilationConfig Overhaul (#20283): name c…
morrison-turnansky Oct 15, 2025
a2986b3
[Bugfix] Fixes prefix-repetition benchmark script (#26828)
kouroshHakha Oct 15, 2025
85a65e7
[Model] Add DeepSeek-V3.1 reasoning parser (split from PR #24972) (#2…
taohui Oct 15, 2025
c43ca82
[Docs] Move build.inc into arm.inc (#26862)
windsonsea Oct 15, 2025
e471d7c
[CI/Build][Bugfix] fix qutlass cmake error when set QUTLASS_SRC_DIR (…
izhuhaoran Oct 15, 2025
a27b288
[Feature] default --extra-body param to disable thinking in vllm benc…
lengrongfu Oct 15, 2025
7cfa420
[BugFix] Patch inductor partitioning logic (#26735)
angelayi Oct 15, 2025
8c851f6
[Bugfix] Fix qwen3-omni audio truncation issue (#26815)
Isotr0py Oct 15, 2025
f0862ea
[Graph Partition] pass tests for decorator (#26831)
BoyuanFeng Oct 15, 2025
8865da1
[Bugfix][Multi Modal] Fix incorrect Molmo token processing (#26873)
sangho-vision Oct 15, 2025
302ef40
[DSA][MLA] Tiny refactor on DeepSeek to make it reusable for differen…
MengqingCao Oct 15, 2025
b8a4572
[Misc] Use helper function to generate dummy messages in OpenAI MM te…
DarkLight1337 Oct 15, 2025
efdef57
[bugfix] Lazy import cv2 (#26869)
angelayi Oct 15, 2025
f5ed68e
[Deepseek-V3.2][Kernel] Integrate cuda indexer k cache gather (#26456)
zyongye Oct 15, 2025
f3c378f
[CI/Build] Add Qwen2.5-VL-7B-Instruct ChartQA Accuracy Tests in CI (#…
zhewenl Oct 15, 2025
71557a5
[CI] Fix mypy for `vllm/executor` (#26845)
yewentao256 Oct 15, 2025
6256697
[Doc] ruff format remaining Python examples (#26795)
DarkLight1337 Oct 15, 2025
650b51f
[doc] add Context Parallel Deployment doc (#26877)
youkaichao Oct 15, 2025
5210dc3
[Misc] Update TritonLanguagePlaceholder to have attributes that are u…
madongfly Oct 15, 2025
5c3bae1
[Fix] Remove divisibility requirement between num_kv_heads and tp_siz…
ant-yy Oct 15, 2025
7f83b4e
[Easy] Get rid of unnecessary paraenthesis in kv_cache_manager (#26842)
Jialin Oct 15, 2025
db1764e
[Platform] allow platform to init dp group (#22243)
wangxiyuan Oct 15, 2025
d4d1a60
[Lora]Load tuned multi-lora kernel configs from json files (#26319)
li2haipeng Oct 15, 2025
f54f851
[Model][2/N] Improve all pooling task | Support multi-vector retrieva…
noooop Oct 15, 2025
f93e348
[Misc] Remove `isort` and `yapf` ignores (#26888)
DarkLight1337 Oct 15, 2025
8f4b313
[Misc] rename torch_dtype to dtype (#26695)
wangxiyuan Oct 15, 2025
5d59868
chore: remove unused marker (#26890)
max-wittig Oct 15, 2025
f574383
[BugFix] Patch inductor memory plan logic (#26878)
BoyuanFeng Oct 15, 2025
136a17f
[Chore] Separate out `vllm.utils.func` (#26904)
DarkLight1337 Oct 15, 2025
828523a
[Chore] Separate out `vllm.utils.async_utils` (#26913)
DarkLight1337 Oct 15, 2025
d3cbaa0
Lower sevarity of log when model info cache misses due to exception (…
hmellor Oct 15, 2025
4794c2b
Olmo 3 tool parser and tests (#26143)
pdasigi Oct 15, 2025
14f8456
[Feature]: Use pydantic validation in observability.py config (#26637)
cern1710 Oct 15, 2025
d796375
[ModelOpt] Remove NVFP4 MoE K%16==0 constraint (#26891)
XiaobingSuper Oct 15, 2025
a106362
[Chore] Clean up CODEOWNERS (#26923)
WoosukKwon Oct 15, 2025
de92d91
[NVIDIA] Add support for cudnn fp4 gemm via flashinfer (#26107)
kaixih Oct 15, 2025
1f491aa
Vectorize RMS norm variance using vectorize_read_with_alignment (#26234)
bbeckca Oct 15, 2025
0b99f5d
support flashinfer_fp4 moe for 5090 gpu (#26669)
XiaobingSuper Oct 15, 2025
e5b438a
[Bug] Temporally Disable `VLLM_ALLREDUCE_USE_SYMM_MEM` by Default (#2…
yewentao256 Oct 15, 2025
0a9ef0c
Move query quantization to attention layer for Flashinfer & Triton. (…
adabeyta Oct 15, 2025
938c43e
[ci] Adjusting AMD test composition 2025-10-14 (#26852)
Alexei-V-Ivanov-AMD Oct 15, 2025
f96bc36
[Qwen3-Next] Add tuned MoE config for Qwen3-Next FP8 on H100 tp2 (#26…
felixzhu555 Oct 16, 2025
0ecc553
[Bugfix] reasoning_parser parameter handling in run_batch.py (#26225)
inc-jeong Oct 16, 2025
1317034
[ROCm][FEAT] Fuse DeepSeek shared experts into AITER fused_moe ops (#…
kliuae Oct 16, 2025
f8a0acb
[CI] Enable Blackwell Llama4 MoE tests (#26731)
mgoin Oct 16, 2025
582f2c6
[BUG] Allow runai_streamer_sharded in config check (#26958)
ahao-anyscale Oct 16, 2025
e19b16d
[bugfix] Fix SP + PP without specifying compile size (#26955)
angelayi Oct 16, 2025
9b6504c
[BugFix] Work around graph partition x torch.compile cache issue (#26…
zou3519 Oct 16, 2025
509cdc0
[DOC][XPU]update feature parity with Intel GPU (#26954)
xuechendi Oct 16, 2025
f6cdc9a
[Chore] Rename `utils` submodules (#26920)
DarkLight1337 Oct 16, 2025
785d8b6
[PERF] Qwen3-next MTP speedup (change bool mask indexing to index_sel…
vadiklyutiy Oct 16, 2025
7d8975d
Deepseek-v3 Batch Invariant on 8xH100 (#26609)
bwasti Oct 16, 2025
76f0d05
[CI/Build] Update expected beam search output for Phi3V (#26978)
DarkLight1337 Oct 16, 2025
f7d318d
[Hardware][CPU][PowerPC]Disable torch.compile() in toptopk sampling (…
Akashcodes732 Oct 16, 2025
44c8555
[CI/Build] Fix AMD import failures in CI (#26841)
zhewenl Oct 16, 2025
17838e5
[Benchmark] Use truncation by default for pooling benchmarks (#26992)
DarkLight1337 Oct 16, 2025
d2740fa
[Chore] Separate out `vllm.utils.collections` (#26990)
DarkLight1337 Oct 16, 2025
e519287
[Model][Bugfix] fix ernie45 vl run failed from shared experts optimiz…
CSWYF3634076 Oct 16, 2025
ed344f4
Cleanup code after Python 3.10 upgrade (#26520)
lgeiger Oct 16, 2025
00417f4
[MISC] fix import violations for re and triton modules (#26654)
llsj14 Oct 16, 2025
dcbb3f1
[Bugfix] Correct LayerNorm epsilon parameter in modernbert.py (#27008)
bogdanminko Oct 16, 2025
334535b
[Benchmark] Show E2EL by default for pooling models (#27014)
DarkLight1337 Oct 16, 2025
314fa8a
[Attention] Tune CUTLASS MLA num_splits (#26846)
MatthewBonanni Oct 16, 2025
4a510ab
[NIXL] Improve request_finished() debug logs (#25665)
markmc Oct 16, 2025
02d709a
[docs] standardize Hugging Face env var to `HF_TOKEN` (deprecates `HU…
yankay Oct 16, 2025
8f4a4be
Add test
jeejeelee Oct 16, 2025
43721bc
[CI] Replace large models with tiny alternatives in tests (#24057)
tahsintunan Oct 16, 2025
5afd327
[Feature] Add process_weights_after_loading to AttentionImpl (#26870)
lengrongfu Oct 16, 2025
9f4e309
[Model] Fix Qwen3VL mm mapping (#27027)
jeejeelee Oct 16, 2025
7bb736d
Fix Qwen2.5 VL image grid docstring (#27033)
skyloevil Oct 16, 2025
aa255ff
Support `set` in the CLI generation (#27031)
hmellor Oct 16, 2025
e6ba200
[gpt-oss][1/N] EZ: refactor serving_responses for modularity (#26948)
qandrew Oct 16, 2025
ac3ed5a
Support block size of 256 used by Intel HPU (#26883)
mandy-li Oct 16, 2025
a5464dc
[Compressed Tensors] Always clone output for compile robustness (#26849)
kylesayrs Oct 16, 2025
013abde
Adding Warmup to Benchmark Serving (#26943)
kimbochen Oct 16, 2025
2ed8b6b
[Bug] Fix batch invariant test `has` to `is` (#27032)
yewentao256 Oct 16, 2025
fb0571b
[GPTOSS][DP/EP][Marlin] Enable GPTOSS Batched DP/EP using Marlin kern…
varun-sundar-rabindranath Oct 16, 2025
b3dda72
[Feature] Migrate DeepGEMM API from `get_m_alignment_for_contiguous_l…
yewentao256 Oct 16, 2025
01c977e
[CI] Prune Quantization Tests and skip compilation (#27038)
mgoin Oct 16, 2025
23583ee
[Bug] Add Assertion for `random-input-len` / `random-output-len` (#26…
yewentao256 Oct 16, 2025
b2f78cb
[small][batch invariance] Rename the env and internal flags to simpli…
bwasti Oct 16, 2025
fb5e10d
Refactor Transformers backend to use mixins (#26906)
hmellor Oct 16, 2025
41d3071
[NVIDIA] [Perf] Update to leverage flashinfer trtllm FP4 MOE throughp…
jiahanc Oct 16, 2025
11ae016
[torch.compile] Passing only necessary compilation config to inductor…
luccafong Oct 17, 2025
4d4d6ba
[Chore] Separate out `vllm.utils.importlib` (#27022)
DarkLight1337 Oct 17, 2025
17c540a
[torch.compile] fix simple inductor graph partition test (#27050)
BoyuanFeng Oct 17, 2025
4d055ef
Remove unused imports (#26972)
lgeiger Oct 17, 2025
965c5f4
vllm bench serve shows num of failed requests (#26478)
tomasruizt Oct 17, 2025
4ffd6e8
[Docs] Reduce custom syntax used in docs (#27009)
hmellor Oct 17, 2025
ab81379
[Perf] Exploit out-of-band buffers in shm_broadcast (#26961)
njhill Oct 17, 2025
0840560
disable graph partition in custom op (#26952)
BoyuanFeng Oct 17, 2025
bde9e22
[Bugfix][Qwen] fixes the weights dtype in qwen3_next: it is actually …
sighingnow Oct 17, 2025
fe3b937
[Core] Change `execute_model_with_error_logging()` to be a ctx manage…
njhill Oct 17, 2025
87bc0c4
[Bugfix] Fix ReplicatedLinearWithLoRA (#27065)
jeejeelee Oct 17, 2025
fec2b34
[Kernel] Lazy import FlashInfer (#26977)
jeejeelee Oct 17, 2025
9c2c228
[CI/Build] Update Llama4 eval yaml (#27070)
zhewenl Oct 17, 2025
8c017b3
[Model] Always use Transformers backend for PaliGemma and Gemma3-MM (…
DarkLight1337 Oct 17, 2025
3aeb19a
[Model] Add support for LightOnOCR (#26916)
staghado Oct 17, 2025
35eb0e9
fix precommit ofr fused_marlin_moe
Oct 17, 2025
5550ff9
[CI/Build] Update compressed tensor test path to fix CPU CI (#27068)
bigPYJ1151 Oct 17, 2025
75c7ad9
[Kernel][Performance] Fuse float cast and renormalize to topk softmax…
izhuhaoran Oct 17, 2025
acb1bfa
[CI] fix docs build failed (#27082)
chaunceyjiang Oct 17, 2025
bbc1b29
Update troubleshooting.md and remind VLLM_TRACE_FUNCTION usage (#27069)
Prowindy Oct 17, 2025
e20eba7
[VLM][Refactor] Remove useless func `get_input_positions` in `MRotary…
MengqingCao Oct 17, 2025
483ea64
[Docs] Replace all explicit anchors with real links (#27087)
hmellor Oct 17, 2025
6c9fdbf
[Docs] Replace `rst` style double-backtick with `md` single-backtick …
hmellor Oct 17, 2025
daec4d2
[Model]Improve Qwen3VLMoeForConditionalGeneration packed_modules_mapp…
jeejeelee Oct 17, 2025
c253745
[Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI350…
rkarhila-amd Oct 17, 2025
be429d0
Fix incorrect docstring for stop_profile() method (#27101)
hyongtao-code Oct 17, 2025
bd7157a
[torch.compile] Enable attention and allreduce fusion without custom …
ProExpertProg Oct 17, 2025
2ba60ec
[CI] Nixl integration tests (#27010)
NickLucche Oct 17, 2025
b038d9c
[Data-parallel] Allow DP>1 for world_size > num_gpus on node (8) (#26…
patrickvonplaten Oct 17, 2025
4c91a28
[bugfix] Qwen3-VL fix video incorrect timestamp calculations while do…
wulipc Oct 17, 2025
99722d5
[CI] Remove forbidden slash (#27112)
NickLucche Oct 17, 2025
0925b28
[ROCM] MoE fp4 CK kernel (#26545)
maleksan85 Oct 17, 2025
b10c64c
[ROCm][Bugfix][Model] Fix illegal memory access when running qwen3_mo…
rasmith Oct 17, 2025
e33ee23
[Bugfix] [AITER] [ROCm] Fix Quark MoE Quant Config and AITER Fused Mo…
vllmellm Oct 17, 2025
3125d79
[Chore] Remove unused `PolyNorm` layer (#27110)
Isotr0py Oct 17, 2025
950cf9e
[Bugfix] Use PIECEWISE cudagraphs on Blackwell if max_model_len > 131…
mgoin Oct 17, 2025
d29483b
[Minor] Remove unnecessary error message (#27115)
zhuohan123 Oct 17, 2025
acedc74
[V1][Spec Decode] Fix greedy temperature detection after sampler refa…
Pradyun92 Oct 17, 2025
f50cc22
[Test] Make `test_failure` more stable for batch invariance (#27054)
yewentao256 Oct 17, 2025
6367bde
[BugFix][Core] Fix error when enable async-scheduling in multi-node e…
lhtin Oct 17, 2025
c981f0e
[Perf] Add H100 fused MoE config (#25398)
skyloevil Oct 18, 2025
c312320
[CI/Build] tests(v1): feed Triton attention the (num_blocks, 2, …) KV…
hl475 Oct 18, 2025
7c57254
[GPT-OSS] Structure_Tag support for gpt-oss tool-call in cot (#25515)
Hanchenli Oct 18, 2025
30a33b9
[Misc] Rev DeepEP (#27122)
varun-sundar-rabindranath Oct 18, 2025
12e2170
[DOC][FEATURES][CPU]update cpu feature for v1 (#27135)
xuechendi Oct 18, 2025
8300402
[Test] Add test for /health endpoint on engine failure (#26074)
dongbo910220 Oct 18, 2025
29dc8eb
added fused_moe_lora_kernel test code and fixed its bug
wcwuwc Oct 18, 2025
1d165d6
[Chore] Separate out `vllm.utils.mem_utils` (#27143)
iAmir97 Oct 18, 2025
7d320c6
Merge pull request #19 from dcmaddix/test_pr_10_14
wcwuwc Oct 18, 2025
245e4f2
[Feature] Batch Invariant: Support DeepGEMM and Blackwell (#27127)
yewentao256 Oct 18, 2025
ab4be40
[fix][cpu] fix prefill attention in CPU attention backend (#27035)
fadara01 Oct 18, 2025
b26b70b
[Misc] Refactor `get_kv_cache_spec` into `AttentionLayerBase` (#26587)
NickLucche Oct 18, 2025
5d2e4d8
avoid tensor.full improve performance
wcwuwc Oct 18, 2025
5c2acb2
[Models][QwenVL] Remove unnecessary `.contiguous()` calls (#27106)
lgeiger Oct 18, 2025
a7148f8
clean
wcwuwc Oct 18, 2025
6ac5e06
[Chore] Clean up pytorch helper functions in `vllm.utils` (#26908)
Isotr0py Oct 18, 2025
168e578
Fix incorrect string formatting in barrier timeout exceptions (#27149)
hyongtao-code Oct 18, 2025
3b45075
[Minor] Add some clarifying comments to recent changes (#27130)
njhill Oct 18, 2025
9f020f4
[BugFix] Fix failing gemma-3-1b-it test: `test_lm_eval_accuracy_v1_en…
LucasWilkinson Oct 18, 2025
a1946c9
[Chore] Separate out profiling utilities from vllm.utils (#27150)
dongbo910220 Oct 18, 2025
e133d6d
[BugFix] fix graph partition signature (#27139)
BoyuanFeng Oct 18, 2025
c2bba69
[BugFix] Disable fp8 kv-cache by default for DeepSeek V3.2 (#27121)
LucasWilkinson Oct 18, 2025
83e760c
[V1][Metrics][Plugin] Add plugin support for custom `StatLoggerBase` …
ptovam Oct 18, 2025
6b2aa4a
clean
wcwuwc Oct 19, 2025
d7057cd
clean
wcwuwc Oct 19, 2025
fb86067
[Minor] Remove unused env variable (#27161)
WoosukKwon Oct 19, 2025
191eed0
[BugFix] Fix lazy imports involving outlines_core (#27158)
22quinn Oct 19, 2025
8a29711
[Chore] Separate out hashing utilities from vllm.utils (#27151)
dongbo910220 Oct 19, 2025
b3aba04
[Benchmark] Convenience script for multiple parameter combinations (#…
DarkLight1337 Oct 19, 2025
221bf72
output type conversion fix (#27159)
jianyuh Oct 19, 2025
7a6c8c3
[Chore] Separate out `vllm.utils.network_utils` (#27164)
iAmir97 Oct 19, 2025
449f41a
Fix tes
jeejeelee Oct 19, 2025
7af116d
Fix test
jeejeelee Oct 19, 2025
d31f784
[Misc] Move utils to avoid conflicts with stdlib, and move tests (#27…
DarkLight1337 Oct 19, 2025
e03c892
merge main into fused_moe_lora
wcwuwc Oct 19, 2025
528900f
clean
wcwuwc Oct 19, 2025
0a29b72
enable passing activation func so act_wrapper in lora will be called …
Oct 19, 2025
bb01436
move default act to beginning of file
Oct 19, 2025
f0ba7dd
clean
wcwuwc Oct 19, 2025
07f7b7c
clean
wcwuwc Oct 19, 2025
570df8c
Merge pull request #20 from dcmaddix/test_memory_err
wcwuwc Oct 19, 2025
f6fdacd
[Bugfix] Fix error with penalties when speculative decoding and struc…
southfreebird Oct 19, 2025
8a81d77
Fix typo in ValueError message: use `kv_role` instead of `kv_disagg_r…
hyongtao-code Oct 19, 2025
4ac5bfd
Merge branch 'main' into fused_moe_lora
jeejeelee Oct 20, 2025
f32bf75
[Model][VLM] Support Bee-8B Model (#27012)
uyzhang Oct 20, 2025
b63f214
[LoRA] LoRA cuda graph specialization (#25914)
andylolu2 Oct 20, 2025
9fce7be
[Kernel] Accelerate solve_tril with TMA (#26746)
ZJY0516 Oct 20, 2025
1c691f4
AArch64 CPU Docker pipeline (#26931)
ioghiban Oct 20, 2025
e93ff6c
Nemotron Nano V2 VL + EVS Video Support (#27107)
BloodAxe Oct 20, 2025
4d0f266
[Kernel][Model] Tune fused_moe Triton configs for Qwen3-30B A3/A3B on…
shivampr Oct 20, 2025
f9e7ad5
[Bugfix][CI] Fix `Distributed Tests (4 GPUs)` async_sched+ray test (#…
NickLucche Oct 20, 2025
5b7ae52
Merge branch 'main' into fused_moe_lora
ywang96 Oct 20, 2025
87778d5
[Feature][Quantization] auto_round support for mixed bits quantizatio…
n1ck-guo Oct 20, 2025
6cba4c6
Merge branch 'main' into fused_moe_lora
ywang96 Oct 20, 2025
ee0ec49
Load tuned fused_moe_lora shrink and expand kernel configs separately…
yugong333 Oct 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
8 changes: 4 additions & 4 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@
import sys
import zipfile

# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 400 MiB
# Note that we have 400 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/3792 .
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 500 MiB
# Note that we have 800 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/6326 .
# Please also sync the value with the one in Dockerfile.
VLLM_MAX_SIZE_MB = int(os.environ.get("VLLM_MAX_SIZE_MB", 400))
VLLM_MAX_SIZE_MB = int(os.environ.get("VLLM_MAX_SIZE_MB", 500))


def print_top_10_largest_files(zip_file):
Expand Down
23 changes: 21 additions & 2 deletions .buildkite/generate_index.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@
<html>
<body>
<h1>Links for vLLM</h1/>
<a href="../{wheel_html_escaped}">{wheel}</a><br/>
<a href="../{x86_wheel_html_escaped}">{x86_wheel}</a><br/>
<a href="../{arm_wheel_html_escaped}">{arm_wheel}</a><br/>
</body>
</html>
"""
Expand All @@ -21,7 +22,25 @@

with open("index.html", "w") as f:
print(f"Generated index.html for {args.wheel}")
# sync the abi tag with .buildkite/scripts/upload-wheels.sh
if "x86_64" in filename:
x86_wheel = filename
arm_wheel = filename.replace("x86_64", "aarch64").replace(
"manylinux1", "manylinux2014"
)
elif "aarch64" in filename:
x86_wheel = filename.replace("aarch64", "x86_64").replace(
"manylinux2014", "manylinux1"
)
arm_wheel = filename
else:
raise ValueError(f"Unsupported wheel: {filename}")
# cloudfront requires escaping the '+' character
f.write(
template.format(wheel=filename, wheel_html_escaped=filename.replace("+", "%2B"))
template.format(
x86_wheel=x86_wheel,
x86_wheel_html_escaped=x86_wheel.replace("+", "%2B"),
arm_wheel=arm_wheel,
arm_wheel_html_escaped=arm_wheel.replace("+", "%2B"),
)
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh -m meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -l 100 -t 8
model_name: "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"
backend: "vllm-vlm"
tasks:
- name: "chartqa"
metrics:
- name: "relaxed_accuracy,none"
# TODO(zhewenl): model card is 0.90, but the actual score is 0.80.
value: 0.80
limit: 100
num_fewshot: 0
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-mmlupro-vllm-baseline.sh -m meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -l 250 -t 8 -f 5
model_name: "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"
tasks:
- name: "mmlu_pro"
metrics:
- name: "exact_match,custom-extract"
value: 0.80
limit: 250 # will run on 250 * 14 subjects = 3500 samples
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic -b auto -l 1319 -f 5 -t 1
# For vllm script, with -t option (tensor parallel size)
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic -l 1319 -t 1
model_name: "RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic"
tasks:
- name: "gsm8k"
Expand Down
12 changes: 12 additions & 0 deletions .buildkite/lm-eval-harness/configs/Qwen2.5-VL-7B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh -m Qwen/Qwen2.5-VL-7B-Instruct -l 2500 -t 1

model_name: "Qwen/Qwen2.5-VL-7B-Instruct"
backend: "vllm-vlm"
tasks:
- name: "chartqa"
metrics:
- name: "relaxed_accuracy,none"
value: 0.855
limit: 2500
num_fewshot: 0
1 change: 1 addition & 0 deletions .buildkite/lm-eval-harness/configs/models-large-h100.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Meta-Llama-4-Maverick-17B-128E-Instruct-FP8.yaml
1 change: 0 additions & 1 deletion .buildkite/lm-eval-harness/configs/models-large.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,3 @@ Meta-Llama-3-70B-Instruct.yaml
Mixtral-8x7B-Instruct-v0.1.yaml
Qwen2-57B-A14-Instruct.yaml
DeepSeek-V2-Lite-Chat.yaml
Meta-Llama-3-8B-QQQ.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Meta-Llama-4-Maverick-17B-128E-Instruct-FP8-MM.yaml
1 change: 1 addition & 0 deletions .buildkite/lm-eval-harness/configs/models-mm-small.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Qwen2.5-VL-7B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on chartqa for vllm.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.9

usage() {
echo``
echo "Runs lm eval harness on ChartQA using multimodal vllm."
echo "This pathway is intended to be used to create baselines for "
echo "our correctness tests in vllm's CI."
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -l - limit number of samples to run"
echo " -t - tensor parallel size to run at"
echo
}

while getopts "m:l:t:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
t )
TP_SIZE="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model vllm-vlm \
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE" \
--tasks chartqa \
--batch_size auto \
--apply_chat_template \
--limit $LIMIT
2 changes: 1 addition & 1 deletion .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# We can use this script to compute baseline accuracy on GSM for transformers.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.4
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]

usage() {
echo``
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.4
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]

usage() {
echo``
Expand Down
50 changes: 50 additions & 0 deletions .buildkite/lm-eval-harness/run-lm-eval-mmlupro-vllm-baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on MMLUPRO for vllm.
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]

usage() {
echo``
echo "Runs lm eval harness on MMLU Pro using huggingface transformers."
echo "This pathway is intended to be used to create baselines for "
echo "our automated nm-test-accuracy workflow"
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -l - limit number of samples to run"
echo " -f - number of fewshot samples to use"
echo " -t - tensor parallel size to run at"
echo
}

while getopts "m:b:l:f:t:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
b )
BATCH_SIZE="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
f )
FEWSHOT="$OPTARG"
;;
t )
TP_SIZE="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model vllm \
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,add_bos_token=true,trust_remote_code=true,max_model_len=4096" \
--tasks mmlu_pro --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size auto
12 changes: 9 additions & 3 deletions .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,21 +19,27 @@
def launch_lm_eval(eval_config, tp_size):
trust_remote_code = eval_config.get("trust_remote_code", False)
max_model_len = eval_config.get("max_model_len", 4096)
batch_size = eval_config.get("batch_size", "auto")
backend = eval_config.get("backend", "vllm")
model_args = (
f"pretrained={eval_config['model_name']},"
f"tensor_parallel_size={tp_size},"
f"enforce_eager=true,"
f"add_bos_token=true,"
f"trust_remote_code={trust_remote_code},"
f"max_model_len={max_model_len}"
f"max_model_len={max_model_len},"
)
results = lm_eval.simple_evaluate(
model="vllm",
model=backend,
model_args=model_args,
tasks=[task["name"] for task in eval_config["tasks"]],
num_fewshot=eval_config["num_fewshot"],
limit=eval_config["limit"],
batch_size="auto",
# TODO(yeq): using chat template w/ fewshot_as_multiturn is supposed help
# text models. however, this is regressing measured strict-match for
# existing text models in CI, so only apply it for mm.
apply_chat_template=backend == "vllm-vlm",
batch_size=batch_size,
)
return results

Expand Down
Loading