Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
317 commits
Select commit Hold shift + click to select a range
742dad0
Fixed model compilation bugs
Akshat-Tripathi Feb 24, 2025
43efe69
Minor changes
Akshat-Tripathi Feb 25, 2025
2eae384
[V1] Get input tokens from scheduler (#13339)
WoosukKwon Feb 17, 2025
70837d6
[V1][PP] Fix intermediate tensor values (#13417)
comaniac Feb 17, 2025
0bbf7db
[V1][Spec decode] Move drafter to model runner (#13363)
WoosukKwon Feb 17, 2025
b7b9248
[Bugfix][CI][V1] Work around V1 + CUDA Graph + torch._scaled_mm fallb…
tlrmchlsmth Feb 18, 2025
f551ab5
[Misc] Remove dangling references to `SamplingType.BEAM` (#13402)
hmellor Feb 18, 2025
d32dd01
[Model] Enable quantization support for `transformers` backend (#12960)
Isotr0py Feb 18, 2025
cfcb3f2
[ROCm] fix get_device_name for rocm (#13438)
divakar-amd Feb 18, 2025
cabcc6e
[v1] fix parallel config rank (#13445)
youkaichao Feb 18, 2025
79dd067
[Quant] Molmo SupportsQuant (#13336)
kylesayrs Feb 18, 2025
682099f
[Quant] Arctic SupportsQuant (#13366)
kylesayrs Feb 18, 2025
ec2ec12
[Bugfix] Only print out chat template when supplied (#13444)
terrytangyuan Feb 18, 2025
f87ec52
[core] fix sleep mode in pytorch 2.6 (#13456)
youkaichao Feb 18, 2025
6329db4
[Quant] Aria SupportsQuant (#13416)
kylesayrs Feb 18, 2025
770eab0
[V1][PP] Fix & Pin Ray version in requirements-cuda.txt (#13436)
WoosukKwon Feb 18, 2025
49d0819
Add outlines fallback when JSON schema has enum (#13449)
mgoin Feb 18, 2025
13c152e
[Bugfix] Ensure LoRA path from the request can be included in err msg…
terrytangyuan Feb 18, 2025
64ebfa3
[Bugfix] Fix failing transformers dynamic module resolving with spawn…
Isotr0py Feb 18, 2025
70fc779
[Doc]: Improve feature tables (#13224)
hmellor Feb 18, 2025
053e304
[Bugfix] Remove noisy error logging during local model loading (#13458)
Isotr0py Feb 18, 2025
151fdfc
[ROCm] Make amdsmi import optional for other platforms (#13460)
DarkLight1337 Feb 18, 2025
7482ce2
[Bugfix] Handle content type with optional parameters (#13383)
zifeitong Feb 18, 2025
bbfee9a
[Bugfix] Fix invalid rotary embedding unit test (#13431)
liangfu Feb 18, 2025
168a9ff
[CI/Build] migrate static project metadata from setup.py to pyproject…
dtrifiro Feb 18, 2025
80f3dc3
[V1][PP] Enable true PP with Ray executor (#13472)
WoosukKwon Feb 18, 2025
5b612c4
[misc] fix debugging code (#13487)
youkaichao Feb 18, 2025
d15cbbe
[V1][Tests] Adding additional testing for multimodal models to V1 (#1…
andoorve Feb 18, 2025
d9b7062
[V1] Optimize handling of sampling metadata and req_ids list (#13244)
njhill Feb 18, 2025
1590617
Pin Ray version to 2.40.0 (#13490)
WoosukKwon Feb 18, 2025
1104f29
[V1][Spec Decode] Optimize N-gram matching with Numba (#13365)
WoosukKwon Feb 18, 2025
ca4b020
[Misc] Remove dangling references to `--use-v2-block-manager` (#13492)
hmellor Feb 19, 2025
8e711a6
[Hardware][Gaudi][Feature] Support Contiguous Cache Fetch (#12139)
Feb 19, 2025
52b3192
[perf-benchmark] Allow premerge ECR (#13509)
khluu Feb 19, 2025
ec0057f
[ROCm][MoE configs] mi325 mixtral & mi300 qwen_moe (#13503)
divakar-amd Feb 19, 2025
85c2fd6
[Doc] Add clarification note regarding paligemma (#13511)
ywang96 Feb 19, 2025
aabd263
[1/n][CI] Load models in CI from S3 instead of HF (#13205)
khluu Feb 19, 2025
66a5a16
[perf-benchmark] Fix ECR path for premerge benchmark (#13512)
khluu Feb 19, 2025
ea48908
use device param in load_model method (#13037)
Zzhiter Feb 19, 2025
7615128
[Bugfix] Fix Positive Feature Layers in Llava Models (#13514)
alex-jw-brooks Feb 19, 2025
25f0d87
[Model][Speculative Decoding] DeepSeek MTP spec decode (#12755)
luccafong Feb 19, 2025
9288c58
[V1][Core] Generic mechanism for handling engine utility (#13060)
njhill Feb 19, 2025
b5c9857
[Feature] Pluggable platform-specific scheduler (#13161)
yannicks1 Feb 19, 2025
7862a7c
[CI/Build] force writing version file (#13544)
dtrifiro Feb 19, 2025
81c6d14
[doc] clarify profiling is only for developers (#13554)
youkaichao Feb 19, 2025
b6e3057
[VLM][Bugfix] Pass processor kwargs properly on init (#13516)
DarkLight1337 Feb 19, 2025
4b2dd14
[Bugfix] Fix device ordinal for multi-node spec decode (#13269)
ShangmingCai Feb 19, 2025
c1e40f8
[doc] clarify multi-node serving doc (#13558)
youkaichao Feb 19, 2025
10020cb
Fix copyright year to auto get current year (#13561)
wilsonwu Feb 19, 2025
f1ead2c
[MISC] Logging the message about Ray teardown (#13502)
comaniac Feb 19, 2025
9e112ca
[Misc] Avoid calling unnecessary `hf_list_repo_files` for local model…
Isotr0py Feb 19, 2025
4b3a90d
[BugFix] Avoid error traceback in logs when V1 `LLM` terminates (#13565)
njhill Feb 20, 2025
ad6b051
[3/n][CI] Load Quantization test models with S3 (#13570)
khluu Feb 20, 2025
3f8fdc1
[Misc] Qwen2.5 VL support LoRA (#13261)
jeejeelee Feb 20, 2025
861c978
[ci] Add AWS creds for AMD (#13572)
khluu Feb 20, 2025
f684038
[ROCm][MoE] mi300 mixtral8x7B perf for specific BS (#13577)
divakar-amd Feb 20, 2025
5cde704
[core] add sleep and wake up endpoint and v1 support (#12987)
youkaichao Feb 20, 2025
82a666b
[bugfix] spec decode worker get tp group only when initialized (#13578)
simon-mo Feb 20, 2025
26232b7
[Misc] Warn if the vLLM version can't be retrieved (#13501)
alex-jw-brooks Feb 20, 2025
4c3177b
[Misc] add mm_processor_kwargs to extra_body for Qwen2.5-VL (#13533)
wulipc Feb 20, 2025
1d457c7
[ROCm] MI300A compile targets deprecation (#13560)
gshtras Feb 20, 2025
3e32a6a
[API Server] Add port number range validation (#13506)
terrytangyuan Feb 20, 2025
4ba521b
[CI/Build] Use uv in the Dockerfile (#13566)
mgoin Feb 20, 2025
a633353
[ci] Fix spec decode test (#13600)
khluu Feb 20, 2025
66b0fd7
[2/n][ci] S3: Use full model path (#13564)
khluu Feb 20, 2025
a1ddfdd
[Kernel] LoRA - Refactor sgmv kernels (#13110)
varun-sundar-rabindranath Feb 20, 2025
ae77311
Merge similar examples in `offline_inference` into single `basic` exa…
hmellor Feb 20, 2025
ecfd03e
[Bugfix] Fix deepseekv3 grouped topk error (#13474)
Chen-XiaoBing Feb 20, 2025
55d7ec5
Update `pre-commit`'s `isort` version to remove warnings (#13614)
hmellor Feb 20, 2025
63645d2
[V1][Minor] Print KV cache size in token counts (#13596)
WoosukKwon Feb 20, 2025
6b81301
fix neuron performance issue (#13589)
ajayvohra2005 Feb 20, 2025
07f3c1e
[Frontend] Add backend-specific options for guided decoding (#13505)
joerunde Feb 20, 2025
cfb63a6
[Bugfix] Fix max_num_batched_tokens for MLA (#13620)
mgoin Feb 21, 2025
1296538
[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to ma…
lingfanyu Feb 21, 2025
0fad7f4
Add llmaz as another integration (#13643)
kerthcet Feb 21, 2025
b4c6249
[Misc] Adding script to setup ray for multi-node vllm deployments (#…
Edwinhr716 Feb 21, 2025
13058c7
[NVIDIA] Fix an issue to use current stream for the nvfp4 quant (#13632)
kaixih Feb 21, 2025
75c35cb
Use pre-commit to update `requirements-test.txt` (#13617)
hmellor Feb 21, 2025
761be24
[Bugfix] Add `mm_processor_kwargs` to chat-related protocols (#13644)
ywang96 Feb 21, 2025
d10b6ad
[V1][Sampler] Avoid an operation during temperature application (#13587)
njhill Feb 21, 2025
3fa24d2
Missing comment explaining VDR variable in GGUF kernels (#13290)
SzymonOzog Feb 21, 2025
6cf9d59
[FEATURE] Enables /score endpoint for embedding models (#12846)
gmarinho2 Feb 21, 2025
fe51a3d
[ci] Fix metrics test model path (#13635)
khluu Feb 21, 2025
0161781
[Kernel]Add streamK for block-quantized CUTLASS kernels (#12978)
Hongbosherlock Feb 21, 2025
a9c0ffb
[Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation …
Isotr0py Feb 21, 2025
feaf88e
fix typo of grafana dashboard, with correct datasource (#13668)
johnzheng1975 Feb 21, 2025
855ecfb
[Attention] MLA with chunked prefill (#12639)
LucasWilkinson Feb 21, 2025
fe90015
[Misc] Fix yapf linting tools etc not running on pre-commit (#13695)
Isotr0py Feb 22, 2025
025d6e6
docs: Add a note on full CI run in contributing guide (#13646)
terrytangyuan Feb 22, 2025
3752fb3
[HTTP Server] Make model param optional in request (#13568)
youngkent Feb 22, 2025
74b3937
[Bugfix][API Server] Fix invalid usage of 'ge' and 'le' in port valid…
WangErXiao Feb 22, 2025
3cb4af2
[Misc] Capture and log the time of loading weights (#13666)
waltforme Feb 22, 2025
ecfe822
[ROCM] fix native attention function call (#13650)
gongdao123 Feb 22, 2025
836ec6b
[Bugfix][Model] OLMo 2: split qkv correctly for GQA and MQA (#13687)
2015aroras Feb 22, 2025
ce71cab
[Misc] Bump compressed-tensors (#13619)
dsikka Feb 22, 2025
225c16d
[Bugfix] Fix benchmark script bug: inaccurate stats for vllm backend …
WangErXiao Feb 22, 2025
b8466ec
[v1] Support allowed_token_ids in v1 Sampler (#13210)
houseroad Feb 22, 2025
623a414
[Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejectio…
JenZhao Feb 22, 2025
2bf686d
Correction to TP logic for Mamba Mixer 2 when Num Groups not divisibl…
fabianlim Feb 22, 2025
c4dbd88
[V1][Metrics] Support `vllm:cache_config_info` (#13299)
markmc Feb 22, 2025
35380d4
[Metrics] Add `--show-hidden-metrics-for-version` CLI arg (#13295)
markmc Feb 22, 2025
e4f5b9c
[Misc] Reduce LoRA-related static variable (#13166)
jeejeelee Feb 22, 2025
439a0ea
[CI/Build] Fix pre-commit errors (#13696)
DarkLight1337 Feb 22, 2025
f1a809e
[core] set up data parallel communication (#13591)
youkaichao Feb 22, 2025
4392f14
[ci] fix linter (#13701)
youkaichao Feb 22, 2025
f605734
Support SSL Key Rotation in HTTP Server (#13495)
youngkent Feb 22, 2025
b296490
[NVIDIA] Support nvfp4 cutlass gemm (#13571)
kaixih Feb 22, 2025
3ffae46
[V1][Kernel] Refactor the prefix_prefill kernel so that the caller no…
SageMoore Feb 22, 2025
7626844
[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes…
gshtras Feb 22, 2025
886189b
[Doc] Dockerfile instructions for optional dependencies and dev trans…
DarkLight1337 Feb 22, 2025
994dad5
[Bugfix] Fix boolean conversion for OpenVINO env variable (#13615)
helena-intel Feb 22, 2025
04cb85a
[XPU]fix setuptools version for xpu (#13548)
yma11 Feb 22, 2025
8f46c13
[CI/Build] fix uv caching in Dockerfile (#13611)
dtrifiro Feb 22, 2025
e46908b
[CI/Build] Fix pre-commit errors from #13571 (#13709)
ywang96 Feb 23, 2025
e5ad78f
[BugFix] Minor: logger import in attention backend (#13706)
andylolu2 Feb 23, 2025
5c71345
[ci] Use env var to control whether to use S3 bucket in CI (#13634)
khluu Feb 23, 2025
e684adb
[Quant] BaiChuan SupportsQuant (#13710)
kylesayrs Feb 23, 2025
aad6f8a
[LMM] Implement merged multimodal processor for whisper (#13278)
Isotr0py Feb 23, 2025
efddd99
[Core][Distributed] Use IPC (domain socket) ZMQ socket for local comm…
njhill Feb 23, 2025
1d8bcf7
[Misc] Deprecate `--dataset` from `benchmark_serving.py` (#13708)
ywang96 Feb 23, 2025
dc8db38
[v1] torchrun compatibility (#13642)
youkaichao Feb 23, 2025
9226797
[V1][BugFix] Fix engine core client shutdown hangs (#13298)
njhill Feb 23, 2025
bc7c0aa
Fix some issues with benchmark data output (#13641)
huydhn Feb 24, 2025
fa87a0a
[ci] Add logic to change model to S3 path only when S3 CI env var is …
khluu Feb 24, 2025
49f7ae2
[V1][Core] Fix memory issue with logits & sampling (#13721)
ywang96 Feb 24, 2025
72f1743
[model][refactor] remove cuda hard code in models and layers (#13658)
MengqingCao Feb 24, 2025
458d3a9
[Bugfix] fix(logging): add missing opening square bracket (#13011)
bufferoverflow Feb 24, 2025
449c61f
[CI/Build] add python-json-logger to requirements-common (#12842)
bufferoverflow Feb 24, 2025
521dce4
Expert Parallelism (EP) Support for DeepSeek V2 (#12583)
cakeng Feb 24, 2025
140913e
[BugFix] Illegal memory access for MoE On H20 (#13693)
Abatom Feb 24, 2025
06b6876
[Misc][Docs] Raise error when flashinfer is not installed and `VLLM_A…
NickLucche Feb 24, 2025
0395274
[V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine) …
afeldman-nm Feb 24, 2025
26b42da
Revert "[V1][Core] Fix memory issue with logits & sampling" (#13775)
ywang96 Feb 24, 2025
c912e7f
Fix precommit fail in fused_moe intermediate_cache2 chunking (#13772)
mgoin Feb 24, 2025
92b6a8f
[Misc] Clean Up `EngineArgs.create_engine_config` (#13734)
robertgshaw2-redhat Feb 24, 2025
2a643b9
[Misc][Chore] Clean Up `AsyncOutputProcessing` Logs (#13780)
robertgshaw2-redhat Feb 25, 2025
ae7df05
Remove unused kwargs from model definitions (#13555)
hmellor Feb 25, 2025
9d08ecc
[Doc] arg_utils.py: fixed a typo (#13785)
eli-b Feb 25, 2025
82eef03
[Misc] set single whitespace between log sentences (#13771)
cjackal Feb 25, 2025
0fe54d7
[Bugfix][Quantization] Fix FP8 + EP (#13784)
tlrmchlsmth Feb 25, 2025
8c6e1ca
[Misc][Attention][Quantization] init property earlier (#13733)
wangxiyuan Feb 25, 2025
e48f1b8
[V1][Metrics] Implement vllm:lora_requests_info metric (#13504)
markmc Feb 25, 2025
6ff6dd0
[Bugfix] Fix deepseek-v2 error: "missing 1 required positional argume…
LucasWilkinson Feb 25, 2025
3660d88
[Bugfix] Support MLA for CompressedTensorsWNA16 (#13725)
mgoin Feb 25, 2025
9127c8e
Fix CompressedTensorsWNA16MoE with grouped scales (#13769)
mgoin Feb 25, 2025
36646a8
[Core] LoRA V1 - Add add/pin/list/remove_lora functions (#13705)
varun-sundar-rabindranath Feb 25, 2025
4d5b2e3
[Misc] Check that the model can be inspected upon registration (#13743)
DarkLight1337 Feb 25, 2025
53fd485
[Core] xgrammar: Expand list of unsupported jsonschema keywords (#13783)
russellb Feb 25, 2025
a6a99fe
[Bugfix] Modify modelscope api usage in transformer_utils (#13807)
shen-shanshan Feb 25, 2025
77c117b
[misc] Clean up ray compiled graph type hints (#13731)
ruisearch42 Feb 25, 2025
a5f5674
[Feature] Support KV cache offloading and disagg prefill with LMCache…
YaoJiayi Feb 25, 2025
d8c31f3
[ROCm][Quantization][Kernel] Using HIP FP8 header (#12593)
gshtras Feb 25, 2025
ab1bdb3
[CI/Build] Fix V1 LoRA failure (#13767)
jeejeelee Feb 25, 2025
2248377
[Misc]Clarify Error Handling for Non-existent Model Paths and HF Repo…
Chen-0210 Feb 25, 2025
5bf3a9b
[Bugfix] Initialize attention bias on the same device as Query/Key/Va…
edwardzjl Feb 25, 2025
be48b41
[Bugfix] Flush TunableOp results before worker processes are destroye…
naromero77amd Feb 25, 2025
6d05cde
[Bugfix] Fix deepseek-vl2 inference with more than 2 images (#13818)
Isotr0py Feb 25, 2025
f63bfea
Fix `/v1/audio/transcriptions ` Bad Request Error (#13811)
HermitSun Feb 25, 2025
d8265fd
[Bugfix] Revert inspection code in #13743 (#13832)
DarkLight1337 Feb 25, 2025
8c32ae8
Fix string parsing error (#13825)
Chen-0210 Feb 25, 2025
fecd1c2
[Neuron] Add custom_ops for neuron backend (#13246)
liangfu Feb 25, 2025
e78fc4b
Fix failing `MyGemma2Embedding` test (#13820)
hmellor Feb 25, 2025
5b754aa
[Model] Support Grok1 (#13795)
mgoin Feb 26, 2025
1f90308
DeepSeek V2/V3/R1 only place `lm_head` on last pp rank (#13833)
hmellor Feb 26, 2025
1b1e51d
[misc] Show driver IP info when Ray fails to allocate driver worker (…
ruisearch42 Feb 26, 2025
cfb690d
[V1][Spec Decode] Change Spec Decode Rejection Sampling API (#13729)
LiuXiaoxuanPKU Feb 26, 2025
c1bc1ac
[Misc]Code Cleanup (#13859)
noemotiovon Feb 26, 2025
db656e2
[Kernel][Build/CI] Bump CUTLASS to 3.8 and add initializers for cutla…
henrylhtsang Feb 26, 2025
043428f
Improve pipeline partitioning (#13839)
hmellor Feb 26, 2025
734cb2e
[Doc] fix the incorrect module path of tensorize_vllm_model (#13863)
tianyuzhou95 Feb 26, 2025
021ef73
[ROCm] Disable chunked prefill/prefix caching when running MLA on non…
SageMoore Feb 26, 2025
77ca08e
[v0][Core] Use xgrammar shared context to avoid copy overhead for off…
sethkimmel3 Feb 26, 2025
241fa24
[Misc] Improve LoRA spelling (#13831)
Akshat-Tripathi Mar 3, 2025
101ff85
[Misc] Fix input processing for Ultravox (#13871)
ywang96 Feb 26, 2025
9001578
[Bugfix] Add test example for Ultravox v0.5 (#13890)
DarkLight1337 Feb 26, 2025
b8fe8c1
Add comments on accessing `kv_cache` and `attn_metadata` (#13887)
hmellor Feb 26, 2025
71fa6b5
[Bugfix] Handle None parameters in Mistral function calls. (#13786)
fgreinacher Feb 26, 2025
87b6aeb
[Misc]: Add support for goodput on guided benchmarking + TPOT calcula…
b8zhong Feb 26, 2025
ddd560f
[Bugfix] Do not crash V0 engine on input errors (#13101)
joerunde Feb 26, 2025
afdf702
[Bugfix] Update expected token counts for Ultravox tests (#13895)
DarkLight1337 Feb 26, 2025
b7a622d
[TPU] use torch2.6 with whl package (#13860)
Chenyaaang Feb 26, 2025
20f3457
[Misc] fixed qwen_vl_utils parameter error (#13906)
chaunceyjiang Feb 26, 2025
2462b65
[Bugfix] Backend option to disable xgrammar any_whitespace (#12744)
wallashss Feb 26, 2025
5eb0d63
[BugFix] Make FP8 Linear compatible with torch.compile (#13918)
WoosukKwon Feb 26, 2025
b6ce762
[Kernel] FlashMLA integration (#13747)
LucasWilkinson Feb 27, 2025
375e570
[ROCm][Quantization][Kernel] Use FP8 FNUZ when OCP flag is 0 or undef…
HollowMan6 Feb 27, 2025
258b598
Use CUDA 12.4 as default for release and nightly wheels (#12098)
mgoin Feb 27, 2025
f81c37f
[misc] Rename Ray ADAG to Compiled Graph (#13928)
ruisearch42 Feb 27, 2025
3f10862
[ROCm][V1] Update reshape_and_cache to properly work with CUDA graph …
SageMoore Feb 27, 2025
c9095e0
[V1][Metrics] Handle preemptions (#13169)
markmc Feb 27, 2025
96faffa
[CI/Build] Add examples/ directory to be labelled by `mergify` (#13944)
b8zhong Feb 27, 2025
20b147a
[Misc] fixed 'required' is an invalid argument for positionals (#13948)
chaunceyjiang Feb 27, 2025
f0a2f15
[PP] Correct cache size check (#13873)
zhengy001 Feb 27, 2025
c3128cb
Fix test_block_fp8.py test for MoE (#13915)
mgoin Feb 27, 2025
93a4a50
[VLM] Support multimodal inputs for Florence-2 models (#13320)
Isotr0py Feb 27, 2025
feaa8ce
[Model] Deepseek GGUF support (#13167)
SzymonOzog Feb 27, 2025
954a82a
Update quickstart.md (#13958)
observerw Feb 27, 2025
e2ec2eb
Deduplicate `.pre-commit-config.yaml`'s `exclude` (#13967)
hmellor Feb 27, 2025
4ef625a
[bugfix] Fix profiling for RayDistributedExecutor (#13945)
ruisearch42 Feb 27, 2025
d805e8d
Update LMFE version to v0.10.11 to support new versions of transforme…
noamgat Feb 27, 2025
41ea542
[Bugfix] Fix qwen2.5-vl overflow issue (#13968)
Isotr0py Feb 27, 2025
af836be
[VLM] Generalized prompt updates for multi-modal processor (#13964)
DarkLight1337 Feb 27, 2025
c845298
[Attention] MLA support for V1 (#13789)
chenyang78 Feb 27, 2025
284a899
Bump azure/setup-helm from 4.2.0 to 4.3.0 (#13742)
dependabot[bot] Feb 27, 2025
3e575f6
[VLM] Deprecate legacy input mapper for OOT multimodal models (#13979)
DarkLight1337 Feb 27, 2025
5d11292
[ROCm] Fix the Kernels, Core, and Prefix Caching AMD CI groups (#13970)
SageMoore Feb 27, 2025
810e7c5
[V1][Minor] Minor cleanup for GPU Model Runner (#13983)
WoosukKwon Feb 27, 2025
9d019ef
[core] Perf improvement for DSv3 on AMD GPUs (#13718)
qli88 Feb 27, 2025
4962680
[Attention] Flash MLA for V1 (#13867)
LucasWilkinson Feb 27, 2025
b5ae7d9
[Model][Speculative Decoding] Expand DeepSeek MTP code to support k >…
benchislett Feb 27, 2025
76f457e
[Misc] Print FusedMoE detail info (#13974)
jeejeelee Feb 27, 2025
8d02f59
[V1]`SupportsV0Only` protocol for model definitions (#13959)
ywang96 Feb 28, 2025
0e21ae3
[Bugfix] Check that number of images matches number of <|image|> toke…
tjohnson31415 Feb 28, 2025
f00446b
[Doc] Move multimodal Embedding API example to Online Serving page (#…
DarkLight1337 Feb 28, 2025
8f1c663
[Bugfix][Disaggregated] patch the inflight batching on the decode nod…
hasB4K Feb 28, 2025
377095a
Use smaller embedding model when not testing model specifically (#13891)
hmellor Feb 28, 2025
24a3865
[Hardware][Intel-Gaudi] Regional compilation support (#13213)
Kacper-Pietkun Feb 28, 2025
181d4bf
[V1][Minor] Restore V1 compatibility with LLMEngine class (#13090)
Ryp Feb 28, 2025
d65f74b
Update AutoAWQ docs (#14042)
hmellor Feb 28, 2025
367db4b
[Bugfix] Fix MoeWNA16Method activation (#14024)
jeejeelee Feb 28, 2025
c3eca7b
[VLM][Bugfix] Enable specifying prompt target via index (#14038)
DarkLight1337 Feb 28, 2025
50bf059
[Bugfix] Initialize attention bias on the same device as Query/Key/Va…
LouieYang Feb 28, 2025
e613446
[Doc] Fix ROCm documentation (#14041)
b8zhong Feb 28, 2025
f0c3ab6
Fix entrypoint tests for embedding models (#14052)
hmellor Feb 28, 2025
a99cf1d
[V1][TPU] Integrate the new ragged paged attention kernel with vLLM v…
Akshat-Tripathi Mar 3, 2025
25a1bdf
[v1] Cleanup the BlockTable in InputBatch (#13977)
heheda12345 Feb 28, 2025
8212e03
Add RELEASE.md (#13926)
atalman Feb 28, 2025
0532919
[v1] Move block pool operations to a separate class (#13973)
heheda12345 Feb 28, 2025
728088f
[core] Bump ray to 2.43 (#13994)
ruisearch42 Feb 28, 2025
27aacf9
[torch.compile] Fix RMSNorm + quant fusion in the non-cutlass-fp8 cas…
ProExpertProg Feb 28, 2025
c875893
[Docs] Add `pipeline_parallel_size` to optimization docs (#14059)
b8zhong Mar 1, 2025
1a94642
[Bugfix] Add file lock for ModelScope download (#14060)
jeejeelee Mar 1, 2025
d92cda2
[Misc][Kernel]: Add GPTQAllSpark Quantization (#12931)
wyajieha Mar 1, 2025
39a2024
[Bugfix][V1][Minor] Fix shutting_down flag checking in V1 MultiprocEx…
bigPYJ1151 Mar 1, 2025
0daae74
[Documentation] Add more deployment guide for Kubernetes deployment (…
KuntaiDu Mar 1, 2025
b469f95
[Doc] Consolidate `whisper` and `florence2` examples (#14050)
Isotr0py Mar 1, 2025
8aea81e
[V1][Minor] Do not print attn backend twice (#13985)
WoosukKwon Mar 1, 2025
2658c53
[ROCm][V1][Bugfix] Add get_builder_cls method to the ROCmAttentionBac…
SageMoore Mar 1, 2025
b029cc4
[v1][Bugfix] Only cache blocks that are not in the prefix cache (#14073)
heheda12345 Mar 1, 2025
a7ca7a6
[v1] Add `__repr__` to KVCacheBlock to avoid recursive print (#14081)
heheda12345 Mar 1, 2025
615e492
[Model] Add LoRA support for TransformersModel (#13770)
jeejeelee Mar 2, 2025
6bab1e2
[Misc] Accurately capture the time of loading weights (#14063)
waltforme Mar 2, 2025
657beea
[Doc] Source building add clone step (#14086)
qux-bbb Mar 2, 2025
0aba218
[v0][structured output] Support reasoning output (#12955)
gaocegege Mar 2, 2025
3610d54
Update deprecated Python 3.8 typing (#13971)
hmellor Mar 3, 2025
06c6a48
[Bugfix] Explicitly include "omp.h" for MacOS to avoid installation f…
realShengYao Mar 3, 2025
0da18f9
[Misc] duplicate code in deepseek_v2 (#14106)
noooop Mar 3, 2025
50c2cf8
[Misc][Platform] Move use allgather to platform (#14010)
MengqingCao Mar 3, 2025
9bad4fd
[Build] Make sure local main branch is synced when VLLM_USE_PRECOMPIL…
comaniac Mar 3, 2025
c6cc1f5
[V1] Refactor parallel sampling support (#13774)
markmc Mar 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
96 changes: 91 additions & 5 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,18 @@ steps:
- image: badouralix/curl-jq
command:
- sh .buildkite/nightly-benchmarks/scripts/wait-for-image.sh

- label: "Cleanup H100"
agents:
queue: H100
depends_on: ~
command: docker system prune -a --volumes --force

- label: "A100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: A100
depends_on: wait-for-container-image
if: build.branch == "main"
plugins:
- kubernetes:
podSpec:
Expand Down Expand Up @@ -50,6 +56,7 @@ steps:
agents:
queue: H200
depends_on: wait-for-container-image
if: build.branch == "main"
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
Expand All @@ -70,20 +77,99 @@ steps:
#key: block-h100
#depends_on: ~

- label: "Cleanup H100"
- label: "H100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H100
depends_on: ~
command: docker system prune -a --volumes --force
depends_on: wait-for-container-image
if: build.branch == "main"
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: all # see CUDA_VISIBLE_DEVICES for actual GPUs used
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN

# Premerge benchmark
- label: "A100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: A100
depends_on: wait-for-container-image
if: build.branch != "main"
plugins:
- kubernetes:
podSpec:
priorityClassName: perf-benchmark
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
command:
- bash .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
resources:
limits:
nvidia.com/gpu: 8
volumeMounts:
- name: devshm
mountPath: /dev/shm
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
volumes:
- name: devshm
emptyDir:
medium: Memory

- label: "H200"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H200
depends_on: wait-for-container-image
if: build.branch != "main"
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: 4,5,6,7
volumes:
- /data/benchmark-hf-cache:/root/.cache/huggingface
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN

#- block: "Run H100 Benchmark"
#key: block-h100
#depends_on: ~

- label: "H100"
# skip: "use this flag to conditionally skip the benchmark step, useful for PR testing"
agents:
queue: H100
depends_on: wait-for-container-image
if: build.branch != "main"
plugins:
- docker#v5.12.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-postmerge-repo:$BUILDKITE_COMMIT
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/scripts/run-performance-benchmarks.sh
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -84,8 +84,13 @@ def results_to_json(latency, throughput, serving):
# this result is generated via `benchmark_serving.py`

# attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read())
try:
with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read())
except OSError as e:
print(e)
continue

raw_result.update(command)

# update the test name of this result
Expand All @@ -99,8 +104,13 @@ def results_to_json(latency, throughput, serving):
# this result is generated via `benchmark_latency.py`

# attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read())
try:
with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read())
except OSError as e:
print(e)
continue

raw_result.update(command)

# update the test name of this result
Expand All @@ -121,8 +131,13 @@ def results_to_json(latency, throughput, serving):
# this result is generated via `benchmark_throughput.py`

# attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read())
try:
with open(test_file.with_suffix(".commands")) as f:
command = json.loads(f.read())
except OSError as e:
print(e)
continue

raw_result.update(command)

# update the test name of this result
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -309,11 +309,14 @@ run_serving_tests() {

new_test_name=$test_name"_qps_"$qps

# pass the tensor parallel size to the client so that it can be displayed
# on the benchmark dashboard
client_command="python3 benchmark_serving.py \
--save-result \
--result-dir $RESULTS_FOLDER \
--result-filename ${new_test_name}.json \
--request-rate $qps \
--metadata "tensor_parallel_size=$tp" \
$client_args"

echo "Running test case $test_name with qps $qps"
Expand Down
6 changes: 5 additions & 1 deletion .buildkite/nightly-benchmarks/scripts/wait-for-image.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
#!/bin/sh
TOKEN=$(curl -s -L "https://public.ecr.aws/token?service=public.ecr.aws&scope=repository:q9t5s3a7/vllm-ci-postmerge-repo:pull" | jq -r .token)
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-postmerge-repo/manifests/$BUILDKITE_COMMIT"
if [[ "$BUILDKITE_BRANCH" == "main" ]]; then
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-postmerge-repo/manifests/$BUILDKITE_COMMIT"
else
URL="https://public.ecr.aws/v2/q9t5s3a7/vllm-ci-test-repo/manifests/$BUILDKITE_COMMIT"
fi

TIMEOUT_SECONDS=10

Expand Down
3 changes: 1 addition & 2 deletions .buildkite/nightly-benchmarks/tests/serving-tests.json
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,7 @@
"swap_space": 16,
"speculative_model": "turboderp/Qwama-0.5B-Instruct",
"num_speculative_tokens": 4,
"speculative_draft_tensor_parallel_size": 1,
"use_v2_block_manager": ""
"speculative_draft_tensor_parallel_size": 1
},
"client_parameters": {
"model": "meta-llama/Meta-Llama-3.1-70B-Instruct",
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/nightly-benchmarks/tests/throughput-tests.json
Original file line number Diff line number Diff line change
Expand Up @@ -32,4 +32,4 @@
"backend": "vllm"
}
}
]
]
13 changes: 12 additions & 1 deletion .buildkite/release-pipeline.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,15 @@
steps:
- label: "Build wheel - CUDA 12.4"
agents:
queue: cpu_queue_postmerge
commands:
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.4.0 --tag vllm-ci:build-image --target build --progress plain ."
- "mkdir artifacts"
- "docker run --rm -v $(pwd)/artifacts:/artifacts_host vllm-ci:build-image bash -c 'cp -r dist /artifacts_host && chmod -R a+rw /artifacts_host'"
- "bash .buildkite/upload-wheels.sh"
env:
DOCKER_BUILDKIT: "1"

- label: "Build wheel - CUDA 12.1"
agents:
queue: cpu_queue_postmerge
Expand Down Expand Up @@ -37,7 +48,7 @@ steps:
queue: cpu_queue_postmerge
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.1.0 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain ."
- "DOCKER_BUILDKIT=1 docker build --build-arg max_jobs=16 --build-arg USE_SCCACHE=1 --build-arg GIT_REPO_CHECK=1 --build-arg CUDA_VERSION=12.4.0 --tag public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT --target vllm-openai --progress plain ."
- "docker push public.ecr.aws/q9t5s3a7/vllm-release-repo:$BUILDKITE_COMMIT"

- label: "Build and publish TPU release image"
Expand Down
8 changes: 7 additions & 1 deletion .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,9 @@ if [[ $commands == *" kernels "* ]]; then
--ignore=kernels/test_moe.py \
--ignore=kernels/test_prefix_prefill.py \
--ignore=kernels/test_rand.py \
--ignore=kernels/test_sampler.py"
--ignore=kernels/test_sampler.py \
--ignore=kernels/test_cascade_flash_attn.py \
--ignore=kernels/test_mamba_mixer2.py"
fi

#ignore certain Entrypoints tests
Expand Down Expand Up @@ -121,6 +123,8 @@ if [[ $commands == *"--shard-id="* ]]; then
--rm \
-e HIP_VISIBLE_DEVICES="${GPU}" \
-e HF_TOKEN \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \
-v "${HF_CACHE}:${HF_MOUNT}" \
-e "HF_HOME=${HF_MOUNT}" \
--name "${container_name}_${GPU}" \
Expand Down Expand Up @@ -148,6 +152,8 @@ else
--rm \
-e HIP_VISIBLE_DEVICES=0 \
-e HF_TOKEN \
-e AWS_ACCESS_KEY_ID \
-e AWS_SECRET_ACCESS_KEY \
-v "${HF_CACHE}:${HF_MOUNT}" \
-e "HF_HOME=${HF_MOUNT}" \
--name "${container_name}" \
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ function cpu_tests() {
# offline inference
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-avx2-"$NUMA_NODE" bash -c "
set -e
python3 examples/offline_inference/basic.py"
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m"

# Run basic model test
docker exec cpu-test-"$BUILDKITE_BUILD_NUMBER"-"$NUMA_NODE" bash -c "
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/run-gh200-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,5 @@ remove_docker_container

# Run the image and test offline inference
docker run -e HF_TOKEN -v /root/.cache/huggingface:/root/.cache/huggingface --name gh200-test --gpus=all --entrypoint="" gh200-test bash -c '
python3 examples/offline_inference/cli.py --model meta-llama/Llama-3.2-1B
python3 examples/offline_inference/basic/generate.py --model meta-llama/Llama-3.2-1B
'
2 changes: 1 addition & 1 deletion .buildkite/run-hpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,5 @@ trap remove_docker_container_and_exit EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/basic.py
docker run --runtime=habana --name=hpu-test --network=host -e HABANA_VISIBLE_DEVICES=all -e VLLM_SKIP_WARMUP=true --entrypoint="" hpu-test-env python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m
EXITCODE=$?
2 changes: 1 addition & 1 deletion .buildkite/run-openvino-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,4 @@ trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/examples/offline_inference/basic.py
docker run --network host --env VLLM_OPENVINO_KVCACHE_SPACE=1 --name openvino-test openvino-test python3 /workspace/examples/offline_inference/basic/generate.py --model facebook/opt-125m
4 changes: 2 additions & 2 deletions .buildkite/run-xpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,6 @@ remove_docker_container

# Run the image and test offline inference/tensor parallel
docker run --name xpu-test --device /dev/dri -v /dev/dri/by-path:/dev/dri/by-path --entrypoint="" xpu-test sh -c '
python3 examples/offline_inference/basic.py
python3 examples/offline_inference/cli.py -tp 2
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m
python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m -tp 2
'
Loading