nm vllm ent 0.8.5 sync #139

ckhordiasma · 2025-05-14T19:58:28Z

Improve configs - SchedulerConfig (Improve configs - SchedulerConfig vllm-project/vllm#16533)
[Misc] remove warning if triton>=3.2.0 ([Misc] remove warning if triton>=3.2.0 vllm-project/vllm#16553)
[Misc] refactor examples ([Misc] refactor examples vllm-project/vllm#16563)
[Misc] Update usage with mooncake lib for kv transfer ([Misc] Update usage with mooncake lib for kv transfer vllm-project/vllm#16523)
[fix]: Dockerfile.ppc64le fixes for opencv-python and hf-xet ([fix]: Dockerfile.ppc64le fixes for opencv-python and hf-xet vllm-project/vllm#16048)
[Bugfix] Multi-modal caches not acting like LRU caches ([Bugfix] Multi-modal caches not acting like LRU caches vllm-project/vllm#16593)
[TPU][V1] Fix exponential padding when max-num-batched-tokens is not a power of 2 ([TPU][V1] Fix exponential padding when max-num-batched-tokens is not a power of 2 vllm-project/vllm#16596)
Fix triton install condition on CPU (Fix triton install condition on CPU vllm-project/vllm#16600)
s390x: Fix PyArrow build and add CPU test script for Buildkite CI (s390x: Fix PyArrow build and add CPU test script for Buildkite CI vllm-project/vllm#16036)
[Model][VLM] Add Kimi-VL model support ([Model][VLM] Add Kimi-VL model support vllm-project/vllm#16387)
[Hardware][TPU] Add torchvision to tpu dependency file ([Hardware][TPU] Add torchvision to tpu dependency file vllm-project/vllm#16616)
[DOC][TPU] Add core idea about avoiding recompilation after warmup ([DOC][TPU] Add core idea about avoiding recompilation after warmup vllm-project/vllm#16614)
config check sleep mode support oot platforms (config check sleep mode support oot platforms vllm-project/vllm#16562)
[Core][Bugfix] Fix Offline MM Beam Search ([Core][Bugfix] Fix Offline MM Beam Search vllm-project/vllm#16390)
[Kernel] moe wna16 marlin kernel ([Kernel] moe wna16 marlin kernel vllm-project/vllm#14447)
[BugFix]: Update minimum pyzmq version ([BugFix]: Update minimum pyzmq version vllm-project/vllm#16549)
[Bugfix] Fix tests/kernels/test_mamba_ssm_ssd.py ([Bugfix] Fix tests/kernels/test_mamba_ssm_ssd.py vllm-project/vllm#16623)
[Bugfix] Fix broken GritLM model and tests (missing pooling_metadata) ([Bugfix] Fix broken GritLM model and tests (missing pooling_metadata) vllm-project/vllm#16631)
Add vllm bench [latency, throughput] CLI commands (Add vllm bench [latency, throughput] CLI commands vllm-project/vllm#16508)
Fix vLLM x torch.compile config caching (Fix vLLM x torch.compile config caching vllm-project/vllm#16491)
[Misc] refactor argument parsing in examples ([Misc] refactor argument parsing in examples vllm-project/vllm#16635)
[CI/Build] Fix LoRA OOM ([CI/Build] Fix LoRA OOM vllm-project/vllm#16624)
Add "/server_info" endpoint in api_server to retrieve the vllm_config. (Add "/server_info" endpoint in api_server to retrieve the vllm_config. vllm-project/vllm#16572)
[Kernel] Remove redundant Exp calculations ([Kernel] Remove redundant Exp calculations vllm-project/vllm#16123)
[Misc] Update compressed-tensors WNA16 to support zero-points ([Misc] Update compressed-tensors WNA16 to support zero-points vllm-project/vllm#14211)
[Misc] Enable vLLM to Dynamically Load LoRA from a Remote Server ([Misc] Enable vLLM to Dynamically Load LoRA from a Remote Server vllm-project/vllm#10546)
[Model] Add PLaMo2 ([Model] Add PLaMo2 vllm-project/vllm#14323)
[Bugfix] fix gpu docker image mis benchmarks dir ([Bugfix] fix gpu docker image mis benchmarks dir vllm-project/vllm#16628)
[Misc] Modify LRUCache touch ([Misc] Modify LRUCache touch vllm-project/vllm#16689)
Disable remote caching when calling compile_fx (Disable remote caching when calling compile_fx vllm-project/vllm#16611)
[Feature] add model aware kv ops helper ([Feature] add model aware kv ops helper vllm-project/vllm#16020)
[ROCM] Bind triton version to 3.2 in requirements-built.txt ([ROCM] Bind triton version to 3.2 in requirements-built.txt vllm-project/vllm#16664)
[V1][Structured Output] Move xgrammar related utils to backend_xgrammar.py ([V1][Structured Output] Move xgrammar related utils to backend_xgrammar.py vllm-project/vllm#16578)
[CI] Cleanup additional_dependencies: [toml] for pre-commit yapf hook ([CI] Cleanup additional_dependencies: [toml] for pre-commit yapf hook vllm-project/vllm#16405)
[Misc] refactor examples series ([Misc] refactor examples series vllm-project/vllm#16708)
[Doc] Improve OOM troubleshooting ([Doc] Improve OOM troubleshooting vllm-project/vllm#16704)
[Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel ([Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel vllm-project/vllm#16693)
[Model] support modernbert ([Model] support modernbert vllm-project/vllm#16648)
[Hardware] Add processor inputs to platform validation ([Hardware] Add processor inputs to platform validation vllm-project/vllm#16680)
Improve error for structured output backend selection (Improve error for structured output backend selection vllm-project/vllm#16717)
[Misc] Remove redundant comment ([Misc] Remove redundant comment vllm-project/vllm#16703)
Help user create custom model for Transformers backend remote code models (Help user create custom model for Transformers backend remote code models vllm-project/vllm#16719)
[V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] ([V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] vllm-project/vllm#16432)
[V1][Spec Dec Bug Fix] Respect Spec Dec Method Specification ([V1][Spec Dec Bug Fix] Respect Spec Dec Method Specification vllm-project/vllm#16636)
Adding vllm buildkite job for IBM Power (Adding vllm buildkite job for IBM Power vllm-project/vllm#16679)
[V1][Frontend] Improve Shutdown And Logs ([V1][Frontend] Improve Shutdown And Logs vllm-project/vllm#11737)
[rocm][V0] fix selection logic for custom PA in V0 ([rocm][V0] fix selection logic for custom PA in V0 vllm-project/vllm#16426)
[Bugfix] Update Florence-2 tokenizer to make grounding tasks work ([Bugfix] Update Florence-2 tokenizer to make grounding tasks work vllm-project/vllm#16734)
[Bugfix] Revert max_prompt_len validation for decoder-only models. ([Bugfix] Revert max_prompt_len validation for decoder-only models. vllm-project/vllm#16741)
[V1] Remove log noise when idle ([V1] Remove log noise when idle vllm-project/vllm#16735)
[Ray] Improve documentation on batch inference ([Ray] Improve documentation on batch inference vllm-project/vllm#16609)
[misc] ignore marlin_moe_wna16 local gen codes ([misc] ignore marlin_moe_wna16 local gen codes vllm-project/vllm#16760)
[Doc] Add more tips to avoid OOM ([Doc] Add more tips to avoid OOM vllm-project/vllm#16765)
[doc] add open-webui example ([doc] add open-webui example vllm-project/vllm#16747)
[Bugfix] Fix GLM4 model ([Bugfix] Fix GLM4 model vllm-project/vllm#16618)
[Doc] Fix a 404 link in installation/cpu.md ([Doc] Fix a 404 link in installation/cpu.md vllm-project/vllm#16773)
[Misc] refactor examples series - lmcache ([Misc] refactor examples series - lmcache vllm-project/vllm#16758)
Improve configs - TokenizerPoolConfig + DeviceConfig (Improve configs - TokenizerPoolConfig + DeviceConfig vllm-project/vllm#16603)
fix: hyperlink (fix: hyperlink vllm-project/vllm#16778)
[Doc] Make sure to update vLLM when installing latest code ([Doc] Make sure to update vLLM when installing latest code vllm-project/vllm#16781)
[Doc] Document Matryoshka Representation Learning support ([Doc] Document Matryoshka Representation Learning support vllm-project/vllm#16770)
[Doc] Changed explanation of generation_tokens_total and prompt_tokens_total counter type metrics to avoid confusion ([Doc] Changed explanation of generation_tokens_total and prompt_tokens_total counter type metrics to avoid confusion vllm-project/vllm#16784)
[V1][Perf] Faster incremental detokenization ([V1][Perf] Faster incremental detokenization vllm-project/vllm#15137)
[Bugfix]Fix index out of range error in api server log ([Bugfix]Fix index out of range error in api server log vllm-project/vllm#16787)
[Kernel] Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 ([Kernel] Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 vllm-project/vllm#16753)
[Model] use AutoWeightsLoader for olmoe,opt,orion,persimmon,phi3_small ([Model] use AutoWeightsLoader for olmoe,opt,orion,persimmon,phi3_small vllm-project/vllm#16548)
[TPU][V1] Fix padding recompilation when max-num-batched-tokens is not even ([TPU][V1] Fix padding recompilation when max-num-batched-tokens is not even vllm-project/vllm#16726)
[V1][TPU] Enable Top K ([V1][TPU] Enable Top K vllm-project/vllm#15489)
[ROCM] enable aiter fused moe kernel for llama4 bf16 checkpoints ([ROCM] enable aiter fused moe kernel for llama4 bf16 checkpoints vllm-project/vllm#16674)
[V1][Metrics] Fix http metrics middleware ([V1][Metrics] Fix http metrics middleware vllm-project/vllm#15894)
[MLA] Simplification to batch P/D reordering ([MLA] Simplification to batch P/D reordering vllm-project/vllm#16673)
[P/D][V1] KV Connector API V1 ([P/D][V1] KV Connector API V1 vllm-project/vllm#15960)
[Attention] Update to lastest FA3 code ([Attention] Update to lastest FA3 code vllm-project/vllm#13111)
Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema (Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema vllm-project/vllm#16721)
[Doc] Improve help examples for --compilation-config ([Doc] Improve help examples for --compilation-config vllm-project/vllm#16729)
[Misc] Update outdated note: LMCache now supports chunked prefill ([Misc] Update outdated note: LMCache now supports chunked prefill vllm-project/vllm#16697)
[V1][Structured Output] Minor modification to _validate_structured_output() ([V1][Structured Output] Minor modification to _validate_structured_output() vllm-project/vllm#16748)
Add hardware print to TPU V1 test (Add hardware print to TPU V1 test vllm-project/vllm#16792)
[BugFix] Accuracy fix for llama4 int4 - improperly casted scales ([BugFix] Accuracy fix for llama4 int4 - improperly casted scales vllm-project/vllm#16801)
Improve-mm-and-pooler-and-decoding-configs (Improve configs - MultiModalConfig + PoolerConfig + DecodingConfig vllm-project/vllm#16789)
[Misc] add collect_env to cli and docker image ([Misc] add collect_env to cli and docker image vllm-project/vllm#16759)
[ROCm] [Attention] Cleanup ROCm output passing ([ROCm] [Attention] Cleanup ROCm output passing vllm-project/vllm#16431)
[Bugfix] fix pp for llama4 ([Bugfix] fix pp for llama4 vllm-project/vllm#16746)
[Doc] add podman setup instructions for official image ([Doc] add podman setup instructions for official image vllm-project/vllm#16796)
[Docs] Fix a link and grammar issue in production-stack.md ([Docs] Fix a link and grammar issue in production-stack.md vllm-project/vllm#16809)
[Model] use AutoWeightsLoader for BigCode, GPT-J ([Model] use AutoWeightsLoader for BigCode, GPT-J vllm-project/vllm#16823)
[Misc] Clean up Kimi-VL ([Misc] Clean up Kimi-VL vllm-project/vllm#16833)
Fix nullable_kvs fallback (Fix nullable_kvs fallback vllm-project/vllm#16837)
[New Model]: Snowflake Arctic Embed (Family) ([New Model]: Snowflake Arctic Embed (Family) vllm-project/vllm#16649)
[Misc] refactor examples series - Chat Completion Client With Tools ([Misc] refactor examples series - Chat Completion Client With Tools vllm-project/vllm#16829)
[Doc] Updated Llama section in tool calling docs to have llama 3.2 config info ([Doc] Updated Llama section in tool calling docs to have llama 3.2 config info vllm-project/vllm#16857)
[release] Publish neuron docker image (publish neuron docker image vllm-project/vllm#16733)
[Model][VLM] Add Qwen2.5-Omni model support (thinker only) ([Model][VLM] Add Qwen2.5-Omni model support (thinker only) vllm-project/vllm#15130)
[rocm][MI300] llama4 maverick fp8 moe config tp8 ([rocm][MI300] llama4 maverick fp8 moe config tp8 vllm-project/vllm#16847)
[Frontend] Add sampling params to v1/audio/transcriptions endpoint ([Frontend] Add sampling params to v1/audio/transcriptions endpoint vllm-project/vllm#16591)
[Misc] Benchmarks for audio models ([Misc] Benchmarks for audio models vllm-project/vllm#16505)
[V1][Misc] stop update prefix cache stats when logs_stats is disabled ([V1][Misc] stop update prefix cache stats when logs_stats is disabled vllm-project/vllm#16460)
[Model] Refactor Phi-4-multimodal to use merged processor and support V1 ([Model] Refactor Phi-4-multimodal to use merged processor and support V1 vllm-project/vllm#15477)
[Model] Qwen2.5-Omni Cleanup ([Model] Qwen2.5-Omni Cleanup vllm-project/vllm#16872)
[VLM] Clean up models ([VLM] Clean up models vllm-project/vllm#16873)
[doc] update hyperlink ([doc] update hyperlink vllm-project/vllm#16877)
Log how much time loading a compiled artifact takes (Log how much time loading a compiled artifact takes vllm-project/vllm#16848)
Serialize tensors using int8 views (Serialize tensors using int8 views vllm-project/vllm#16866)
Improve configs - CacheConfig (Improve configs - CacheConfig vllm-project/vllm#16835)
[easy] Pass compile_fx only the config patches ([easy] Pass compile_fx only the config patches vllm-project/vllm#16845)
[Bugfix] Fix v1/spec_decode/test_ngram.py ([Bugfix] Fix v1/spec_decode/test_ngram.py vllm-project/vllm#16895)
[CI/CD][V1] Add spec decode tests to CI ([CI/CD][V1] Add spec decode tests to CI vllm-project/vllm#16900)
[Bugfix] Fix distributed bug in Qwen2.5-VL & Qwen2.5-Omni ([Bugfix] Fix distributed bug in Qwen2.5-VL & Qwen2.5-Omni vllm-project/vllm#16907)
[Doc] Split dummy_processor_inputs() in Multimodal Docs ([Doc] Split dummy_processor_inputs() in Multimodal Docs vllm-project/vllm#16915)
Restore buffers when wake up from level 2 sleep ([Bug]: After wake up from level 2 sleep, model cannot load weights properly vllm-project/vllm#16564) (Restore buffers when wake up from level 2 sleep (#16564) vllm-project/vllm#16889)
[Misc] fix collect_env version parse ([Misc] fix collect_env version parse vllm-project/vllm#15267)
[Misc] Refactor platform to get device specific stream and event ([Misc] Refactor platform to get device specific stream and event vllm-project/vllm#14411)
[Bugfix] Fix GLM rotary_dim issue and support v1 ([Bugfix] Fix GLM rotary_dim issue and support v1 vllm-project/vllm#16912)
Raise error for data-parallel with benchmark_throughput (Raise error for data-parallel with benchmark_throughput vllm-project/vllm#16737)
[XPU][Bugfix] minor fix for XPU ([XPU][Bugfix] minor fix for XPU vllm-project/vllm#15591)
[doc] install required python3-dev apt package ([doc] install required python3-dev apt package vllm-project/vllm#16888)
[Doc] mention how to install in CPU editable mode ([Doc] mention how to install in CPU editable mode vllm-project/vllm#16923)
[Core] Speed up decode by remove synchronizing operation in sampler ([Core] Speed up decode by remove synchronizing operation in sampler vllm-project/vllm#16436)
[V1][Spec Decode] Handle draft tokens beyond max_model_len ([V1][Spec Decode] Handle draft tokens beyond max_model_len vllm-project/vllm#16087)
[TPU][V1] Implicitly adjust page size when there's SMEM OOM ([TPU][V1] Implicitly adjust page size when there's SMEM OOM vllm-project/vllm#16871)
Update Qwen1.5-MoE-W4A16-compressed-tensors.yaml (Update Qwen1.5-MoE-W4A16-compressed-tensors.yaml vllm-project/vllm#16946)
[TPU][V1] Capture multimodal encoder during model compilation ([TPU][V1] Capture multimodal encoder during model compilation vllm-project/vllm#15051)
[V1] V1 FlashInfer Attention ([V1] V1 FlashInfer Attention vllm-project/vllm#16684)
[TPU][V1] Enable Top-P ([TPU][V1] Enable Top-P vllm-project/vllm#16843)
[Doc] Remove unnecessary V1 flag ([Doc] Remove unnecessary V1 flag vllm-project/vllm#16924)
[BugFix][Spec Decode] No in-place update to draft probs ([BugFix][Spec Decode] No in-place update to draft probs vllm-project/vllm#16952)
[Bugfix]: fix issue with n>1 sampling on v1 requests overriding each other ([Bugfix]: fix issue with n>1 sampling on v1 requests overriding each other vllm-project/vllm#16863)
[ROCm] Add aiter tkw1 kernel for Llama4 fp8 ([ROCm] Add aiter tkw1 kernel for Llama4 fp8 vllm-project/vllm#16727)
[Misc] Remove the chunked prefill warning for LoRA ([Misc] Remove the chunked prefill warning for LoRA vllm-project/vllm#16925)
[Kernel] Add expert_map support to Cutlass FP8 MOE ([Kernel] Add expert_map support to Cutlass FP8 MOE vllm-project/vllm#16861)
[V1] Remove additional_config check ([V1] Remove additional_config check vllm-project/vllm#16710)
[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm ([Performance][ROCm] Add skinny gemms for unquantized linear on ROCm vllm-project/vllm#15830)
Support S3 Sharded loading with RunAI Model Streamer (Support S3 Sharded loading with RunAI Model Streamer vllm-project/vllm#16317)
[Bugfix] Fix f-string for Python 3.9-3.11 ([Bugfix] Fix f-string for Python 3.9-3.11 vllm-project/vllm#16962)
[Doc] Update ai_accelerator/hpu-gaudi.inc.md ([Doc] Update ai_accelerator/hpu-gaudi.inc.md vllm-project/vllm#16956)
[Perf] Optimize _update_states for GPU model runner ([Perf] Optimize _update_states for GPU model runner vllm-project/vllm#16910)
[Bugfix] Fix the issue where llm.generate cannot be called repeatedly after setting GuidedDecodingParams ([Bugfix] Fix the issue where llm.generate cannot be called repeatedly after setting GuidedDecodingParams vllm-project/vllm#16767)
[Model] Use autoweightloader for mamba ([Model] Use autoweightloader for mamba vllm-project/vllm#16950)
[V1] Remove pre-allocation for KV cache ([V1] Remove pre-allocation for KV cache vllm-project/vllm#16941)
[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS ([Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS vllm-project/vllm#6036)
[BugFix] Fix incremental detokenization perf issue ([BugFix] Fix incremental detokenization perf issue vllm-project/vllm#16963)
[Doc] Improve documentation for multimodal CLI args ([Doc] Improve documentation for multimodal CLI args vllm-project/vllm#16960)
[FEAT][ROCm] Integrate Paged Attention Kernel from AITER ([FEAT][ROCm] Integrate Paged Attention Kernel from AITER vllm-project/vllm#15001)
[Misc] refactor example series ([Misc] refactor example series vllm-project/vllm#16972)
[Bugfix] Fix distributed bug again in Qwen2.5-VL & Qwen2.5-Omni ([Bugfix] Fix distributed bug again in Qwen2.5-VL & Qwen2.5-Omni vllm-project/vllm#16974)
Improve configs - SpeculativeConfig (Improve configs - SpeculativeConfig vllm-project/vllm#16971)
[BugFix] Pass in correct VLLM config in FlashInfer backend ([Bug]: VLLM config not set when using Flash Infer backend. vllm-project/vllm#13207) ([BugFix] Pass in correct VLLM config in FlashInfer backend (#13207) vllm-project/vllm#16973)
[Misc] Add S3 environment variables for better support of MinIO. ([Misc] Add S3 environment variables for better support of MinIO. vllm-project/vllm#16977)
[frontend] enhance tool_calls type check ([frontend] enhance tool_calls type check vllm-project/vllm#16882)
[FEAT][ROCm]: Support AITER MLA ([FEAT][ROCm]: Support AITER MLA vllm-project/vllm#15893)
Add assertion for no objects while hashing hf_config (Add assertion for no objects while hashing hf_config vllm-project/vllm#16930)
Fencing Kernels Tests for enabling on AMD (Fencing Kernels Tests for enabling on AMD vllm-project/vllm#16929)
[BugFix] Remove default multiproc executor collective_rpc timeout ([BugFix] Remove default multiproc executor collective_rpc timeout vllm-project/vllm#17000)
[Core][V1][TPU] Enable structured decoding on TPU V1 ([Core][V1][TPU] Enable structured decoding on TPU V1 vllm-project/vllm#16499)
[Bugfix] validate urls object for multimodal content parts ([Bugfix] validate urls object for multimodal content parts vllm-project/vllm#16990)
add Dockerfile build vllm against torch nightly (add Dockerfile build vllm against torch nightly vllm-project/vllm#16936)
[Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1 ([Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1 vllm-project/vllm#13305)
[V1][DP] More robust DP/EP dummy request coordination ([V1][DP] More robust DP/EP dummy request coordination vllm-project/vllm#16277)
[BugFix] Revert ROCm Custom Paged Attention Env Flag Check ([BugFix] Revert ROCm Custom Paged Attention Env Flag Check vllm-project/vllm#17022)
Revert "[Misc] Add S3 environment variables for better support of MinIO." (Revert "[Misc] Add S3 environment variables for better support of MinIO." vllm-project/vllm#17021)
[misc] tune some env vars for GB200 ([misc] tune some env vars for GB200 vllm-project/vllm#16992)
[INTEL-HPU][v0] Port delayed sampling to upstream ([INTEL-HPU][v0] Port delayed sampling to upstream vllm-project/vllm#16949)
[doc] add download path tips ([doc] add download path tips vllm-project/vllm#17013)
[Bugfix] Triton FA function takes no keyword arguments ([Bugfix] Triton FA function takes no keyword arguments vllm-project/vllm#16902)
[V1] Avoid socket errors during shutdown when requests are in in-flight ([V1] Avoid socket errors during shutdown when requests are in in-flight vllm-project/vllm#16807)
[BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have shape (metadata_size) ([BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have shape (metadata_size) vllm-project/vllm#16998)
[Misc] Improve readability of get_open_port function. ([Misc] Improve readability of get_open_port function. vllm-project/vllm#17024)
[Bugfix] Fix AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers ([Bugfix] Fix AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers vllm-project/vllm#16964)
[CI] Run v1/test_serial_utils.py in CI ([CI] Run v1/test_serial_utils.py in CI vllm-project/vllm#16996)
Mistral-format support for compressed-tensors (Mistral-format support for compressed-tensors vllm-project/vllm#16803)
Categorize tests/kernels/ based on kernel type (Categorize tests/kernels/ based on kernel type vllm-project/vllm#16799)
[Doc] Add top anchor and a note to quantization/bitblas.md ([Doc] Add top anchor and a note to quantization/bitblas.md vllm-project/vllm#17042)
Ensure that pid passed to kill_process_tree is int for mypy (Ensure that pid passed to kill_process_tree is int for mypy vllm-project/vllm#17051)
[CI] Update structured-output label automation ([CI] Update structured-output label automation vllm-project/vllm#17055)
Improve Transformers backend model loading QoL (Improve Transformers backend model loading QoL vllm-project/vllm#17039)
CacheConfig.block_size should always be int when used (CacheConfig.block_size should always be int when used vllm-project/vllm#17052)
Use @property and private field for data_parallel_rank_local (Use @property and private field for data_parallel_rank_local vllm-project/vllm#17053)
[Frontend] Support guidance:no-additional-properties for compatibility with xgrammar ([Frontend] Support guidance:no-additional-properties for compatibility with xgrammar vllm-project/vllm#15949)
[BugFix][V1] Fix int32 token index overflow when preparing input ids ([BugFix][V1] Fix int32 token index overflow when preparing input ids vllm-project/vllm#16806)
[V1][Spec Decode] Always use argmax for sampling draft tokens ([V1][Spec Decode] Always use argmax for sampling draft tokens vllm-project/vllm#16899)
[CI/Build] workaround for CI build failure ([CI/Build] workaround for CI build failure vllm-project/vllm#17070)
[Quantization]add prefix for commandA quantized model ([Quantization]add prefix for commandA quantized model vllm-project/vllm#17017)
[Minor] Use larger batch sizes for A100/B100/B200/MI300x ([Minor] Use larger batch sizes for A100/B100/B200/MI300x vllm-project/vllm#17073)
[Bugfix] Enable V1 usage stats ([Bugfix] Enable V1 usage stats vllm-project/vllm#16986)
More informative error when using Transformers backend (More informative error when using Transformers backend vllm-project/vllm#16988)
Addendum Fix to support FIPS enabled machines with MD5 hashing (Addendum Fix to support FIPS enabled machines with MD5 hashing vllm-project/vllm#17043)
[Bugfix][Core] add seq_id_to_seq_group clearing to avoid memory leak when s… ([Bugfix][Core] add seq_id_to_seq_group clearing to avoid memory leak when s… vllm-project/vllm#16472)
[V1] Update structured output ([V1] Update structured output vllm-project/vllm#16812)
[doc] update to hyperlink ([doc] update to hyperlink vllm-project/vllm#17096)
Add docs for runai_streamer_sharded (Add docs for runai_streamer_sharded vllm-project/vllm#17093)
[Chore] Remove Sampler from Model Code ([Chore] Remove Sampler from Model Code vllm-project/vllm#17084)
Disable enforce_eager for V1 TPU sampler and structured output tests (Disable enforce_eager for V1 TPU sampler and structured output tests vllm-project/vllm#17016)
Simplify TokenizerGroup (Simplify TokenizerGroup vllm-project/vllm#16790)
Fix OOT registration test (Fix OOT registration test vllm-project/vllm#17099)
[V1][PP] Optimization: continue scheduling prefill chunks ([V1][PP] Optimization: continue scheduling prefill chunks vllm-project/vllm#17080)
[Misc] Remove OLMo2 config copy ([Misc] Remove OLMo2 config copy vllm-project/vllm#17066)
Improve static type checking in LoRAModelRunnerMixin (Improve static type checking in LoRAModelRunnerMixin vllm-project/vllm#17104)
[V1][Structured Output] Clear xgrammar compiler object when engine core shut down to avoid nanobind leaked warning ([V1][Structured Output] Clear xgrammar compiler object when engine core shut down to avoid nanobind leaked warning vllm-project/vllm#16954)
[Frontend] Using matryoshka_dimensions control the allowed output dimensions. ([Frontend] Using matryoshka_dimensions control the allowed output dimensions. vllm-project/vllm#16970)
Add missing rocm_skinny_gemms kernel test to CI (Add missing rocm_skinny_gemms kernel test to CI vllm-project/vllm#17060)
[Misc] refactor example series - structured outputs ([Misc] refactor example series - structured outputs vllm-project/vllm#17040)
[V1][Spec Decoding] Add num_drafts and num_accepted_tokens_per_position metrics ([V1][Spec Decoding] Add num_drafts and num_accepted_tokens_per_position metrics vllm-project/vllm#16665)
[CI] Add automation for the tool-calling github label ([CI] Add automation for the tool-calling github label vllm-project/vllm#17118)
Updating builkite job for IBM Power (Updating builkite job for IBM Power vllm-project/vllm#17111)
existing torch installation pip command fix for docs (existing torch installation pip command fix for docs vllm-project/vllm#17059)
Molmo Requirements (Molmo Requirements vllm-project/vllm#17026)
Add :markdownhelp: to EngineArgs docs so markdown docstrings render properly (Add :markdownhelp: to EngineArgs docs so markdown docstrings render properly vllm-project/vllm#17124)
Improve configs - LoRAConfig + PromptAdapterConfig (Improve configs - LoRAConfig + PromptAdapterConfig vllm-project/vllm#16980)
[Docs] Generate correct github links for decorated functions ([Docs] Generate correct github links for decorated functions vllm-project/vllm#17125)
Add collective_rpc to llm engine (Add collective_rpc to llm engine vllm-project/vllm#16999)
Add chat template for Llama 4 models (Add chat template for Llama 4 models vllm-project/vllm#16428)
[Misc] Add example to run DeepSeek with Ray Serve LLM ([Misc] Add example to run DeepSeek with Ray Serve LLM vllm-project/vllm#17134)
Better error message for missing mistral params.json (Better error message for missing mistral params.json vllm-project/vllm#17132)
Use custom address for listening socket (Use custom address for listening socket vllm-project/vllm#15988)
[FEAT] [ROCm]: AITER Fused MOE V1 Support ([FEAT] [ROCm]: AITER Fused MOE V1 Support vllm-project/vllm#16752)
[Attention] FA3 decode perf improvement - single mma warp group support for head dim 128 ([Attention] FA3 decode perf improvement - single mma warp group support for head dim 128 vllm-project/vllm#16864)
fix float16 support for kimi-vl (fix float16 support for kimi-vl vllm-project/vllm#17156)
[Doc] V1 : Update LoRA status ([Doc] V1 : Update LoRA status vllm-project/vllm#17133)
[Docs] Fix True->true in supported_models.md ([Docs] Fix True->true in supported_models.md vllm-project/vllm#17141)
Move missed SchedulerConfig args into scheduler config group in EngineArgs (Move missed SchedulerConfig args into scheduler config group in EngineArgs vllm-project/vllm#17131)
[Misc] Clean up redundant code in uniproc_executor.py ([Misc] Clean up redundant code in uniproc_executor.py vllm-project/vllm#16762)
[Bugfix][Misc] Use TritonPlaceholderModule to defensively import triton ([Bugfix][Misc] Use TritonPlaceholderModule to defensively import triton vllm-project/vllm#15099)
[Misc] Benchmark Serving Script Support Appending Results ([Misc] Benchmark Serving Script Support Appending Results vllm-project/vllm#17028)
[Perf]Optimize rotary_emb implementation to use Triton operator for improved inference performance ([Perf]Optimize rotary_emb implementation to use Triton operator for improved inference performance vllm-project/vllm#16457)
[Bugfix] remove fallback in guided_json (int range, patterns) ([Bugfix] remove fallback in guided_json (int range, patterns) vllm-project/vllm#16725)
[Quantization][FP8] Add support for FP8 models with input_scale for output projection and QK quantization ([Quantization][FP8] Add support for FP8 models with input_scale for output projection and QK quantization vllm-project/vllm#15734)
[Doc] Add headings to improve gptqmodel.md ([Doc] Add headings to improve gptqmodel.md vllm-project/vllm#17164)
Only turn on FastIncrementalDetokenizer when tokenizers >= 0.21.1 (Only turn on FastIncrementalDetokenizer when tokenizers >= 0.21.1 vllm-project/vllm#17158)
[Doc] Add two links to disagg_prefill.md ([Doc] Add two links to disagg_prefill.md vllm-project/vllm#17168)
[Doc] Move todo out of beam search docstring ([Doc] Move todo out of beam search docstring vllm-project/vllm#17183)
[Bugfix] Fix mistral model tests ([Bugfix] Fix mistral model tests vllm-project/vllm#17181)
[Bugfix] Fix Mistral ChatCompletionRequest Body Exception ([Bugfix] Fix Mistral ChatCompletionRequest Body Exception vllm-project/vllm#16769)
Bump Transformers to 4.51.3 (Bump Transformers to 4.51.3 vllm-project/vllm#17116)
Use Transformers helper get_text_config() instead of checking for text_config (Use Transformers helper get_text_config() instead of checking for text_config vllm-project/vllm#17105)
[doc] update wrong hf model links ([doc] update wrong hf model links vllm-project/vllm#17184)
[Misc] Inline Molmo requirements ([Misc] Inline Molmo requirements vllm-project/vllm#17190)
[Security] Use safe serialization and fix zmq setup for mooncake pipe ([Security] Use safe serialization and fix zmq setup for mooncake pipe vllm-project/vllm#17192)
[V1] Move usage stats to worker and start logging TPU hardware ([V1] Move usage stats to worker and start logging TPU hardware vllm-project/vllm#16211)
[Bugfix] Fix hybrid model tests ([Bugfix] Fix hybrid model tests vllm-project/vllm#17182)
Fix Python packaging edge cases (Fix Python packaging edge cases vllm-project/vllm#17159)
[BugFix][Frontend] Fix LLM.chat() tokenization ([BugFix][Frontend] Fix LLM.chat() tokenization vllm-project/vllm#16081)
[V1][Spec Decode] EAGLE-3 Support ([V1][Spec Decode] EAGLE-3 Support vllm-project/vllm#16937)
[Misc] Refine ray_serve_deepseek example ([Misc] Refine ray_serve_deepseek example vllm-project/vllm#17204)
[Bugfix] gemma[2,3] interleaved attention when sliding window is disabled ([Bugfix] gemma[2,3] interleaved attention when sliding window is disabled vllm-project/vllm#17180)
[AMD][FP8][BugFix] Remove V1 check in arg_utils.py for FP8 since it is not necessary ([AMD][FP8][BugFix] Remove V1 check in arg_utils.py for FP8 since it is not necessary vllm-project/vllm#17215)
[v1] [P/D] Adding LMCache KV connector for v1 ([v1] [P/D] Adding LMCache KV connector for v1 vllm-project/vllm#16625)
[Bugfix] [pytorch] Patch AOTAutogradCache._get_shape_env ([Bugfix] [pytorch] Patch AOTAutogradCache._get_shape_env vllm-project/vllm#17142)
[MISC][AMD] Add unused annotation to rocm kernel file ([MISC][AMD] Add unused annotation to rocm kernel file vllm-project/vllm#17097)
[doc] add Anything LLM integration ([doc] add Anything LLM integration vllm-project/vllm#17216)
[Minor][Spec Decode] Add use_eagle to SpeculativeConfig ([Minor][Spec Decode] Add use_eagle to SpeculativeConfig vllm-project/vllm#17213)
[Doc] Minor fix for the vLLM TPU setup page ([Doc] Minor fix for the vLLM TPU setup page vllm-project/vllm#17206)
[Minor][Models] Fix Return Types of Llama & Eagle ([Minor][Models] Fix Return Types of Llama & Eagle vllm-project/vllm#17220)
Allocate kv_cache with stride order (Allocate kv_cache with stride order vllm-project/vllm#16605)
[ROCm][Misc] Follow-ups for Skinny Gemms on ROCm. ([ROCm][Misc] Follow-ups for Skinny Gemms on ROCm. vllm-project/vllm#17011)
[V1][Metrics] Allow V1 AsyncLLM to use custom logger ([V1][Metrics] Allow V1 AsyncLLM to use custom logger vllm-project/vllm#14661)
[BugFix] Avoid race conditions in zero-copy tensor transmission ([BugFix] Avoid race conditions in zero-copy tensor transmission vllm-project/vllm#17203)
[CI/test] Fix Eagle Correctness Test ([CI/test] Fix Eagle Correctness Test vllm-project/vllm#17209)
[Core] Remove prompt string from engine core data structures ([Core] Remove prompt string from engine core data structures vllm-project/vllm#17214)
[Bugfix] Fix missing int type for -n in multi-image example ([Bugfix] Fix missing int type for -n in multi-image example vllm-project/vllm#17223)
[Bugfix] Fix standard models tests ([Bugfix] Fix standard models tests vllm-project/vllm#17217)
[Hardware][Intel-Gaudi] Update hpu-extension and update bucketing system for HPU device ([Hardware][Intel-Gaudi] Update hpu-extension and update bucketing system for HPU device vllm-project/vllm#17186)
[V1] Add structural_tag support using xgrammar ([V1] Add structural_tag support using xgrammar vllm-project/vllm#17085)
[BUGFIX] use random for NONE_HASH only when PYTHONHASHSEED not set ([BUGFIX] use random for NONE_HASH only when PYTHONHASHSEED not set vllm-project/vllm#17088)
[Chore] added stubs for vllm_flash_attn during development mode ([Chore] added stubs for vllm_flash_attn during development mode vllm-project/vllm#17228)
[Docs] Update structured output doc for V1 ([Docs] Update structured output doc for V1 vllm-project/vllm#17135)
[Bugfix] fix error due to an uninitialized tokenizer when using skip_tokenizer_init with num_scheduler_steps ([Bugfix] fix error due to an uninitialized tokenizer when using skip_tokenizer_init with num_scheduler_steps vllm-project/vllm#9276)
Disable the torch.compile cache checks when VLLM_DISABLE_COMPILE_CACHE=1 (Disable the torch.compile cache checks when VLLM_DISABLE_COMPILE_CACHE=1 vllm-project/vllm#16573)
[MISC] rename interval to max_recent_requests ([MISC] rename interval to max_recent_requests vllm-project/vllm#14285)
[Bugfix] Fix Qwen2.5-Omni M-RoPE position ids generation ([Bugfix] Fix Qwen2.5-Omni M-RoPE position ids generation vllm-project/vllm#16878)
[Minor] Fix lint error in main branch ([Minor] Fix lint error in main branch vllm-project/vllm#17233)
[CI/Build] remove -t for run-lm-eval-gsm-hf-baseline.sh ([CI/Build] remove -t for run-lm-eval-gsm-hf-baseline.sh vllm-project/vllm#16271)
Update test_flash_attn.py (Update test_flash_attn.py vllm-project/vllm#17102)
[Kernel][Triton][FP8] Adding fp8 and variable length sequence support to Triton FAv2 kernel ([Kernel][Triton][FP8] Adding fp8 and variable length sequence support to Triton FAv2 kernel vllm-project/vllm#12591)
[Misc] Make cached tokenizer pickle-compatible ([Misc] Make cached tokenizer pickle-compatible vllm-project/vllm#17048)
[Bugfix] Fix QWen2 VL multimodal mapping ([Bugfix] Fix QWen2 VL multimodal mapping vllm-project/vllm#17240)
[Bugfix] Get a specific type of layer from forward context ([Bugfix] Get a specific type of layer from forward context vllm-project/vllm#17222)
[MISC] Use string annotation types for class definitions ([MISC] Use string annotation types for class definitions vllm-project/vllm#17244)
[Misc] Change buckets of histogram_iteration_tokens to [1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8096] to represent number of tokens ([Misc] Change buckets of histogram_iteration_tokens to [1, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8096] to represent number of tokens vllm-project/vllm#17033)
[Bugfix] Fix Lora Name Parsing ([Bugfix] Fix Lora Name Parsing vllm-project/vllm#17196)
[NVIDIA] Support Cutlass MLA for Blackwell GPUs ([NVIDIA] Support Cutlass MLA for Blackwell GPUs vllm-project/vllm#16032)
[Feature] support sequence parallelism using compilation pass ([Feature] support sequence parallelism using compilation pass vllm-project/vllm#16155)
[doc] Add feature status legend ([doc] Add feature status legend vllm-project/vllm#17257)
[Metrics] Fix minor inconsistencies in bucket progression ([Metrics] Fix minor inconsistencies in bucket progression vllm-project/vllm#17262)
[V1][Spec Decode] Make eagle compatible with prefix caching. ([V1][Spec Decode] Make eagle compatible with prefix caching. vllm-project/vllm#17137)
[BugFix] Fix vllm_flash_attn install issues ([BugFix] Fix vllm_flash_attn install issues vllm-project/vllm#17267)
[Bugfix] Fix missing ARG in Dockerfile for arm64 platforms ([Bugfix] Fix missing ARG in Dockerfile for arm64 platforms vllm-project/vllm#17261)
[Bugfix] Fix cutlass dispatch for fp8/int8 to properly invoke M<=16 c… ([Bugfix] Fix cutlass dispatch for fp8/int8 to properly invoke M<=16 c… vllm-project/vllm#16751)
[Bugfix] Fix Mistral3 spatial merge error ([Bugfix] Fix Mistral3 spatial merge error vllm-project/vllm#17270)
[Doc] Fix wrong github link in LMCache examples ([Doc] Fix wrong github link in LMCache examples vllm-project/vllm#17274)
[Doc] small fix ([Doc] small fix vllm-project/vllm#17277)
[Misc] Validate stop_token_ids contents ([Misc] Validate stop_token_ids contents vllm-project/vllm#17268)
[Minor][Models] Pass partial_rotary_factor parameter to rope ([Minor][Models] Pass partial_rotary_factor parameter to rope vllm-project/vllm#17266)
[Core] Remove legacy input mapper/processor from V0 ([Core] Remove legacy input mapper/processor from V0 vllm-project/vllm#15686)
[Model] Add Granite Speech Support ([Model] Add Granite Speech Support vllm-project/vllm#16246)
Update tpu_worker.py 's typo (Update tpu_worker.py 's typo vllm-project/vllm#17288)
Add missing class docstring for PromptAdapterConfig (Add missing class docstring for PromptAdapterConfig vllm-project/vllm#17302)
[Bugfix] Add missing get_language_model to new MLLMs ([Bugfix] Add missing get_language_model to new MLLMs vllm-project/vllm#17300)
[doc] update wrong model id ([doc] update wrong model id vllm-project/vllm#17287)
[Misc] Minor typo/grammar in platforms/interface.py ([Misc] Minor typo/grammar in platforms/interface.py vllm-project/vllm#17307)
[Misc] Clean up Qwen2.5-Omni code ([Misc] Clean up Qwen2.5-Omni code vllm-project/vllm#17301)
add Dockerfile.rocm.ubi
Dockerfile.rocm.ubi: improvements
[Docs] Add a security guide ([Docs] Add a security guide vllm-project/vllm#17230)
Improve conversion from dataclass configs to argparse arguments (Improve conversion from dataclass configs to argparse arguments vllm-project/vllm#17303)
Make name of compressed-tensors quant method consistent across vLLM (Make name of compressed-tensors quant method consistent across vLLM vllm-project/vllm#17255)
Explicitly explain quant method override ordering and ensure all overrides are ordered (Explicitly explain quant method override ordering and ensure all overrides are ordered vllm-project/vllm#17256)
[Security] Don't bind tcp zmq socket to all interfaces ([Security] Don't bind tcp zmq socket to all interfaces vllm-project/vllm#17197)
[Chore] cleanup license indicators in light of SPDX ([Chore] cleanup license indicators in light of SPDX vllm-project/vllm#17259)
[BugFix] Fix cascade attention - RuntimeError: scheduler_metadata must have shape (metadata_size) ([BugFix] Fix cascade attention - RuntimeError: scheduler_metadata must have shape (metadata_size) vllm-project/vllm#17283)
[Bugfix] Fix moe weight losing all extra attrs after process_weights_after_loading. ([Bugfix] Fix moe weight losing all extra attrs after process_weights_after_loading. vllm-project/vllm#16854)
[Model] Qwen3 Dense FP8 Compat Fixes ([Model] Qwen3 Dense FP8 Compat Fixes vllm-project/vllm#17318)
[Model] Add tuned triton fused_moe configs for Qwen3Moe (Add tuned triton fused_moe configs for Qwen3Moe vllm-project/vllm#17328)
add snyk security scan (Dockerfile.rocm.ubi: reduce image size opendatahub-io/vllm#217)
Dockerfile.rocm.ubi: use default torch index ROCm (do not use nightlies anymore) (Dockerfile.rocm.ubi: bump ROCm to 6.2.4 opendatahub-io/vllm#220)
Dockerfile.rocm.ubi: fix torch extra index url
[BugFix] Fix Memory Leak ([BugFix] Fix Memory Leak vllm-project/vllm#17567)
[BugFix][Attention] Fix sliding window attention in V1 giving incorrect results ([BugFix][Attention] Fix sliding window attention in V1 giving incorrect results vllm-project/vllm#17574)
[Bugfix] Fix AttributeError: 'State' object has no attribute 'engine_client' ([Bugfix] Fix AttributeError: 'State' object has no attribute 'engine_client' vllm-project/vllm#17434)
Bump Compressed Tensors version to 0.9.4 (Bump Compressed Tensors version to 0.9.4 vllm-project/vllm#17478)
[model] make llama4 compatible with pure dense layers ([model] make llama4 compatible with pure dense layers vllm-project/vllm#17315)
[Bugfix] LoRA - Retire unused maxnreg LoRA kernel argument ([Bugfix] LoRA - Retire unused maxnreg LoRA kernel argument vllm-project/vllm#17677)
bump cuda to 12-8

…m-project#16801) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

Signed-off-by: Lu Fang <fanglu@fb.com>

…16796) Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

…ect#16809) Signed-off-by: windsonsea <haifeng.yao@daocloud.io>

Signed-off-by: Jonghyun Choe <andy.choe729@gmail.com>

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

…llm-project#16829) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>

…nfig info (vllm-project#16857) Signed-off-by: jmho <jaylenho734@gmail.com>

Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>

…ect#15130) Signed-off-by: fyabc <suyang.fy@alibaba-inc.com> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com> Co-authored-by: Roger Wang <ywang@roblox.com> Co-authored-by: Xiong Wang <wangxiongts@163.com>

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

…llm-project#16591) Signed-off-by: Jannis Schönleber <joennlae@gmail.com> Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Jannis Schönleber <joennlae@gmail.com>

Signed-off-by: NickLucche <nlucches@redhat.com>

…vllm-project#16460) Signed-off-by: vie-serendipity <2733147505@qq.com>

… V1 (vllm-project#15477) Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>

Signed-off-by: rzou <zou3519@gmail.com>

Signed-off-by: Staszek Pasko <staszek@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com>

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Signed-off-by: rzou <zou3519@gmail.com>

Signed-off-by: qizixi <qizixi@meta.com>

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

…ct#16907)

- remove build steps/dependencies - allow for installing pre-built flash-attention/vllm wheels - default ROCM_VERSION to 6.3.4, allowing ovverride with env vars - cleanup rocm docker bake, defaults - amdsmi: use setup.py to build - add amdsmi bind mount - remove flashinfer from rocm target - bump vllm-tgis-adapter to 0.7.0 - Dockerfile*.ubi: bump ubi base

Signed-off-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

- remove build steps/dependencies - allow for installing pre-built flash-attention/vllm wheels - default ROCM_VERSION to 6.3.4, allowing ovverride with env vars - cleanup rocm docker bake, defaults - amdsmi: use setup.py to build - add amdsmi bind mount - remove flashinfer from rocm target - bump vllm-tgis-adapter to 0.7.0 - Dockerfile*.ubi: bump ubi base

…-project#17303) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

…vllm-project#17255) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

…rides are ordered (vllm-project#17256) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

…17197) Signed-off-by: Russell Bryant <rbryant@redhat.com>

Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Co-authored-by: Russell Bryant <rbryant@redhat.com>

…t have shape (metadata_size) (vllm-project#17283) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

…_after_loading`. (vllm-project#16854) Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: simon-mo <xmo@berkeley.edu>

…#17328) Signed-off-by: mgoin <mgoin64@gmail.com>

Co-authored-by: andy-neuma <andy@neuralmagic.com>

…es anymore) (opendatahub-io#220)

Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>

…ct results (vllm-project#17574) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

…client' (vllm-project#17434) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

Signed-off-by: Rahul Tuli <rtuli@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com>

…7315) Signed-off-by: Lucia Fang <fanglu@fb.com>

…ect#17677)

Syncing midstream NM fork to Upstream tag of [v0.8.5.post1](https://github.com/vllm-project/vllm/tree/v0.8.5.post1) + cherry pick of vllm-project@be633fb needed for benchmarks + [CP](neuralmagic/nm-vllm-ent@1fe447d) for compressed tensor bump + [CP](vllm-project#17677) for lora on AMD + [CP](vllm-project#17315) for llama4 w/ pure dense layers ``` commit 31c73ba (HEAD -> upstream-v0.8.5, nm-fork/upstream-v0.8.5) Author: Chauncey <chaunceyjiang@gmail.com> Date: Wed Apr 30 15:11:04 2025 +0800 [Bugfix] Fix AttributeError: 'State' object has no attribute 'engine_client' (vllm-project#17434) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> commit f8db0bd Author: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Date: Fri May 2 14:01:38 2025 -0400 [BugFix][Attention] Fix sliding window attention in V1 giving incorrect results (vllm-project#17574) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com> commit e335c34 Author: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Date: Fri May 2 04:07:03 2025 -0400 [BugFix] Fix Memory Leak (vllm-project#17567) Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com> commit cc463fe Merge: 1e358ff ba41cc9 Author: Selbi Nuryyeva <selbi@redhat.com> Date: Tue Apr 29 12:34:57 2025 -0400 Merge branch 'tag-upstream-v0.8.5' into upstream-v0.8.5 commit ba41cc9 (tag: v0.8.5, tag-upstream-v0.8.5) Author: Michael Goin <mgoin64@gmail.com> Date: Mon Apr 28 16:20:24 2025 -0600 [Model] Add tuned triton fused_moe configs for Qwen3Moe (vllm-project#17328) Signed-off-by: mgoin <mgoin64@gmail.com> commit dcbac4c Author: Simon Mo <simon.mo@hey.com> Date: Mon Apr 28 14:12:01 2025 -0700 [Model] Qwen3 Dense FP8 Compat Fixes (vllm-project#17318) Signed-off-by: simon-mo <xmo@berkeley.edu> [...] ``` Commands ``` git fetch upstream git checkout -b upstream-v0.8.5 git merge upstream/v0.8.5 git cherry-pick be633fb ``` TEST PLAN accept sync: https://github.com/neuralmagic/nm-cicd/actions/runs/14841223552 related PR in cicd: neuralmagic/nm-cicd#99 release workflow: https://github.com/neuralmagic/nm-cicd/actions/runs/14845693864

This bumps the cuda version in the base layer to 12-8 instead of 12-4. This could break something if during dep install we have to build a dependency from source, as the wheels we bring in later in prepare are now being built against 12.8. FIX #xxxx (*link existing issues this PR will resolve*)  **BEFORE SUBMITTING, PLEASE READ <https://docs.vllm.ai/en/latest/contributing/overview.html>** (anything written below this line will be removed by GitHub Actions)

notable conflicts were in Dockerfile.rocm.ubi and Dockerfile.ubi Up to date with Upstream v0.8.5.post1 tag and includes CPs for lora, llama4, compressed tensors bump

LucasWilkinson and others added 30 commits April 17, 2025 22:13

[BugFix] Accuracy fix for llama4 int4 - improperly casted scales (vll…

7eb4255

…m-project#16801) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

Improve-mm-and-pooler-and-decoding-configs (vllm-project#16789)

e78587a

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[Misc] add collect_env to cli and docker image (vllm-project#16759)

7bdfd29

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>

[ROCm] [Attention] Cleanup ROCm output passing (vllm-project#16431)

aaec845

Signed-off-by: Luka Govedič <lgovedic@redhat.com>

[Bugfix] fix pp for llama4 (vllm-project#16746)

e31045f

Signed-off-by: Lu Fang <fanglu@fb.com>

[Doc] add podman setup instructions for official image (vllm-project#…

9c1d5b4

…16796) Signed-off-by: Nathan Weinberg <nweinber@redhat.com>

[Docs] Fix a link and grammar issue in production-stack.md (vllm-proj…

26507f8

…ect#16809) Signed-off-by: windsonsea <haifeng.yao@daocloud.io>

[Model] use AutoWeightsLoader for BigCode, GPT-J (vllm-project#16823)

87e067d

Signed-off-by: Jonghyun Choe <andy.choe729@gmail.com>

[Misc] Clean up Kimi-VL (vllm-project#16833)

aadb656

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Fix nullable_kvs fallback (vllm-project#16837)

686623c

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[New Model]: Snowflake Arctic Embed (Family) (vllm-project#16649)

3d3ab36

[Misc] refactor examples series - Chat Completion Client With Tools (v…

5a5e29d

…llm-project#16829) Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>

[Doc] Updated Llama section in tool calling docs to have llama 3.2 co…

490b169

…nfig info (vllm-project#16857) Signed-off-by: jmho <jaylenho734@gmail.com>

[release] Publish neuron docker image (vllm-project#16733)

5c91212

Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>

[rocm][MI300] llama4 maverick fp8 moe config tp8 (vllm-project#16847)

1d4680f

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

[Frontend] Add sampling params to v1/audio/transcriptions endpoint (v…

2ef0dc5

…llm-project#16591) Signed-off-by: Jannis Schönleber <joennlae@gmail.com> Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: Jannis Schönleber <joennlae@gmail.com>

[Misc] Benchmarks for audio models (vllm-project#16505)

9d4ca19

Signed-off-by: NickLucche <nlucches@redhat.com>

[V1][Misc] stop update prefix cache stats when logs_stats is disabled (…

d9737ca

…vllm-project#16460) Signed-off-by: vie-serendipity <2733147505@qq.com>

[Model] Refactor Phi-4-multimodal to use merged processor and support…

83f3c3b

… V1 (vllm-project#15477) Signed-off-by: Isotr0py <2037008807@qq.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Model] Qwen2.5-Omni Cleanup (vllm-project#16872)

5124f5b

[VLM] Clean up models (vllm-project#16873)

205d84a

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

[doc] update hyperlink (vllm-project#16877)

d6195a7

Signed-off-by: reidliu41 <reid201711@gmail.com> Co-authored-by: reidliu41 <reid201711@gmail.com>

Log how much time loading a compiled artifact takes (vllm-project#16848)

682e0b6

Signed-off-by: rzou <zou3519@gmail.com>

Serialize tensors using int8 views (vllm-project#16866)

87aaade

Signed-off-by: Staszek Pasko <staszek@gmail.com> Co-authored-by: Nick Hill <nhill@redhat.com>

Improve configs - CacheConfig (vllm-project#16835)

4b07d36

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[easy] Pass compile_fx only the config patches (vllm-project#16845)

fe742ae

Signed-off-by: rzou <zou3519@gmail.com>

[Bugfix] Fix v1/spec_decode/test_ngram.py (vllm-project#16895)

bb3605d

Signed-off-by: qizixi <qizixi@meta.com>

[CI/CD][V1] Add spec decode tests to CI (vllm-project#16900)

4c41278

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

[Bugfix] Fix distributed bug in Qwen2.5-VL & Qwen2.5-Omni (vllm-proje…

26c0406

…ct#16907)

dtrifiro and others added 28 commits April 28, 2025 15:50

add Dockerfile.rocm.ubi

bf36270

[Docs] Add a security guide (vllm-project#17230)

72dfe4c

Signed-off-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

Improve conversion from dataclass configs to argparse arguments (vllm…

f948869

…-project#17303) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Make name of compressed-tensors quant method consistent across vLLM (…

b6dd32a

…vllm-project#17255) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

Explicitly explain quant method override ordering and ensure all over…

c7941cc

…rides are ordered (vllm-project#17256) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

[Security] Don't bind tcp zmq socket to all interfaces (vllm-project#…

a0304dc

…17197) Signed-off-by: Russell Bryant <rbryant@redhat.com>

[Chore] cleanup license indicators in light of SPDX (vllm-project#17259)

2c89cd9

Signed-off-by: Aaron Pham <contact@aarnphm.xyz> Co-authored-by: Russell Bryant <rbryant@redhat.com>

[BugFix] Fix cascade attention - RuntimeError: scheduler_metadata mus…

cc5befb

…t have shape (metadata_size) (vllm-project#17283) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

[Bugfix] Fix moe weight losing all extra attrs after `process_weights…

ed24620

…_after_loading`. (vllm-project#16854) Signed-off-by: charlifu <charlifu@amd.com>

[Model] Qwen3 Dense FP8 Compat Fixes (vllm-project#17318)

dcbac4c

Signed-off-by: simon-mo <xmo@berkeley.edu>

[Model] Add tuned triton fused_moe configs for Qwen3Moe (vllm-project…

ba41cc9

…#17328) Signed-off-by: mgoin <mgoin64@gmail.com>

Merge branch 'tag-upstream-v0.8.5' into upstream-v0.8.5

cc463fe

add snyk security scan (opendatahub-io#217)

88c65ab

Co-authored-by: andy-neuma <andy@neuralmagic.com>

Dockerfile.rocm.ubi: use default torch index ROCm (do not use nightli…

b357f89

…es anymore) (opendatahub-io#220)

Dockerfile.rocm.ubi: fix torch extra index url

5cc9ebf

[BugFix] Fix Memory Leak (vllm-project#17567)

e335c34

Signed-off-by: rshaw@neuralmagic.com <robertgshaw2@gmail.com>

[BugFix][Attention] Fix sliding window attention in V1 giving incorre…

f8db0bd

…ct results (vllm-project#17574) Signed-off-by: Lucas Wilkinson <lwilkinson@neuralmagic.com>

[Bugfix] Fix AttributeError: 'State' object has no attribute 'engine_…

31c73ba

…client' (vllm-project#17434) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

Bump Compressed Tensors version to 0.9.4 (vllm-project#17478)

1fe447d

Signed-off-by: Rahul Tuli <rtuli@redhat.com> Co-authored-by: mgoin <mgoin64@gmail.com>

[model] make llama4 compatible with pure dense layers (vllm-project#1…

ba1713a

…7315) Signed-off-by: Lucia Fang <fanglu@fb.com>

[Bugfix] LoRA - Retire unused maxnreg LoRA kernel argument (vllm-proj…

278cc0f

…ect#17677)

Merge remote-tracking branch 'nm-fork/main' into upstream-v0.8.5

9bbd1ab

bump cuda to 12-8

f2481f8

Merge tag 'midstream-v0.8.5.0' into nm-vllm-ent-0.8.5-sync

b023d4f

notable conflicts were in Dockerfile.rocm.ubi and Dockerfile.ubi Up to date with Upstream v0.8.5.post1 tag and includes CPs for lora, llama4, compressed tensors bump

ckhordiasma merged commit 60c92f8 into main May 15, 2025
3 of 4 checks passed

ckhordiasma deleted the nm-vllm-ent-0.8.5-sync branch May 15, 2025 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nm vllm ent 0.8.5 sync #139

nm vllm ent 0.8.5 sync #139

Uh oh!

ckhordiasma commented May 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

108 participants

nm vllm ent 0.8.5 sync #139

nm vllm ent 0.8.5 sync #139

Uh oh!

Conversation

ckhordiasma commented May 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

108 participants