Skip to content

Conversation

@MatthewBonanni
Copy link
Contributor

@MatthewBonanni MatthewBonanni commented Oct 30, 2025

Purpose

Reduce FlashMLA reorder_batch_threshold from 512 to 128. This fixes LM Eval Large Models when applied to 82af928 (the commit that merged #26541) but doesn't seem sufficient to fix it on TOT. There might be multiple commits involved in that failure

Test Plan

pytest -s -v test_lm_eval_correctness.py::test_lm_eval_correctness_param[config_filename4] --config-list-file=configs/models-large.txt --tp-size=4

Test Result

Passes (on 82af928)


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
@mergify mergify bot added the v1 label Oct 30, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reduces reorder_batch_threshold in FlashMLA from 512 to 128. This is a bugfix to prevent Out-Of-Memory errors for large models by routing longer prefill sequences to a more memory-efficient computation path. My review includes a suggestion to improve the code comment for this critical parameter, explaining the rationale behind the value to improve maintainability and guide future tuning efforts.

Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the context, could you add it in detail in the pr description?
EG, what is the original issue, why there is a failure, and how does this change fix it

Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work!

@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 30, 2025
@yewentao256 yewentao256 merged commit 145c00a into vllm-project:main Nov 3, 2025
49 checks passed
@MatthewBonanni MatthewBonanni deleted the fix_flashmla branch November 3, 2025 20:34
zhaozuy pushed a commit to zhaozuy/vllm that referenced this pull request Nov 4, 2025
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
omerpaz95 pushed a commit to omerpaz95/vllm that referenced this pull request Nov 4, 2025
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
juliendenize pushed a commit to juliendenize/vllm that referenced this pull request Nov 6, 2025
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
zWaNg3 added a commit to fangyuchu/vllm that referenced this pull request Nov 7, 2025
* add fault_report_addr in FaultToleranceConfig

* add handle fault&get_fault_info api

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* remove fault_report_address in CoreEngineActorManager __init__

Signed-off-by: a798347923 <2645302020@qq.com>

* ruff format

Signed-off-by: a798347923 <2645302020@qq.com>

* add handle fault&get_fault_info api

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fix one bug.

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* add fault_report_port in FaultToleranceConfig

Signed-off-by: a798347923 <2645302020@qq.com>

* add zmq_addr concatenate with fault_report_addr and fault_report_port

Signed-off-by: a798347923 <2645302020@qq.com>

* fault reporter bug fix

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fault reporter bug fix

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fault reporter bug fix

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fault reporter bug fix

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fault reporter bug fix

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fault reporter bug fix

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fix some bug

* fault reporter bug fix

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fault reporter bug fix

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* remove fault_report_addr in FaultToleranceConfig

Signed-off-by: a798347923 <2645302020@qq.com>

* refactor: relocate method serialization functions to serial_util.py

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* fix actor bug

* fix actor bug

* add engine_core_cmd_addr in FaultToleranceConfig

Signed-off-by: a798347923 <2645302020@qq.com>

* add and use _stop_worker_execution in EngineCoreGuard

Signed-off-by: a798347923 <2645302020@qq.com>

* add and use run in WorkerGuard

Signed-off-by: a798347923 <2645302020@qq.com>

* fix actor bug

* fix bug

* fix sentinel

* fix bug vllm/v1/engine/core.py:847: error: Missing positional argument "tp_size" in call to "EngineCoreGuard"

Signed-off-by: a798347923 <2645302020@qq.com>

* fix bug error: Missing positional arguments "length", "byteorder" in call to "to_bytes" of "int"

Signed-off-by: a798347923 <2645302020@qq.com>

* fix bug in fault tolerance mode

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fix bug in fault tolerance mode

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* change fault_report_port to internal_fault_report_port
add external_fault_notify_port

Signed-off-by: a798347923 <2645302020@qq.com>

* change fault_report_port to internal_fault_report_port
add external_fault_notify_port

Signed-off-by: a798347923 <2645302020@qq.com>

* add _recv_cmd func
use deserialize_method_call and run_method in run func

Signed-off-by: a798347923 <2645302020@qq.com>

* Update core.py

fix bug error: Need type annotation for "kwargs" (hint: "kwargs: dict[<type>, <type>] = ...")

Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>

* add self.ctx.term() in shutdown()

Signed-off-by: a798347923 <2645302020@qq.com>

* changed import deserialize_method_call,serialize_method_call

Signed-off-by: a798347923 <2645302020@qq.com>

* changed init worker_guard in init_device

Signed-off-by: a798347923 <2645302020@qq.com>

* Update core.py

add import serialize_method_call

Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>

* Update gpu_worker.py

changed init WorkerGuard in init_device

Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>

* Update gpu_worker.py

FIX BUG self.worker_guard: WorkerGuard|None = None

Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>

* Update gpu_worker.py

fix bug error: Argument 1 to "deserialize_method_call" has incompatible type "str | None"; expected "str"  [arg-type]

Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>

* Update gpu_worker.py

ruff format

Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>

* Update core.py

ruff-format

Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>

* actively send exception information

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* actively send exception information

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* actively send exception information

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* change engine_core_cmd_addr(str) to engine_core_cmd_addrs(list[str]) in EngineZmqAddresses

Signed-off-by: a798347923 <2645302020@qq.com>

* change engine_core_cmd_addr(str) to engine_core_cmd_addrs(list[str]) in EngineZmqAddresses

Signed-off-by: a798347923 <2645302020@qq.com>

* Update utils.py

delete engine_core_cmd_addr in EngineZmqAddresses

Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>

* Remove redundant configuration: fault-pub-port

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* Send pause instructions after receiving fault info in ClientGuard

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* change engine_core_guard_identities from dict[int, bytes] to list[bytes]

Signed-off-by: a798347923 <2645302020@qq.com>

* fix bug "only the worker guard of engine core 0 can receive messages sent from engine core guard

Signed-off-by: a798347923 <2645302020@qq.com>

* change local_rank to rank_in_group in WorkerGuard

Signed-off-by: a798347923 <2645302020@qq.com>

* changed del self.client_cmd_registry[int(unhealthy_engine.engine_id)]

Signed-off-by: a798347923 <2645302020@qq.com>

* add gloo communication timeout

* fix some bug

* add  stateless_process_group gloo_comm_timeout

* reconstruct fault receiver&fault handler

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fix some bug

* reconstruct fault receiver&fault handler

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* reconstruct fault receiver&fault handler

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fix return format

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fix return format

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fix return format

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* add abort request

* fix some bug

* fix some bug

* fix some bug

* add dt for client guard

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* add dt for client guard

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* add dt for client guard

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* Implementation of two types of pause: a soft one by using flag signals and a hard one by aborting nccl communicators.

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* Refine certain log forms and fix a minor bug in pause function.

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* Refactor and abstract the recv_msg logic in CG,ECG,WG.

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* [Frontend] Align finish_reason when tool is called with OpenAI (vllm-project#25054)

Signed-off-by: Sungyoon Jeong <sungyoon.jeong@furiosa.ai>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>

* [Hybrid] Pass kernel block size to builders (vllm-project#27753)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

* [Bugfix] Padded Eagle Specdec with Chunked Prefill (vllm-project#26263)

Signed-off-by: Rémi Delacourt <remi@mistral.ai>
Signed-off-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com>
Signed-off-by: remi <remi@mistral.ai>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>

* [XPU]Refine Dockerfile.xpu, avoid oneccl dependency issue (vllm-project#27964)

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* Add and check method uuid when sending commands and receiving results.

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* Add ORCA endpoint load metrics support (vllm-project#24905)

Signed-off-by: Misha Efimov <mef@google.com>

* [CI/Build] Remove the flaky gpt-oss lora test (vllm-project#27966)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* Abstract the logic of sending instructions and waiting responses from FaultHandler

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* [Model] Add PaddleOCR-VL Model Support  (vllm-project#27758)

Signed-off-by: zhangyue <zhangyue66@baidu.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: zhangyue66 <zhangyue66@baidu.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* Add options in EngineCoreGuard to recv execution results from WorkerGuard

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* Early exit for MoE LoRA kernels (vllm-project#27131)

Signed-off-by: gnovack <gnovack@amazon.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

* [Bugfix] Skip gs:// model paths for speculator detection (vllm-project#27846)

Signed-off-by: Peter Schuurman <psch@google.com>

* [BUG] Make 'binary' default option for saving torch compile artifacts when using standalone_compile (vllm-project#27616)

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

* [CI/Testing] Add basic single node dual batch overlap test (vllm-project#27235)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

* [Spec Decode] Integrate Suffix Decoding from Arctic Inference (vllm-project#25784)

Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>

* [Feature][Benchmarks] Support `inf` burstiness (vllm-project#26941)

Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com>

* [Bugfix][Qwen][Multimodal] Move Qwen2_5_vl sdpa to custom op and reenable compile (vllm-project#27764)

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>

* [Bugfix] change FlashMLA reorder_batch_threshold (vllm-project#27777)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

* [Docs] add runai_streamer_sharded to LoadConfig (vllm-project#27937)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* Add TP parameter to attention tests (vllm-project#27683)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

* [Bugfix][plugin] fla crash on plugin (vllm-project#27322)

* [Bugfix] Fix MoE Routing Simulation (vllm-project#28002)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

* Remove the tpu docker image nightly build. (vllm-project#27997)

Signed-off-by: Qiliang Cui <derrhein@gmail.com>

* [Bugfix][ROCm] Fix ViT rotary embeddings for torch.compile compatibility on ROCm (vllm-project#27748)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

* [LoRA] Lora shrink swizzle (vllm-project#27694)

Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

* [Refactor] Lazy import tool_parser (vllm-project#27974)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [NIXL][XPU] Pin NIXL version to 0.7.0 (vllm-project#27849)

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

* [Metrics] Enable sleep state metric outside of dev mode (vllm-project#27867)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

* [Bug] Batch invariant: Fix flash attn MLA `RuntimeError: scheduler_metadata must have shape (metadata_size)` (vllm-project#27884)

* [CPU]Improve dynamic 4bit moe performance (vllm-project#27240)

Signed-off-by: Zhang Xiangze <Xiangze.Zhang@arm.com>

* [CI/Build] Update LM Eval Version in AMD CI (vllm-project#27944)

Signed-off-by: zhewenli <zhewenli@meta.com>

* [KV Connector] Make KVCacheConfig an explicit constructor argument (vllm-project#27887)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

* [Model] fix ernie45 reasoning_parser (vllm-project#27973)

Signed-off-by: wangyafeng <wangyafeng@baidu.com>

* [CI/Build] Fix OpenAI API correctness on AMD CI (vllm-project#28022)

Signed-off-by: zhewenli <zhewenli@meta.com>

* [BugFix][Performance] Restore flashinfer autotuning for all scenarios (vllm-project#27904)

* Support worker reinitialization after hard pause; add task queue in FaultHandler to ensure sequential task execution

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* resolve conflicts

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* resolve conflicts

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* resolve conflicts

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* Load tuned fused_moe_lora shrink and expand kernel configs separately (vllm-project#27435)

Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

* resolve conflicts

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* resolve conflicts

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* resolve conflicts

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* Support using Int4PreshuffledTensor after loading (vllm-project#26066)

Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>

* [Core] Enable StatLogger in LLMEngine (vllm-project#28020)

Signed-off-by: Zhuohan Li <zhuohan123@gmail.com>

* [Model][Bugfix] fix pipeline parallelism support for NemotronH (vllm-project#27968)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

* [Model] add optimal triton fused moe configs for NemotronH MoE (vllm-project#27967)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

* [Kernels] Isolate modular kernel code from FusedMoEMethodBase subclasses. (vllm-project#27123)

* [BugFix] Fix incorrect preallocated sampled_token_ids tensor size (vllm-project#28025)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [Perf] SM100 - add swap AB optimization to CUTLASS FP8 GEMM (vllm-project#27284)

Signed-off-by: Faqin Zhong <faqin.zhong@gmail.com>
Co-authored-by: Faqin Zhong <zhofaqin@amazon.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>

* [PERF] Decouple projections from GDN custom op (vllm-project#27512)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

* [model] Add support for openPangu_Ultra_MoE (vllm-project#27521)

Signed-off-by: yuantao <2422264527@qq.com>
Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

* [PerfFix] Avoid separate thread for MP executor shm spin (vllm-project#28012)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [AsyncScheduling] Don't schedule past request max_tokens (vllm-project#27922)

Signed-off-by: Nick Hill <nhill@redhat.com>

* Remove deprecated `--rope-scaling` and `--rope-theta` (vllm-project#28006)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [ROCm][Perf] New design on ROCm AITER MHA backend Implementation (vllm-project#25763)

Signed-off-by: ganyi <ygan@amd.com>

* Added disable rule to track files under benchmarks/lib (vllm-project#28048)

Signed-off-by: Nadav Kluger <nadav.k@fmr.ai>

* [Multimodal] Make MediaConnector extensible. (vllm-project#27759)

Signed-off-by: Chenheli Hua <huachenheli@outlook.com>

* [ROCm] gemm_a16w16 upstreaming (vllm-project#26969)

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>

* Revert "[PERF] Decouple projections from GDN custom op" (vllm-project#28080)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

* add engine core ut

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* add engine core ut

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* [Qwen3-Next] MOE configs for A100-SXM4-80GB TP4 TP8 (vllm-project#27740)

* [XPU] Add gpt-oss model support for Intel GPU (vllm-project#27786)

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* [CI/Build] Enable some fixed tests in AMD CI (vllm-project#28078)

Signed-off-by: zhewenli <zhewenli@meta.com>

* [V0 deprecation] Remove VLLM_USE_V1 usage in most modules (vllm-project#27955)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Bugfix] Fix encoder-only model support for transformers backend (vllm-project#28021)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [BugFix] Fix DCP Assert (AssertionError: DCP not support reorder_batch_threshold > 1 now.) (vllm-project#28100)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

* [Model, Core] Support Granite Speech & LoRA for STT (vllm-project#24455)

* [Refactor] Lazy-loaded reasoning_parser (vllm-project#28092)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [Refactor] to simplify and extract the shared logic between chat completion and responses (vllm-project#27961)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [bugfix] fix wrong `dcp_local_seq_lens` calc (vllm-project#27518)

Signed-off-by: Qiu <qiuchunshuo@huawei.com>

* [Hybrid allocator + kv connector] revert connector test changes related to hybrid allocator (vllm-project#28011)

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

* [Misc] fix import error for DeepSeekR1ReasoningParser (vllm-project#28114)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* Fix excessive logging noise by reducing the log level of the MinimaxM2ToolParser import success message (vllm-project#27635)

Signed-off-by: minatoaquaMK2 <jiacheng.yue@foxmail.com>

* Bugfix: Cutlass FP8 FusedMoE bad scaling factors (vllm-project#27255)

Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>

* [Graph Partition][Cache] Use inductor partition ops config (vllm-project#27702)

Signed-off-by: Boyuan Feng <boyuan@meta.com>

* [XPU] Enable custom routing functions in IPEX for Llama4 (vllm-project#28004)

Signed-off-by: frost-intel <frost.mitchell@intel.com>

* add kimi reasoning parser (vllm-project#28128)

Signed-off-by: wangzhengtao <wangzhengtao@msh.team>
Co-authored-by: wangzhengtao <wangzhengtao@msh.team>

* [DCP] check return_lse for all layers in dcp (vllm-project#27929)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [BugFix] Support EP/DP + EPLB with MTP (vllm-project#25311)

Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>

* Enabling cooperative multi-gpu tests on multi-gpu nodes (vllm-project#27986)

Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

* [ROCm][MLA] Support block-size > 1 for AITER MLA backend  (vllm-project#27224)

Signed-off-by: ganyi <ygan@amd.com>
Co-authored-by: wuhuikx <hattie.wu@amd.com>

* [Bugfix] Validate custom logits processor xargs for online serving (vllm-project#27560)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [misc] add vLLM Beijing Meetup (vllm-project#28127)

Signed-off-by: Jiaju Zhang <jjzhang@redhat.com>

* [Kernel] Fuse computation of g and beta for Gated Delta Net (vllm-project#28095)

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

* [Core] add support for reasoning parser plugins (vllm-project#28075)

Signed-off-by: walter beller-morales <walter.beller.morales@gmail.com>

* [Bugfix] vLLM should check Inductor config for compile cache enablement status (vllm-project#27637)

Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>

* [FlashInfer] Avoid FlashInfer block_size 16 + head_size 256 on blackwell (vllm-project#27994)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [CI]: Add LMCacheConnector Unit Tests (vllm-project#27852)

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Yihua Cheng <yihua98@uchicago.edu>

* [Feature] Extend batch invariant torch.compile to B200 (vllm-project#27856)

Signed-off-by: PaulZhang12 <paulzhan@fb.com>

* [Bugfix] Fix Qwen3-Reranker-8B load (vllm-project#28117)

Signed-off-by: wang.yuqi <noooop@126.com>

* [Docs] Clean up README_TUNING.md (vllm-project#28088)

Signed-off-by: windsonsea <haifeng.yao@daocloud.io>

* [Hardware][IBM Z] Optimize s390x Dockerfile (vllm-project#28023)

Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>

* [Chore] Remove Nemotron-Nano-VL config copy (vllm-project#28126)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [Docs] Add guide to debugging vLLM-torch.compile integration (vllm-project#28094)

Signed-off-by: Richard Zou <zou3519@gmail.com>

* [Feature]: Add corrupted request metric to V1 metrics system. (vllm-project#27306)

Signed-off-by: atalhens <sneh.lata@nutanix.com>

* [CI/Build] Update checking logic in cutlass_group_gemm_supported  (vllm-project#27948)

Signed-off-by: zhewenli <zhewenli@meta.com>

* [CI/Build] Fix `test_defaults_with_usage_context` in AMD CI (vllm-project#27926)

Signed-off-by: zhewenli <zhewenli@meta.com>

* [Core][Hybrid allocator + connector 2/n] Unify `remove_skipped_blocks` by `get_last_useful_token` (vllm-project#25431)

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

* [Debugging] Add annotation for easier trace analysis (vllm-project#22496)

* [PERF] Decouple projections from GDN custom op. Attempt 2 (vllm-project#28083)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

* [Bug] Fix cpu disable shared_experts `VLLM_DISABLE_SHARED_EXPERTS_STREAM` (vllm-project#28157)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* [Bug] Fix env string `"0"` same to `True` (vllm-project#28159)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* Ensure WorkerGuard command execution returns result; fix missing set_device when TP>1

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* [Feature] Enable TP + EP `shared_experts` overlap with router, 3.7% E2E performance improvement (vllm-project#28164)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* [CI Failure] `nm-testing/Qwen2-0.5B-Instruct-FP8-SkipQKV` was removed from HF. Skip it in tests (vllm-project#28170)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

* [Misc] Remove the duplicate code (vllm-project#28111)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* rename& format logger

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* rename& format logger

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* feat(nccl): enable non-blocking NCCL communicators to support ncclCommAbort

Signed-off-by: fangyuchu <fangyuchu@qq.com>

---------

Signed-off-by: w00689259 <wangzhuo66@huawei.com>
Signed-off-by: a798347923 <2645302020@qq.com>
Signed-off-by: fangyuchu <fangyuchu@qq.com>
Signed-off-by: zWaNg3 <37772915+zWaNg3@users.noreply.github.com>
Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>
Signed-off-by: Sungyoon Jeong <sungyoon.jeong@furiosa.ai>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Rémi Delacourt <remi@mistral.ai>
Signed-off-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com>
Signed-off-by: remi <remi@mistral.ai>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Misha Efimov <mef@google.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: zhangyue <zhangyue66@baidu.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: zhangyue66 <zhangyue66@baidu.com>
Signed-off-by: gnovack <gnovack@amazon.com>
Signed-off-by: Peter Schuurman <psch@google.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com>
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Andy Xie <andy.xning@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Qiliang Cui <derrhein@gmail.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Zhang Xiangze <Xiangze.Zhang@arm.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
Signed-off-by: wangyafeng <wangyafeng@baidu.com>
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Faqin Zhong <faqin.zhong@gmail.com>
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: yuantao <2422264527@qq.com>
Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: Nadav Kluger <nadav.k@fmr.ai>
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Qiu <qiuchunshuo@huawei.com>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: minatoaquaMK2 <jiacheng.yue@foxmail.com>
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Signed-off-by: frost-intel <frost.mitchell@intel.com>
Signed-off-by: wangzhengtao <wangzhengtao@msh.team>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Signed-off-by: Jiaju Zhang <jjzhang@redhat.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: walter beller-morales <walter.beller.morales@gmail.com>
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: PaulZhang12 <paulzhan@fb.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
Signed-off-by: Richard Zou <zou3519@gmail.com>
Signed-off-by: atalhens <sneh.lata@nutanix.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: fangyuchu <fangyuchu@qq.com>
Co-authored-by: a798347923 <2645302020@qq.com>
Co-authored-by: w00689259 <wangzhuo66@huawei.com>
Co-authored-by: fangyuchu <569160112@qq.com>
Co-authored-by: TianZhuo <2770730562@qq.com>
Co-authored-by: a798347923 <39047817+a798347923@users.noreply.github.com>
Co-authored-by: Sungyoon Jeong <157349761+n0gu-furiosa@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Misha Efimov <mef@google.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: zhang-prog <69562787+zhang-prog@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: gnovack <gnovack@amazon.com>
Co-authored-by: pwschuurman <psch@google.com>
Co-authored-by: ahao-anyscale <ahao@anyscale.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Aurick Qiao <aurickq@users.noreply.github.com>
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
Co-authored-by: Sophie du Couédic <sop@zurich.ibm.com>
Co-authored-by: Lucas Kabela <lucaskabela@meta.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Ning Xie <andy.xning@gmail.com>
Co-authored-by: Hank_ <37239608+ILikeIneine@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: QiliangCui <derrhein@gmail.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Co-authored-by: liuzhenwei <zhenwei.liu@intel.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: xiangze-arm <Xiangze.Zhang@arm.com>
Co-authored-by: Zhewen Li <zhewenli@meta.com>
Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: yugong333 <yu3.gong@gmail.com>
Co-authored-by: Jerry Zhang <jerryzh168@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: lyrisz <145491716+LyrisZhong@users.noreply.github.com>
Co-authored-by: Faqin Zhong <zhofaqin@amazon.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
Co-authored-by: yt0428 <51468697+yt0428@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Pleaplusone <pleaplusone.gy@gmail.com>
Co-authored-by: nadavkluger <nadav.kluger@gmail.com>
Co-authored-by: Chenheli Hua <huachenheli@outlook.com>
Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: tou <57480529+toulzx@users.noreply.github.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Alex Brooks <alex.brooks@ibm.com>
Co-authored-by: Qiu <chunshuoq@gmail.com>
Co-authored-by: Kuntai Du <kuntai@uchicago.edu>
Co-authored-by: Eric Yue <jiacheng.yue@foxmail.com>
Co-authored-by: amirkl94 <203507526+amirkl94@users.noreply.github.com>
Co-authored-by: Boyuan Feng <boyuan@meta.com>
Co-authored-by: Frost Mitchell <frost.mitchell@intel.com>
Co-authored-by: bigmoyan <moyan_work@foxmail.com>
Co-authored-by: wangzhengtao <wangzhengtao@msh.team>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Ilya Markov <markovilya197@gmail.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: wuhuikx <hattie.wu@amd.com>
Co-authored-by: Jiaju Zhang <jjzhang@redhat.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Co-authored-by: Walter Beller-Morales <walterbm@users.noreply.github.com>
Co-authored-by: gmagogsfm <gmagogsfm@users.noreply.github.com>
Co-authored-by: Samuel Shen <102553648+sammshen@users.noreply.github.com>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Yihua Cheng <yihua98@uchicago.edu>
Co-authored-by: Paul Zhang <paulzhan@fb.com>
Co-authored-by: wang.yuqi <noooop@126.com>
Co-authored-by: Michael Yao <haifeng.yao@daocloud.io>
Co-authored-by: R3hankhan <Rehan.Khan7@ibm.com>
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com>
Co-authored-by: Snehlata <sneh.lata@nutanix.com>
Co-authored-by: Dayeol Lee <dayeolee@gmail.com>
ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants