Skip to content

Conversation

@atalhens
Copy link
Contributor

@atalhens atalhens commented Oct 21, 2025

Signed off: sneh.lata@nutanix.com

Purpose

The purpose of this PR is to fix: #27301 and #27298. This PR implements a new Prometheus metric vllm:num_requests_corrupted to track and monitor corrupted requests in vLLM. The feature provides visibility into request corruption issues that can occur during inference, enabling better monitoring and debugging capabilities.

  • New Prometheus gauge metric vllm:num_requests_corrupted
  • Console logging of corrupted request counts per logging interval.
  • Enabled via the env variable VLLM_COMPUTE_NANS_IN_LOGITS flag for performance control, thus it won't impact any functionality unless enabled; the default value is set to False.
  • Updated documentation and updated dashboard examples config as well.

Performance Considerations:

  • Corrupted request counting is optional and disabled by default
  • When enabled, it adds no computational overhead

Test Plan

  • Not needed as it is a by-default disabled metric.

Test Result

  • N/A

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • [] The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link

mergify bot commented Oct 21, 2025

Documentation preview: https://vllm--27306.org.readthedocs.build/en/27306/

@mergify mergify bot added documentation Improvements or additions to documentation v1 labels Oct 21, 2025
Copy link
Member

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work on this - it's pretty obvious that #18777 has very limited value without this

And ... I know the way SchedulerStats.num_corrupted_reqs is currently implemented as a "count of currently running requests that are corrupted" leads to using a Gauge ...

But I really think a Counter is a more natural use of a time-series metric for this information. Probably that would mean adding is_output_corrupted to FinishedRequestStats and removing num_corrupted_reqs from SchedulerStats?

Copy link
Member

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(sorry didn't mean to approve, wrong button!)

@mergify
Copy link

mergify bot commented Oct 22, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @atalhens.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@atalhens
Copy link
Contributor Author

atalhens commented Oct 24, 2025

Thanks for all your work on this - it's pretty obvious that #18777 has very limited value without this

And ... I know the way SchedulerStats.num_corrupted_reqs is implemented as a "count of currently running requests that are corrupted" leads to using a Gauge ...

But I really think a Counter is a more natural use of a time-series metric for this information. Probably that would mean adding is_output_corrupted to FinishedRequestStats and removing num_corrupted_reqs from SchedulerStats?

Thanks for the suggestion. Due to the way SchedulerStats was implemented, it previously did make sense to have it as a gauge metric. But definitely for time series metrics like this, it makes more sense as a counter metric. In this case, I suggest the following changes that align with your thought of changing SchedulerStats and introducing a new field over the FinishedStats.

  1. Added is_corrupted flag in FinishedStats with default as False.
  2. Added the num_corrupted_request field in IterationStats, within which update_from_output () helps update the response of a request from EngineCoreOutput. However, currently, it doesn't have any context about the nans in output logits.
  3. Thus, added a field num_nans_in_logits that gets populated from the already implemented requests.num_nans_in_logits.
  4. Added a field in LoggingStatsLogger num_corrupted_reqs which gets populated from the iteration stats.
  5. Added Prometheus metric corrupted_requests: Number of requests corrupted out of running requests.

cc: @markmc

@mergify
Copy link

mergify bot commented Oct 24, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @atalhens.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: atalhens <sneh.lata@nutanix.com>
@atalhens atalhens reopened this Oct 24, 2025
@atalhens atalhens marked this pull request as ready for review October 24, 2025 18:42
@mergify
Copy link

mergify bot commented Oct 24, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @atalhens.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@markmc markmc added ready ONLY add when PR is ready to merge/full CI is needed and removed performance Performance-related issues structured-output frontend speculative-decoding ci/build tool-calling qwen Related to Qwen models labels Oct 28, 2025
@markmc markmc removed this from Tool Calling Oct 28, 2025
simon-mo
simon-mo previously approved these changes Oct 28, 2025
Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markmc reviewed

@simon-mo simon-mo dismissed their stale review October 28, 2025 17:53

Have concern over the code

Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you further justify the need for VLLM_COMPUTE_NANS_IN_LOGITS env var and why can't this be enabled by default? I cannot find a place where we were doing CPU <> GPU syncs in counting NaNs here.

@markmc
Copy link
Member

markmc commented Oct 28, 2025

Can you further justify the need for VLLM_COMPUTE_NANS_IN_LOGITS env var and why can't this be enabled by default?

Looks like it was added in #18777 in response to this comment

I cannot find a place where we were doing CPU <> GPU syncs in counting NaNs here.

Yeah, looks like the original performance overhead concern wasn't very specific

@atalhens
Copy link
Contributor Author

atalhens commented Oct 28, 2025

Can you further justify the need for VLLM_COMPUTE_NANS_IN_LOGITS env var and why can't this be enabled by default?

Looks like it was added in #18777 in response to this comment

I cannot find a place where we were doing CPU <> GPU syncs in counting NaNs here.

Yeah, looks like the original performance overhead concern wasn't very specific

Thanks for pointing out simon,
The env variable was kept to address the overhead concern raised in the previous PR mentioned by Marc above. This PR brought the implementation to compute _get_nans_in_output_logits.

About the GPU<>CPU sync to calculate the NaNs, it is called this way.
request.num_nans_in_logits = num_nans_in_logits[req_id] is gets populated from

num_nans_in_logits = model_runner_output.num_nans_in_logits and the model runner output utilizes the

def _get_nans_in_logits(

In order to get nans, as it was gatekept here:
if envs.VLLM_COMPUTE_NANS_IN_LOGITS:
by the environment and disabled by default, I utilized the same variable just to stay consistent.

cc: @simon-mo , @markmc

@simon-mo
Copy link
Collaborator

Can we quantify this for the users?

When running model X on hardware Y, there's x% throughput regression when sending metrics requests every 1s...

So users can understand the implication of turning this on.

@atalhens atalhens changed the title Add corrupted req metric [Feature]: Add corrupted request metric to V1 metrics system. Oct 28, 2025
@atalhens
Copy link
Contributor Author

atalhens commented Nov 4, 2025

Hi, I ran the VLLM performance benchmarks with and without the flag being enabled to see the performance impact on a 2 H100 hardware node.
From my sight, I observed that enabling the flag adds an overhead of:

  • About 1-2% in latency
  • Though surprisingly, there was not a very big impact on throughput. I observed less than 1% in most of the scenarios; for high-throughput scenarios, it was around 3-4%
    T
    This doc includes the benchmark result. - Doc Link

Throughput Tests

Test name Metric Native With VLLM_COMPUTE_NANS_IN_LOGITS flag enabled Change (%)
throughput_llama8B_tp1 Tput (req/s) 22.7456 22.7126 -0.14
throughput_llama8B_tp1 Tput (tok/s) 9803.83 9789.58 -0.15
throughput_llama8B_tp1 Elapsed time (s) 8.79289 8.80569 +0.15
throughput_mixtral8x7B_tp2 Tput (req/s) 6.5888 6.5919 +0.05
throughput_mixtral8x7B_tp2 Tput (tok/s) 3250.45 3251.98 +0.05
throughput_mixtral8x7B_tp2 Elapsed time (s) 30.3546 30.3403 -0.05

In conclusion, I think this difference is significant enough to warrant flagging this feature.

Thanks!
Snehlata

cc: @markmc @simon-mo

@atalhens atalhens requested a review from simon-mo November 4, 2025 16:16
Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the benchmarks

@simon-mo simon-mo merged commit e156017 into vllm-project:main Nov 5, 2025
47 checks passed
zWaNg3 added a commit to fangyuchu/vllm that referenced this pull request Nov 7, 2025
* add fault_report_addr in FaultToleranceConfig

* add handle fault&get_fault_info api

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* remove fault_report_address in CoreEngineActorManager __init__

Signed-off-by: a798347923 <2645302020@qq.com>

* ruff format

Signed-off-by: a798347923 <2645302020@qq.com>

* add handle fault&get_fault_info api

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fix one bug.

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* add fault_report_port in FaultToleranceConfig

Signed-off-by: a798347923 <2645302020@qq.com>

* add zmq_addr concatenate with fault_report_addr and fault_report_port

Signed-off-by: a798347923 <2645302020@qq.com>

* fault reporter bug fix

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fault reporter bug fix

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fault reporter bug fix

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fault reporter bug fix

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fault reporter bug fix

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fault reporter bug fix

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fix some bug

* fault reporter bug fix

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fault reporter bug fix

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* remove fault_report_addr in FaultToleranceConfig

Signed-off-by: a798347923 <2645302020@qq.com>

* refactor: relocate method serialization functions to serial_util.py

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* fix actor bug

* fix actor bug

* add engine_core_cmd_addr in FaultToleranceConfig

Signed-off-by: a798347923 <2645302020@qq.com>

* add and use _stop_worker_execution in EngineCoreGuard

Signed-off-by: a798347923 <2645302020@qq.com>

* add and use run in WorkerGuard

Signed-off-by: a798347923 <2645302020@qq.com>

* fix actor bug

* fix bug

* fix sentinel

* fix bug vllm/v1/engine/core.py:847: error: Missing positional argument "tp_size" in call to "EngineCoreGuard"

Signed-off-by: a798347923 <2645302020@qq.com>

* fix bug error: Missing positional arguments "length", "byteorder" in call to "to_bytes" of "int"

Signed-off-by: a798347923 <2645302020@qq.com>

* fix bug in fault tolerance mode

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fix bug in fault tolerance mode

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* change fault_report_port to internal_fault_report_port
add external_fault_notify_port

Signed-off-by: a798347923 <2645302020@qq.com>

* change fault_report_port to internal_fault_report_port
add external_fault_notify_port

Signed-off-by: a798347923 <2645302020@qq.com>

* add _recv_cmd func
use deserialize_method_call and run_method in run func

Signed-off-by: a798347923 <2645302020@qq.com>

* Update core.py

fix bug error: Need type annotation for "kwargs" (hint: "kwargs: dict[<type>, <type>] = ...")

Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>

* add self.ctx.term() in shutdown()

Signed-off-by: a798347923 <2645302020@qq.com>

* changed import deserialize_method_call,serialize_method_call

Signed-off-by: a798347923 <2645302020@qq.com>

* changed init worker_guard in init_device

Signed-off-by: a798347923 <2645302020@qq.com>

* Update core.py

add import serialize_method_call

Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>

* Update gpu_worker.py

changed init WorkerGuard in init_device

Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>

* Update gpu_worker.py

FIX BUG self.worker_guard: WorkerGuard|None = None

Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>

* Update gpu_worker.py

fix bug error: Argument 1 to "deserialize_method_call" has incompatible type "str | None"; expected "str"  [arg-type]

Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>

* Update gpu_worker.py

ruff format

Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>

* Update core.py

ruff-format

Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>

* actively send exception information

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* actively send exception information

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* actively send exception information

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* change engine_core_cmd_addr(str) to engine_core_cmd_addrs(list[str]) in EngineZmqAddresses

Signed-off-by: a798347923 <2645302020@qq.com>

* change engine_core_cmd_addr(str) to engine_core_cmd_addrs(list[str]) in EngineZmqAddresses

Signed-off-by: a798347923 <2645302020@qq.com>

* Update utils.py

delete engine_core_cmd_addr in EngineZmqAddresses

Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>

* Remove redundant configuration: fault-pub-port

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* Send pause instructions after receiving fault info in ClientGuard

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* change engine_core_guard_identities from dict[int, bytes] to list[bytes]

Signed-off-by: a798347923 <2645302020@qq.com>

* fix bug "only the worker guard of engine core 0 can receive messages sent from engine core guard

Signed-off-by: a798347923 <2645302020@qq.com>

* change local_rank to rank_in_group in WorkerGuard

Signed-off-by: a798347923 <2645302020@qq.com>

* changed del self.client_cmd_registry[int(unhealthy_engine.engine_id)]

Signed-off-by: a798347923 <2645302020@qq.com>

* add gloo communication timeout

* fix some bug

* add  stateless_process_group gloo_comm_timeout

* reconstruct fault receiver&fault handler

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fix some bug

* reconstruct fault receiver&fault handler

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* reconstruct fault receiver&fault handler

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fix return format

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fix return format

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* fix return format

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* add abort request

* fix some bug

* fix some bug

* fix some bug

* add dt for client guard

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* add dt for client guard

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* add dt for client guard

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* Implementation of two types of pause: a soft one by using flag signals and a hard one by aborting nccl communicators.

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* Refine certain log forms and fix a minor bug in pause function.

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* Refactor and abstract the recv_msg logic in CG,ECG,WG.

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* [Frontend] Align finish_reason when tool is called with OpenAI (vllm-project#25054)

Signed-off-by: Sungyoon Jeong <sungyoon.jeong@furiosa.ai>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>

* [Hybrid] Pass kernel block size to builders (vllm-project#27753)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

* [Bugfix] Padded Eagle Specdec with Chunked Prefill (vllm-project#26263)

Signed-off-by: Rémi Delacourt <remi@mistral.ai>
Signed-off-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com>
Signed-off-by: remi <remi@mistral.ai>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>

* [XPU]Refine Dockerfile.xpu, avoid oneccl dependency issue (vllm-project#27964)

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* Add and check method uuid when sending commands and receiving results.

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* Add ORCA endpoint load metrics support (vllm-project#24905)

Signed-off-by: Misha Efimov <mef@google.com>

* [CI/Build] Remove the flaky gpt-oss lora test (vllm-project#27966)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* Abstract the logic of sending instructions and waiting responses from FaultHandler

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* [Model] Add PaddleOCR-VL Model Support  (vllm-project#27758)

Signed-off-by: zhangyue <zhangyue66@baidu.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: zhangyue66 <zhangyue66@baidu.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* Add options in EngineCoreGuard to recv execution results from WorkerGuard

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* Early exit for MoE LoRA kernels (vllm-project#27131)

Signed-off-by: gnovack <gnovack@amazon.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

* [Bugfix] Skip gs:// model paths for speculator detection (vllm-project#27846)

Signed-off-by: Peter Schuurman <psch@google.com>

* [BUG] Make 'binary' default option for saving torch compile artifacts when using standalone_compile (vllm-project#27616)

Signed-off-by: ahao-anyscale <ahao@anyscale.com>

* [CI/Testing] Add basic single node dual batch overlap test (vllm-project#27235)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

* [Spec Decode] Integrate Suffix Decoding from Arctic Inference (vllm-project#25784)

Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>

* [Feature][Benchmarks] Support `inf` burstiness (vllm-project#26941)

Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com>

* [Bugfix][Qwen][Multimodal] Move Qwen2_5_vl sdpa to custom op and reenable compile (vllm-project#27764)

Signed-off-by: Lucas Kabela <lucaskabela@meta.com>

* [Bugfix] change FlashMLA reorder_batch_threshold (vllm-project#27777)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

* [Docs] add runai_streamer_sharded to LoadConfig (vllm-project#27937)

Signed-off-by: Andy Xie <andy.xning@gmail.com>

* Add TP parameter to attention tests (vllm-project#27683)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

* [Bugfix][plugin] fla crash on plugin (vllm-project#27322)

* [Bugfix] Fix MoE Routing Simulation (vllm-project#28002)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

* Remove the tpu docker image nightly build. (vllm-project#27997)

Signed-off-by: Qiliang Cui <derrhein@gmail.com>

* [Bugfix][ROCm] Fix ViT rotary embeddings for torch.compile compatibility on ROCm (vllm-project#27748)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

* [LoRA] Lora shrink swizzle (vllm-project#27694)

Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

* [Refactor] Lazy import tool_parser (vllm-project#27974)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [NIXL][XPU] Pin NIXL version to 0.7.0 (vllm-project#27849)

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

* [Metrics] Enable sleep state metric outside of dev mode (vllm-project#27867)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

* [Bug] Batch invariant: Fix flash attn MLA `RuntimeError: scheduler_metadata must have shape (metadata_size)` (vllm-project#27884)

* [CPU]Improve dynamic 4bit moe performance (vllm-project#27240)

Signed-off-by: Zhang Xiangze <Xiangze.Zhang@arm.com>

* [CI/Build] Update LM Eval Version in AMD CI (vllm-project#27944)

Signed-off-by: zhewenli <zhewenli@meta.com>

* [KV Connector] Make KVCacheConfig an explicit constructor argument (vllm-project#27887)

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

* [Model] fix ernie45 reasoning_parser (vllm-project#27973)

Signed-off-by: wangyafeng <wangyafeng@baidu.com>

* [CI/Build] Fix OpenAI API correctness on AMD CI (vllm-project#28022)

Signed-off-by: zhewenli <zhewenli@meta.com>

* [BugFix][Performance] Restore flashinfer autotuning for all scenarios (vllm-project#27904)

* Support worker reinitialization after hard pause; add task queue in FaultHandler to ensure sequential task execution

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* resolve conflicts

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* resolve conflicts

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* resolve conflicts

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* Load tuned fused_moe_lora shrink and expand kernel configs separately (vllm-project#27435)

Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

* resolve conflicts

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* resolve conflicts

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* resolve conflicts

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* Support using Int4PreshuffledTensor after loading (vllm-project#26066)

Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>

* [Core] Enable StatLogger in LLMEngine (vllm-project#28020)

Signed-off-by: Zhuohan Li <zhuohan123@gmail.com>

* [Model][Bugfix] fix pipeline parallelism support for NemotronH (vllm-project#27968)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

* [Model] add optimal triton fused moe configs for NemotronH MoE (vllm-project#27967)

Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>

* [Kernels] Isolate modular kernel code from FusedMoEMethodBase subclasses. (vllm-project#27123)

* [BugFix] Fix incorrect preallocated sampled_token_ids tensor size (vllm-project#28025)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [Perf] SM100 - add swap AB optimization to CUTLASS FP8 GEMM (vllm-project#27284)

Signed-off-by: Faqin Zhong <faqin.zhong@gmail.com>
Co-authored-by: Faqin Zhong <zhofaqin@amazon.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>

* [PERF] Decouple projections from GDN custom op (vllm-project#27512)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

* [model] Add support for openPangu_Ultra_MoE (vllm-project#27521)

Signed-off-by: yuantao <2422264527@qq.com>
Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>

* [PerfFix] Avoid separate thread for MP executor shm spin (vllm-project#28012)

Signed-off-by: Nick Hill <nhill@redhat.com>

* [AsyncScheduling] Don't schedule past request max_tokens (vllm-project#27922)

Signed-off-by: Nick Hill <nhill@redhat.com>

* Remove deprecated `--rope-scaling` and `--rope-theta` (vllm-project#28006)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [ROCm][Perf] New design on ROCm AITER MHA backend Implementation (vllm-project#25763)

Signed-off-by: ganyi <ygan@amd.com>

* Added disable rule to track files under benchmarks/lib (vllm-project#28048)

Signed-off-by: Nadav Kluger <nadav.k@fmr.ai>

* [Multimodal] Make MediaConnector extensible. (vllm-project#27759)

Signed-off-by: Chenheli Hua <huachenheli@outlook.com>

* [ROCm] gemm_a16w16 upstreaming (vllm-project#26969)

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>

* Revert "[PERF] Decouple projections from GDN custom op" (vllm-project#28080)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

* add engine core ut

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* add engine core ut

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* [Qwen3-Next] MOE configs for A100-SXM4-80GB TP4 TP8 (vllm-project#27740)

* [XPU] Add gpt-oss model support for Intel GPU (vllm-project#27786)

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* [CI/Build] Enable some fixed tests in AMD CI (vllm-project#28078)

Signed-off-by: zhewenli <zhewenli@meta.com>

* [V0 deprecation] Remove VLLM_USE_V1 usage in most modules (vllm-project#27955)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

* [Bugfix] Fix encoder-only model support for transformers backend (vllm-project#28021)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [BugFix] Fix DCP Assert (AssertionError: DCP not support reorder_batch_threshold > 1 now.) (vllm-project#28100)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

* [Model, Core] Support Granite Speech & LoRA for STT (vllm-project#24455)

* [Refactor] Lazy-loaded reasoning_parser (vllm-project#28092)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [Refactor] to simplify and extract the shared logic between chat completion and responses (vllm-project#27961)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [bugfix] fix wrong `dcp_local_seq_lens` calc (vllm-project#27518)

Signed-off-by: Qiu <qiuchunshuo@huawei.com>

* [Hybrid allocator + kv connector] revert connector test changes related to hybrid allocator (vllm-project#28011)

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

* [Misc] fix import error for DeepSeekR1ReasoningParser (vllm-project#28114)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* Fix excessive logging noise by reducing the log level of the MinimaxM2ToolParser import success message (vllm-project#27635)

Signed-off-by: minatoaquaMK2 <jiacheng.yue@foxmail.com>

* Bugfix: Cutlass FP8 FusedMoE bad scaling factors (vllm-project#27255)

Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>

* [Graph Partition][Cache] Use inductor partition ops config (vllm-project#27702)

Signed-off-by: Boyuan Feng <boyuan@meta.com>

* [XPU] Enable custom routing functions in IPEX for Llama4 (vllm-project#28004)

Signed-off-by: frost-intel <frost.mitchell@intel.com>

* add kimi reasoning parser (vllm-project#28128)

Signed-off-by: wangzhengtao <wangzhengtao@msh.team>
Co-authored-by: wangzhengtao <wangzhengtao@msh.team>

* [DCP] check return_lse for all layers in dcp (vllm-project#27929)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [BugFix] Support EP/DP + EPLB with MTP (vllm-project#25311)

Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>

* Enabling cooperative multi-gpu tests on multi-gpu nodes (vllm-project#27986)

Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>

* [ROCm][MLA] Support block-size > 1 for AITER MLA backend  (vllm-project#27224)

Signed-off-by: ganyi <ygan@amd.com>
Co-authored-by: wuhuikx <hattie.wu@amd.com>

* [Bugfix] Validate custom logits processor xargs for online serving (vllm-project#27560)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [misc] add vLLM Beijing Meetup (vllm-project#28127)

Signed-off-by: Jiaju Zhang <jjzhang@redhat.com>

* [Kernel] Fuse computation of g and beta for Gated Delta Net (vllm-project#28095)

Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>

* [Core] add support for reasoning parser plugins (vllm-project#28075)

Signed-off-by: walter beller-morales <walter.beller.morales@gmail.com>

* [Bugfix] vLLM should check Inductor config for compile cache enablement status (vllm-project#27637)

Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>

* [FlashInfer] Avoid FlashInfer block_size 16 + head_size 256 on blackwell (vllm-project#27994)

Signed-off-by: Chen Zhang <zhangch99@outlook.com>

* [CI]: Add LMCacheConnector Unit Tests (vllm-project#27852)

Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Yihua Cheng <yihua98@uchicago.edu>

* [Feature] Extend batch invariant torch.compile to B200 (vllm-project#27856)

Signed-off-by: PaulZhang12 <paulzhan@fb.com>

* [Bugfix] Fix Qwen3-Reranker-8B load (vllm-project#28117)

Signed-off-by: wang.yuqi <noooop@126.com>

* [Docs] Clean up README_TUNING.md (vllm-project#28088)

Signed-off-by: windsonsea <haifeng.yao@daocloud.io>

* [Hardware][IBM Z] Optimize s390x Dockerfile (vllm-project#28023)

Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>

* [Chore] Remove Nemotron-Nano-VL config copy (vllm-project#28126)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [Docs] Add guide to debugging vLLM-torch.compile integration (vllm-project#28094)

Signed-off-by: Richard Zou <zou3519@gmail.com>

* [Feature]: Add corrupted request metric to V1 metrics system. (vllm-project#27306)

Signed-off-by: atalhens <sneh.lata@nutanix.com>

* [CI/Build] Update checking logic in cutlass_group_gemm_supported  (vllm-project#27948)

Signed-off-by: zhewenli <zhewenli@meta.com>

* [CI/Build] Fix `test_defaults_with_usage_context` in AMD CI (vllm-project#27926)

Signed-off-by: zhewenli <zhewenli@meta.com>

* [Core][Hybrid allocator + connector 2/n] Unify `remove_skipped_blocks` by `get_last_useful_token` (vllm-project#25431)

Signed-off-by: KuntaiDu <kuntai@uchicago.edu>

* [Debugging] Add annotation for easier trace analysis (vllm-project#22496)

* [PERF] Decouple projections from GDN custom op. Attempt 2 (vllm-project#28083)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

* [Bug] Fix cpu disable shared_experts `VLLM_DISABLE_SHARED_EXPERTS_STREAM` (vllm-project#28157)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* [Bug] Fix env string `"0"` same to `True` (vllm-project#28159)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* Ensure WorkerGuard command execution returns result; fix missing set_device when TP>1

Signed-off-by: fangyuchu <fangyuchu@qq.com>

* [Feature] Enable TP + EP `shared_experts` overlap with router, 3.7% E2E performance improvement (vllm-project#28164)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* [CI Failure] `nm-testing/Qwen2-0.5B-Instruct-FP8-SkipQKV` was removed from HF. Skip it in tests (vllm-project#28170)

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

* [Misc] Remove the duplicate code (vllm-project#28111)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* rename& format logger

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* rename& format logger

Signed-off-by: w00689259 <wangzhuo66@huawei.com>

* feat(nccl): enable non-blocking NCCL communicators to support ncclCommAbort

Signed-off-by: fangyuchu <fangyuchu@qq.com>

---------

Signed-off-by: w00689259 <wangzhuo66@huawei.com>
Signed-off-by: a798347923 <2645302020@qq.com>
Signed-off-by: fangyuchu <fangyuchu@qq.com>
Signed-off-by: zWaNg3 <37772915+zWaNg3@users.noreply.github.com>
Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com>
Signed-off-by: Sungyoon Jeong <sungyoon.jeong@furiosa.ai>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Rémi Delacourt <remi@mistral.ai>
Signed-off-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com>
Signed-off-by: remi <remi@mistral.ai>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Misha Efimov <mef@google.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: zhangyue <zhangyue66@baidu.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: zhangyue66 <zhangyue66@baidu.com>
Signed-off-by: gnovack <gnovack@amazon.com>
Signed-off-by: Peter Schuurman <psch@google.com>
Signed-off-by: ahao-anyscale <ahao@anyscale.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com>
Signed-off-by: Lucas Kabela <lucaskabela@meta.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Andy Xie <andy.xning@gmail.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Qiliang Cui <derrhein@gmail.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Signed-off-by: Haipeng Li <li2haipeng@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Zhang Xiangze <Xiangze.Zhang@arm.com>
Signed-off-by: zhewenli <zhewenli@meta.com>
Signed-off-by: wangyafeng <wangyafeng@baidu.com>
Signed-off-by: Yu Gong <yu3.gong@gmail.com>
Signed-off-by: Jerry Zhang <jerryzh168@gmail.com>
Signed-off-by: Zhuohan Li <zhuohan123@gmail.com>
Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com>
Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Faqin Zhong <faqin.zhong@gmail.com>
Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>
Signed-off-by: yuantao <2422264527@qq.com>
Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: ganyi <ygan@amd.com>
Signed-off-by: Nadav Kluger <nadav.k@fmr.ai>
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Qiu <qiuchunshuo@huawei.com>
Signed-off-by: KuntaiDu <kuntai@uchicago.edu>
Signed-off-by: minatoaquaMK2 <jiacheng.yue@foxmail.com>
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
Signed-off-by: Boyuan Feng <boyuan@meta.com>
Signed-off-by: frost-intel <frost.mitchell@intel.com>
Signed-off-by: wangzhengtao <wangzhengtao@msh.team>
Signed-off-by: Chen Zhang <zhangch99@outlook.com>
Signed-off-by: ilmarkov <markovilya197@gmail.com>
Signed-off-by: Sage Moore <sage@neuralmagic.com>
Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com>
Signed-off-by: Jiaju Zhang <jjzhang@redhat.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: walter beller-morales <walter.beller.morales@gmail.com>
Signed-off-by: Yanan Cao <gmagogsfm@gmail.com>
Signed-off-by: Samuel Shen <slshen@uchciago.edu>
Signed-off-by: PaulZhang12 <paulzhan@fb.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: windsonsea <haifeng.yao@daocloud.io>
Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com>
Signed-off-by: Richard Zou <zou3519@gmail.com>
Signed-off-by: atalhens <sneh.lata@nutanix.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: fangyuchu <fangyuchu@qq.com>
Co-authored-by: a798347923 <2645302020@qq.com>
Co-authored-by: w00689259 <wangzhuo66@huawei.com>
Co-authored-by: fangyuchu <569160112@qq.com>
Co-authored-by: TianZhuo <2770730562@qq.com>
Co-authored-by: a798347923 <39047817+a798347923@users.noreply.github.com>
Co-authored-by: Sungyoon Jeong <157349761+n0gu-furiosa@users.noreply.github.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com>
Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Misha Efimov <mef@google.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: zhang-prog <69562787+zhang-prog@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: gnovack <gnovack@amazon.com>
Co-authored-by: pwschuurman <psch@google.com>
Co-authored-by: ahao-anyscale <ahao@anyscale.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Aurick Qiao <aurickq@users.noreply.github.com>
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
Co-authored-by: Sophie du Couédic <sop@zurich.ibm.com>
Co-authored-by: Lucas Kabela <lucaskabela@meta.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Ning Xie <andy.xning@gmail.com>
Co-authored-by: Hank_ <37239608+ILikeIneine@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: QiliangCui <derrhein@gmail.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com>
Co-authored-by: liuzhenwei <zhenwei.liu@intel.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: xiangze-arm <Xiangze.Zhang@arm.com>
Co-authored-by: Zhewen Li <zhewenli@meta.com>
Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com>
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: yugong333 <yu3.gong@gmail.com>
Co-authored-by: Jerry Zhang <jerryzh168@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Co-authored-by: lyrisz <145491716+LyrisZhong@users.noreply.github.com>
Co-authored-by: Faqin Zhong <zhofaqin@amazon.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>
Co-authored-by: yt0428 <51468697+yt0428@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Pleaplusone <pleaplusone.gy@gmail.com>
Co-authored-by: nadavkluger <nadav.kluger@gmail.com>
Co-authored-by: Chenheli Hua <huachenheli@outlook.com>
Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com>
Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>
Co-authored-by: tou <57480529+toulzx@users.noreply.github.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Alex Brooks <alex.brooks@ibm.com>
Co-authored-by: Qiu <chunshuoq@gmail.com>
Co-authored-by: Kuntai Du <kuntai@uchicago.edu>
Co-authored-by: Eric Yue <jiacheng.yue@foxmail.com>
Co-authored-by: amirkl94 <203507526+amirkl94@users.noreply.github.com>
Co-authored-by: Boyuan Feng <boyuan@meta.com>
Co-authored-by: Frost Mitchell <frost.mitchell@intel.com>
Co-authored-by: bigmoyan <moyan_work@foxmail.com>
Co-authored-by: wangzhengtao <wangzhengtao@msh.team>
Co-authored-by: Chen Zhang <zhangch99@outlook.com>
Co-authored-by: Ilya Markov <markovilya197@gmail.com>
Co-authored-by: Sage Moore <sage@neuralmagic.com>
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: wuhuikx <hattie.wu@amd.com>
Co-authored-by: Jiaju Zhang <jjzhang@redhat.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Co-authored-by: Walter Beller-Morales <walterbm@users.noreply.github.com>
Co-authored-by: gmagogsfm <gmagogsfm@users.noreply.github.com>
Co-authored-by: Samuel Shen <102553648+sammshen@users.noreply.github.com>
Co-authored-by: Samuel Shen <slshen@uchciago.edu>
Co-authored-by: Yihua Cheng <yihua98@uchicago.edu>
Co-authored-by: Paul Zhang <paulzhan@fb.com>
Co-authored-by: wang.yuqi <noooop@126.com>
Co-authored-by: Michael Yao <haifeng.yao@daocloud.io>
Co-authored-by: R3hankhan <Rehan.Khan7@ibm.com>
Co-authored-by: Richard Zou <zou3519@users.noreply.github.com>
Co-authored-by: Snehlata <sneh.lata@nutanix.com>
Co-authored-by: Dayeol Lee <dayeolee@gmail.com>
ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Add num_corrupted_request metric to V1 metrics system.

3 participants