Add ORCA endpoint load metrics support #24905

efimki · 2025-09-15T19:36:02Z

Purpose

Add ORCA endpoint load metrics for /chat/completion and /completion responses.

Engine V1 implementation uses get_named_metrics_from_prometheus() to collect metrics.
Reporting of engine metrics in response headers when enabled through request header "endpoint-load-metrics-format" supported formats "text" and "json".
No added load on response for default case
No new computation in engine

Test Plan

Added tests/entrypoints/openai/test_orca_metrics.py
Run: pytest -s -v tests/entrypoints/openai/test_orca_metrics.py

Test Result

Pass

FIX

#10086

Forked from vllm-project#14906 Use `get_named_metrics_from_prometheus()` to collect metrics for Engine V1. Signed-off-by: Misha Efimov <mef@google.com>

Signed-off-by: Misha Efimov <mef@google.com>

mergify · 2025-09-18T11:53:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @efimki.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Misha Efimov <mef@google.com>

mergify · 2025-09-21T01:07:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @efimki.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

robertgshaw2-redhat · 2025-09-22T15:39:25Z

@markmc

efimki · 2025-09-29T16:40:35Z

@markmc - do you have any comments / suggestions?

markmc · 2025-09-30T17:03:07Z

If you rebase, you'll see v0 engine is removed by #25033 and I think that means you can remove the InbandEngineStats stuff

Also, can 'get_metrics_snapshot()' do what you want? Maybe we can add a filter to that API so you can just get the two metrics you want?

markmc · 2025-09-30T17:04:49Z

For reference, (even for myself), this background info is useful as justification for adding these in-band metrics: #14906 (comment)

Signed-off-by: Misha Efimov <mef@google.com>

efimki · 2025-10-07T21:08:39Z

If you rebase, you'll see v0 engine is removed by #25033 and I think that means you can remove the InbandEngineStats stuff

Excellent point, it made this PR much simpler!

Also, can 'get_metrics_snapshot()' do what you want? Maybe we can add a filter to that API so you can just get the two metrics you want?

The get_metrics_snapshot() API looks good. Adding a filter could be useful to avoid overhead of converting all the metrics. At this point I'm not entirely sure what metrics would be used for load balancing, but we can tweak it later.

efimki · 2025-10-13T15:33:18Z

Friendly ping. :)

Any comments and suggestions would be greatly appreciated. :)

Signed-off-by: Misha Efimov <mef@google.com>

efimki · 2025-10-24T16:10:46Z

@markmc - thanks for the approval! Do you know who else needs to approve it for merging?

simon-mo

@markmc reviewed

markmc · 2025-10-30T08:15:31Z

Needs a rebase to fix conflicts and pick up #27610 which fixes the buildkite/ci/pr/entrypoints-integration-test-api-server failure

efimki · 2025-10-30T18:27:27Z

Needs a rebase to fix conflicts and pick up #27610 which fixes the buildkite/ci/pr/entrypoints-integration-test-api-server failure

Thanks for the suggestion! I'm having some issues with my environment missing required version of torch:


  × No solution found when resolving dependencies:
  ╰─▶ Because there is no version of torch==2.9.0+cu129 and you require torch==2.9.0+cu129, we can conclude that your requirements are unsatisfiable.

I'll try to fix that ASAP.

Signed-off-by: Misha Efimov <mef@google.com>

markmc · 2025-10-31T17:52:55Z

Another rebase/merge needed to pick up #27861 and #27811

Signed-off-by: Misha Efimov <mef@google.com> Signed-off-by: soaringk <k3vin.zhang@gmail.com>

Signed-off-by: Misha Efimov <mef@google.com>

* add fault_report_addr in FaultToleranceConfig * add handle fault&get_fault_info api Signed-off-by: w00689259 <wangzhuo66@huawei.com> * remove fault_report_address in CoreEngineActorManager __init__ Signed-off-by: a798347923 <2645302020@qq.com> * ruff format Signed-off-by: a798347923 <2645302020@qq.com> * add handle fault&get_fault_info api Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fix one bug. Signed-off-by: fangyuchu <fangyuchu@qq.com> * add fault_report_port in FaultToleranceConfig Signed-off-by: a798347923 <2645302020@qq.com> * add zmq_addr concatenate with fault_report_addr and fault_report_port Signed-off-by: a798347923 <2645302020@qq.com> * fault reporter bug fix Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fault reporter bug fix Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fault reporter bug fix Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fault reporter bug fix Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fault reporter bug fix Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fault reporter bug fix Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fix some bug * fault reporter bug fix Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fault reporter bug fix Signed-off-by: w00689259 <wangzhuo66@huawei.com> * remove fault_report_addr in FaultToleranceConfig Signed-off-by: a798347923 <2645302020@qq.com> * refactor: relocate method serialization functions to serial_util.py Signed-off-by: fangyuchu <fangyuchu@qq.com> * fix actor bug * fix actor bug * add engine_core_cmd_addr in FaultToleranceConfig Signed-off-by: a798347923 <2645302020@qq.com> * add and use _stop_worker_execution in EngineCoreGuard Signed-off-by: a798347923 <2645302020@qq.com> * add and use run in WorkerGuard Signed-off-by: a798347923 <2645302020@qq.com> * fix actor bug * fix bug * fix sentinel * fix bug vllm/v1/engine/core.py:847: error: Missing positional argument "tp_size" in call to "EngineCoreGuard" Signed-off-by: a798347923 <2645302020@qq.com> * fix bug error: Missing positional arguments "length", "byteorder" in call to "to_bytes" of "int" Signed-off-by: a798347923 <2645302020@qq.com> * fix bug in fault tolerance mode Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fix bug in fault tolerance mode Signed-off-by: w00689259 <wangzhuo66@huawei.com> * change fault_report_port to internal_fault_report_port add external_fault_notify_port Signed-off-by: a798347923 <2645302020@qq.com> * change fault_report_port to internal_fault_report_port add external_fault_notify_port Signed-off-by: a798347923 <2645302020@qq.com> * add _recv_cmd func use deserialize_method_call and run_method in run func Signed-off-by: a798347923 <2645302020@qq.com> * Update core.py fix bug error: Need type annotation for "kwargs" (hint: "kwargs: dict[<type>, <type>] = ...") Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> * add self.ctx.term() in shutdown() Signed-off-by: a798347923 <2645302020@qq.com> * changed import deserialize_method_call,serialize_method_call Signed-off-by: a798347923 <2645302020@qq.com> * changed init worker_guard in init_device Signed-off-by: a798347923 <2645302020@qq.com> * Update core.py add import serialize_method_call Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> * Update gpu_worker.py changed init WorkerGuard in init_device Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> * Update gpu_worker.py FIX BUG self.worker_guard: WorkerGuard|None = None Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> * Update gpu_worker.py fix bug error: Argument 1 to "deserialize_method_call" has incompatible type "str | None"; expected "str" [arg-type] Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> * Update gpu_worker.py ruff format Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> * Update core.py ruff-format Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> * actively send exception information Signed-off-by: w00689259 <wangzhuo66@huawei.com> * actively send exception information Signed-off-by: w00689259 <wangzhuo66@huawei.com> * actively send exception information Signed-off-by: w00689259 <wangzhuo66@huawei.com> * change engine_core_cmd_addr(str) to engine_core_cmd_addrs(list[str]) in EngineZmqAddresses Signed-off-by: a798347923 <2645302020@qq.com> * change engine_core_cmd_addr(str) to engine_core_cmd_addrs(list[str]) in EngineZmqAddresses Signed-off-by: a798347923 <2645302020@qq.com> * Update utils.py delete engine_core_cmd_addr in EngineZmqAddresses Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> * Remove redundant configuration: fault-pub-port Signed-off-by: fangyuchu <fangyuchu@qq.com> * Send pause instructions after receiving fault info in ClientGuard Signed-off-by: fangyuchu <fangyuchu@qq.com> * change engine_core_guard_identities from dict[int, bytes] to list[bytes] Signed-off-by: a798347923 <2645302020@qq.com> * fix bug "only the worker guard of engine core 0 can receive messages sent from engine core guard Signed-off-by: a798347923 <2645302020@qq.com> * change local_rank to rank_in_group in WorkerGuard Signed-off-by: a798347923 <2645302020@qq.com> * changed del self.client_cmd_registry[int(unhealthy_engine.engine_id)] Signed-off-by: a798347923 <2645302020@qq.com> * add gloo communication timeout * fix some bug * add stateless_process_group gloo_comm_timeout * reconstruct fault receiver&fault handler Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fix some bug * reconstruct fault receiver&fault handler Signed-off-by: w00689259 <wangzhuo66@huawei.com> * reconstruct fault receiver&fault handler Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fix return format Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fix return format Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fix return format Signed-off-by: w00689259 <wangzhuo66@huawei.com> * add abort request * fix some bug * fix some bug * fix some bug * add dt for client guard Signed-off-by: w00689259 <wangzhuo66@huawei.com> * add dt for client guard Signed-off-by: w00689259 <wangzhuo66@huawei.com> * add dt for client guard Signed-off-by: w00689259 <wangzhuo66@huawei.com> * Implementation of two types of pause: a soft one by using flag signals and a hard one by aborting nccl communicators. Signed-off-by: fangyuchu <fangyuchu@qq.com> * Refine certain log forms and fix a minor bug in pause function. Signed-off-by: fangyuchu <fangyuchu@qq.com> * Refactor and abstract the recv_msg logic in CG,ECG,WG. Signed-off-by: fangyuchu <fangyuchu@qq.com> * [Frontend] Align finish_reason when tool is called with OpenAI (vllm-project#25054) Signed-off-by: Sungyoon Jeong <sungyoon.jeong@furiosa.ai> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> * [Hybrid] Pass kernel block size to builders (vllm-project#27753) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> * [Bugfix] Padded Eagle Specdec with Chunked Prefill (vllm-project#26263) Signed-off-by: Rémi Delacourt <remi@mistral.ai> Signed-off-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com> Signed-off-by: remi <remi@mistral.ai> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com> * [XPU]Refine Dockerfile.xpu, avoid oneccl dependency issue (vllm-project#27964) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * Add and check method uuid when sending commands and receiving results. Signed-off-by: fangyuchu <fangyuchu@qq.com> * Add ORCA endpoint load metrics support (vllm-project#24905) Signed-off-by: Misha Efimov <mef@google.com> * [CI/Build] Remove the flaky gpt-oss lora test (vllm-project#27966) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * Abstract the logic of sending instructions and waiting responses from FaultHandler Signed-off-by: fangyuchu <fangyuchu@qq.com> * [Model] Add PaddleOCR-VL Model Support (vllm-project#27758) Signed-off-by: zhangyue <zhangyue66@baidu.com> Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: zhangyue66 <zhangyue66@baidu.com> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> * Add options in EngineCoreGuard to recv execution results from WorkerGuard Signed-off-by: fangyuchu <fangyuchu@qq.com> * Early exit for MoE LoRA kernels (vllm-project#27131) Signed-off-by: gnovack <gnovack@amazon.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> * [Bugfix] Skip gs:// model paths for speculator detection (vllm-project#27846) Signed-off-by: Peter Schuurman <psch@google.com> * [BUG] Make 'binary' default option for saving torch compile artifacts when using standalone_compile (vllm-project#27616) Signed-off-by: ahao-anyscale <ahao@anyscale.com> * [CI/Testing] Add basic single node dual batch overlap test (vllm-project#27235) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> * [Spec Decode] Integrate Suffix Decoding from Arctic Inference (vllm-project#25784) Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com> * [Feature][Benchmarks] Support `inf` burstiness (vllm-project#26941) Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com> * [Bugfix][Qwen][Multimodal] Move Qwen2_5_vl sdpa to custom op and reenable compile (vllm-project#27764) Signed-off-by: Lucas Kabela <lucaskabela@meta.com> * [Bugfix] change FlashMLA reorder_batch_threshold (vllm-project#27777) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> * [Docs] add runai_streamer_sharded to LoadConfig (vllm-project#27937) Signed-off-by: Andy Xie <andy.xning@gmail.com> * Add TP parameter to attention tests (vllm-project#27683) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> * [Bugfix][plugin] fla crash on plugin (vllm-project#27322) * [Bugfix] Fix MoE Routing Simulation (vllm-project#28002) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> * Remove the tpu docker image nightly build. (vllm-project#27997) Signed-off-by: Qiliang Cui <derrhein@gmail.com> * [Bugfix][ROCm] Fix ViT rotary embeddings for torch.compile compatibility on ROCm (vllm-project#27748) Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> * [LoRA] Lora shrink swizzle (vllm-project#27694) Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> * [Refactor] Lazy import tool_parser (vllm-project#27974) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [NIXL][XPU] Pin NIXL version to 0.7.0 (vllm-project#27849) Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com> * [Metrics] Enable sleep state metric outside of dev mode (vllm-project#27867) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [Bug] Batch invariant: Fix flash attn MLA `RuntimeError: scheduler_metadata must have shape (metadata_size)` (vllm-project#27884) * [CPU]Improve dynamic 4bit moe performance (vllm-project#27240) Signed-off-by: Zhang Xiangze <Xiangze.Zhang@arm.com> * [CI/Build] Update LM Eval Version in AMD CI (vllm-project#27944) Signed-off-by: zhewenli <zhewenli@meta.com> * [KV Connector] Make KVCacheConfig an explicit constructor argument (vllm-project#27887) Signed-off-by: Mark McLoughlin <markmc@redhat.com> * [Model] fix ernie45 reasoning_parser (vllm-project#27973) Signed-off-by: wangyafeng <wangyafeng@baidu.com> * [CI/Build] Fix OpenAI API correctness on AMD CI (vllm-project#28022) Signed-off-by: zhewenli <zhewenli@meta.com> * [BugFix][Performance] Restore flashinfer autotuning for all scenarios (vllm-project#27904) * Support worker reinitialization after hard pause; add task queue in FaultHandler to ensure sequential task execution Signed-off-by: fangyuchu <fangyuchu@qq.com> * resolve conflicts Signed-off-by: w00689259 <wangzhuo66@huawei.com> * resolve conflicts Signed-off-by: w00689259 <wangzhuo66@huawei.com> * resolve conflicts Signed-off-by: w00689259 <wangzhuo66@huawei.com> * Load tuned fused_moe_lora shrink and expand kernel configs separately (vllm-project#27435) Signed-off-by: Yu Gong <yu3.gong@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> * resolve conflicts Signed-off-by: w00689259 <wangzhuo66@huawei.com> * resolve conflicts Signed-off-by: w00689259 <wangzhuo66@huawei.com> * resolve conflicts Signed-off-by: w00689259 <wangzhuo66@huawei.com> * Support using Int4PreshuffledTensor after loading (vllm-project#26066) Signed-off-by: Jerry Zhang <jerryzh168@gmail.com> * [Core] Enable StatLogger in LLMEngine (vllm-project#28020) Signed-off-by: Zhuohan Li <zhuohan123@gmail.com> * [Model][Bugfix] fix pipeline parallelism support for NemotronH (vllm-project#27968) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> * [Model] add optimal triton fused moe configs for NemotronH MoE (vllm-project#27967) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> * [Kernels] Isolate modular kernel code from FusedMoEMethodBase subclasses. (vllm-project#27123) * [BugFix] Fix incorrect preallocated sampled_token_ids tensor size (vllm-project#28025) Signed-off-by: Nick Hill <nhill@redhat.com> * [Perf] SM100 - add swap AB optimization to CUTLASS FP8 GEMM (vllm-project#27284) Signed-off-by: Faqin Zhong <faqin.zhong@gmail.com> Co-authored-by: Faqin Zhong <zhofaqin@amazon.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> * [PERF] Decouple projections from GDN custom op (vllm-project#27512) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> * [model] Add support for openPangu_Ultra_MoE (vllm-project#27521) Signed-off-by: yuantao <2422264527@qq.com> Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> * [PerfFix] Avoid separate thread for MP executor shm spin (vllm-project#28012) Signed-off-by: Nick Hill <nhill@redhat.com> * [AsyncScheduling] Don't schedule past request max_tokens (vllm-project#27922) Signed-off-by: Nick Hill <nhill@redhat.com> * Remove deprecated `--rope-scaling` and `--rope-theta` (vllm-project#28006) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [ROCm][Perf] New design on ROCm AITER MHA backend Implementation (vllm-project#25763) Signed-off-by: ganyi <ygan@amd.com> * Added disable rule to track files under benchmarks/lib (vllm-project#28048) Signed-off-by: Nadav Kluger <nadav.k@fmr.ai> * [Multimodal] Make MediaConnector extensible. (vllm-project#27759) Signed-off-by: Chenheli Hua <huachenheli@outlook.com> * [ROCm] gemm_a16w16 upstreaming (vllm-project#26969) Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> * Revert "[PERF] Decouple projections from GDN custom op" (vllm-project#28080) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> * add engine core ut Signed-off-by: w00689259 <wangzhuo66@huawei.com> * add engine core ut Signed-off-by: w00689259 <wangzhuo66@huawei.com> * [Qwen3-Next] MOE configs for A100-SXM4-80GB TP4 TP8 (vllm-project#27740) * [XPU] Add gpt-oss model support for Intel GPU (vllm-project#27786) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * [CI/Build] Enable some fixed tests in AMD CI (vllm-project#28078) Signed-off-by: zhewenli <zhewenli@meta.com> * [V0 deprecation] Remove VLLM_USE_V1 usage in most modules (vllm-project#27955) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [Bugfix] Fix encoder-only model support for transformers backend (vllm-project#28021) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [BugFix] Fix DCP Assert (AssertionError: DCP not support reorder_batch_threshold > 1 now.) (vllm-project#28100) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> * [Model, Core] Support Granite Speech & LoRA for STT (vllm-project#24455) * [Refactor] Lazy-loaded reasoning_parser (vllm-project#28092) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [Refactor] to simplify and extract the shared logic between chat completion and responses (vllm-project#27961) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [bugfix] fix wrong `dcp_local_seq_lens` calc (vllm-project#27518) Signed-off-by: Qiu <qiuchunshuo@huawei.com> * [Hybrid allocator + kv connector] revert connector test changes related to hybrid allocator (vllm-project#28011) Signed-off-by: KuntaiDu <kuntai@uchicago.edu> * [Misc] fix import error for DeepSeekR1ReasoningParser (vllm-project#28114) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * Fix excessive logging noise by reducing the log level of the MinimaxM2ToolParser import success message (vllm-project#27635) Signed-off-by: minatoaquaMK2 <jiacheng.yue@foxmail.com> * Bugfix: Cutlass FP8 FusedMoE bad scaling factors (vllm-project#27255) Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> * [Graph Partition][Cache] Use inductor partition ops config (vllm-project#27702) Signed-off-by: Boyuan Feng <boyuan@meta.com> * [XPU] Enable custom routing functions in IPEX for Llama4 (vllm-project#28004) Signed-off-by: frost-intel <frost.mitchell@intel.com> * add kimi reasoning parser (vllm-project#28128) Signed-off-by: wangzhengtao <wangzhengtao@msh.team> Co-authored-by: wangzhengtao <wangzhengtao@msh.team> * [DCP] check return_lse for all layers in dcp (vllm-project#27929) Signed-off-by: Chen Zhang <zhangch99@outlook.com> * [BugFix] Support EP/DP + EPLB with MTP (vllm-project#25311) Signed-off-by: ilmarkov <markovilya197@gmail.com> Signed-off-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> * Enabling cooperative multi-gpu tests on multi-gpu nodes (vllm-project#27986) Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> * [ROCm][MLA] Support block-size > 1 for AITER MLA backend (vllm-project#27224) Signed-off-by: ganyi <ygan@amd.com> Co-authored-by: wuhuikx <hattie.wu@amd.com> * [Bugfix] Validate custom logits processor xargs for online serving (vllm-project#27560) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [misc] add vLLM Beijing Meetup (vllm-project#28127) Signed-off-by: Jiaju Zhang <jjzhang@redhat.com> * [Kernel] Fuse computation of g and beta for Gated Delta Net (vllm-project#28095) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> * [Core] add support for reasoning parser plugins (vllm-project#28075) Signed-off-by: walter beller-morales <walter.beller.morales@gmail.com> * [Bugfix] vLLM should check Inductor config for compile cache enablement status (vllm-project#27637) Signed-off-by: Yanan Cao <gmagogsfm@gmail.com> * [FlashInfer] Avoid FlashInfer block_size 16 + head_size 256 on blackwell (vllm-project#27994) Signed-off-by: Chen Zhang <zhangch99@outlook.com> * [CI]: Add LMCacheConnector Unit Tests (vllm-project#27852) Signed-off-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Yihua Cheng <yihua98@uchicago.edu> * [Feature] Extend batch invariant torch.compile to B200 (vllm-project#27856) Signed-off-by: PaulZhang12 <paulzhan@fb.com> * [Bugfix] Fix Qwen3-Reranker-8B load (vllm-project#28117) Signed-off-by: wang.yuqi <noooop@126.com> * [Docs] Clean up README_TUNING.md (vllm-project#28088) Signed-off-by: windsonsea <haifeng.yao@daocloud.io> * [Hardware][IBM Z] Optimize s390x Dockerfile (vllm-project#28023) Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com> * [Chore] Remove Nemotron-Nano-VL config copy (vllm-project#28126) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [Docs] Add guide to debugging vLLM-torch.compile integration (vllm-project#28094) Signed-off-by: Richard Zou <zou3519@gmail.com> * [Feature]: Add corrupted request metric to V1 metrics system. (vllm-project#27306) Signed-off-by: atalhens <sneh.lata@nutanix.com> * [CI/Build] Update checking logic in cutlass_group_gemm_supported (vllm-project#27948) Signed-off-by: zhewenli <zhewenli@meta.com> * [CI/Build] Fix `test_defaults_with_usage_context` in AMD CI (vllm-project#27926) Signed-off-by: zhewenli <zhewenli@meta.com> * [Core][Hybrid allocator + connector 2/n] Unify `remove_skipped_blocks` by `get_last_useful_token` (vllm-project#25431) Signed-off-by: KuntaiDu <kuntai@uchicago.edu> * [Debugging] Add annotation for easier trace analysis (vllm-project#22496) * [PERF] Decouple projections from GDN custom op. Attempt 2 (vllm-project#28083) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> * [Bug] Fix cpu disable shared_experts `VLLM_DISABLE_SHARED_EXPERTS_STREAM` (vllm-project#28157) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [Bug] Fix env string `"0"` same to `True` (vllm-project#28159) Signed-off-by: yewentao256 <zhyanwentao@126.com> * Ensure WorkerGuard command execution returns result; fix missing set_device when TP>1 Signed-off-by: fangyuchu <fangyuchu@qq.com> * [Feature] Enable TP + EP `shared_experts` overlap with router, 3.7% E2E performance improvement (vllm-project#28164) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [CI Failure] `nm-testing/Qwen2-0.5B-Instruct-FP8-SkipQKV` was removed from HF. Skip it in tests (vllm-project#28170) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> * [Misc] Remove the duplicate code (vllm-project#28111) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * rename& format logger Signed-off-by: w00689259 <wangzhuo66@huawei.com> * rename& format logger Signed-off-by: w00689259 <wangzhuo66@huawei.com> * feat(nccl): enable non-blocking NCCL communicators to support ncclCommAbort Signed-off-by: fangyuchu <fangyuchu@qq.com> --------- Signed-off-by: w00689259 <wangzhuo66@huawei.com> Signed-off-by: a798347923 <2645302020@qq.com> Signed-off-by: fangyuchu <fangyuchu@qq.com> Signed-off-by: zWaNg3 <37772915+zWaNg3@users.noreply.github.com> Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> Signed-off-by: Sungyoon Jeong <sungyoon.jeong@furiosa.ai> Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Rémi Delacourt <remi@mistral.ai> Signed-off-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com> Signed-off-by: remi <remi@mistral.ai> Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: Misha Efimov <mef@google.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: zhangyue <zhangyue66@baidu.com> Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: zhangyue66 <zhangyue66@baidu.com> Signed-off-by: gnovack <gnovack@amazon.com> Signed-off-by: Peter Schuurman <psch@google.com> Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Sophie du Couédic <sop@zurich.ibm.com> Signed-off-by: Lucas Kabela <lucaskabela@meta.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Andy Xie <andy.xning@gmail.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Qiliang Cui <derrhein@gmail.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Signed-off-by: Haipeng Li <li2haipeng@gmail.com> Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Zhang Xiangze <Xiangze.Zhang@arm.com> Signed-off-by: zhewenli <zhewenli@meta.com> Signed-off-by: wangyafeng <wangyafeng@baidu.com> Signed-off-by: Yu Gong <yu3.gong@gmail.com> Signed-off-by: Jerry Zhang <jerryzh168@gmail.com> Signed-off-by: Zhuohan Li <zhuohan123@gmail.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: Nick Hill <nhill@redhat.com> Signed-off-by: Faqin Zhong <faqin.zhong@gmail.com> Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: yuantao <2422264527@qq.com> Signed-off-by: yt0428 <51468697+yt0428@users.noreply.github.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: ganyi <ygan@amd.com> Signed-off-by: Nadav Kluger <nadav.k@fmr.ai> Signed-off-by: Chenheli Hua <huachenheli@outlook.com> Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Qiu <qiuchunshuo@huawei.com> Signed-off-by: KuntaiDu <kuntai@uchicago.edu> Signed-off-by: minatoaquaMK2 <jiacheng.yue@foxmail.com> Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com> Signed-off-by: Boyuan Feng <boyuan@meta.com> Signed-off-by: frost-intel <frost.mitchell@intel.com> Signed-off-by: wangzhengtao <wangzhengtao@msh.team> Signed-off-by: Chen Zhang <zhangch99@outlook.com> Signed-off-by: ilmarkov <markovilya197@gmail.com> Signed-off-by: Sage Moore <sage@neuralmagic.com> Signed-off-by: Alexei V. Ivanov <alexei.ivanov@amd.com> Signed-off-by: Jiaju Zhang <jjzhang@redhat.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: walter beller-morales <walter.beller.morales@gmail.com> Signed-off-by: Yanan Cao <gmagogsfm@gmail.com> Signed-off-by: Samuel Shen <slshen@uchciago.edu> Signed-off-by: PaulZhang12 <paulzhan@fb.com> Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: windsonsea <haifeng.yao@daocloud.io> Signed-off-by: Rehan Khan <Rehan.Khan7@ibm.com> Signed-off-by: Richard Zou <zou3519@gmail.com> Signed-off-by: atalhens <sneh.lata@nutanix.com> Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: fangyuchu <fangyuchu@qq.com> Co-authored-by: a798347923 <2645302020@qq.com> Co-authored-by: w00689259 <wangzhuo66@huawei.com> Co-authored-by: fangyuchu <569160112@qq.com> Co-authored-by: TianZhuo <2770730562@qq.com> Co-authored-by: a798347923 <39047817+a798347923@users.noreply.github.com> Co-authored-by: Sungyoon Jeong <157349761+n0gu-furiosa@users.noreply.github.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com> Co-authored-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Misha Efimov <mef@google.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: zhang-prog <69562787+zhang-prog@users.noreply.github.com> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: gnovack <gnovack@amazon.com> Co-authored-by: pwschuurman <psch@google.com> Co-authored-by: ahao-anyscale <ahao@anyscale.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Aurick Qiao <aurickq@users.noreply.github.com> Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com> Co-authored-by: Sophie du Couédic <sop@zurich.ibm.com> Co-authored-by: Lucas Kabela <lucaskabela@meta.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Ning Xie <andy.xning@gmail.com> Co-authored-by: Hank_ <37239608+ILikeIneine@users.noreply.github.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: QiliangCui <derrhein@gmail.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: li2haipeng <44383182+li2haipeng@users.noreply.github.com> Co-authored-by: liuzhenwei <zhenwei.liu@intel.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: xiangze-arm <Xiangze.Zhang@arm.com> Co-authored-by: Zhewen Li <zhewenli@meta.com> Co-authored-by: CSWYF3634076 <wangyafeng@baidu.com> Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com> Co-authored-by: yugong333 <yu3.gong@gmail.com> Co-authored-by: Jerry Zhang <jerryzh168@gmail.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com> Co-authored-by: Nick Hill <nhill@redhat.com> Co-authored-by: lyrisz <145491716+LyrisZhong@users.noreply.github.com> Co-authored-by: Faqin Zhong <zhofaqin@amazon.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: yt0428 <51468697+yt0428@users.noreply.github.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Pleaplusone <pleaplusone.gy@gmail.com> Co-authored-by: nadavkluger <nadav.kluger@gmail.com> Co-authored-by: Chenheli Hua <huachenheli@outlook.com> Co-authored-by: Aleksandr Malyshev <164964928+maleksan85@users.noreply.github.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: tou <57480529+toulzx@users.noreply.github.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Alex Brooks <alex.brooks@ibm.com> Co-authored-by: Qiu <chunshuoq@gmail.com> Co-authored-by: Kuntai Du <kuntai@uchicago.edu> Co-authored-by: Eric Yue <jiacheng.yue@foxmail.com> Co-authored-by: amirkl94 <203507526+amirkl94@users.noreply.github.com> Co-authored-by: Boyuan Feng <boyuan@meta.com> Co-authored-by: Frost Mitchell <frost.mitchell@intel.com> Co-authored-by: bigmoyan <moyan_work@foxmail.com> Co-authored-by: wangzhengtao <wangzhengtao@msh.team> Co-authored-by: Chen Zhang <zhangch99@outlook.com> Co-authored-by: Ilya Markov <markovilya197@gmail.com> Co-authored-by: Sage Moore <sage@neuralmagic.com> Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com> Co-authored-by: wuhuikx <hattie.wu@amd.com> Co-authored-by: Jiaju Zhang <jjzhang@redhat.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: Walter Beller-Morales <walterbm@users.noreply.github.com> Co-authored-by: gmagogsfm <gmagogsfm@users.noreply.github.com> Co-authored-by: Samuel Shen <102553648+sammshen@users.noreply.github.com> Co-authored-by: Samuel Shen <slshen@uchciago.edu> Co-authored-by: Yihua Cheng <yihua98@uchicago.edu> Co-authored-by: Paul Zhang <paulzhan@fb.com> Co-authored-by: wang.yuqi <noooop@126.com> Co-authored-by: Michael Yao <haifeng.yao@daocloud.io> Co-authored-by: R3hankhan <Rehan.Khan7@ibm.com> Co-authored-by: Richard Zou <zou3519@users.noreply.github.com> Co-authored-by: Snehlata <sneh.lata@nutanix.com> Co-authored-by: Dayeol Lee <dayeolee@gmail.com>

Signed-off-by: Misha Efimov <mef@google.com>

efimki added 2 commits September 15, 2025 17:43

Add ORCA endpoint load metrics

31b2ddf

Forked from vllm-project#14906 Use `get_named_metrics_from_prometheus()` to collect metrics for Engine V1. Signed-off-by: Misha Efimov <mef@google.com>

Minor cleanup

533a8f4

Signed-off-by: Misha Efimov <mef@google.com>

mergify bot added the frontend label Sep 15, 2025

mergify bot added the needs-rebase label Sep 18, 2025

Merge remote-tracking branch 'upstream/main' into orca4

e1074b4

Signed-off-by: Misha Efimov <mef@google.com>

mergify bot removed the needs-rebase label Sep 18, 2025

efimki marked this pull request as ready for review September 18, 2025 16:32

efimki requested review from DarkLight1337, NickLucche, aarnphm, alexm-redhat, chaunceyjiang, comaniac, njhill, robertgshaw2-redhat, simon-mo, youkaichao and zhuohan123 as code owners September 18, 2025 16:32

mergify bot added the needs-rebase label Sep 21, 2025

efimki added 2 commits October 7, 2025 17:27

Merge remote-tracking branch 'upstream/main' into orca4

9aadc87

Signed-off-by: Misha Efimov <mef@google.com>

Remove InbandEngineStats previously used for Engine V0.

14ced00

Signed-off-by: Misha Efimov <mef@google.com>

mergify bot removed the needs-rebase label Oct 7, 2025

efimki added 2 commits October 7, 2025 17:49

Remove old comment.

4711e96

Signed-off-by: Misha Efimov <mef@google.com>

Use vllm.v1.metrics.reader.get_metrics_snapshot() to collect metrics.

4da709a

Signed-off-by: Misha Efimov <mef@google.com>

efimki added 2 commits October 7, 2025 19:45

Fix formatting.

d9a6b92

Signed-off-by: Misha Efimov <mef@google.com>

Fix mypy presubmit finding.

77302d5

Signed-off-by: Misha Efimov <mef@google.com>

efimki added 3 commits October 13, 2025 15:34

Merge remote-tracking branch 'upstream/main' into orca4

a108c78

Merge remote-tracking branch 'upstream/main' into orca4

13b89ce

Switch from Optional to | None.

3756104

Signed-off-by: Misha Efimov <mef@google.com>

markmc approved these changes Oct 22, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into orca4

f2e48df

simon-mo approved these changes Oct 28, 2025

View reviewed changes

simon-mo enabled auto-merge (squash) October 28, 2025 17:50

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 28, 2025

Merge remote-tracking branch 'upstream/main' into orca4

785bac3

Signed-off-by: Misha Efimov <mef@google.com>

Merge remote-tracking branch 'upstream/main' into orca4

8ac7fb2

simon-mo merged commit ba464e6 into vllm-project:main Nov 3, 2025
48 checks passed

soaringk pushed a commit to soaringk/vllm that referenced this pull request Nov 3, 2025

Add ORCA endpoint load metrics support (vllm-project#24905)

5a75fb6

Signed-off-by: Misha Efimov <mef@google.com> Signed-off-by: soaringk <k3vin.zhang@gmail.com>

zhaozuy pushed a commit to zhaozuy/vllm that referenced this pull request Nov 4, 2025

Add ORCA endpoint load metrics support (vllm-project#24905)

c5179ad

Signed-off-by: Misha Efimov <mef@google.com>

omerpaz95 pushed a commit to omerpaz95/vllm that referenced this pull request Nov 4, 2025

Add ORCA endpoint load metrics support (vllm-project#24905)

aa2f697

Signed-off-by: Misha Efimov <mef@google.com>

juliendenize pushed a commit to juliendenize/vllm that referenced this pull request Nov 6, 2025

Add ORCA endpoint load metrics support (vllm-project#24905)

2932b3c

Signed-off-by: Misha Efimov <mef@google.com>

ZhengHongming888 pushed a commit to ZhengHongming888/vllm that referenced this pull request Nov 8, 2025

Add ORCA endpoint load metrics support (vllm-project#24905)

e28d588

Signed-off-by: Misha Efimov <mef@google.com>

Uh oh!

Add ORCA endpoint load metrics support #24905

Add ORCA endpoint load metrics support #24905

Uh oh!

Conversation

efimki commented Sep 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

FIX

Uh oh!

mergify bot commented Sep 18, 2025

Uh oh!

mergify bot commented Sep 21, 2025

Uh oh!

robertgshaw2-redhat commented Sep 22, 2025

Uh oh!

efimki commented Sep 29, 2025

Uh oh!

markmc commented Sep 30, 2025

Uh oh!

markmc commented Sep 30, 2025

Uh oh!

efimki commented Oct 7, 2025

Uh oh!

efimki commented Oct 13, 2025

Uh oh!

efimki commented Oct 24, 2025

Uh oh!

simon-mo left a comment

Choose a reason for hiding this comment

Uh oh!

markmc commented Oct 30, 2025

Uh oh!

efimki commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markmc commented Oct 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

efimki commented Sep 15, 2025 •

edited by github-actions bot

Loading

efimki commented Oct 30, 2025 •

edited

Loading