Add endpoint load metrics #14906

coolkp · 2025-03-16T22:27:09Z

FIX #10086
Implement for V0 engine only for chat/completion and /completion

reporting of engine metrics in response headers when enabled through request header "endpoint-load-metrics-format"
supported formats "text" and "json"
No added load on response for default case
No new computation in engine

github-actions · 2025-03-16T22:27:21Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

simon-mo · 2025-03-17T05:39:34Z

@youngkent @houseroad can you help review this as it might conflict with the feature your team recently added for load measurement? additionally, it will be useful to get a review for code quality + whether you think feature is implemented in the right way. finally, we are going to V0 feature freeze and should only focus on V1.

vllm/entrypoints/openai/serving_completion.py

vllm/sequence.py

houseroad

In general, I don't think this would conflict with Meta internal features. Wondering how production-stack folks would like to collect such load metrics?

Besides load, we may also consider caching distribution, doing something like sticky routing, etc. prod-stack should also cover this, right?

coolkp · 2025-03-17T16:07:28Z

@simon-mo

@youngkent @houseroad can you help review this as it might conflict with the feature your team recently added for load measurement? additionally, it will be useful to get a review for code quality + whether you think feature is implemented in the right way. finally, we are going to V0 feature freeze and should only focus on V1.

In general, I don't think this would conflict with Meta internal features. Wondering how production-stack folks would like to collect such load metrics?

Besides load, we may also consider caching distribution, doing something like sticky routing, etc. prod-stack should also cover this, right?

Would production-stack rely on inband metrics as opposed to querying the prometheus metrics?
Also I am not sure if this will make it to V1 entirely. We will have to do more scaled testing to determine this kind of metrics gathering has advantage over out of band metrics in prometheus. Concept is not entirely validated. So getting it to V0 will allow us to test faster as we won't have to maintain images and also be out of sync with other vllm features. I can send a quick followup for V1 in a week if this tests well, only changes required will be in engine. entrypoints and sequence shouldn't need change. I also didn't see RequestMetrics being populated in v1. Is it already implemented?

coolkp · 2025-03-19T18:41:20Z

Hi, gentle ping on this

simon-mo · 2025-03-19T23:43:35Z

The metrics should be there for V1 cc @markmc who implemented the stack.

As we turned on V1 by default, we would like any feature introducing to vLLM to implement in both V0 and V1 or V1 only. To minimize porting cost.

markmc · 2025-03-20T14:28:15Z

There's a lot of useful info in #10086 but this PR seems (at a glance) to focus on the proposal for metrics to be reported in the response headers using the ORCA format

I think it could be really useful to document a proposal on just the inband metrics piece specifically, and I'd especially appreciate an explanation of whether and how it relates to other Kubernetes-associated load-balancing efforts I tried to capture here: https://docs.vllm.ai/en/stable/design/v1/metrics.html#autoscaling-and-load-balancing

Shaoting-Feng · 2025-03-21T00:59:22Z

Would production-stack rely on inband metrics as opposed to querying the prometheus metrics?

Production stack relies on the prometheus metrics instead of inband metrics. So if Prometheus scraping is unchanged, and inband metrics are purely additive, production stack shouldn’t be affected.

Besides load, we may also consider caching distribution, doing something like sticky routing, etc. prod-stack should also cover this, right?

Production stack supports session sticky routing, i.e., route the request to the appropriate engine URL according to the request headers.

Jeffwan · 2025-03-23T19:38:12Z

aibrix scrape the metrics from engine directly instead from Prometheus source at this moment. We talked with inference gateway project earlier on this and it won't affect aibrix's future plan. the change looks good to us

Forked from vllm-project#14906

liu-cong · 2025-03-26T16:08:15Z

Thank you @Jeffwan and @Shaoting-Feng for confirming this change won't be in conflict with your features!

@simon-mo , given this feature is controlled by a user provided header and by default disabled, can we get this to v0 for validation, and follow up on v1 later?

Signed-off-by: kunjan <kunjanp@google.com>

simon-mo · 2025-03-27T03:28:13Z

You can do it as a follow up. But we need the V1 PR by two weeks or we have to revert this PR; given our policy is V0 and V1 needs to have full parity.

simon-mo · 2025-03-27T03:28:46Z

@houseroad can you do a round of code quality review?

houseroad

I think we need to add some unittest and e2e test, also wondering if we can have some profiling on the e2e perf. To be safe, would like to ensure the e2e perf won't regress obviously.

houseroad · 2025-03-27T06:34:51Z

vllm/engine/llm_engine.py

            seq_group.maybe_set_first_token_time(now)
            if not seq_group.is_prefill():
                seq_group.set_last_token_time(now)
+                stats_snapshot = self._get_stats(scheduler_outputs, outputs,


nit: create a function to include get_stats then set inband_stats logic, since it repeats.

houseroad · 2025-03-27T06:38:51Z

vllm/entrypoints/openai/api_server.py

+
+    # Tuple[ChatCompletionResponse,Optional[InbandEngineStats]]
+    elif isinstance(generator, tuple):
+        return JSONResponse(content=generator[0].model_dump(),


also check len(generator) == 2, and the correpsonding type? Maybe create a helper function.

houseroad · 2025-03-27T06:41:05Z

vllm/outputs.py

        outputs: list[CompletionOutput],
        finished: bool,
        metrics: Optional[RequestMetrics] = None,
+        inband_engine_stats: Optional[InbandEngineStats] = None,


why not add at the end of the list?

houseroad · 2025-03-27T06:41:31Z

vllm/entrypoints/openai/orca_metrics.py

@@ -0,0 +1,85 @@
+# SPDX-License-Identifier: Apache-2.0


Add some unittest?

markmc · 2025-04-01T16:48:46Z

Calling engine._get_stats() is a bit of a red flag - we already do this once per step in do_log_stats() so that heavy computation should not be repeated. Since we compute all of these metrics and store them in-memory in Prometheus collectors, why not just query those collectors from the OpenAI request handler for the values in order to build the response headers?

I'd also like to see this highly isolated to orca_metrics.py - i.e. as much of the ORCA-specific code encapsulated there - and also behind an off-by-default --enable-experimental-orca-inband-metrics CLI argument for now because AIUI this is all rather experimental at this stage? This isn't about performance, but rather avoiding committing to maintaining this metric format long term just yet

Also agree with Simon's stance on V1 - it might make sense to accept a V1-only implementation, but not a V0-only implementation

Hope that helps.

coolkp · 2025-04-01T17:52:58Z

Calling engine._get_stats() is a bit of a red flag - we already do this once per step in do_log_stats() so that heavy computation should not be repeated. Since we compute all of these metrics and store them in-memory in Prometheus collectors, why not just query those collectors from the OpenAI request handler for the values in order to build the response headers?

I'd also like to see this highly isolated to orca_metrics.py - i.e. as much of the ORCA-specific code encapsulated there - and also behind an off-by-default --enable-experimental-orca-inband-metrics CLI argument for now because AIUI this is all rather experimental at this stage? This isn't about performance, but rather avoiding committing to maintaining this metric format long term just yet

Also agree with Simon's stance on V1 - it might make sense to accept a V1-only implementation, but not a V0-only implementation

Hope that helps.

It is off by default, We are using http request header to enable this metric since this metric is passed in response headers. We don't need a flag. Ack on the point of in-memory collectors. Do you know the frequency of Stats computation?. I can do some profiling on the get_stats.

markmc · 2025-04-01T18:23:36Z

It is off by default, We are using http request header to enable this metric since this metric is passed in response headers. We don't need a flag.

The CLI arg would be the operator acknowledging that they are enabling an experimental feature

liu-cong · 2025-04-01T21:03:39Z

Thanks @simon-mo and @markmc 's comments.

/hold

Let's hold for now until we have the v1 change prioritized, to avoid potential rollback. I will follow up again.

markmc · 2025-04-02T09:25:07Z

Useful background from @smarterclayton on Slack worth capturing here for reference:

Utilization based balancing is a general construct (implemented in Envoy now as client-weighted round robin) to allow backends to report a utilization factor that the balancer can use to weight the decision of which backend to select with low-latency and relatively higher efficiency. If I were to put it in an operational framing:

Model servers deal with inherently unpredictable request and response costs

Balancers need some improved signals to make decisions

We would also like to enable vLLM to expose those signals efficiently

In terms of rough complexity and runtime cost both to the model server and the balancer, there are three levels of signal exposure that we were exploring in order

Frequently scrape the model server for metrics that allow the balancer to make a better decision (high value for operators, high cost to scrape frequently, interval-limited accuracy)

Identify a limited set of continuous signals that could be returned per request to indicate load, using a protocol that Envoy supports OOTB and others are interested in (high value for balancers, lower cost to scrape, best accuracy at high qps)

Using learnings from 2, implement probing load balancing / synchronous load detection with a limited set of metrics scraped by each balancer before requests are dispatched (still being developed in Envoy / others, lowest cost to scrape, best accuracy when fast)

So 1 right now is the work to add golden signals, to round out the metrics in vLLM and others so that most model servers deployers can rely on a common operational patterns to carry over between model servers, and then to use that for the “inefficient scrape all the time” basic load balancer operation.
Exposing a subset of those metrics to 2 allows the balancers (envoy ootb today) to natively perform utilization based balancing - anyone using envoy in front of vLLM would be able to use that to get the minimal algorithmic improvements gateway brings (kv-cache usage based balancing, etc). Gateway would also be able to use those signals as it is an envoy callout to make more efficient decisions and remove the aggressive polling loop.
The north star / ideal architecture in the long run is probing load balancing - 3 - but we believe the benefit of 2 minimizes the runtime load on vLLM while guiding us to that right set. We could consider an alternate path where we expose a “fast scrape endpoint” but it would have the same rough runtime cost as 2 (i.e. any locking / coordination to be able to sample a set of metrics at the native QPS rate)

a better version of the above should probably be in a simple google doc that we could share between multiple communities, since we’re attempting to align envoy OSS + distinct model server communities + non-envoy balancers on top of of model servers

markmc · 2025-04-02T09:27:10Z

My suggestion:

might be worth considering whether this could be an external project that provides a prometheus_client integrated middleware that could be reused across projects - just configure the middleware with the mapping of response header name to prometheus_client collector

Signed-off-by: Misha Efimov <mef@google.com>

github-actions · 2025-07-11T02:15:27Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

mergify · 2025-07-11T02:16:01Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @coolkp.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Forked from vllm-project#14906

Forked from vllm-project#14906 Use `get_named_metrics_from_prometheus()` to collect metrics for Engine V1. Signed-off-by: Misha Efimov <mef@google.com>

github-actions · 2025-10-12T02:08:32Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

mergify bot added the frontend label Mar 16, 2025

coolkp marked this pull request as ready for review March 16, 2025 23:06

coolkp requested review from alexm-redhat, comaniac, njhill, youkaichao and zhuohan123 as code owners March 16, 2025 23:06

houseroad reviewed Mar 17, 2025

View reviewed changes

vllm/entrypoints/openai/serving_completion.py Outdated Show resolved Hide resolved

houseroad reviewed Mar 17, 2025

View reviewed changes

vllm/sequence.py Outdated Show resolved Hide resolved

houseroad reviewed Mar 17, 2025

View reviewed changes

coolkp requested a review from houseroad March 17, 2025 15:52

simon-mo requested review from markmc and removed request for houseroad March 19, 2025 23:44

coolkp force-pushed the endpoint-load-metrics branch from 83ab707 to e1f8925 Compare March 21, 2025 17:29

efimki added a commit to efimki/vllm that referenced this pull request Mar 26, 2025

Add endpoint load metrics

01e7c87

Forked from vllm-project#14906

efimki mentioned this pull request Mar 26, 2025

Add endpoint load metrics #15555

Closed

coolkp added 4 commits March 26, 2025 20:17

Add endpoint load metrics in response header

c917a9e

Signed-off-by: kunjan <kunjanp@google.com>

make format case insensitive

ea5f2a7

Signed-off-by: kunjan <kunjanp@google.com>

Fix mypy issues

a49f04c

Signed-off-by: kunjan <kunjanp@google.com>

Rename method to set_inband_engine_stats in sequence group

cdc1ac7

Signed-off-by: kunjan <kunjanp@google.com>

coolkp force-pushed the endpoint-load-metrics branch from e1f8925 to cdc1ac7 Compare March 26, 2025 20:18

mergify bot added the tpu Related to Google TPUs label Mar 27, 2025

houseroad reviewed Mar 27, 2025

View reviewed changes

mergify bot removed the tpu Related to Google TPUs label Mar 28, 2025

mergify bot added tpu Related to Google TPUs and removed tpu Related to Google TPUs labels Apr 9, 2025

efimki added a commit to efimki/vllm that referenced this pull request Jun 25, 2025

Applied PR vllm-project#14906

e846e17

Signed-off-by: Misha Efimov <mef@google.com>

github-actions bot added the stale Over 90 days of inactivity label Jul 11, 2025

mergify bot added the needs-rebase label Jul 11, 2025

github-actions bot added unstale Recieved activity after being labelled stale and removed stale Over 90 days of inactivity labels Jul 13, 2025

efimki added a commit to efimki/vllm that referenced this pull request Sep 15, 2025

Add endpoint load metrics

5e54706

Forked from vllm-project#14906

efimki added a commit to efimki/vllm that referenced this pull request Sep 15, 2025

Add ORCA endpoint load metrics

31b2ddf

Forked from vllm-project#14906 Use `get_named_metrics_from_prometheus()` to collect metrics for Engine V1. Signed-off-by: Misha Efimov <mef@google.com>

efimki mentioned this pull request Sep 15, 2025

Add ORCA endpoint load metrics support #24905

Merged

github-actions bot added stale Over 90 days of inactivity and removed unstale Recieved activity after being labelled stale labels Oct 12, 2025

Uh oh!

Add endpoint load metrics #14906

Are you sure you want to change the base?

Add endpoint load metrics #14906

Conversation

coolkp commented Mar 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 16, 2025

Uh oh!

simon-mo commented Mar 17, 2025

Uh oh!

Uh oh!

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

coolkp commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coolkp commented Mar 19, 2025

Uh oh!

simon-mo commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markmc commented Mar 20, 2025

Uh oh!

Shaoting-Feng commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jeffwan commented Mar 23, 2025

Uh oh!

liu-cong commented Mar 26, 2025

Uh oh!

simon-mo commented Mar 27, 2025

Uh oh!

simon-mo commented Mar 27, 2025

Uh oh!

houseroad left a comment

Choose a reason for hiding this comment

Uh oh!

houseroad Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

houseroad Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

houseroad Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

houseroad Mar 27, 2025

Choose a reason for hiding this comment

Uh oh!

markmc commented Apr 1, 2025

Uh oh!

coolkp commented Apr 1, 2025

Uh oh!

markmc commented Apr 1, 2025

Uh oh!

liu-cong commented Apr 1, 2025

Uh oh!

markmc commented Apr 2, 2025

Uh oh!

markmc commented Apr 2, 2025

Uh oh!

github-actions bot commented Jul 11, 2025

Uh oh!

mergify bot commented Jul 11, 2025

Uh oh!

github-actions bot commented Oct 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

coolkp commented Mar 16, 2025 •

edited by github-actions bot

Loading

coolkp commented Mar 17, 2025 •

edited

Loading

simon-mo commented Mar 19, 2025 •

edited

Loading

Shaoting-Feng commented Mar 21, 2025 •

edited

Loading