Skip to content

Conversation

@coolkp
Copy link
Contributor

@coolkp coolkp commented Mar 16, 2025

FIX #10086
Implement for V0 engine only for chat/completion and /completion

  • reporting of engine metrics in response headers when enabled through request header "endpoint-load-metrics-format"
  • supported formats "text" and "json"
  • No added load on response for default case
  • No new computation in engine

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added the frontend label Mar 16, 2025
@coolkp coolkp marked this pull request as ready for review March 16, 2025 23:06
@simon-mo
Copy link
Collaborator

@youngkent @houseroad can you help review this as it might conflict with the feature your team recently added for load measurement? additionally, it will be useful to get a review for code quality + whether you think feature is implemented in the right way. finally, we are going to V0 feature freeze and should only focus on V1.

Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I don't think this would conflict with Meta internal features. Wondering how production-stack folks would like to collect such load metrics?

Besides load, we may also consider caching distribution, doing something like sticky routing, etc. prod-stack should also cover this, right?

@coolkp coolkp requested a review from houseroad March 17, 2025 15:52
@coolkp
Copy link
Contributor Author

coolkp commented Mar 17, 2025

@simon-mo

@youngkent @houseroad can you help review this as it might conflict with the feature your team recently added for load measurement? additionally, it will be useful to get a review for code quality + whether you think feature is implemented in the right way. finally, we are going to V0 feature freeze and should only focus on V1.

In general, I don't think this would conflict with Meta internal features. Wondering how production-stack folks would like to collect such load metrics?

Besides load, we may also consider caching distribution, doing something like sticky routing, etc. prod-stack should also cover this, right?

Would production-stack rely on inband metrics as opposed to querying the prometheus metrics?
Also I am not sure if this will make it to V1 entirely. We will have to do more scaled testing to determine this kind of metrics gathering has advantage over out of band metrics in prometheus. Concept is not entirely validated. So getting it to V0 will allow us to test faster as we won't have to maintain images and also be out of sync with other vllm features. I can send a quick followup for V1 in a week if this tests well, only changes required will be in engine. entrypoints and sequence shouldn't need change. I also didn't see RequestMetrics being populated in v1. Is it already implemented?

@coolkp
Copy link
Contributor Author

coolkp commented Mar 19, 2025

Hi, gentle ping on this

@simon-mo
Copy link
Collaborator

simon-mo commented Mar 19, 2025

The metrics should be there for V1 cc @markmc who implemented the stack.

As we turned on V1 by default, we would like any feature introducing to vLLM to implement in both V0 and V1 or V1 only. To minimize porting cost.

@simon-mo simon-mo requested review from markmc and removed request for houseroad March 19, 2025 23:44
@markmc
Copy link
Member

markmc commented Mar 20, 2025

There's a lot of useful info in #10086 but this PR seems (at a glance) to focus on the proposal for metrics to be reported in the response headers using the ORCA format

I think it could be really useful to document a proposal on just the inband metrics piece specifically, and I'd especially appreciate an explanation of whether and how it relates to other Kubernetes-associated load-balancing efforts I tried to capture here: https://docs.vllm.ai/en/stable/design/v1/metrics.html#autoscaling-and-load-balancing

@Shaoting-Feng
Copy link
Contributor

Shaoting-Feng commented Mar 21, 2025

Would production-stack rely on inband metrics as opposed to querying the prometheus metrics?

Production stack relies on the prometheus metrics instead of inband metrics. So if Prometheus scraping is unchanged, and inband metrics are purely additive, production stack shouldn’t be affected.

Besides load, we may also consider caching distribution, doing something like sticky routing, etc. prod-stack should also cover this, right?

Production stack supports session sticky routing, i.e., route the request to the appropriate engine URL according to the request headers.

@coolkp coolkp force-pushed the endpoint-load-metrics branch from 83ab707 to e1f8925 Compare March 21, 2025 17:29
@Jeffwan
Copy link
Contributor

Jeffwan commented Mar 23, 2025

aibrix scrape the metrics from engine directly instead from Prometheus source at this moment. We talked with inference gateway project earlier on this and it won't affect aibrix's future plan. the change looks good to us

efimki added a commit to efimki/vllm that referenced this pull request Mar 26, 2025
@liu-cong
Copy link

Thank you @Jeffwan and @Shaoting-Feng for confirming this change won't be in conflict with your features!

@simon-mo , given this feature is controlled by a user provided header and by default disabled, can we get this to v0 for validation, and follow up on v1 later?

coolkp added 4 commits March 26, 2025 20:17
Signed-off-by: kunjan <kunjanp@google.com>
Signed-off-by: kunjan <kunjanp@google.com>
Signed-off-by: kunjan <kunjanp@google.com>
Signed-off-by: kunjan <kunjanp@google.com>
@coolkp coolkp force-pushed the endpoint-load-metrics branch from e1f8925 to cdc1ac7 Compare March 26, 2025 20:18
@simon-mo
Copy link
Collaborator

You can do it as a follow up. But we need the V1 PR by two weeks or we have to revert this PR; given our policy is V0 and V1 needs to have full parity.

@simon-mo
Copy link
Collaborator

@houseroad can you do a round of code quality review?

@mergify mergify bot added the tpu Related to Google TPUs label Mar 27, 2025
Copy link
Collaborator

@houseroad houseroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add some unittest and e2e test, also wondering if we can have some profiling on the e2e perf. To be safe, would like to ensure the e2e perf won't regress obviously.

seq_group.maybe_set_first_token_time(now)
if not seq_group.is_prefill():
seq_group.set_last_token_time(now)
stats_snapshot = self._get_stats(scheduler_outputs, outputs,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: create a function to include get_stats then set inband_stats logic, since it repeats.


# Tuple[ChatCompletionResponse,Optional[InbandEngineStats]]
elif isinstance(generator, tuple):
return JSONResponse(content=generator[0].model_dump(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also check len(generator) == 2, and the correpsonding type? Maybe create a helper function.

outputs: list[CompletionOutput],
finished: bool,
metrics: Optional[RequestMetrics] = None,
inband_engine_stats: Optional[InbandEngineStats] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not add at the end of the list?

@@ -0,0 +1,85 @@
# SPDX-License-Identifier: Apache-2.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some unittest?

@mergify mergify bot removed the tpu Related to Google TPUs label Mar 28, 2025
@markmc
Copy link
Member

markmc commented Apr 1, 2025

Calling engine._get_stats() is a bit of a red flag - we already do this once per step in do_log_stats() so that heavy computation should not be repeated. Since we compute all of these metrics and store them in-memory in Prometheus collectors, why not just query those collectors from the OpenAI request handler for the values in order to build the response headers?

I'd also like to see this highly isolated to orca_metrics.py - i.e. as much of the ORCA-specific code encapsulated there - and also behind an off-by-default --enable-experimental-orca-inband-metrics CLI argument for now because AIUI this is all rather experimental at this stage? This isn't about performance, but rather avoiding committing to maintaining this metric format long term just yet

Also agree with Simon's stance on V1 - it might make sense to accept a V1-only implementation, but not a V0-only implementation

Hope that helps.

@coolkp
Copy link
Contributor Author

coolkp commented Apr 1, 2025

Calling engine._get_stats() is a bit of a red flag - we already do this once per step in do_log_stats() so that heavy computation should not be repeated. Since we compute all of these metrics and store them in-memory in Prometheus collectors, why not just query those collectors from the OpenAI request handler for the values in order to build the response headers?

I'd also like to see this highly isolated to orca_metrics.py - i.e. as much of the ORCA-specific code encapsulated there - and also behind an off-by-default --enable-experimental-orca-inband-metrics CLI argument for now because AIUI this is all rather experimental at this stage? This isn't about performance, but rather avoiding committing to maintaining this metric format long term just yet

Also agree with Simon's stance on V1 - it might make sense to accept a V1-only implementation, but not a V0-only implementation

Hope that helps.

It is off by default, We are using http request header to enable this metric since this metric is passed in response headers. We don't need a flag. Ack on the point of in-memory collectors. Do you know the frequency of Stats computation?. I can do some profiling on the get_stats.

@markmc
Copy link
Member

markmc commented Apr 1, 2025

It is off by default, We are using http request header to enable this metric since this metric is passed in response headers. We don't need a flag.

The CLI arg would be the operator acknowledging that they are enabling an experimental feature

@liu-cong
Copy link

liu-cong commented Apr 1, 2025

Thanks @simon-mo and @markmc 's comments.

/hold

Let's hold for now until we have the v1 change prioritized, to avoid potential rollback. I will follow up again.

@markmc
Copy link
Member

markmc commented Apr 2, 2025

Useful background from @smarterclayton on Slack worth capturing here for reference:

Utilization based balancing is a general construct (implemented in Envoy now as client-weighted round robin) to allow backends to report a utilization factor that the balancer can use to weight the decision of which backend to select with low-latency and relatively higher efficiency. If I were to put it in an operational framing:

  1. Model servers deal with inherently unpredictable request and response costs
  2. Balancers need some improved signals to make decisions
  3. We would also like to enable vLLM to expose those signals efficiently

In terms of rough complexity and runtime cost both to the model server and the balancer, there are three levels of signal exposure that we were exploring in order

  1. Frequently scrape the model server for metrics that allow the balancer to make a better decision (high value for operators, high cost to scrape frequently, interval-limited accuracy)
  2. Identify a limited set of continuous signals that could be returned per request to indicate load, using a protocol that Envoy supports OOTB and others are interested in (high value for balancers, lower cost to scrape, best accuracy at high qps)
  3. Using learnings from 2, implement probing load balancing / synchronous load detection with a limited set of metrics scraped by each balancer before requests are dispatched (still being developed in Envoy / others, lowest cost to scrape, best accuracy when fast)

So 1 right now is the work to add golden signals, to round out the metrics in vLLM and others so that most model servers deployers can rely on a common operational patterns to carry over between model servers, and then to use that for the “inefficient scrape all the time” basic load balancer operation.
Exposing a subset of those metrics to 2 allows the balancers (envoy ootb today) to natively perform utilization based balancing - anyone using envoy in front of vLLM would be able to use that to get the minimal algorithmic improvements gateway brings (kv-cache usage based balancing, etc). Gateway would also be able to use those signals as it is an envoy callout to make more efficient decisions and remove the aggressive polling loop.
The north star / ideal architecture in the long run is probing load balancing - 3 - but we believe the benefit of 2 minimizes the runtime load on vLLM while guiding us to that right set. We could consider an alternate path where we expose a “fast scrape endpoint” but it would have the same rough runtime cost as 2 (i.e. any locking / coordination to be able to sample a set of metrics at the native QPS rate)

a better version of the above should probably be in a simple google doc that we could share between multiple communities, since we’re attempting to align envoy OSS + distinct model server communities + non-envoy balancers on top of of model servers

@markmc
Copy link
Member

markmc commented Apr 2, 2025

My suggestion:

might be worth considering whether this could be an external project that provides a prometheus_client integrated middleware that could be reused across projects - just configure the middleware with the mapping of response header name to prometheus_client collector

@mergify mergify bot added tpu Related to Google TPUs and removed tpu Related to Google TPUs labels Apr 9, 2025
efimki added a commit to efimki/vllm that referenced this pull request Jun 25, 2025
Signed-off-by: Misha Efimov <mef@google.com>
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added the stale Over 90 days of inactivity label Jul 11, 2025
@mergify
Copy link

mergify bot commented Jul 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @coolkp.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 11, 2025
@github-actions github-actions bot added unstale Recieved activity after being labelled stale and removed stale Over 90 days of inactivity labels Jul 13, 2025
efimki added a commit to efimki/vllm that referenced this pull request Sep 15, 2025
efimki added a commit to efimki/vllm that referenced this pull request Sep 15, 2025
Forked from vllm-project#14906

Use `get_named_metrics_from_prometheus()` to collect metrics for Engine V1.

Signed-off-by: Misha Efimov <mef@google.com>
@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

@github-actions github-actions bot added stale Over 90 days of inactivity and removed unstale Recieved activity after being labelled stale labels Oct 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend needs-rebase stale Over 90 days of inactivity

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Enhance integration with advanced LB/gateways with better load/cost reporting and LoRA management

7 participants