-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Add endpoint load metrics #14906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add endpoint load metrics #14906
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
@youngkent @houseroad can you help review this as it might conflict with the feature your team recently added for load measurement? additionally, it will be useful to get a review for code quality + whether you think feature is implemented in the right way. finally, we are going to V0 feature freeze and should only focus on V1. |
houseroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I don't think this would conflict with Meta internal features. Wondering how production-stack folks would like to collect such load metrics?
Besides load, we may also consider caching distribution, doing something like sticky routing, etc. prod-stack should also cover this, right?
Would production-stack rely on inband metrics as opposed to querying the prometheus metrics? |
|
Hi, gentle ping on this |
|
The metrics should be there for V1 cc @markmc who implemented the stack. As we turned on V1 by default, we would like any feature introducing to vLLM to implement in both V0 and V1 or V1 only. To minimize porting cost. |
|
There's a lot of useful info in #10086 but this PR seems (at a glance) to focus on the proposal for metrics to be reported in the response headers using the ORCA format I think it could be really useful to document a proposal on just the inband metrics piece specifically, and I'd especially appreciate an explanation of whether and how it relates to other Kubernetes-associated load-balancing efforts I tried to capture here: https://docs.vllm.ai/en/stable/design/v1/metrics.html#autoscaling-and-load-balancing |
Production stack relies on the prometheus metrics instead of inband metrics. So if Prometheus scraping is unchanged, and inband metrics are purely additive, production stack shouldn’t be affected.
Production stack supports session sticky routing, i.e., route the request to the appropriate engine URL according to the request headers. |
83ab707 to
e1f8925
Compare
|
aibrix scrape the metrics from engine directly instead from Prometheus source at this moment. We talked with inference gateway project earlier on this and it won't affect aibrix's future plan. the change looks good to us |
Forked from vllm-project#14906
|
Thank you @Jeffwan and @Shaoting-Feng for confirming this change won't be in conflict with your features! @simon-mo , given this feature is controlled by a user provided header and by default disabled, can we get this to v0 for validation, and follow up on v1 later? |
Signed-off-by: kunjan <kunjanp@google.com>
Signed-off-by: kunjan <kunjanp@google.com>
Signed-off-by: kunjan <kunjanp@google.com>
Signed-off-by: kunjan <kunjanp@google.com>
e1f8925 to
cdc1ac7
Compare
|
You can do it as a follow up. But we need the V1 PR by two weeks or we have to revert this PR; given our policy is V0 and V1 needs to have full parity. |
|
@houseroad can you do a round of code quality review? |
houseroad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to add some unittest and e2e test, also wondering if we can have some profiling on the e2e perf. To be safe, would like to ensure the e2e perf won't regress obviously.
| seq_group.maybe_set_first_token_time(now) | ||
| if not seq_group.is_prefill(): | ||
| seq_group.set_last_token_time(now) | ||
| stats_snapshot = self._get_stats(scheduler_outputs, outputs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: create a function to include get_stats then set inband_stats logic, since it repeats.
|
|
||
| # Tuple[ChatCompletionResponse,Optional[InbandEngineStats]] | ||
| elif isinstance(generator, tuple): | ||
| return JSONResponse(content=generator[0].model_dump(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also check len(generator) == 2, and the correpsonding type? Maybe create a helper function.
| outputs: list[CompletionOutput], | ||
| finished: bool, | ||
| metrics: Optional[RequestMetrics] = None, | ||
| inband_engine_stats: Optional[InbandEngineStats] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not add at the end of the list?
| @@ -0,0 +1,85 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some unittest?
|
Calling I'd also like to see this highly isolated to Also agree with Simon's stance on V1 - it might make sense to accept a V1-only implementation, but not a V0-only implementation Hope that helps. |
It is off by default, We are using http request header to enable this metric since this metric is passed in response headers. We don't need a flag. Ack on the point of in-memory collectors. Do you know the frequency of Stats computation?. I can do some profiling on the get_stats. |
The CLI arg would be the operator acknowledging that they are enabling an experimental feature |
|
Useful background from @smarterclayton on Slack worth capturing here for reference:
|
|
My suggestion:
|
Signed-off-by: Misha Efimov <mef@google.com>
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
This pull request has merge conflicts that must be resolved before it can be |
Forked from vllm-project#14906
Forked from vllm-project#14906 Use `get_named_metrics_from_prometheus()` to collect metrics for Engine V1. Signed-off-by: Misha Efimov <mef@google.com>
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
FIX #10086
Implement for V0 engine only for chat/completion and /completion