-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[KVConnector][Metrics] Aggregate scheduler-side KVConnectorStats #26046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@NickLucche a small extension on top of your #24786 :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request extends metrics reporting for KVConnector to include stats from the Scheduler. The change correctly identifies the place to aggregate scheduler-side stats with worker-side stats. However, there is a potential issue in how the aggregation is performed. The aggregate method on KVConnectorStats might return a new stats object rather than modifying it in-place. The current implementation discards the return value, which could lead to scheduler-side stats not being reported. I've suggested a fix to address this.
vllm/v1/core/sched/scheduler.py
Outdated
| if kv_connector_stats and self.connector: | ||
| stats = self.connector.get_kv_connector_stats() | ||
| if stats: | ||
| kv_connector_stats.aggregate(stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The aggregate method of KVConnectorStats might return a new instance rather than modifying the object in-place. Other parts of the codebase, like KVConnectorLogging, reassign the result of aggregate. To ensure correctness and prevent potential loss of the aggregated stats, you should assign the result of aggregate back to kv_connector_stats.
| kv_connector_stats.aggregate(stats) | |
| kv_connector_stats = kv_connector_stats.aggregate(stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think gemini's concern is valid given the interface is pretty flexible here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @QierLi thanks a lot for adding this feature!
Although the change is very small due to the flexibility of the existing infra, I think it would be great if you could add a simple unit test showcasing scheduler<>worker metrics merging!
vllm/v1/core/sched/scheduler.py
Outdated
| if kv_connector_stats and self.connector: | ||
| stats = self.connector.get_kv_connector_stats() | ||
| if stats: | ||
| kv_connector_stats.aggregate(stats) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think gemini's concern is valid given the interface is pretty flexible here
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Qier Li <kevin44036@gmail.com>
Signed-off-by: Qier Li <kevin44036@gmail.com>
Signed-off-by: Qier Li <kevin44036@gmail.com>
Signed-off-by: Qier Li <kevin44036@gmail.com>
Added a test to guard the scheduler stat aggregation via update_from_output path. : ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…m-project#26046) Signed-off-by: Qier Li <kevin44036@gmail.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
|
@QierLi , thanks for quick reply, I initiated a quick fix, |
|
I'm wondering whether this could introduce some confusion. Currently the connector interface methods are either scheduler-side or worker-side. We're now saying one of them will be called on both sides. I'm thinking we should even separate the KVConnector interface (abstract class) into two separate ones for the worker side and scheduler side. And maybe there should be a separate method for querying the scheduler-side stats. |
…m-project#26046) Signed-off-by: Qier Li <kevin44036@gmail.com> Signed-off-by: bbartels <benjamin@bartels.dev>
…m-project#26046) Signed-off-by: Qier Li <kevin44036@gmail.com>
…m-project#26046) Signed-off-by: Qier Li <kevin44036@gmail.com>
…m-project#26046) Signed-off-by: Qier Li <kevin44036@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…m-project#26046) Signed-off-by: Qier Li <kevin44036@gmail.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…m-project#26046) Signed-off-by: Qier Li <kevin44036@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…m-project#26046) Signed-off-by: Qier Li <kevin44036@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

Purpose
#22188 introduced generic metrics reporting for KVConnector. Building on that, this PR extends metrics to include stats from the Scheduler, in addition to those from Workers.
The Scheduler-side KVConnector should export several important metrics, such as prefix matching for CPU KV (get_num_new_matched_tokens - duration, token counts), KV Cache event's (take_events) etc, while Workers' export KV transfers'.
Test Plan
Existed tests.
The changes are a no-op if no KVConnectorStats are reported from the Scheduler.
Test Result
No new breakages.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.