-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[P/D][Nixl] Introduce KVTransferMetrics and aggregation strategy
#22188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces logging for KVTransferMetrics, which is a valuable addition for monitoring performance. However, my review identified a few critical issues in the implementation. There's a bug in the metrics aggregation logic that would lead to incorrect, doubled statistics. Additionally, the logging class for these metrics is functionally broken and would cause a runtime error. I've also pointed out a potential issue with returning a mutable global object, which could lead to subtle bugs. I've provided detailed comments and code suggestions to address these issues.
ec74abc to
405dd79
Compare
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a framework for logging KV transfer metrics, which is a valuable addition for monitoring performance. The changes include new classes for metrics collection, plumbing the metrics from workers to the scheduler, and integrating with the logging system. The implementation for Nixl is also included, though it's in an early stage with placeholder values.
I've identified a couple of issues in the new metrics collection logic that should be addressed. One is a critical bug in defaultdict usage that would cause a runtime error, and the other is a potential issue with in-place modification of a list that could lead to subtle bugs. Addressing these will improve the robustness and correctness of the new feature.
|
This pull request has merge conflicts that must be resolved before it can be |
KVTransferMetrics and aggregation strategy
e2443f4 to
73afb31
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @NickLucche — I’ve added a few inline comments on API boundaries and maintainability.
cd02840 to
b630bc7
Compare
| """ | ||
| return None | ||
|
|
||
| def get_kv_transfer_stats(self) -> Optional["KVTransferStats"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NickLucche I was thinking this could return the optional dict... KVTransferStats type / implementation would then only be used on the scheduler side
|
@lacora2017 has imported this pull request. If you are a Meta employee, you can view this in D82610371. |
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
|
@lacora2017 has imported this pull request. If you are a Meta employee, you can view this in D82610371. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR :) Looks great overall only some nits
vllm/distributed/kv_transfer/kv_connector/v1/multi_connector.py
Outdated
Show resolved
Hide resolved
|
Thanks for reviewing ! |
Signed-off-by: NickLucche <nlucches@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NickLucche
| All sub-classes need to be serializable as stats are sent from worker to | ||
| logger process. | ||
| """ | ||
| data: dict[str, Any] = field(default_factory=dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure how I feel about having a base class with a dict like this and think it would be cleaner to keep KVConnectorStats just on the scheduler side and just use dicts directly on the worker side. Perhaps we can experiment with that as a follow-on change though and compare what it looks like.
…llm-project#22188) Signed-off-by: NickLucche <nlucches@redhat.com>
| output.finished_sending, output.finished_recving = ( | ||
| kv_connector.get_finished(scheduler_output.finished_req_ids)) | ||
|
|
||
| kv_connector.clear_connector_metadata() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NickLucche According to the KV Connector API documentation, clear_connector_metadata should be invoked after every model execution. Could you clarify why it was removed?
…llm-project#22188) Signed-off-by: NickLucche <nlucches@redhat.com>
…llm-project#22188) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: charlifu <charlifu@amd.com>
…llm-project#22188) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
…llm-project#22188) Signed-off-by: NickLucche <nlucches@redhat.com>
…llm-project#22188) Signed-off-by: NickLucche <nlucches@redhat.com>
…llm-project#22188) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
This PR provides support for a general
KVTransferMetricsobject while laying the ground for recording stats, but without tracking actual metrics just yet, waiting for NIXL to expose actual telemetry.Adding those should be a matter of expanding
record_transferand providingaggregate+reduceoperations.Therefore, this PR is focusing on interfaces and plumbing required for aggregating and reducing KVStats across multiple different kvconnectors (when
MultiKVConnectoris used) as well as across TP ranks.Aggregation across connectors happens in the MultiKVConnector class (newly added
get_kv_transfer_statsinterface) at the worker-level.Aggregation across ranks happens in the MultiProcExecutor (same as finished_reqs sets) on the main/fe process.
Final reduction happens on logger in main/fe process.
A few notes on the
KVTransferStatsinterface:.aggregateis for "fusing" two stats objects into one.reduceis for computing representative values (eg avg/median..) from the collected series, for printing/storing.With this initial enablement PR we're also setting the stage for more complex metric management, including exposing them to Prometheus. For now, we just print them to stdout with the default logger.
Example:
Test with
Update
I've re-worked this PR a bit to address the disalignment between the async READ xfer-related telemetry collection and the scheduler->logger workflow.
Basically it can happen that telemetry from nixl is received on a scheduler step where no ECOs are to be sent to the AsyncLLM.
To address that, we introduce a buffer (which can conveniently use.aggregate) to address this lag.Buffered stats will then only be forwarded (and reset) when the scheduler is set to produce ECOs (checking
num_scheduled_tokens).I've actually ended up leveraging @njhill PR #22995 which was facing the same issue.
I've also integrated suggestions from @sdavidbd review (thanks!):
KVTransferStatsby connector type in a transparent way to the interfaceUpdate 2
Discussing with @njhill , we've decided to move to a process where the a generic data representation (a dict) is sent from P1->P0, in order to address the limitations of msgspec and OOT plugins, which would require dynamic typing.
At the same time, we still allow every connector to define their custom behavior (eg what stats to expose) by inheriting
KVConnectorStatsand operating on the data dictionary. Also, the data collected remains totally in control of the Connector, as long as it returned as a generic serializable dictionary.Follow up PRs: