Skip to content

Conversation

@NickLucche
Copy link
Collaborator

@NickLucche NickLucche commented Aug 4, 2025

This PR provides support for a general KVTransferMetrics object while laying the ground for recording stats, but without tracking actual metrics just yet, waiting for NIXL to expose actual telemetry.
Adding those should be a matter of expanding record_transfer and providing aggregate+reduce operations.

Therefore, this PR is focusing on interfaces and plumbing required for aggregating and reducing KVStats across multiple different kvconnectors (when MultiKVConnector is used) as well as across TP ranks.

Aggregation across connectors happens in the MultiKVConnector class (newly added get_kv_transfer_stats interface) at the worker-level.
Aggregation across ranks happens in the MultiProcExecutor (same as finished_reqs sets) on the main/fe process.
Final reduction happens on logger in main/fe process.

A few notes on the KVTransferStats interface:

  • .aggregate is for "fusing" two stats objects into one
  • .reduce is for computing representative values (eg avg/median..) from the collected series, for printing/storing.

With this initial enablement PR we're also setting the stage for more complex metric management, including exposing them to Prometheus. For now, we just print them to stdout with the default logger.

Example:

(APIServer pid=3026493) INFO 08-05 17:32:58 [loggers.py:126] Engine 000: Avg prompt throughput: 53.2 tokens/s, Avg generation throughput: 15.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=3026493) INFO 08-05 17:32:58 [metrics.py:122] KVConnectorType: KVConnectorType.NIXL, KV Transfer metrics: num_successful_transfers=1

Test with

tests/v1/kv_connector/unit/test_nixl_connector.py::test_kv_transfer_stats
tests/v1/kv_connector/unit/test_nixl_connector.py::test_kv_transfer_stats_aggregation

Update

I've re-worked this PR a bit to address the disalignment between the async READ xfer-related telemetry collection and the scheduler->logger workflow.
Basically it can happen that telemetry from nixl is received on a scheduler step where no ECOs are to be sent to the AsyncLLM. To address that, we introduce a buffer (which can conveniently use .aggregate) to address this lag.
Buffered stats will then only be forwarded (and reset) when the scheduler is set to produce ECOs (checking num_scheduled_tokens).

I've actually ended up leveraging @njhill PR #22995 which was facing the same issue.

I've also integrated suggestions from @sdavidbd review (thanks!):

  • MultiConnector manages KVTransferStats by connector type in a transparent way to the interface
  • =>cleaner all around interface
  • Move KVTransferStats to SchedulerStats as it does not depend on ECOs and must only be sent once

Update 2

Discussing with @njhill , we've decided to move to a process where the a generic data representation (a dict) is sent from P1->P0, in order to address the limitations of msgspec and OOT plugins, which would require dynamic typing.
At the same time, we still allow every connector to define their custom behavior (eg what stats to expose) by inheriting KVConnectorStats and operating on the data dictionary. Also, the data collected remains totally in control of the Connector, as long as it returned as a generic serializable dictionary.

Follow up PRs:

  • Support OOT KVTransferStats, see review comment
  • Track actual metrics from NIXL once available
  • Support PP

@github-actions
Copy link

github-actions bot commented Aug 4, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added v1 tpu Related to Google TPUs labels Aug 4, 2025
@mergify
Copy link

mergify bot commented Aug 4, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 4, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces logging for KVTransferMetrics, which is a valuable addition for monitoring performance. However, my review identified a few critical issues in the implementation. There's a bug in the metrics aggregation logic that would lead to incorrect, doubled statistics. Additionally, the logging class for these metrics is functionally broken and would cause a runtime error. I've also pointed out a potential issue with returning a mutable global object, which could lead to subtle bugs. I've provided detailed comments and code suggestions to address these issues.

@NickLucche
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a framework for logging KV transfer metrics, which is a valuable addition for monitoring performance. The changes include new classes for metrics collection, plumbing the metrics from workers to the scheduler, and integrating with the logging system. The implementation for Nixl is also included, though it's in an early stage with placeholder values.

I've identified a couple of issues in the new metrics collection logic that should be addressed. One is a critical bug in defaultdict usage that would cause a runtime error, and the other is a potential issue with in-place modification of a list that could lead to subtle bugs. Addressing these will improve the robustness and correctness of the new feature.

@mergify
Copy link

mergify bot commented Aug 5, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 5, 2025
@NickLucche NickLucche changed the title [P/D][Nixl] Log KVTransferMetrics [P/D][Nixl] Introduce KVTransferMetrics and aggregation strategy Aug 5, 2025
@NickLucche NickLucche marked this pull request as ready for review August 5, 2025 17:55
@mergify mergify bot removed the needs-rebase label Aug 5, 2025
@NickLucche
Copy link
Collaborator Author

This is ready for review now @lk-chen @sdavidbd @njhill

@mergify
Copy link

mergify bot commented Aug 11, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 11, 2025
Copy link
Contributor

@sdavidbd sdavidbd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @NickLucche — I’ve added a few inline comments on API boundaries and maintainability.

"""
return None

def get_kv_transfer_stats(self) -> Optional["KVTransferStats"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickLucche I was thinking this could return the optional dict... KVTransferStats type / implementation would then only be used on the scheduler side

@facebook-github-bot
Copy link

@lacora2017 has imported this pull request. If you are a Meta employee, you can view this in D82610371.

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
@facebook-github-bot
Copy link

@lacora2017 has imported this pull request. If you are a Meta employee, you can view this in D82610371.

Copy link
Contributor

@lacora lacora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR :) Looks great overall only some nits

@NickLucche
Copy link
Collaborator Author

Thanks for reviewing !

Signed-off-by: NickLucche <nlucches@redhat.com>
Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @NickLucche

All sub-classes need to be serializable as stats are sent from worker to
logger process.
"""
data: dict[str, Any] = field(default_factory=dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure how I feel about having a base class with a dict like this and think it would be cleaner to keep KVConnectorStats just on the scheduler side and just use dicts directly on the worker side. Perhaps we can experiment with that as a follow-on change though and compare what it looks like.

@NickLucche NickLucche enabled auto-merge (squash) September 18, 2025 12:59
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 18, 2025
@mergify mergify bot added the kv-connector label Sep 18, 2025
@NickLucche NickLucche requested a review from ApostaC as a code owner September 18, 2025 14:12
@NickLucche NickLucche merged commit a3d087a into vllm-project:main Sep 19, 2025
45 checks passed
debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025
output.finished_sending, output.finished_recving = (
kv_connector.get_finished(scheduler_output.finished_req_ids))

kv_connector.clear_connector_metadata()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NickLucche According to the KV Connector API documentation, clear_connector_metadata should be invoked after every model execution. Could you clarify why it was removed?

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
…llm-project#22188)

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: charlifu <charlifu@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…llm-project#22188)

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…llm-project#22188)

Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants