Skip to content

Conversation

@QierLi
Copy link
Contributor

@QierLi QierLi commented Oct 1, 2025

Purpose

#22188 introduced generic metrics reporting for KVConnector. Building on that, this PR extends metrics to include stats from the Scheduler, in addition to those from Workers.

The Scheduler-side KVConnector should export several important metrics, such as prefix matching for CPU KV (get_num_new_matched_tokens - duration, token counts), KV Cache event's (take_events) etc, while Workers' export KV transfers'.

Test Plan

Existed tests.
The changes are a no-op if no KVConnectorStats are reported from the Scheduler.

Test Result

No new breakages.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@QierLi
Copy link
Contributor Author

QierLi commented Oct 1, 2025

@NickLucche a small extension on top of your #24786 :)

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request extends metrics reporting for KVConnector to include stats from the Scheduler. The change correctly identifies the place to aggregate scheduler-side stats with worker-side stats. However, there is a potential issue in how the aggregation is performed. The aggregate method on KVConnectorStats might return a new stats object rather than modifying it in-place. The current implementation discards the return value, which could lead to scheduler-side stats not being reported. I've suggested a fix to address this.

if kv_connector_stats and self.connector:
stats = self.connector.get_kv_connector_stats()
if stats:
kv_connector_stats.aggregate(stats)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The aggregate method of KVConnectorStats might return a new instance rather than modifying the object in-place. Other parts of the codebase, like KVConnectorLogging, reassign the result of aggregate. To ensure correctness and prevent potential loss of the aggregated stats, you should assign the result of aggregate back to kv_connector_stats.

Suggested change
kv_connector_stats.aggregate(stats)
kv_connector_stats = kv_connector_stats.aggregate(stats)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think gemini's concern is valid given the interface is pretty flexible here

Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @QierLi thanks a lot for adding this feature!

Although the change is very small due to the flexibility of the existing infra, I think it would be great if you could add a simple unit test showcasing scheduler<>worker metrics merging!

if kv_connector_stats and self.connector:
stats = self.connector.get_kv_connector_stats()
if stats:
kv_connector_stats.aggregate(stats)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think gemini's concern is valid given the interface is pretty flexible here

@mergify
Copy link

mergify bot commented Oct 6, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @QierLi.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Qier Li <kevin44036@gmail.com>
Signed-off-by: Qier Li <kevin44036@gmail.com>
Signed-off-by: Qier Li <kevin44036@gmail.com>
Signed-off-by: Qier Li <kevin44036@gmail.com>
@QierLi
Copy link
Contributor Author

QierLi commented Oct 13, 2025

Hey @QierLi thanks a lot for adding this feature!

Although the change is very small due to the flexibility of the existing infra, I think it would be great if you could add a simple unit test showcasing scheduler<>worker metrics merging!

Added a test to guard the scheduler stat aggregation via update_from_output path. : )

@QierLi QierLi requested a review from NickLucche October 13, 2025 17:54
Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@NickLucche NickLucche enabled auto-merge (squash) October 14, 2025 12:50
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 14, 2025
@NickLucche NickLucche merged commit 720394d into vllm-project:main Oct 14, 2025
46 checks passed
Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
…m-project#26046)

Signed-off-by: Qier Li <kevin44036@gmail.com>
Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
@xuechendi
Copy link
Contributor

I found this PR will crash:
bash vllm/tests/v1/kv_connector/nixl_integratio/run_accuracy_test.sh

with error

 File "/workspace/vllm/vllm/v1/engine/core.py", line 328, in step
engine_core_outputs = self.scheduler.update_from_output(
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/vllm/vllm/v1/core/sched/scheduler.py", line 926, in update_from_output
stats = self.connector.get_kv_connector_stats()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/vllm/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 244, in get_kv_connector_stats
assert self.connector_worker is not None

It calls get_kv_connector_stats() in scheduler, connector_worker for SCHEDULER role is None
image

Would like to know should we enable something else to init connector_worker for SCHEDULER ?

@QierLi
Copy link
Contributor Author

QierLi commented Oct 14, 2025

I found this PR will crash:

bash vllm/tests/v1/kv_connector/nixl_integratio/run_accuracy_test.sh

with error


 File "/workspace/vllm/vllm/v1/engine/core.py", line 328, in step

engine_core_outputs = self.scheduler.update_from_output(

 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/workspace/vllm/vllm/v1/core/sched/scheduler.py", line 926, in update_from_output

stats = self.connector.get_kv_connector_stats()

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

File "/workspace/vllm/vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py", line 244, in get_kv_connector_stats

assert self.connector_worker is not None

It calls get_kv_connector_stats() in scheduler, connector_worker for SCHEDULER role is None

image

Would like to know should we enable something else to init connector_worker for SCHEDULER ?

Thanks for catching this - I think returning None on get_kv_connector_stats() if the ROLE is Scheduler should simply fix this. Do you want to apply a PR to patch it? I can also file one later today.

@xuechendi
Copy link
Contributor

@QierLi , thanks for quick reply, I initiated a quick fix,
When will we initiate connector_worker for Scheduler?

@njhill
Copy link
Member

njhill commented Oct 15, 2025

I'm wondering whether this could introduce some confusion. Currently the connector interface methods are either scheduler-side or worker-side. We're now saying one of them will be called on both sides.

I'm thinking we should even separate the KVConnector interface (abstract class) into two separate ones for the worker side and scheduler side.

And maybe there should be a separate method for querying the scheduler-side stats.

cc @NickLucche @ApostaC

bbartels pushed a commit to bbartels/vllm that referenced this pull request Oct 16, 2025
…m-project#26046)

Signed-off-by: Qier Li <kevin44036@gmail.com>
Signed-off-by: bbartels <benjamin@bartels.dev>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…m-project#26046)

Signed-off-by: Qier Li <kevin44036@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…m-project#26046)

Signed-off-by: Qier Li <kevin44036@gmail.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…m-project#26046)

Signed-off-by: Qier Li <kevin44036@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…m-project#26046)

Signed-off-by: Qier Li <kevin44036@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants