Skip to content

Conversation

@ptovam
Copy link
Contributor

@ptovam ptovam commented Oct 5, 2025

Add prefix-cache metrics for KV connectors

Introduces connector-agnostic metrics to track KV connector cache efficiency,
enabling clearer insights into cache effectiveness and overall system performance.

@mergify
Copy link

mergify bot commented Oct 5, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ptovam.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds metrics for KV connector prefix cache hit rate. The changes span across the KV connector base, scheduler, and metrics loggers to collect, propagate, and log these new stats.

My review has two main points:

  1. A critical issue in scheduler.py where the number of queries for the connector cache is miscalculated, leading to incorrect hit rate metrics.
  2. A suggestion to address the FIXME in base.py for conditionally initializing PrefixCacheStats to improve performance when stats are disabled.

Overall, the changes are well-structured to introduce the new metrics. Addressing these points will ensure the correctness and efficiency of the implementation.

Comment on lines 498 to 500
self.connector.update_prefix_cache_stats(
request.num_tokens, num_external_computed_tokens)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The number of queries to the connector cache is being miscalculated. request.num_tokens represents the total prompt length, but the query to the connector is only for the tokens not found in the local prefix cache.

Using the total prompt length overestimates the number of queries and will lead to an incorrectly low hit rate metric. The number of queries should be request.num_tokens - num_new_local_computed_tokens. The num_new_local_computed_tokens variable is available in this scope.

Suggested change
self.connector.update_prefix_cache_stats(
request.num_tokens, num_external_computed_tokens)
self.connector.update_prefix_cache_stats(
request.num_tokens - num_new_local_computed_tokens, num_external_computed_tokens)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer the current counting method — it reflects the connector’s contribution relative to the entire prompt, which gives a clearer view of overall cache effectiveness and system-level savings.
That said, I’m open to adjustments — maybe we can combine both approaches to expose connector-level stats as well.

@ptovam ptovam force-pushed the connector-hit-rate branch from 00f903d to ae2b26e Compare October 5, 2025 09:40
@mergify mergify bot removed the needs-rebase label Oct 5, 2025
Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this specific PrefixCacheStats is orthogonal to the general KVConnector logging we tried to design here #22188.

One example of customizing the logs here #25388.
And one more example of getting this to work with Scheduler-side by #26046 by @QierLi .

Have you considered the options above for implementing your use-case?

@kfirwolfson
Copy link

Good initiative, very interesting. Following.

@mergify
Copy link

mergify bot commented Oct 10, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ptovam.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 10, 2025
@ptovam
Copy link
Contributor Author

ptovam commented Oct 15, 2025

I think this specific PrefixCacheStats is orthogonal to the general KVConnector logging we tried to design here #22188.

One example of customizing the logs here #25388. And one more example of getting this to work with Scheduler-side by #26046 by @QierLi .

Have you considered the options above for implementing your use-case?

Thanks @NickLucche
I followed the KVConnector logging work, and it’s really useful for me in other use cases.
but for this specific one I believe it’s less appropriate for a few reasons:

  • In my view, KVConnectorStats is great for connector-internal telemetry, but the prefix-cache hit rate is a system-level performance metric, not connector. It’s closely related to the other cache metrics vLLM already exposes (like GPU prefix hit rate), so it makes sense to report it alongside them.
  • If we used the KVConnectorStats, each connector would have to implement its own hit-rate tracking.
    Since this metric is fundamental and shared across connectors, I think it’s better provided centrally.
  • The current KVConnector logging only prints to stdout, not to Prometheus.
    (I did look at #26811, but it doesn’t seem generic yet for OOT connectors.)

@ptovam ptovam force-pushed the connector-hit-rate branch from 41b04ab to 5144e4f Compare October 15, 2025 18:35
@mergify mergify bot removed the needs-rebase label Oct 15, 2025
@ApostaC
Copy link
Collaborator

ApostaC commented Oct 15, 2025

Very interesting and useful feature! Looking forward to it 👀

Copy link
Collaborator

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from @NickLucche's comment, just some other comments regarding the current implementation:
Since the scheduler has enough information to calculate the number of requests, request tokens, and hit tokens, why do we need to add update_prefix_cache_stats and make_prefix_cache_stats to the KVConnectorBase_V1 at all?

We can just add a helper function in the scheduler and make it fully "connector agnostic".

@NickLucche
Copy link
Collaborator

NickLucche commented Oct 16, 2025

Hey @ptovam, thanks for elaborating! I think I initially misunderstood your intent as I was confused by the "agnostic" and the proposed changes:

the prefix-cache hit rate is a system-level performance metric, not connector

I see your point, but looking at the impl this looks like very connector-oriented given all the values come from PrefixCacheStats which is entirely the responsibility of the KVConnector. I am not against that, but if we want something connector-agnostic I agree with @ApostaC comment.
At that point we could promote it to "first-class" metrics without having to bundle it with connectors.

(I did look at #26811, but it doesn’t seem generic yet for OOT connectors.)

Good point, I will look to expand that. Regardless, my initial suggestion was based on the (wrong) assumption that we wanted to do something connector-centric here.
I think we can safely ignore my first comment in light of the current discussion.

@NickLucche
Copy link
Collaborator

cc @markmc for the prefix cache hit metric

@ptovam ptovam force-pushed the connector-hit-rate branch from 5144e4f to 8e8ea29 Compare October 16, 2025 14:40
Signed-off-by: tovam <tovam@pliops.com>
@markmc
Copy link
Member

markmc commented Oct 17, 2025

But ... maybe renaming them from connector_prefix_cache to remote_prefix_cache would be a good compromise?

@markmc indeed "remote" cache is one of the connector use-cases, relating to remote key-value stores or file systems, but this is not the only use-case: the connector is used for all caches external to the vLLM's APC (residing in GPU VRAM). For example, the OffloadingConnector manages cache in CPU DRAM, and using the term "remote" for it might be misleading.

Yep, fair - "remote" wouldn't match the offloading use case well

I suggest keeping connector_prefix_cache, which is self-explanatory.

Uh, my feedback is because it wasn't self-explanatory to me!

An alternative can possibly be external_prefix_cache, hinting at external to vLLM's APC, though it doesn't have to be "external" to vLLM (if using a connector in vLLM codebase) so might also be somewhat confusing.

"External" is the language used in the scheduler, so that makes sense

The offloading example does make me wonder though - if we add it the way it is proposed, I suspect it won't be long before users of e.g. offload with P/D want metrics from the individual connectors ... which brings me back to the idea of a single prefix cache metric with labels like source=local, source=external, source=cpu-offload, source=nixl, etc.

We can maintain the existing label-free metric for backwards compatibility and add a new one with labels

@kfirwolfson
Copy link

kfirwolfson commented Oct 18, 2025

I suggest keeping connector_prefix_cache, which is self-explanatory.

Uh, my feedback is because it wasn't self-explanatory to me!

Point taken 😃. I meant it just literaly relates to all cache managed by the connector, so cannot be misinterpreted. I agree it's not actually explanatory as does not explain what that cache is. Like you said, external is the name for connector-managed cache in the scheduler, so sounds good to me.

Signed-off-by: tovam <tovam@pliops.com>
… to external_prefix_cache_*.

Signed-off-by: tovam <tovam@pliops.com>
Signed-off-by: tovam <tovam@pliops.com>
@ptovam
Copy link
Contributor Author

ptovam commented Oct 19, 2025

Thanks @markmc and @kfirwolfson!

A test would be good also

Added.

The offloading example does make me wonder though - if we add it the way it is proposed, I suspect it won't be long before users of e.g. offload with P/D want metrics from the individual connectors ... which brings me back to the idea of a single prefix cache metric with labels like source=local, source=external, source=cpu-offload, source=nixl, etc.

The distinction between connector sources isn’t always straightforward - some connectors rely on multiple sources, and MultiConnector would require separate counters per connector.
This would defeat the goal of keeping the implementation fully connector-agnostic and would move the metric tracking logic back into the connector layer.

To keep things simple and consistent, I’m keeping the existing vllm:prefix_cache_queries metric unchanged, and adding vllm:external_prefix_cache_queries for connector-managed caches.
Happy to hear your thoughts on this.

@ptovam
Copy link
Contributor Author

ptovam commented Oct 19, 2025

Additionally, I noticed the recent PR Don’t count preempted tokens in prefix cache hit rate and applied the same logic to the connector metrics as well

@ptovam
Copy link
Contributor Author

ptovam commented Oct 20, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces metrics for KV connector prefix cache efficiency, which is a valuable addition for monitoring performance. The implementation is clean, refactoring the stat collection logic into a record method on PrefixCacheStats and adding a new test case. However, I've identified an issue where statistics for preempted requests are not being included in the final metrics for either Prometheus or the standard logger. This could lead to an incomplete picture of cache performance. I've provided specific comments and suggestions to address this and ensure the metrics are comprehensive.

Copy link
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @ptovam , thanks for the great input everyone!

@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 22, 2025
@NickLucche NickLucche merged commit 88afa11 into vllm-project:main Oct 23, 2025
46 checks passed
@ptovam ptovam deleted the connector-hit-rate branch October 23, 2025 11:20
usberkeley pushed a commit to usberkeley/vllm that referenced this pull request Oct 23, 2025
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 23, 2025
…lm-project#26245)

Signed-off-by: tovam <tovam@pliops.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
845473182 pushed a commit to raindaywhu/vllm that referenced this pull request Oct 24, 2025
…o step_forward

* 'step_forward' of https://github.com/raindaywhu/vllm: (148 commits)
  [Model] Add MoE support for NemotronH (vllm-project#25863)
  [Metrics] [KVConnector] Add connector prefix cache hit rate stats (vllm-project#26245)
  [CI] Reorganize entrypoints tests (vllm-project#27403)
  add SLA information into comparison graph for vLLM Benchmark Suite (vllm-project#25525)
  [CI/Build] Fix AMD CI: test_cpu_gpu.py (vllm-project#27388)
  [Bugfix] Fix args settings for guided decoding args (vllm-project#27375)
  [CI/Build] Fix Prithvi plugin test (vllm-project#27393)
  [Chore] Remove duplicate `has_` functions in vllm.utils (vllm-project#27372)
  [Model] Add num_cached_tokens for PoolingRequestOutput (vllm-project#27378)
  [V1][spec decode] return logprobs for spec decoding (vllm-project#26060)
  [CORE] Support Prefix Caching with Prompt Embeds (vllm-project#27219)
  [Bugfix][Core] running queue index leakage exception (vllm-project#26754)
  [Bugfix] Fix incorrect kv cache metrics in grafana.json (vllm-project#27133)
  [Bugfix] Fix SLA tuner initialization (vllm-project#27355)
  [Bugfix] Fix deepseek-ocr multi-image inference and add `merge_by_field_config=True` with tensor schema support (vllm-project#27361)
  [MLA] Bump FlashMLA (vllm-project#27354)
  [Chore] Separate out system utilities from vllm.utils (vllm-project#27201)
  [BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 (vllm-project#27128)
  [Feature] publisher default set zmq in kv_event config (vllm-project#26915)
  [Prefix Cache] Use LoRA name for consistent KV-cache block hashing (vllm-project#27211)
  ...
kingsmad pushed a commit to kingsmad/vllm that referenced this pull request Oct 25, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…lm-project#26245)

Signed-off-by: tovam <tovam@pliops.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…lm-project#26245)

Signed-off-by: tovam <tovam@pliops.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants