[Metrics] [KVConnector] Add connector prefix cache hit rate stats #26245

ptovam · 2025-10-05T09:05:21Z

Add prefix-cache metrics for KV connectors

Introduces connector-agnostic metrics to track KV connector cache efficiency,
enabling clearer insights into cache effectiveness and overall system performance.

mergify · 2025-10-05T09:05:56Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ptovam.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request adds metrics for KV connector prefix cache hit rate. The changes span across the KV connector base, scheduler, and metrics loggers to collect, propagate, and log these new stats.

My review has two main points:

A critical issue in scheduler.py where the number of queries for the connector cache is miscalculated, leading to incorrect hit rate metrics.
A suggestion to address the FIXME in base.py for conditionally initializing PrefixCacheStats to improve performance when stats are disabled.

Overall, the changes are well-structured to introduce the new metrics. Addressing these points will ensure the correctness and efficiency of the implementation.

gemini-code-assist · 2025-10-05T09:08:04Z

vllm/v1/core/sched/scheduler.py

+                        self.connector.update_prefix_cache_stats(
+                            request.num_tokens, num_external_computed_tokens)


The number of queries to the connector cache is being miscalculated. request.num_tokens represents the total prompt length, but the query to the connector is only for the tokens not found in the local prefix cache.

Using the total prompt length overestimates the number of queries and will lead to an incorrectly low hit rate metric. The number of queries should be request.num_tokens - num_new_local_computed_tokens. The num_new_local_computed_tokens variable is available in this scope.

Suggested change

self.connector.update_prefix_cache_stats(

request.num_tokens, num_external_computed_tokens)

self.connector.update_prefix_cache_stats(

request.num_tokens - num_new_local_computed_tokens, num_external_computed_tokens)

I actually prefer the current counting method — it reflects the connector’s contribution relative to the entire prompt, which gives a clearer view of overall cache effectiveness and system-level savings.
That said, I’m open to adjustments — maybe we can combine both approaches to expose connector-level stats as well.

vllm/distributed/kv_transfer/kv_connector/v1/base.py

NickLucche

I think this specific PrefixCacheStats is orthogonal to the general KVConnector logging we tried to design here #22188.

One example of customizing the logs here #25388.
And one more example of getting this to work with Scheduler-side by #26046 by @QierLi .

Have you considered the options above for implementing your use-case?

kfirwolfson · 2025-10-07T05:47:11Z

Good initiative, very interesting. Following.

mergify · 2025-10-10T10:06:35Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ptovam.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ptovam · 2025-10-15T11:24:05Z

I think this specific PrefixCacheStats is orthogonal to the general KVConnector logging we tried to design here #22188.

One example of customizing the logs here #25388. And one more example of getting this to work with Scheduler-side by #26046 by @QierLi .

Have you considered the options above for implementing your use-case?

Thanks @NickLucche
I followed the KVConnector logging work, and it’s really useful for me in other use cases.
but for this specific one I believe it’s less appropriate for a few reasons:

In my view, KVConnectorStats is great for connector-internal telemetry, but the prefix-cache hit rate is a system-level performance metric, not connector. It’s closely related to the other cache metrics vLLM already exposes (like GPU prefix hit rate), so it makes sense to report it alongside them.
If we used the KVConnectorStats, each connector would have to implement its own hit-rate tracking.
Since this metric is fundamental and shared across connectors, I think it’s better provided centrally.
The current KVConnector logging only prints to stdout, not to Prometheus.
(I did look at #26811, but it doesn’t seem generic yet for OOT connectors.)

ApostaC · 2025-10-15T20:18:03Z

Very interesting and useful feature! Looking forward to it 👀

ApostaC

Aside from @NickLucche's comment, just some other comments regarding the current implementation:
Since the scheduler has enough information to calculate the number of requests, request tokens, and hit tokens, why do we need to add update_prefix_cache_stats and make_prefix_cache_stats to the KVConnectorBase_V1 at all?

We can just add a helper function in the scheduler and make it fully "connector agnostic".

NickLucche · 2025-10-16T13:39:46Z

Hey @ptovam, thanks for elaborating! I think I initially misunderstood your intent as I was confused by the "agnostic" and the proposed changes:

the prefix-cache hit rate is a system-level performance metric, not connector

I see your point, but looking at the impl this looks like very connector-oriented given all the values come from PrefixCacheStats which is entirely the responsibility of the KVConnector. I am not against that, but if we want something connector-agnostic I agree with @ApostaC comment.
At that point we could promote it to "first-class" metrics without having to bundle it with connectors.

(I did look at #26811, but it doesn’t seem generic yet for OOT connectors.)

Good point, I will look to expand that. Regardless, my initial suggestion was based on the (wrong) assumption that we wanted to do something connector-centric here.
I think we can safely ignore my first comment in light of the current discussion.

NickLucche · 2025-10-16T13:41:22Z

cc @markmc for the prefix cache hit metric

Signed-off-by: tovam <tovam@pliops.com>

markmc · 2025-10-17T16:36:09Z

But ... maybe renaming them from connector_prefix_cache to remote_prefix_cache would be a good compromise?

@markmc indeed "remote" cache is one of the connector use-cases, relating to remote key-value stores or file systems, but this is not the only use-case: the connector is used for all caches external to the vLLM's APC (residing in GPU VRAM). For example, the OffloadingConnector manages cache in CPU DRAM, and using the term "remote" for it might be misleading.

Yep, fair - "remote" wouldn't match the offloading use case well

I suggest keeping connector_prefix_cache, which is self-explanatory.

Uh, my feedback is because it wasn't self-explanatory to me!

An alternative can possibly be external_prefix_cache, hinting at external to vLLM's APC, though it doesn't have to be "external" to vLLM (if using a connector in vLLM codebase) so might also be somewhat confusing.

"External" is the language used in the scheduler, so that makes sense

The offloading example does make me wonder though - if we add it the way it is proposed, I suspect it won't be long before users of e.g. offload with P/D want metrics from the individual connectors ... which brings me back to the idea of a single prefix cache metric with labels like source=local, source=external, source=cpu-offload, source=nixl, etc.

We can maintain the existing label-free metric for backwards compatibility and add a new one with labels

kfirwolfson · 2025-10-18T10:16:28Z

I suggest keeping connector_prefix_cache, which is self-explanatory.

Uh, my feedback is because it wasn't self-explanatory to me!

Point taken 😃. I meant it just literaly relates to all cache managed by the connector, so cannot be misinterpreted. I agree it's not actually explanatory as does not explain what that cache is. Like you said, external is the name for connector-managed cache in the scheduler, so sounds good to me.

Signed-off-by: tovam <tovam@pliops.com>

… to external_prefix_cache_*. Signed-off-by: tovam <tovam@pliops.com>

Signed-off-by: tovam <tovam@pliops.com>

ptovam · 2025-10-19T19:53:17Z

Thanks @markmc and @kfirwolfson!

A test would be good also

Added.

The offloading example does make me wonder though - if we add it the way it is proposed, I suspect it won't be long before users of e.g. offload with P/D want metrics from the individual connectors ... which brings me back to the idea of a single prefix cache metric with labels like source=local, source=external, source=cpu-offload, source=nixl, etc.

The distinction between connector sources isn’t always straightforward - some connectors rely on multiple sources, and MultiConnector would require separate counters per connector.
This would defeat the goal of keeping the implementation fully connector-agnostic and would move the metric tracking logic back into the connector layer.

To keep things simple and consistent, I’m keeping the existing vllm:prefix_cache_queries metric unchanged, and adding vllm:external_prefix_cache_queries for connector-managed caches.
Happy to hear your thoughts on this.

ptovam · 2025-10-19T19:54:03Z

Additionally, I noticed the recent PR Don’t count preempted tokens in prefix cache hit rate and applied the same logic to the connector metrics as well

ptovam · 2025-10-20T09:36:14Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces metrics for KV connector prefix cache efficiency, which is a valuable addition for monitoring performance. The implementation is clean, refactoring the stat collection logic into a record method on PrefixCacheStats and adding a new test case. However, I've identified an issue where statistics for preempted requests are not being included in the final metrics for either Prometheus or the standard logger. This could lead to an incomplete picture of cache performance. I've provided specific comments and suggestions to address this and ensure the metrics are comprehensive.

vllm/v1/metrics/loggers.py

vllm/v1/metrics/stats.py

NickLucche

Nice work @ptovam , thanks for the great input everyone!

…lm-project#26245) Signed-off-by: tovam <tovam@pliops.com>

…lm-project#26245) Signed-off-by: tovam <tovam@pliops.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

…o step_forward * 'step_forward' of https://github.com/raindaywhu/vllm: (148 commits) [Model] Add MoE support for NemotronH (vllm-project#25863) [Metrics] [KVConnector] Add connector prefix cache hit rate stats (vllm-project#26245) [CI] Reorganize entrypoints tests (vllm-project#27403) add SLA information into comparison graph for vLLM Benchmark Suite (vllm-project#25525) [CI/Build] Fix AMD CI: test_cpu_gpu.py (vllm-project#27388) [Bugfix] Fix args settings for guided decoding args (vllm-project#27375) [CI/Build] Fix Prithvi plugin test (vllm-project#27393) [Chore] Remove duplicate `has_` functions in vllm.utils (vllm-project#27372) [Model] Add num_cached_tokens for PoolingRequestOutput (vllm-project#27378) [V1][spec decode] return logprobs for spec decoding (vllm-project#26060) [CORE] Support Prefix Caching with Prompt Embeds (vllm-project#27219) [Bugfix][Core] running queue index leakage exception (vllm-project#26754) [Bugfix] Fix incorrect kv cache metrics in grafana.json (vllm-project#27133) [Bugfix] Fix SLA tuner initialization (vllm-project#27355) [Bugfix] Fix deepseek-ocr multi-image inference and add `merge_by_field_config=True` with tensor schema support (vllm-project#27361) [MLA] Bump FlashMLA (vllm-project#27354) [Chore] Separate out system utilities from vllm.utils (vllm-project#27201) [BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 (vllm-project#27128) [Feature] publisher default set zmq in kv_event config (vllm-project#26915) [Prefix Cache] Use LoRA name for consistent KV-cache block hashing (vllm-project#27211) ...

…lm-project#26245) Signed-off-by: tovam <tovam@pliops.com>

…lm-project#26245) Signed-off-by: tovam <tovam@pliops.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

…lm-project#26245) Signed-off-by: tovam <tovam@pliops.com>

ptovam requested review from ApostaC, NickLucche, WoosukKwon, alexm-redhat, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners October 5, 2025 09:05

mergify bot added the v1 label Oct 5, 2025

mergify bot added needs-rebase kv-connector labels Oct 5, 2025

gemini-code-assist bot reviewed Oct 5, 2025

View reviewed changes

ptovam force-pushed the connector-hit-rate branch from 00f903d to ae2b26e Compare October 5, 2025 09:40

mergify bot removed the needs-rebase label Oct 5, 2025

NickLucche requested changes Oct 6, 2025

View reviewed changes

mergify bot added the needs-rebase label Oct 10, 2025

ptovam force-pushed the connector-hit-rate branch from 41b04ab to 5144e4f Compare October 15, 2025 18:35

mergify bot removed the needs-rebase label Oct 15, 2025

ApostaC reviewed Oct 15, 2025

View reviewed changes

ptovam force-pushed the connector-hit-rate branch from 5144e4f to 8e8ea29 Compare October 16, 2025 14:40

ptovam added 2 commits October 16, 2025 17:42

[Metrics][KVConnector] Add connector prefix cache hit rate stats

48bc281

Signed-off-by: tovam <tovam@pliops.com>

Refactor prefix cache stats tracking

2792b32

Signed-off-by: tovam <tovam@pliops.com>

ptovam added 3 commits October 19, 2025 17:53

Separate metrics for preempted requests

9635098

Signed-off-by: tovam <tovam@pliops.com>

Update metric documentation - rename connector_prefix_cache_* metrics…

ae5ecf1

… to external_prefix_cache_*. Signed-off-by: tovam <tovam@pliops.com>

Add unit test

e287649

Signed-off-by: tovam <tovam@pliops.com>

gemini-code-assist bot reviewed Oct 20, 2025

View reviewed changes

vllm/v1/metrics/loggers.py Show resolved Hide resolved

vllm/v1/metrics/loggers.py Show resolved Hide resolved

vllm/v1/metrics/stats.py Show resolved Hide resolved

markmc approved these changes Oct 22, 2025

View reviewed changes

NickLucche approved these changes Oct 22, 2025

View reviewed changes

markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 22, 2025

markmc removed request for WoosukKwon, alexm-redhat, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 October 22, 2025 19:00

Merge branch 'main' into connector-hit-rate

4db6b0d

ptovam mentioned this pull request Oct 23, 2025

[KVConnector][Feature] Support KV connector cache reset via /reset_prefix_cache #27170

Open

NickLucche merged commit 88afa11 into vllm-project:main Oct 23, 2025
46 checks passed

ptovam deleted the connector-hit-rate branch October 23, 2025 11:20

usberkeley pushed a commit to usberkeley/vllm that referenced this pull request Oct 23, 2025

[Metrics] [KVConnector] Add connector prefix cache hit rate stats (vl…

0357aae

…lm-project#26245) Signed-off-by: tovam <tovam@pliops.com>

kingsmad pushed a commit to kingsmad/vllm that referenced this pull request Oct 25, 2025

[Metrics] [KVConnector] Add connector prefix cache hit rate stats (vl…

c8db0e3

…lm-project#26245) Signed-off-by: tovam <tovam@pliops.com>

ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025

[Metrics] [KVConnector] Add connector prefix cache hit rate stats (vl…

a483008

…lm-project#26245) Signed-off-by: tovam <tovam@pliops.com>

		self.connector.update_prefix_cache_stats(
		request.num_tokens, num_external_computed_tokens)

Uh oh!

[Metrics] [KVConnector] Add connector prefix cache hit rate stats #26245

[Metrics] [KVConnector] Add connector prefix cache hit rate stats #26245

Conversation

ptovam commented Oct 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add prefix-cache metrics for KV connectors

Uh oh!

mergify bot commented Oct 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 5, 2025

Choose a reason for hiding this comment

Uh oh!

ptovam Oct 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

kfirwolfson commented Oct 7, 2025

Uh oh!

mergify bot commented Oct 10, 2025

Uh oh!

ptovam commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ApostaC commented Oct 15, 2025

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

NickLucche commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NickLucche commented Oct 16, 2025

Uh oh!

markmc commented Oct 17, 2025

Uh oh!

kfirwolfson commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ptovam commented Oct 19, 2025

Uh oh!

ptovam commented Oct 19, 2025

Uh oh!

ptovam commented Oct 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ptovam commented Oct 5, 2025 •

edited by github-actions bot

Loading

ptovam commented Oct 15, 2025 •

edited

Loading

NickLucche commented Oct 16, 2025 •

edited

Loading

kfirwolfson commented Oct 18, 2025 •

edited

Loading