-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Metrics] [KVConnector] Add connector prefix cache hit rate stats #26245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds metrics for KV connector prefix cache hit rate. The changes span across the KV connector base, scheduler, and metrics loggers to collect, propagate, and log these new stats.
My review has two main points:
- A critical issue in
scheduler.pywhere the number of queries for the connector cache is miscalculated, leading to incorrect hit rate metrics. - A suggestion to address the
FIXMEinbase.pyfor conditionally initializingPrefixCacheStatsto improve performance when stats are disabled.
Overall, the changes are well-structured to introduce the new metrics. Addressing these points will ensure the correctness and efficiency of the implementation.
vllm/v1/core/sched/scheduler.py
Outdated
| self.connector.update_prefix_cache_stats( | ||
| request.num_tokens, num_external_computed_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of queries to the connector cache is being miscalculated. request.num_tokens represents the total prompt length, but the query to the connector is only for the tokens not found in the local prefix cache.
Using the total prompt length overestimates the number of queries and will lead to an incorrectly low hit rate metric. The number of queries should be request.num_tokens - num_new_local_computed_tokens. The num_new_local_computed_tokens variable is available in this scope.
| self.connector.update_prefix_cache_stats( | |
| request.num_tokens, num_external_computed_tokens) | |
| self.connector.update_prefix_cache_stats( | |
| request.num_tokens - num_new_local_computed_tokens, num_external_computed_tokens) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually prefer the current counting method — it reflects the connector’s contribution relative to the entire prompt, which gives a clearer view of overall cache effectiveness and system-level savings.
That said, I’m open to adjustments — maybe we can combine both approaches to expose connector-level stats as well.
00f903d to
ae2b26e
Compare
NickLucche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this specific PrefixCacheStats is orthogonal to the general KVConnector logging we tried to design here #22188.
One example of customizing the logs here #25388.
And one more example of getting this to work with Scheduler-side by #26046 by @QierLi .
Have you considered the options above for implementing your use-case?
|
Good initiative, very interesting. Following. |
|
This pull request has merge conflicts that must be resolved before it can be |
Thanks @NickLucche
|
41b04ab to
5144e4f
Compare
|
Very interesting and useful feature! Looking forward to it 👀 |
ApostaC
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside from @NickLucche's comment, just some other comments regarding the current implementation:
Since the scheduler has enough information to calculate the number of requests, request tokens, and hit tokens, why do we need to add update_prefix_cache_stats and make_prefix_cache_stats to the KVConnectorBase_V1 at all?
We can just add a helper function in the scheduler and make it fully "connector agnostic".
|
Hey @ptovam, thanks for elaborating! I think I initially misunderstood your intent as I was confused by the "agnostic" and the proposed changes:
I see your point, but looking at the impl this looks like very connector-oriented given all the values come from
Good point, I will look to expand that. Regardless, my initial suggestion was based on the (wrong) assumption that we wanted to do something connector-centric here. |
|
cc @markmc for the prefix cache hit metric |
5144e4f to
8e8ea29
Compare
Signed-off-by: tovam <tovam@pliops.com>
Signed-off-by: tovam <tovam@pliops.com>
Yep, fair - "remote" wouldn't match the offloading use case well
Uh, my feedback is because it wasn't self-explanatory to me!
"External" is the language used in the scheduler, so that makes sense The offloading example does make me wonder though - if we add it the way it is proposed, I suspect it won't be long before users of e.g. offload with P/D want metrics from the individual connectors ... which brings me back to the idea of a single prefix cache metric with labels like We can maintain the existing label-free metric for backwards compatibility and add a new one with labels |
Point taken 😃. I meant it just literaly relates to all cache managed by the connector, so cannot be misinterpreted. I agree it's not actually explanatory as does not explain what that cache is. Like you said, |
Signed-off-by: tovam <tovam@pliops.com>
… to external_prefix_cache_*. Signed-off-by: tovam <tovam@pliops.com>
Signed-off-by: tovam <tovam@pliops.com>
|
Thanks @markmc and @kfirwolfson!
Added.
The distinction between connector sources isn’t always straightforward - some connectors rely on multiple sources, and To keep things simple and consistent, I’m keeping the existing |
|
Additionally, I noticed the recent PR Don’t count preempted tokens in prefix cache hit rate and applied the same logic to the connector metrics as well |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces metrics for KV connector prefix cache efficiency, which is a valuable addition for monitoring performance. The implementation is clean, refactoring the stat collection logic into a record method on PrefixCacheStats and adding a new test case. However, I've identified an issue where statistics for preempted requests are not being included in the final metrics for either Prometheus or the standard logger. This could lead to an incomplete picture of cache performance. I've provided specific comments and suggestions to address this and ensure the metrics are comprehensive.
NickLucche
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @ptovam , thanks for the great input everyone!
…lm-project#26245) Signed-off-by: tovam <tovam@pliops.com>
…lm-project#26245) Signed-off-by: tovam <tovam@pliops.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
…o step_forward * 'step_forward' of https://github.com/raindaywhu/vllm: (148 commits) [Model] Add MoE support for NemotronH (vllm-project#25863) [Metrics] [KVConnector] Add connector prefix cache hit rate stats (vllm-project#26245) [CI] Reorganize entrypoints tests (vllm-project#27403) add SLA information into comparison graph for vLLM Benchmark Suite (vllm-project#25525) [CI/Build] Fix AMD CI: test_cpu_gpu.py (vllm-project#27388) [Bugfix] Fix args settings for guided decoding args (vllm-project#27375) [CI/Build] Fix Prithvi plugin test (vllm-project#27393) [Chore] Remove duplicate `has_` functions in vllm.utils (vllm-project#27372) [Model] Add num_cached_tokens for PoolingRequestOutput (vllm-project#27378) [V1][spec decode] return logprobs for spec decoding (vllm-project#26060) [CORE] Support Prefix Caching with Prompt Embeds (vllm-project#27219) [Bugfix][Core] running queue index leakage exception (vllm-project#26754) [Bugfix] Fix incorrect kv cache metrics in grafana.json (vllm-project#27133) [Bugfix] Fix SLA tuner initialization (vllm-project#27355) [Bugfix] Fix deepseek-ocr multi-image inference and add `merge_by_field_config=True` with tensor schema support (vllm-project#27361) [MLA] Bump FlashMLA (vllm-project#27354) [Chore] Separate out system utilities from vllm.utils (vllm-project#27201) [BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 (vllm-project#27128) [Feature] publisher default set zmq in kv_event config (vllm-project#26915) [Prefix Cache] Use LoRA name for consistent KV-cache block hashing (vllm-project#27211) ...
…lm-project#26245) Signed-off-by: tovam <tovam@pliops.com>
…lm-project#26245) Signed-off-by: tovam <tovam@pliops.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…lm-project#26245) Signed-off-by: tovam <tovam@pliops.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…lm-project#26245) Signed-off-by: tovam <tovam@pliops.com>
Add prefix-cache metrics for KV connectors
Introduces connector-agnostic metrics to track KV connector cache efficiency,
enabling clearer insights into cache effectiveness and overall system performance.