You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
My use case is having prompts with overlaping prefixes (mostly a few big ones). And I already use vllm paged attention.
Assuming I would only want to cache kv states for prefixes (not positioned anywhere like in the paper).
Would there be any gains in caching attention prefix states, or is paged attention and vllm indeed already doing this?
Paper:
Paged attention also demonstrates simple prefix sharing,
where different prompts with an identical prefix share
KV Cache
Goal:
shared inputs with prompt1
|
|
+---------------------------------+ +-----+------+--------------------+
| | ... | ////|///// | |
+---------------------------------+ +------------+--------------------+
prompt 1 prompt 2
request 1 request 2
- store prefix->kvs
- request
- find shared inputs
- assert_kv_cache(prefix-kvs)
Any gain from this idea?
So do we with paged attention already skip the attention for the shared inputs, or is there anything to be gainend from
additionally caching prefix kvs?
If it already caches across requests, what is the mechanism that keeps kv-cache entries from busting?
Wondering if there are still potential tweaks to make to make sure certain prefixes stay in kv-cache.
The text was updated successfully, but these errors were encountered:
Thanks! @franklyd, but is there any detailed document/API regarding this mechanism? For example, how exactly they store the prefixes, how long it gonna lasts, how to match, etc.. New to vllm here :)
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Reading https://arxiv.org/abs/2311.04934 and wondering if I would gain anything from prompt cache.
My use case is having prompts with overlaping prefixes (mostly a few big ones). And I already use vllm paged attention.
Assuming I would only want to cache kv states for prefixes (not positioned anywhere like in the paper).
Would there be any gains in caching attention prefix states, or is paged attention and vllm indeed already doing this?
Paper:
Goal:
So do we with paged attention already skip the attention for the shared inputs, or is there anything to be gainend from
additionally caching prefix kvs?
If it already caches across requests, what is the mechanism that keeps kv-cache entries from busting?
Wondering if there are still potential tweaks to make to make sure certain prefixes stay in
kv-cache
.The text was updated successfully, but these errors were encountered: