Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Does paged attention demonstrate prefix sharing? #2354

Open
bob-just-bob opened this issue Jan 5, 2024 · 5 comments
Open

Question: Does paged attention demonstrate prefix sharing? #2354

bob-just-bob opened this issue Jan 5, 2024 · 5 comments
Labels

Comments

@bob-just-bob
Copy link

bob-just-bob commented Jan 5, 2024

Reading https://arxiv.org/abs/2311.04934 and wondering if I would gain anything from prompt cache.

My use case is having prompts with overlaping prefixes (mostly a few big ones). And I already use vllm paged attention.

Assuming I would only want to cache kv states for prefixes (not positioned anywhere like in the paper).
Would there be any gains in caching attention prefix states, or is paged attention and vllm indeed already doing this?

Paper:

Paged attention also demonstrates simple prefix sharing,
where different prompts with an identical prefix share
KV Cache

Goal:

                                           shared inputs with prompt1
                                               |
                                               |
 +---------------------------------+     +-----+------+--------------------+
 |                                 | ... | ////|///// |                    |
 +---------------------------------+     +------------+--------------------+
  prompt 1                                           prompt 2
  request 1                                          request 2


- store prefix->kvs
- request
  - find shared inputs
  - assert_kv_cache(prefix-kvs)


Any gain from this idea?

So do we with paged attention already skip the attention for the shared inputs, or is there anything to be gainend from
additionally caching prefix kvs?

If it already caches across requests, what is the mechanism that keeps kv-cache entries from busting?
Wondering if there are still potential tweaks to make to make sure certain prefixes stay in kv-cache.

@William394873
Copy link

Same question. Is there any update?

@franklyd
Copy link

Is it related to the PR?
#1669

@William394873
Copy link

Thanks! @franklyd, but is there any detailed document/API regarding this mechanism? For example, how exactly they store the prefixes, how long it gonna lasts, how to match, etc.. New to vllm here :)

@rkooo567
Copy link
Collaborator

rkooo567 commented Mar 3, 2024

I believe this #2614 issue can resolve your question! (it is also merged yesterday)

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants