Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StreamingLLM support? #1253

Open
nivibilla opened this issue Oct 3, 2023 · 5 comments
Open

StreamingLLM support? #1253

nivibilla opened this issue Oct 3, 2023 · 5 comments

Comments

@nivibilla
Copy link

nivibilla commented Oct 3, 2023

Hey,

This was a really interesting solution to the KV cache for long context.
https://github.com/mit-han-lab/streaming-llm

I was wondering it could be implemented here. From the looks of things it doesn't change anything about the model itself, its more about how the KV cache is implemented.

They show that they can have coherent inference over millions of tokens

Thanks!

@Guangxuan-Xiao
Copy link

Hello vLLM Team,

I'd like to start by expressing my appreciation for your dedication in developing the vLLM framework. Tracking the project's evolution has been nothing short of exhilarating. I believe that integrating StreamingLLM would be an enhancement that would be of great value to many users.

To streamline this possible integration, I've curated a suggested approach based on the StreamingLLM's structure:

  1. Sliding Window KV Cache: Integrate the sliding window KV cache tailored for extensive generation tasks.

  2. Initial Token Persistence: Ensure that starting tokens' KV (like the first page of tokens) are consistently maintained within the current context window.

  3. Rotary Embedding Caching: It's crucial to cache the key states before applying the rotary embedding.

    Subsequently, the rotary positional embedding should be reapplied within this cache during the generation stage.

Adhering to this approach, I'm confident that StreamingLLM can be integrated into the vLLM framework. The entire community, myself included, is enthusiastic about the potential of this feature!

Best,
Guangxuan

@MichaelZhouwang
Copy link

Is there any update on this feature update?

@Kaiyang-Chen
Copy link
Contributor

Kaiyang-Chen commented Mar 19, 2024

Hi @Guangxuan-Xiao, I believe this feature is quite meaningful and I'm interested in helping to implement it. However, after some initial research, I feel that there isn't a straightforward and efficient method. This is because the values in the kvcache already include the pos_embedding calculated with the absolute position, combined with the query's embedding to perform the linear computation to obtain the kv cache. However, in the streamingllm, the relative positions of the tokens need to change.

The most naive solution would be only store kv value from QKVLinear before applying any position info, and then add position info to the whole context_window before the attention computation. This approach would require new CUDA kernels to handle such inputs and introduce recomputation for the current key cache, and also need extra memory for the intermediate key_state. And also, if we try to leverage paged kv cache, when token eviction or replacement happens, we need extra data copying or extra space to store the index in order to keep the token in fixed blocks and maintain their relative sequence. I think the throughput would decrease significantly. Also, such a change is completely different from the original memory management logic and would be a significant modification. I'm not sure if the team would accept making the codebase especially complex for this feature.

Hi Team @WoosukKwon @zhuohan123, do you guys have any thoughts/hints about how to gracefully integrate this feature?

@halixness
Copy link

+1

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants