StreamingLLM support? #1253

nivibilla · 2023-10-03T09:12:47Z

Hey,

This was a really interesting solution to the KV cache for long context.
https://github.com/mit-han-lab/streaming-llm

I was wondering it could be implemented here. From the looks of things it doesn't change anything about the model itself, its more about how the KV cache is implemented.

They show that they can have coherent inference over millions of tokens

Thanks!

Guangxuan-Xiao · 2023-10-23T22:38:14Z

Hello vLLM Team,

I'd like to start by expressing my appreciation for your dedication in developing the vLLM framework. Tracking the project's evolution has been nothing short of exhilarating. I believe that integrating StreamingLLM would be an enhancement that would be of great value to many users.

To streamline this possible integration, I've curated a suggested approach based on the StreamingLLM's structure:

Sliding Window KV Cache: Integrate the sliding window KV cache tailored for extensive generation tasks.
- Reference - KV Cache
- Reference - Example Implementation
Initial Token Persistence: Ensure that starting tokens' KV (like the first page of tokens) are consistently maintained within the current context window.
Rotary Embedding Caching: It's crucial to cache the key states before applying the rotary embedding.
- Reference - Caching
Subsequently, the rotary positional embedding should be reapplied within this cache during the generation stage.
- Reference - Reapplication

Adhering to this approach, I'm confident that StreamingLLM can be integrated into the vLLM framework. The entire community, myself included, is enthusiastic about the potential of this feature!

Best,
Guangxuan

MichaelZhouwang · 2023-12-15T12:27:47Z

Is there any update on this feature update?

Kaiyang-Chen · 2024-03-19T02:51:16Z

Hi @Guangxuan-Xiao, I believe this feature is quite meaningful and I'm interested in helping to implement it. However, after some initial research, I feel that there isn't a straightforward and efficient method. This is because the values in the kvcache already include the pos_embedding calculated with the absolute position, combined with the query's embedding to perform the linear computation to obtain the kv cache. However, in the streamingllm, the relative positions of the tokens need to change.

The most naive solution would be only store kv value from QKVLinear before applying any position info, and then add position info to the whole context_window before the attention computation. This approach would require new CUDA kernels to handle such inputs and introduce recomputation for the current key cache, and also need extra memory for the intermediate key_state. And also, if we try to leverage paged kv cache, when token eviction or replacement happens, we need extra data copying or extra space to store the index in order to keep the token in fixed blocks and maintain their relative sequence. I think the throughput would decrease significantly. Also, such a change is completely different from the original memory management logic and would be a significant modification. I'm not sure if the team would accept making the codebase especially complex for this feature.

Hi Team @WoosukKwon @zhuohan123, do you guys have any thoughts/hints about how to gracefully integrate this feature?

halixness · 2024-06-01T08:21:16Z

+1

github-actions · 2024-10-30T02:05:03Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

WoosukKwon added the feature request label Oct 17, 2023

Tint0ri mentioned this issue Feb 28, 2024

[Roadmap] vLLM Roadmap Q1 2024 #2681

Closed

30 tasks

beagleski mentioned this issue Mar 26, 2024

[Feature]: H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models #3532

Open

github-actions bot added the stale label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StreamingLLM support? #1253

StreamingLLM support? #1253

nivibilla commented Oct 3, 2023 •

edited

Loading

Guangxuan-Xiao commented Oct 23, 2023

MichaelZhouwang commented Dec 15, 2023

Kaiyang-Chen commented Mar 19, 2024 •

edited

Loading

halixness commented Jun 1, 2024

github-actions bot commented Oct 30, 2024

StreamingLLM support? #1253

StreamingLLM support? #1253

Comments

nivibilla commented Oct 3, 2023 • edited Loading

Guangxuan-Xiao commented Oct 23, 2023

MichaelZhouwang commented Dec 15, 2023

Kaiyang-Chen commented Mar 19, 2024 • edited Loading

halixness commented Jun 1, 2024

github-actions bot commented Oct 30, 2024

nivibilla commented Oct 3, 2023 •

edited

Loading

Kaiyang-Chen commented Mar 19, 2024 •

edited

Loading