-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
StreamingLLM support? #1253
Comments
Hello vLLM Team, I'd like to start by expressing my appreciation for your dedication in developing the vLLM framework. Tracking the project's evolution has been nothing short of exhilarating. I believe that integrating StreamingLLM would be an enhancement that would be of great value to many users. To streamline this possible integration, I've curated a suggested approach based on the StreamingLLM's structure:
Adhering to this approach, I'm confident that StreamingLLM can be integrated into the vLLM framework. The entire community, myself included, is enthusiastic about the potential of this feature! Best, |
Is there any update on this feature update? |
Hi @Guangxuan-Xiao, I believe this feature is quite meaningful and I'm interested in helping to implement it. However, after some initial research, I feel that there isn't a straightforward and efficient method. This is because the values in the kvcache already include the The most naive solution would be only store kv value from Hi Team @WoosukKwon @zhuohan123, do you guys have any thoughts/hints about how to gracefully integrate this feature? |
+1 |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
Hey,
This was a really interesting solution to the KV cache for long context.
https://github.com/mit-han-lab/streaming-llm
I was wondering it could be implemented here. From the looks of things it doesn't change anything about the model itself, its more about how the KV cache is implemented.
They show that they can have coherent inference over millions of tokens
Thanks!
The text was updated successfully, but these errors were encountered: