-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chat: dynamically set kv-cache size #1583
Chat: dynamically set kv-cache size #1583
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looked through it and tried it and it works great! Thanks for the PR Andrei.
But I think before we need to address the comment by Carlos about torch.compile and this dynamic size of kv-cache. |
Can we enable this new way if |
Exactly. Only I think there a message should be printed that it will lead to a higher memory consumption and if there are any OOM errors to try without |
That's a good idea, without the message people could think the higher memory usage could be due to a bug |
Well, it seems that compilation doesn't work anyway #1584, so we worried about a wrong thing 🙈. |
Ouch. I don't have much experience with compilation, I am not sure how much effort it would be to fix that. For the moment, instead of having a bug, what do you think about removing the compilation feature and then later reintroduce it with Thunder ? |
Yes, I think it's better to remove it. |
With the changes you added this should be fine now. I think we should fix the compilation in a separate PR. This would be out of scope for this one. In this case I think this should be good to merge from my view, or do you have any concerns? |
No, nothing else to add. |
Awesome, thanks! |
Hi there 👋
Fixes #1558
In the
chat
script the size ofkv-cache
is set to model's maximum context size, while in thegenerate
script - as a sum oflen(prompt)
andmax_new_tokens
. That leads to unnecessary VRAM consumption and sometimes OOM errors in the chat mode, while the generate script might run fine (since kv-cache might be much smaller as it depends on max_new_tokens value).This PR adds
max_new_tokends
argument to the chat script and sets kv-cache in the same fashion as the generate script, only for each conversation turn.