Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chat: dynamically set kv-cache size #1583

Merged
merged 3 commits into from
Jul 15, 2024

Conversation

Andrei-Aksionov
Copy link
Collaborator

Hi there 👋

Fixes #1558

In the chat script the size of kv-cache is set to model's maximum context size, while in the generate script - as a sum of len(prompt) and max_new_tokens. That leads to unnecessary VRAM consumption and sometimes OOM errors in the chat mode, while the generate script might run fine (since kv-cache might be much smaller as it depends on max_new_tokens value).

This PR adds max_new_tokends argument to the chat script and sets kv-cache in the same fashion as the generate script, only for each conversation turn.

Copy link
Collaborator

@rasbt rasbt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked through it and tried it and it works great! Thanks for the PR Andrei.

@Andrei-Aksionov
Copy link
Collaborator Author

But I think before we need to address the comment by Carlos about torch.compile and this dynamic size of kv-cache.

@rasbt
Copy link
Collaborator

rasbt commented Jul 15, 2024

Can we enable this new way if compile=False and if compile=True use the old way?

@Andrei-Aksionov
Copy link
Collaborator Author

Exactly. Only I think there a message should be printed that it will lead to a higher memory consumption and if there are any OOM errors to try without torch.compile.

@rasbt
Copy link
Collaborator

rasbt commented Jul 15, 2024

That's a good idea, without the message people could think the higher memory usage could be due to a bug

@Andrei-Aksionov
Copy link
Collaborator Author

Well, it seems that compilation doesn't work anyway #1584, so we worried about a wrong thing 🙈.

@rasbt
Copy link
Collaborator

rasbt commented Jul 15, 2024

Ouch. I don't have much experience with compilation, I am not sure how much effort it would be to fix that. For the moment, instead of having a bug, what do you think about removing the compilation feature and then later reintroduce it with Thunder ?

@Andrei-Aksionov
Copy link
Collaborator Author

Yes, I think it's better to remove it.
The only thing that I wouldn't count on Thunder so much, I believe it's still in a very early stage.
So it's better to remove it, but only for a brief moment.
Compilation is very important for generation.
In the #924 you can see a table with benchmarks. Take a look at the utilization column. I remember that with compilation (for non-quantized model) utilization jumped to ~90% and TPS (tokens per second) also was much higher.

@rasbt
Copy link
Collaborator

rasbt commented Jul 15, 2024

With the changes you added this should be fine now. I think we should fix the compilation in a separate PR. This would be out of scope for this one. In this case I think this should be good to merge from my view, or do you have any concerns?

@Andrei-Aksionov
Copy link
Collaborator Author

No, nothing else to add.
Let's merge.

@rasbt
Copy link
Collaborator

rasbt commented Jul 15, 2024

Awesome, thanks!

@rasbt rasbt merged commit ca64704 into Lightning-AI:main Jul 15, 2024
9 checks passed
@Andrei-Aksionov Andrei-Aksionov deleted the chat_kv_cache_preallocation branch July 15, 2024 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Chat consumes more VRAM than Generate
2 participants