Chat: dynamically set kv-cache size #1583

Andrei-Aksionov · 2024-07-15T09:07:24Z

Hi there 👋

In the chat script the size of kv-cache is set to model's maximum context size, while in the generate script - as a sum of len(prompt) and max_new_tokens. That leads to unnecessary VRAM consumption and sometimes OOM errors in the chat mode, while the generate script might run fine (since kv-cache might be much smaller as it depends on max_new_tokens value).

This PR adds max_new_tokends argument to the chat script and sets kv-cache in the same fashion as the generate script, only for each conversation turn.

rasbt

Looked through it and tried it and it works great! Thanks for the PR Andrei.

Andrei-Aksionov · 2024-07-15T16:11:49Z

But I think before we need to address the comment by Carlos about torch.compile and this dynamic size of kv-cache.

rasbt · 2024-07-15T16:38:44Z

Can we enable this new way if compile=False and if compile=True use the old way?

Andrei-Aksionov · 2024-07-15T16:44:14Z

Exactly. Only I think there a message should be printed that it will lead to a higher memory consumption and if there are any OOM errors to try without torch.compile.

rasbt · 2024-07-15T16:46:31Z

That's a good idea, without the message people could think the higher memory usage could be due to a bug

Andrei-Aksionov · 2024-07-15T17:06:27Z

Well, it seems that compilation doesn't work anyway #1584, so we worried about a wrong thing 🙈.

rasbt · 2024-07-15T17:16:03Z

Ouch. I don't have much experience with compilation, I am not sure how much effort it would be to fix that. For the moment, instead of having a bug, what do you think about removing the compilation feature and then later reintroduce it with Thunder ?

Andrei-Aksionov · 2024-07-15T17:59:13Z

Yes, I think it's better to remove it.
The only thing that I wouldn't count on Thunder so much, I believe it's still in a very early stage.
So it's better to remove it, but only for a brief moment.
Compilation is very important for generation.
In the #924 you can see a table with benchmarks. Take a look at the utilization column. I remember that with compilation (for non-quantized model) utilization jumped to ~90% and TPS (tokens per second) also was much higher.

litgpt/chat/base.py

rasbt · 2024-07-15T18:36:38Z

With the changes you added this should be fine now. I think we should fix the compilation in a separate PR. This would be out of scope for this one. In this case I think this should be good to merge from my view, or do you have any concerns?

Andrei-Aksionov · 2024-07-15T18:48:29Z

No, nothing else to add.
Let's merge.

rasbt · 2024-07-15T18:52:58Z

Awesome, thanks!

Chat: dynamically set kv-cache size

4d4f75b

Andrei-Aksionov requested review from awaelchli and lantiga as code owners July 15, 2024 09:07

Update test

dc06877

Andrei-Aksionov mentioned this pull request Jul 15, 2024

Out of memory error using Python API but not with CLI #1582

Closed

Andrei-Aksionov requested a review from rasbt July 15, 2024 10:51

rasbt approved these changes Jul 15, 2024

View reviewed changes

Don't change KV-cache size in compiled mode

8e7405e

rasbt reviewed Jul 15, 2024

View reviewed changes

litgpt/chat/base.py Show resolved Hide resolved

rasbt merged commit ca64704 into Lightning-AI:main Jul 15, 2024
9 checks passed

Andrei-Aksionov deleted the chat_kv_cache_preallocation branch July 15, 2024 19:35

This was referenced Jul 18, 2024

Apply kv-cache related memory improvements to litgpt serve #1595

Closed

Dynamically set kv-cache size in serve #1602

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chat: dynamically set kv-cache size #1583

Chat: dynamically set kv-cache size #1583

Andrei-Aksionov commented Jul 15, 2024

rasbt left a comment

Andrei-Aksionov commented Jul 15, 2024

rasbt commented Jul 15, 2024

Andrei-Aksionov commented Jul 15, 2024

rasbt commented Jul 15, 2024

Andrei-Aksionov commented Jul 15, 2024

rasbt commented Jul 15, 2024

Andrei-Aksionov commented Jul 15, 2024

rasbt commented Jul 15, 2024

Andrei-Aksionov commented Jul 15, 2024

rasbt commented Jul 15, 2024

Chat: dynamically set kv-cache size #1583

Chat: dynamically set kv-cache size #1583

Conversation

Andrei-Aksionov commented Jul 15, 2024

rasbt left a comment

Choose a reason for hiding this comment

Andrei-Aksionov commented Jul 15, 2024

rasbt commented Jul 15, 2024

Andrei-Aksionov commented Jul 15, 2024

rasbt commented Jul 15, 2024

Andrei-Aksionov commented Jul 15, 2024

rasbt commented Jul 15, 2024

Andrei-Aksionov commented Jul 15, 2024

rasbt commented Jul 15, 2024

Andrei-Aksionov commented Jul 15, 2024

rasbt commented Jul 15, 2024