-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
whisper : fix excessive memory usage #2443
Conversation
ok tks very much @ggerganov
I now have to decrease beamsize to 4 in order to be able to decode anything (using the v3 model) without OOM (6GB GPU).
? |
This PR also increases only the decoder KV cache. The KV cache size for the large model is 84 MB per decoder. So if you want to run 8 beams, you will need extra 672 MB of VRAM.
No, it doesn't. I don't think this PR uses more memory than it should. Could you double-check your findings? |
Sure.
beamsize 5: no OOM. Todays master:
Just running out of kv cache slot. With your current change (2443):
beamsize 4: OOM after 7mn:
beamsize 3: OOM at the end :( So I double confirm, for some reasons, does nt this PR use more mem than expected ? |
Good news - I looked deeper into this and indeed found a huge memory overhead across the implementation. The regression started back in ggerganov/ggml#731. I wasn't careful to adapt the I pushed a fix into this branch which should significantly reduce the memory usage. Also, it would be useful if you run some additional tests with Flash Attention enabled (add Looking for extra feedback whenever you get the time. If everything is good, we should make a new release as soon as possible to resolve this problem. Thanks. P.S. I temporary lost access to my CUDA workstation for the next few days, so I am unable to run tests with CUDA and will rely on feedback from this discussion. |
Of course.
I even suspect that the quality is now better than the original python implementation. |
Thanks, I also ran some tests today an all seems good. |
* whisper : fix KV cache allocation * whisper : reduce memory overhead from unused input tensors
alt #2433