You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am loading google/gemma-7b-it on a V100 GPU using CLI and Python API. With CLI it loads perfectly but with Python API, returns Cuda out of memory error.
Same quantization and precision settings have been used
The issue is the same as for the chat script: #1558
Here, during load method, kv-cache is created of the size that is equal to maximum context size of the model.
It should be changed as in this PR: #1583
I am loading google/gemma-7b-it on a V100 GPU using CLI and Python API. With CLI it loads perfectly but with Python API, returns Cuda out of memory error.
Same quantization and precision settings have been used
Command -
litgpt generate google/gemma-7b-it --quantize bnb.nf4 --precision bf16-true --max_new_tokens 256
Python code -
The text was updated successfully, but these errors were encountered: