Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory error using Python API but not with CLI #1582

Closed
shubhdotai opened this issue Jul 15, 2024 · 1 comment · Fixed by #1590
Closed

Out of memory error using Python API but not with CLI #1582

shubhdotai opened this issue Jul 15, 2024 · 1 comment · Fixed by #1590
Assignees

Comments

@shubhdotai
Copy link

I am loading google/gemma-7b-it on a V100 GPU using CLI and Python API. With CLI it loads perfectly but with Python API, returns Cuda out of memory error.

Same quantization and precision settings have been used

Command -
litgpt generate google/gemma-7b-it --quantize bnb.nf4 --precision bf16-true --max_new_tokens 256

Python code -

from litgpt import LLM
llm = LLM.load("google/gemma-7b-it", quantize="bnb.nf4", precision="bf16-true")
@shubhdotai shubhdotai added the question Further information is requested label Jul 15, 2024
@Andrei-Aksionov
Copy link
Collaborator

Hello @shubhamworks

Thanks for the report.

The issue is the same as for the chat script: #1558
Here, during load method, kv-cache is created of the size that is equal to maximum context size of the model.
It should be changed as in this PR: #1583

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants