Out of memory error using Python API but not with CLI #1582

shubhdotai · 2024-07-15T03:54:09Z

I am loading google/gemma-7b-it on a V100 GPU using CLI and Python API. With CLI it loads perfectly but with Python API, returns Cuda out of memory error.

Same quantization and precision settings have been used

Command -
litgpt generate google/gemma-7b-it --quantize bnb.nf4 --precision bf16-true --max_new_tokens 256

Python code -

from litgpt import LLM
llm = LLM.load("google/gemma-7b-it", quantize="bnb.nf4", precision="bf16-true")

The text was updated successfully, but these errors were encountered:

Andrei-Aksionov · 2024-07-15T10:01:00Z

Hello @shubhamworks

Thanks for the report.

The issue is the same as for the chat script: #1558
Here, during load method, kv-cache is created of the size that is equal to maximum context size of the model.
It should be changed as in this PR: #1583

shubhdotai added the question Further information is requested label Jul 15, 2024

Andrei-Aksionov self-assigned this Jul 15, 2024

Andrei-Aksionov removed the question Further information is requested label Jul 15, 2024

Andrei-Aksionov mentioned this issue Jul 17, 2024

LitGPT Python API seems to use more memory than chat #1588

Closed

rasbt mentioned this issue Jul 17, 2024

Fixes an issue where the LitGPT Python API was consuming too much memory #1590

Merged

rasbt closed this as completed in #1590 Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory error using Python API but not with CLI #1582

Out of memory error using Python API but not with CLI #1582

shubhdotai commented Jul 15, 2024

Andrei-Aksionov commented Jul 15, 2024

Out of memory error using Python API but not with CLI #1582

Out of memory error using Python API but not with CLI #1582

Comments

shubhdotai commented Jul 15, 2024

Andrei-Aksionov commented Jul 15, 2024