Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[User] ggml_tensor->extra(s) are not freed at the end of llama_eval, causing a memory leak #2145

Closed
eugen-ajechiloae-clearml opened this issue Jul 8, 2023 · 0 comments · Fixed by #2220

Comments

@eugen-ajechiloae-clearml
Copy link

eugen-ajechiloae-clearml commented Jul 8, 2023

Expected Behavior

llama.cpp should not leak memory when compiled with LLAMA_CUBLAS=1

Current Behavior

llama.cpp leaks memory when compiled with LLAMA_CUBLAS=1

Environment and Context

You can find my environment below, but we were able to reproduce this issue on multiple machines.

CPU: AMD Ryzen 7 3700X 8-Core Processor
GPU: NVIDIA GeForce RTX 2070 Super

OS: 22.04.1-Ubuntu

Python3 version: 3.10.6
Make version: 4.3
g++ version: 11.3.0

Failure Information

In ggml-cuda.cu, the functions ggml_cuda_transform_tensor

struct ggml_tensor_extra_gpu * extra = new struct ggml_tensor_extra_gpu;
and ggml_cuda_assign_buffers_impl
struct ggml_tensor_extra_gpu * extra = new ggml_tensor_extra_gpu;

allocate CPU memory that is not freed anywhere.

Steps to Reproduce

# docker image capable of running llama.cpp, you could also just run the other instructions on your local machine
docker run -it --rm --ipc host --network host --gpus all nvcr.io/nvidia/pytorch:22.11-py3 bash

# install
git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python/vendor
rmdir llama.cpp/
git clone https://github.com/ggerganov/llama.cpp.git

cd ../..

# download model
python3 llama-cpp-python/docker/open_llama/hug_model.py -a SlyEcho -s open_llama_3b -f "q5_1"

# build
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python/

Then run the following python script:

from time import sleep
import gc
import llama_cpp
import multiprocessing
import psutil
import os

N_THREADS = multiprocessing.cpu_count() // 2

prompt = b"\n\n### Instruction:\nWhat is the capital of France?\n\n### Response:\n"

lparams = llama_cpp.llama_context_default_params()
lparams.n_gpu_layers = 43
lparams.n_ctx = 1024 * 2
lparams.use_mmap = True
lparams.low_vram = False
ctx = llama_cpp.llama_init_from_file(b"model.bin", lparams)

# determine the required inference memory per token:
tmp = [0, 1, 2, 3]


def test():
    prompt = b"\n\n### Instruction:\nWhat is the capital of France?\n\n### Response:\n"
    n_past = 0
    prompt = b" " + prompt
    embd_inp = (llama_cpp.llama_token * (len(prompt) + 1))()
    n_of_tok = llama_cpp.llama_tokenize(ctx, prompt, embd_inp, len(embd_inp), True)
    embd_inp = embd_inp[:n_of_tok]
    n_ctx = llama_cpp.llama_n_ctx(ctx)
    n_predict = 20
    n_predict = min(n_predict, n_ctx - len(embd_inp))
    input_consumed = 0
    input_noecho = False
    remaining_tokens = n_predict
    embd = []
    last_n_size = 64
    last_n_tokens_data = [0] * last_n_size
    n_batch = 24
    last_n_repeat = 64
    repeat_penalty = 1
    frequency_penalty = 0.0
    presence_penalty = 0.0
    while remaining_tokens > 0:
        buf = (llama_cpp.c_int * len(embd))(*embd)
        if len(embd) > 0:
            llama_cpp.llama_eval(
                ctx, buf, len(embd), n_past, N_THREADS
            )
        n_past += len(embd)
        embd = []
        if len(embd_inp) <= input_consumed:
            logits = llama_cpp.llama_get_logits(ctx)
            n_vocab = llama_cpp.llama_n_vocab(ctx)
            _arr_0 = (llama_cpp.llama_token_data * n_vocab)(*[
                llama_cpp.llama_token_data(token_id, logits[token_id], 0.0)
                for token_id in range(n_vocab)
            ])
            candidates_p = llama_cpp.ctypes.pointer(llama_cpp.llama_token_data_array(_arr_0, len(_arr_0), False))
            
            _arr = (llama_cpp.c_int * len(last_n_tokens_data))(*last_n_tokens_data)
            llama_cpp.llama_sample_repetition_penalty(ctx, candidates_p,
                _arr,
                last_n_repeat, repeat_penalty)
            llama_cpp.llama_sample_frequency_and_presence_penalties(ctx, candidates_p,
                _arr,
                last_n_repeat, frequency_penalty, presence_penalty)
            llama_cpp.llama_sample_top_k(ctx, candidates_p, 40, llama_cpp.c_size_t(1))
            llama_cpp.llama_sample_top_p(ctx, candidates_p, 0.8, llama_cpp.c_size_t(1))
            llama_cpp.llama_sample_temperature(ctx, candidates_p, 0.2)
            id = llama_cpp.llama_sample_token(ctx, candidates_p)
            last_n_tokens_data = last_n_tokens_data[1:] + [id]
            embd.append(id)
            input_noecho = False
            remaining_tokens -= 1
            del _arr_0
            del candidates_p
            del _arr
            del logits
            del n_vocab
        else:
            while len(embd_inp) > input_consumed:
                embd.append(embd_inp[input_consumed])
                last_n_tokens_data = last_n_tokens_data[1:] + [embd_inp[input_consumed]]
                input_consumed += 1
                if len(embd) >= n_batch:
                    break
        if not input_noecho:
            for id in embd:
                print(
                    llama_cpp.llama_token_to_str(ctx, id).decode("utf-8", errors="ignore"),
                    end="",
                    flush=True,
                )
        if len(embd) > 0 and embd[-1] == llama_cpp.llama_token_eos():
            del buf
            break
        del buf
    print("Memory used: ", psutil.Process(os.getpid()).memory_percent())
    print()
    llama_cpp.llama_print_timings(ctx)
    del n_of_tok
    del embd_inp
    gc.collect()


while True:
    test()

You will notice that the CPU memory used by the program increases slowly but surely

Proposed fix

Allocate extra

void * extra; // extra things e.g. for ggml-cuda.cu
statically

#2146

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant