[User] ggml_tensor->extra(s) are not freed at the end of llama_eval, causing a memory leak #2145

eugen-ajechiloae-clearml · 2023-07-08T12:42:14Z

Expected Behavior

llama.cpp should not leak memory when compiled with LLAMA_CUBLAS=1

Current Behavior

llama.cpp leaks memory when compiled with LLAMA_CUBLAS=1

Environment and Context

You can find my environment below, but we were able to reproduce this issue on multiple machines.

CPU: AMD Ryzen 7 3700X 8-Core Processor
GPU: NVIDIA GeForce RTX 2070 Super

OS: 22.04.1-Ubuntu

Python3 version: 3.10.6
Make version: 4.3
g++ version: 11.3.0

Failure Information

In ggml-cuda.cu, the functions ggml_cuda_transform_tensor

llama.cpp/ggml-cuda.cu

Line 3110 in 061f5f8

struct ggml_tensor_extra_gpu * extra = new struct ggml_tensor_extra_gpu;

and ggml_cuda_assign_buffers_impl

llama.cpp/ggml-cuda.cu

Line 3194 in 061f5f8

struct ggml_tensor_extra_gpu * extra = new ggml_tensor_extra_gpu;

allocate CPU memory that is not freed anywhere.

Steps to Reproduce

# docker image capable of running llama.cpp, you could also just run the other instructions on your local machine
docker run -it --rm --ipc host --network host --gpus all nvcr.io/nvidia/pytorch:22.11-py3 bash

# install
git clone https://github.com/abetlen/llama-cpp-python.git
cd llama-cpp-python/vendor
rmdir llama.cpp/
git clone https://github.com/ggerganov/llama.cpp.git

cd ../..

# download model
python3 llama-cpp-python/docker/open_llama/hug_model.py -a SlyEcho -s open_llama_3b -f "q5_1"

# build
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python/

Then run the following python script:

from time import sleep
import gc
import llama_cpp
import multiprocessing
import psutil
import os

N_THREADS = multiprocessing.cpu_count() // 2

prompt = b"\n\n### Instruction:\nWhat is the capital of France?\n\n### Response:\n"

lparams = llama_cpp.llama_context_default_params()
lparams.n_gpu_layers = 43
lparams.n_ctx = 1024 * 2
lparams.use_mmap = True
lparams.low_vram = False
ctx = llama_cpp.llama_init_from_file(b"model.bin", lparams)

# determine the required inference memory per token:
tmp = [0, 1, 2, 3]


def test():
    prompt = b"\n\n### Instruction:\nWhat is the capital of France?\n\n### Response:\n"
    n_past = 0
    prompt = b" " + prompt
    embd_inp = (llama_cpp.llama_token * (len(prompt) + 1))()
    n_of_tok = llama_cpp.llama_tokenize(ctx, prompt, embd_inp, len(embd_inp), True)
    embd_inp = embd_inp[:n_of_tok]
    n_ctx = llama_cpp.llama_n_ctx(ctx)
    n_predict = 20
    n_predict = min(n_predict, n_ctx - len(embd_inp))
    input_consumed = 0
    input_noecho = False
    remaining_tokens = n_predict
    embd = []
    last_n_size = 64
    last_n_tokens_data = [0] * last_n_size
    n_batch = 24
    last_n_repeat = 64
    repeat_penalty = 1
    frequency_penalty = 0.0
    presence_penalty = 0.0
    while remaining_tokens > 0:
        buf = (llama_cpp.c_int * len(embd))(*embd)
        if len(embd) > 0:
            llama_cpp.llama_eval(
                ctx, buf, len(embd), n_past, N_THREADS
            )
        n_past += len(embd)
        embd = []
        if len(embd_inp) <= input_consumed:
            logits = llama_cpp.llama_get_logits(ctx)
            n_vocab = llama_cpp.llama_n_vocab(ctx)
            _arr_0 = (llama_cpp.llama_token_data * n_vocab)(*[
                llama_cpp.llama_token_data(token_id, logits[token_id], 0.0)
                for token_id in range(n_vocab)
            ])
            candidates_p = llama_cpp.ctypes.pointer(llama_cpp.llama_token_data_array(_arr_0, len(_arr_0), False))
            
            _arr = (llama_cpp.c_int * len(last_n_tokens_data))(*last_n_tokens_data)
            llama_cpp.llama_sample_repetition_penalty(ctx, candidates_p,
                _arr,
                last_n_repeat, repeat_penalty)
            llama_cpp.llama_sample_frequency_and_presence_penalties(ctx, candidates_p,
                _arr,
                last_n_repeat, frequency_penalty, presence_penalty)
            llama_cpp.llama_sample_top_k(ctx, candidates_p, 40, llama_cpp.c_size_t(1))
            llama_cpp.llama_sample_top_p(ctx, candidates_p, 0.8, llama_cpp.c_size_t(1))
            llama_cpp.llama_sample_temperature(ctx, candidates_p, 0.2)
            id = llama_cpp.llama_sample_token(ctx, candidates_p)
            last_n_tokens_data = last_n_tokens_data[1:] + [id]
            embd.append(id)
            input_noecho = False
            remaining_tokens -= 1
            del _arr_0
            del candidates_p
            del _arr
            del logits
            del n_vocab
        else:
            while len(embd_inp) > input_consumed:
                embd.append(embd_inp[input_consumed])
                last_n_tokens_data = last_n_tokens_data[1:] + [embd_inp[input_consumed]]
                input_consumed += 1
                if len(embd) >= n_batch:
                    break
        if not input_noecho:
            for id in embd:
                print(
                    llama_cpp.llama_token_to_str(ctx, id).decode("utf-8", errors="ignore"),
                    end="",
                    flush=True,
                )
        if len(embd) > 0 and embd[-1] == llama_cpp.llama_token_eos():
            del buf
            break
        del buf
    print("Memory used: ", psutil.Process(os.getpid()).memory_percent())
    print()
    llama_cpp.llama_print_timings(ctx)
    del n_of_tok
    del embd_inp
    gc.collect()


while True:
    test()

You will notice that the CPU memory used by the program increases slowly but surely

Proposed fix

Allocate extra

llama.cpp/ggml.h

Line 430 in 061f5f8

void * extra; // extra things e.g. for ggml-cuda.cu

statically

#2146

The text was updated successfully, but these errors were encountered:

This was referenced Jul 8, 2023

Fix ggml_tensor_extra_gpu memory leak #2146

Closed

GPU memory not cleaned up after off-loading layers to GPU using n_gpu_layers abetlen/llama-cpp-python#223

Open

This was referenced Jul 12, 2023

Track and free temporary ggml_tensor_extra_gpu struct #2195

Closed

Allocate all temporary ggml_tensor_extra_gpu from a fixed-size buffer #2220

Merged

ggerganov closed this as completed in #2220 Jul 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[User] ggml_tensor->extra(s) are not freed at the end of llama_eval, causing a memory leak #2145

[User] ggml_tensor->extra(s) are not freed at the end of llama_eval, causing a memory leak #2145

eugen-ajechiloae-clearml commented Jul 8, 2023 •

edited

Loading

[User] ggml_tensor->extra(s) are not freed at the end of llama_eval, causing a memory leak #2145

[User] ggml_tensor->extra(s) are not freed at the end of llama_eval, causing a memory leak #2145

Comments

eugen-ajechiloae-clearml commented Jul 8, 2023 • edited Loading

Expected Behavior

Current Behavior

Environment and Context

Failure Information

Steps to Reproduce

Proposed fix

eugen-ajechiloae-clearml commented Jul 8, 2023 •

edited

Loading