Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ggml_tensor_extra_gpu memory leak #2146

Conversation

eugen-ajechiloae-clearml
Copy link

@eugen-ajechiloae-clearml eugen-ajechiloae-clearml commented Jul 8, 2023

Fixes #2145. Allocate the extra member of ggml_tensor statically

Allocate the `extra` member of `ggml_tensor` statically
Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be fine with me but ggml changes will need to be approved by @ggerganov.

@mqy
Copy link
Collaborator

mqy commented Jul 8, 2023

#2099 uses extra

@ggerganov
Copy link
Owner

Not a good change - the CUDA implementation should clean-up the allocated memory

@ggerganov ggerganov closed this Jul 9, 2023
@eugen-ajechiloae-clearml
Copy link
Author

Hi @ggerganov ! Is there any reason why you allocate the memory dynamically? Cleaning up the memory seems very difficult . Looks like the same memory region is used by multiple instances of ggml_tensor, also tracking every pointer allocated is not easy to me at least.
I think that creating a new member in ggml_tensor with type ggml_tensor_extra_cpu would make memory management a lot easier, and I really don't think it adds significant overhead. We could also leave the extra pointer untouched, such that #2099 is unaffected.
What do you think?

@ggerganov
Copy link
Owner

The goal is to keep ggml_tensor general-purpose. Adding CUDA specific members is not desired since it incentivises doing the same for the other backends and this does not scale well.

tracking every pointer allocated is not easy to me at least.

Is it really so difficult. I guess we could pre-allocate a single large enough buffer in ggml_init_cublas() and then emplace each ggml_tensor_extra_gpu inside it. Keep a counter and increase it for each new struct. Reset the counter at the start of each eval. Finally free the buffer upon exit.

If it is indeed too difficult we can think of some alternative approaches, but the main requirement is to not couple ggml_tensor with GPU-specific things

@bullno1
Copy link
Contributor

bullno1 commented Jul 11, 2023

AFAIK, there are two types of ggml_tensor_extra_gpu, those bound to model/context and those bound to the temporary context in llama_eval_internal. Only the latter get leaked. The former are freed in the destructor.

Since the lifecycle of them is tied to ctx0, they should also be freed at the same time as ctx0.
Their lifecycle should be tracked separately from the model/context bound ones.

All temp allocations are made in ggml_cuda_assign_buffers so it can be tracked right there.
I think it's better to make offload contextful:

typedef void (*offload_func_t)(void * offload_ctx, struct ggml_tensor * tensor);

int llama_eval_internal(...) {
    ctx0 = ggml_init(...);

    #ifdef GGML_USE_CUBLAS
    void * offload_ctx = ggml_cuda_init_offload_ctx(...);
    #endif

    // Now the offload_func will have to be called like this
    offload_func(offload_ctx, tensor);

    #ifdef GGML_USE_CUBLAS
    ggml_cuda_free_offload_ctx(offload_ctx);
    #endif

    ggml_free(ctx0);
}

It's a bit more verbose but we correctly track the life cycle of those temp allocations.
A free list or bump allocator can be used.

ggml_cuda_assign_buffers will have to take a context pointer and look different from the rest of the family like: ggml_cuda_assign_buffers_no_scratch or ggml_cuda_assign_buffers_force_inplace but it is used in a very different manner anw.

The offload_ctx can be dropped too but allocations would have to be linked to some global/TLS state.
The it would just be: ggml_cuda_offload_begin() and ggml_cuda_offload_end().

@eugen-ajechiloae-clearml
Copy link
Author

Memory management seems a bit too complicated for me in this case, so I just wanted to let you know that I will not open another PR for this issue. Thank you guys for your input!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[User] ggml_tensor->extra(s) are not freed at the end of llama_eval, causing a memory leak
5 participants