Closed
Description
I am wondering what has happened and if we can do something about it? Is this some kind of memory pool which has a bigger size? Can we reduce this size if we want to? I have noticed this issue with a model which was fitting into my GPU before, but it reports now out of memory when I offload all layers to GPU.
@slaren , is it possible that this has something to do with the work you have done recently with managing GPU memory?
Will the selection of the LLAMA_CUDA_F16 option during compilation decrease inference GPU memory use?