Description
I'm interested in implementing code for automatically determining the optimal runtime parameters given some model and memory constraints. I imagine the implementation to use something like a "dummy" parameter which, when set, does not result in any actual memory allocations but enables the creation of llama_model
and llama_context
dummies that can be used to determine how much memory would be used for some choice of llama_model_params
and llama_context_params
. By comparing the amount of memory that was used for the dummies with the amount of memory that is actually available the implementation could then iteratively optimize parameters such as context size or the number of GPU layers.
One roadblock that I have run into is how to make this implementation minimally invasive for the rest of the code. Right now I think the way to do it would be:
- Extend
ggml_backend_device
to track the amount of memory that has been allocated to this device by the current process. - Add a function like
ggml_backend_dev_get_device_dummy
that returns a dummy instead of the actual device. - In llama.cpp, conditionally fetch the dummy devices. Some additional logic in
llama-model-load.cpp
will still be needed to avoid temporarily loading data from disk to RAM. - Extend the logic of
llama_decode
a bit to allow for determining the allocated size of the worst-case graph. - In the runtime parameter optimization code, simply iterate over the dummy devices and retrieve the amount of memory that was allocated.
I'm very much open to suggestions, particularly from @slaren .