Skip to content

Automatic optimization of runtime parameters such as -ngl given memory constraints #13860

@JohannesGaessler

Description

@JohannesGaessler

I'm interested in implementing code for automatically determining the optimal runtime parameters given some model and memory constraints. I imagine the implementation to use something like a "dummy" parameter which, when set, does not result in any actual memory allocations but enables the creation of llama_model and llama_context dummies that can be used to determine how much memory would be used for some choice of llama_model_params and llama_context_params. By comparing the amount of memory that was used for the dummies with the amount of memory that is actually available the implementation could then iteratively optimize parameters such as context size or the number of GPU layers.

One roadblock that I have run into is how to make this implementation minimally invasive for the rest of the code. Right now I think the way to do it would be:

  • Extend ggml_backend_device to track the amount of memory that has been allocated to this device by the current process.
  • Add a function like ggml_backend_dev_get_device_dummy that returns a dummy instead of the actual device.
  • In llama.cpp, conditionally fetch the dummy devices. Some additional logic in llama-model-load.cpp will still be needed to avoid temporarily loading data from disk to RAM.
  • Extend the logic of llama_decode a bit to allow for determining the allocated size of the worst-case graph.
  • In the runtime parameter optimization code, simply iterate over the dummy devices and retrieve the amount of memory that was allocated.

I'm very much open to suggestions, particularly from @slaren .

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions