ggml-backend : add load_tensor() to backend API #13106

rgerganov · 2025-04-25T11:00:12Z

This patch is proposing a new backend API for loading tensor's data with a precomputed hash stored in the model KV.
The main use case is to allow faster model load time with the RPC backend when multiple hosts are used for distributed LLM inference.
When the model is being loaded with mmap and there is a precomputed hash available for the current tensor, we can skip reading the actual data and ask the backend to load the data with the specified hash. The RPC backend already supports this with the RPC_CMD_SET_TENSOR_HASH command (which may be renamed to RPC_CMD_LOAD_TENSOR).

In this PoC I have modified llama-gguf-hash to generate a new GGUF file with .rpc suffix which have FNV-1a hashes for all of the tensors.

With the rpc-server running on localhost I am seeing huge improvements of model load time, e.g:

gemma-3-4b-it-q4_0.gguf - 5052,50 ms
gemma-3-4b-it-q4_0.gguf.rpc - 2148,51 ms

I will do more testing with local LAN network soon.

I understand that making API changes for one particular use case is generally not a good idea but there is a significant peformance gain here.
Let me know what you think.

Add new backend API which allows loading tensor's data with precomputed hashes stored in the model KV. ref: ggml-org#12954

rgerganov · 2025-04-25T11:10:49Z

Steps to try this out:

Generate GGUF with precomputed hashes with llama-gguf-hash:

$ bin/llama-gguf-hash --fnv gemma-3-4b-it-q4_0.gguf

This will create gemma-3-4b-it-q4_0.gguf.rpc which is the same as gemma-3-4b-it-q4_0.gguf + precomputed hashes for all tensors
2. Start rpc-server with local cache:

$ bin/rpc-server -c

Run bin/llama-cli -m gemma-3-4b-it-q4_0.gguf.rpc --rpc localhost:50052 -ngl 99 .... The first run will populate the RPC cache and the load time for the second run should be much smaller

slaren · 2025-04-25T16:25:51Z

I don't think this function needs to be part of the base ggml interface, it is just never going to be used outside of the RPC backend. It would be better to use a custom function that is obtained via ggml_backend_reg_get_proc_address.

I also do not think that a 64-bit hash is enough to uniquely identify a tensor. The chance of collisions is too high for this to be acceptable.

steampunque · 2025-04-26T20:04:35Z

I understand that making API changes for one particular use case is generally not a good idea but there is a significant peformance gain here. Let me know what you think.

I think this is the best way manage large models using RPC. Performance gains will start at x2 for 2 RPC and go up from there (3x for 3 RPC, 4x for 4, etc.), predominantly when loading models over USB3 HDDs which will be a very typical use case (I don't store models on ssds to avoid degrading their lifetime, particulary big models). Since RPC is a core feature and arguably one of the best features llama.cpp offers for running big models on limited VRAM commodity GPUs I think it makes sense to alter APIs or add hooks as necessary to let it be as efficient as possible. Once the model is cached into system RAM gains will be less, but for iterating versions on big models (lately I have been testing hybrid layer quants on Llama Scout for example) it will remove a huge amount of time to load and test new versions.

rgerganov · 2025-04-27T11:06:23Z

I don't think this function needs to be part of the base ggml interface, it is just never going to be used outside of the RPC backend. It would be better to use a custom function that is obtained via ggml_backend_reg_get_proc_address.

Ah, right, I forgot about this feature, thanks. Will rework the patch with ggml_backend_reg_get_proc_address

I also do not think that a 64-bit hash is enough to uniquely identify a tensor. The chance of collisions is too high for this to be acceptable.

I guess we can switch to the 128bit version of the FNV hash and bump the version of the RPC protocol. I don't think we need anything bigger than 128bit here. As for the hash function, I think that a non-cryptographic hash like FNV is fine as we use it for caching purposes.

If we go with 128 bit hashes then I guess we should store them as hex strings in the GGUF metadata, right?

ggml-backend : add load_tensor() to backend API

7044072

Add new backend API which allows loading tensor's data with precomputed hashes stored in the model KV. ref: ggml-org#12954

rgerganov requested a review from slaren April 25, 2025 11:00

github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels Apr 25, 2025

fix build

7792798

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-backend : add load_tensor() to backend API #13106

ggml-backend : add load_tensor() to backend API #13106

rgerganov commented Apr 25, 2025

rgerganov commented Apr 25, 2025

slaren commented Apr 25, 2025

steampunque commented Apr 26, 2025

rgerganov commented Apr 27, 2025

ggml-backend : add load_tensor() to backend API #13106

Are you sure you want to change the base?

ggml-backend : add load_tensor() to backend API #13106

Conversation

rgerganov commented Apr 25, 2025

rgerganov commented Apr 25, 2025

slaren commented Apr 25, 2025

steampunque commented Apr 26, 2025

rgerganov commented Apr 27, 2025