Skip to content

ggml-backend : add load_tensor() to backend API #13106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rgerganov
Copy link
Collaborator

This patch is proposing a new backend API for loading tensor's data with a precomputed hash stored in the model KV.
The main use case is to allow faster model load time with the RPC backend when multiple hosts are used for distributed LLM inference.
When the model is being loaded with mmap and there is a precomputed hash available for the current tensor, we can skip reading the actual data and ask the backend to load the data with the specified hash. The RPC backend already supports this with the RPC_CMD_SET_TENSOR_HASH command (which may be renamed to RPC_CMD_LOAD_TENSOR).

In this PoC I have modified llama-gguf-hash to generate a new GGUF file with .rpc suffix which have FNV-1a hashes for all of the tensors.

With the rpc-server running on localhost I am seeing huge improvements of model load time, e.g:

  • gemma-3-4b-it-q4_0.gguf - 5052,50 ms
  • gemma-3-4b-it-q4_0.gguf.rpc - 2148,51 ms

I will do more testing with local LAN network soon.

I understand that making API changes for one particular use case is generally not a good idea but there is a significant peformance gain here.
Let me know what you think.

Add new backend API which allows loading tensor's data with precomputed
hashes stored in the model KV.

ref: ggml-org#12954
@rgerganov rgerganov requested a review from slaren April 25, 2025 11:00
@github-actions github-actions bot added examples ggml changes relating to the ggml tensor library for machine learning labels Apr 25, 2025
@rgerganov
Copy link
Collaborator Author

Steps to try this out:

  1. Generate GGUF with precomputed hashes with llama-gguf-hash:
$ bin/llama-gguf-hash --fnv gemma-3-4b-it-q4_0.gguf

This will create gemma-3-4b-it-q4_0.gguf.rpc which is the same as gemma-3-4b-it-q4_0.gguf + precomputed hashes for all tensors
2. Start rpc-server with local cache:

$ bin/rpc-server -c
  1. Run bin/llama-cli -m gemma-3-4b-it-q4_0.gguf.rpc --rpc localhost:50052 -ngl 99 .... The first run will populate the RPC cache and the load time for the second run should be much smaller

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Apple Metal https://en.wikipedia.org/wiki/Metal_(API) Kompute https://github.com/KomputeProject/kompute/ labels Apr 25, 2025
@slaren
Copy link
Member

slaren commented Apr 25, 2025

I don't think this function needs to be part of the base ggml interface, it is just never going to be used outside of the RPC backend. It would be better to use a custom function that is obtained via ggml_backend_reg_get_proc_address.

I also do not think that a 64-bit hash is enough to uniquely identify a tensor. The chance of collisions is too high for this to be acceptable.

@steampunque
Copy link

I understand that making API changes for one particular use case is generally not a good idea but there is a significant peformance gain here. Let me know what you think.

I think this is the best way manage large models using RPC. Performance gains will start at x2 for 2 RPC and go up from there (3x for 3 RPC, 4x for 4, etc.), predominantly when loading models over USB3 HDDs which will be a very typical use case (I don't store models on ssds to avoid degrading their lifetime, particulary big models). Since RPC is a core feature and arguably one of the best features llama.cpp offers for running big models on limited VRAM commodity GPUs I think it makes sense to alter APIs or add hooks as necessary to let it be as efficient as possible. Once the model is cached into system RAM gains will be less, but for iterating versions on big models (lately I have been testing hybrid layer quants on Llama Scout for example) it will remove a huge amount of time to load and test new versions.

@rgerganov
Copy link
Collaborator Author

I don't think this function needs to be part of the base ggml interface, it is just never going to be used outside of the RPC backend. It would be better to use a custom function that is obtained via ggml_backend_reg_get_proc_address.

Ah, right, I forgot about this feature, thanks. Will rework the patch with ggml_backend_reg_get_proc_address

I also do not think that a 64-bit hash is enough to uniquely identify a tensor. The chance of collisions is too high for this to be acceptable.

I guess we can switch to the 128bit version of the FNV hash and bump the version of the RPC protocol. I don't think we need anything bigger than 128bit here. As for the hash function, I think that a non-cryptographic hash like FNV is fine as we use it for caching purposes.

If we go with 128 bit hashes then I guess we should store them as hex strings in the GGUF metadata, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning Kompute https://github.com/KomputeProject/kompute/ Nvidia GPU Issues specific to Nvidia GPUs SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language Vulkan Issues specific to the Vulkan backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants