-
Notifications
You must be signed in to change notification settings - Fork 11.5k
ggml-backend : add load_tensor() to backend API #13106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Add new backend API which allows loading tensor's data with precomputed hashes stored in the model KV. ref: ggml-org#12954
Steps to try this out:
$ bin/llama-gguf-hash --fnv gemma-3-4b-it-q4_0.gguf This will create
|
I don't think this function needs to be part of the base ggml interface, it is just never going to be used outside of the RPC backend. It would be better to use a custom function that is obtained via I also do not think that a 64-bit hash is enough to uniquely identify a tensor. The chance of collisions is too high for this to be acceptable. |
I think this is the best way manage large models using RPC. Performance gains will start at x2 for 2 RPC and go up from there (3x for 3 RPC, 4x for 4, etc.), predominantly when loading models over USB3 HDDs which will be a very typical use case (I don't store models on ssds to avoid degrading their lifetime, particulary big models). Since RPC is a core feature and arguably one of the best features llama.cpp offers for running big models on limited VRAM commodity GPUs I think it makes sense to alter APIs or add hooks as necessary to let it be as efficient as possible. Once the model is cached into system RAM gains will be less, but for iterating versions on big models (lately I have been testing hybrid layer quants on Llama Scout for example) it will remove a huge amount of time to load and test new versions. |
Ah, right, I forgot about this feature, thanks. Will rework the patch with
I guess we can switch to the 128bit version of the FNV hash and bump the version of the RPC protocol. I don't think we need anything bigger than 128bit here. As for the hash function, I think that a non-cryptographic hash like FNV is fine as we use it for caching purposes. If we go with 128 bit hashes then I guess we should store them as hex strings in the GGUF metadata, right? |
This patch is proposing a new backend API for loading tensor's data with a precomputed hash stored in the model KV.
The main use case is to allow faster model load time with the RPC backend when multiple hosts are used for distributed LLM inference.
When the model is being loaded with
mmap
and there is a precomputed hash available for the current tensor, we can skip reading the actual data and ask the backend to load the data with the specified hash. The RPC backend already supports this with theRPC_CMD_SET_TENSOR_HASH
command (which may be renamed toRPC_CMD_LOAD_TENSOR
).In this PoC I have modified
llama-gguf-hash
to generate a new GGUF file with.rpc
suffix which have FNV-1a hashes for all of the tensors.With the
rpc-server
running on localhost I am seeing huge improvements of model load time, e.g:gemma-3-4b-it-q4_0.gguf
- 5052,50 msgemma-3-4b-it-q4_0.gguf.rpc
- 2148,51 msI will do more testing with local LAN network soon.
I understand that making API changes for one particular use case is generally not a good idea but there is a significant peformance gain here.
Let me know what you think.