How to use the gguf-split
/ Model sharding demo
#6404
Replies: 4 comments 20 replies
-
Maybe I am wrong, but I couldn't make |
Beta Was this translation helpful? Give feedback.
-
On Windows you can compile llama.cpp by opening a VS native tools command prompt (i.e.
|
Beta Was this translation helpful? Give feedback.
-
@dranger003 @phymbert may I ask how to compile gguf-split on MAC? (llamacpp) taozhiyu@603e5f4a42f1 llama.cpp-master % gguf-split |
Beta Was this translation helpful? Give feedback.
-
Please also include more clear and specific instructions for --merge? |
Beta Was this translation helpful? Give feedback.
-
Context
Distributing and storing GGUFs is difficult for 70b+ models, especially on f16. Lot of issue can happen during file transfers, examples:
Typically, GGUFs need to be transferred from Hugging Face to an internal storage like s3, minio, git lfs, nexus or artifactory, then downloaded by the inference server and stored locally (or on a k8s PvC for example).
Storage solutions and filesystems poorly support large GGUF, typically HF does not support files larger than 50GB.
Such limits also exist on Artifactory.
Solution
We recently introduced
gguf-split
CLI and support the load of sharded GGUFs model inllama.cpp
:Download a model
Convert to GGUF F16
python -u convert-hf-to-gguf.py \ ~/.cache/huggingface/hub/models--keyfan--grok-1-hf/snapshots/64e7373053c1bc7994ce427827b78ec11c181b3e/ \ --outfile grok-1-f16.gguf \ --outtype f16
NOTE: Follow llama.cpp build instructions to generate all tools/cli:
make
.Quantize (optional)
Build model shards
It is possible to use different sharding strategy:
--split-max-tensors 256
--split-max-size 48G
It will produce 9 files with maximum 256 tensors in each.
You can then upload the sharded model to your HF Repo:
Files produced by
gguf-split
are valid GGUFs, so you can visualize them in HF.Load sharded model
llama_load_model_from_file
will detect the number of files and will load additional tensors from the rest of files.You may notice:
Load sharded model from a remote URL
Beta Was this translation helpful? Give feedback.
All reactions