-
Notifications
You must be signed in to change notification settings - Fork 154
llamacpp_en
Using the llama.cpp tool as an example, we'll discuss the detailed steps for model quantization and local deployment. For Windows, additional tools like cmake may be required. For a quick local deployment experience, it is recommended to use the instruction-tuned Llama-3-Chinese-Instruct model with 6-bit or 8-bit quantization. Before proceeding, ensure that:
- Your system has
make
(included with MacOS/Linux) orcmake
(Windows users must install separately). - It is recommended to use Python version 3.10 or higher for compiling and running the tool.
- (Optional) If you have an older version of the repository downloaded, it's recommended to
git pull
to fetch the latest code and executemake clean
to clean up. - Pull the latest version of the llama.cpp repository:
$ git clone https://github.com/ggerganov/llama.cpp
- Compile the llama.cpp project to generate the
./main
(for inference) and./quantize
(for quantization) binaries.
$ make
For Windows/Linux users, if GPU inference is desired, it's recommended to compile with BLAS (or cuBLAS if you have a GPU) to improve prompt processing speed. Below is the command for compiling with cuBLAS, suitable for NVIDIA GPUs. Refer to: llama.cpp#blas-build
$ make LLAMA_CUDA=1
For macOS users, no extra steps are necessary; llama.cpp is already optimized for ARM NEON, and BLAS is automatically enabled. For M series chips, it's recommended to enable GPU inference with Metal to significantly increase speed. Simply change the compile command to: LLAMA_METAL=1 make
, refer to llama.cpp#metal-build
$ LLAMA_METAL=1 make
💡 You can also directly download pre-quantized GGUF models from: Download Link
Currently, llama.cpp supports converting .safetensors
files and Hugging Face format .bin
files to FP16 GGUF format.
$ python convert-hf-to-gguf.py llama-3-chinese-8b-instruct
$ ./quantize llama-3-chinese-8b-instruct/ggml-model-f16.gguf llama-3-chinese-8b-instruct/ggml-model-q4_0.gguf q4_0
Since the project's Llama-3-Chinese-Instruct uses the original Llama-3-Instruct instruction template, first copy the project's scripts/llama_cpp/chat.sh
to the root directory of llama.cpp. The chat.sh
file's contents are shown below, embedding a chat template and some default parameters which can be modified as needed.
- For GPU inference: When compiled with cuBLAS/Metal, specify the offload layers in
./main
, e.g.,-ngl 40
to offload 40 layers of model parameters to the GPU. - (New) Enable FlashAttention: specify
-fa
to accelerate inference speed (depend on computing device)
FIRST_INSTRUCTION=$2
SYSTEM_PROMPT="You are a helpful assistant. 你是一个乐于助人的助手。"
./main -m $1 --color -i \
-c 0 -t 6 --temp 0.2 --repeat_penalty 1.1 -ngl 999 \
-r '<|eot_id|>' \
--in-prefix '<|start_header_id|>user<|end_header_id|>\n\n' \
--in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' \
-p "<|start_header_id|>system<|end_header_id|>\n\n$SYSTEM_PROMPT<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n$FIRST_INSTRUCTION<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
Use the following command to start chatting.
$ chmod +x chat.sh
$ ./chat.sh ggml-model-q4_0.gguf 你好
Enter your prompt after the >
symbol, use cmd/ctrl+c
to interrupt output, and end multi-line messages with a \
. For help and parameter explanations, execute the ./main -h
command.
For more detailed official instructions, please refer to: https://github.com/ggerganov/llama.cpp/tree/master/examples/main
- Model Reconstruction
- Model Quantization, Inference and Deployment
- System Performance
- Training Scripts
- FAQ