Skip to content

llamacpp_en

ymcui edited this page May 15, 2024 · 6 revisions

Using llama.cpp for Quantization and Local Deployment

Using the llama.cpp tool as an example, we'll discuss the detailed steps for model quantization and local deployment. For Windows, additional tools like cmake may be required. For a quick local deployment experience, it is recommended to use the instruction-tuned Llama-3-Chinese-Instruct model with 6-bit or 8-bit quantization. Before proceeding, ensure that:

  1. Your system has make (included with MacOS/Linux) or cmake (Windows users must install separately).
  2. It is recommended to use Python version 3.10 or higher for compiling and running the tool.

Step 1: Clone and Compile llama.cpp

⚠️ llama.cpp introduced breaking changes to Llama-3 pre-tokenizer. Please pull the latest code!

  1. (Optional) If you have an older version of the repository downloaded, it's recommended to git pull to fetch the latest code and execute make clean to clean up.
  2. Pull the latest version of the llama.cpp repository:
$ git clone https://github.com/ggerganov/llama.cpp
  1. Compile the llama.cpp project to generate the ./main (for inference) and ./quantize (for quantization) binaries.
$ make

For Windows/Linux users, if GPU inference is desired, it's recommended to compile with BLAS (or cuBLAS if you have a GPU) to improve prompt processing speed. Below is the command for compiling with cuBLAS, suitable for NVIDIA GPUs. Refer to: llama.cpp#blas-build

$ make LLAMA_CUDA=1

For macOS users, no extra steps are necessary; llama.cpp is already optimized for ARM NEON, and BLAS is automatically enabled. For M series chips, it's recommended to enable GPU inference with Metal to significantly increase speed. Simply change the compile command to: LLAMA_METAL=1 make, refer to llama.cpp#metal-build

$ LLAMA_METAL=1 make

Step 2: Generate a Quantized Model Version

💡 You can also directly download pre-quantized GGUF models from: Download Link

Currently, llama.cpp supports converting .safetensors files and Hugging Face format .bin files to FP16 GGUF format.

$ python convert-hf-to-gguf.py llama-3-chinese-8b-instruct
$ ./quantize llama-3-chinese-8b-instruct/ggml-model-f16.gguf llama-3-chinese-8b-instruct/ggml-model-q4_0.gguf q4_0

Step 3: Load and Start the Model

Since the project's Llama-3-Chinese-Instruct uses the original Llama-3-Instruct instruction template, first copy the project's scripts/llama_cpp/chat.sh to the root directory of llama.cpp. The chat.sh file's contents are shown below, embedding a chat template and some default parameters which can be modified as needed.

  • For GPU inference: When compiled with cuBLAS/Metal, specify the offload layers in ./main, e.g., -ngl 40 to offload 40 layers of model parameters to the GPU.
  • (New) Enable FlashAttention: specify -fa to accelerate inference speed (depend on computing device)
FIRST_INSTRUCTION=$2
SYSTEM_PROMPT="You are a helpful assistant. 你是一个乐于助人的助手。"

./main -m $1 --color -i \
-c 0 -t 6 --temp 0.2 --repeat_penalty 1.1 -ngl 999 \
-r '<|eot_id|>' \
--in-prefix '<|start_header_id|>user<|end_header_id|>\n\n' \
--in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' \
-p "<|start_header_id|>system<|end_header_id|>\n\n$SYSTEM_PROMPT<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n$FIRST_INSTRUCTION<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

Use the following command to start chatting.

$ chmod +x chat.sh
$ ./chat.sh ggml-model-q4_0.gguf 你好

Enter your prompt after the > symbol, use cmd/ctrl+c to interrupt output, and end multi-line messages with a \. For help and parameter explanations, execute the ./main -h command.

For more detailed official instructions, please refer to: https://github.com/ggerganov/llama.cpp/tree/master/examples/main

Clone this wiki locally