Skip to content

Latest commit

 

History

History
149 lines (109 loc) · 5.25 KB

README.md

File metadata and controls

149 lines (109 loc) · 5.25 KB

Fast-LLaMA: A High-Performance Inference Engine

image

Descriptions

fast-llama is a super high-performance inference engine for LLMs like LLaMA (2.5x of llama.cpp) written in pure C++. It can run a 8-bit quantized LLaMA2-7B model on a cpu with 56 cores in speed of ~25 tokens / s. It outperforms all current open-source inference engines, especially when compared to the renowned llama.cpp, with ~2.5 times better inference speed on a CPU.

Features

Feature Name Current Support Future Suport
Model Types ✅LLaMA2 Others LLMs like Baichuan, StableDiffusion
Quantization ✅INT16, ✅INT8 INT4
Model Formats ✅HuggingFace, ✅gguf(by llama.cpp), ✅flm
Systems ✅Linux, ✅Windows Macbook, Android, iOS
CPU/GPU ✅X86/64 CPU ARM, Apple Mx CPUs, GPU, CPU+GPU
Architectures ✅UMA, ✅NUMA

Advantages

Why you should use Fast-LLaMA?

  • Fast
    • Extremely fast on CPU. Faster than any other engines on Github including llama.cpp.
  • Simple
    • Totally less than 7k lines of C++ codes with well-orgnized code structures and no dependencies except NUMA (if needed for multi-cpus).
  • "Easy To Use" (target ☺️

Quick Start

Compile

Only Linux is supported currently. Support of other platforms including Windows, Mac, GPU is coming soon.

Requirements

  • GCC 10.x or newer versions
  • libnuma-dev if your computer has more than one physical CPUs
    • Linux Kernel v5.x or higher is needed for NUMA

Compilation

Method 1. Using the provided build script:

bash ./build.sh

Method 2. Using Make:

make -j 4

Run

1. Run with llama2.c models:

Step 1: Download a model

See llama2.c

Step 2: Run the model

./main -c ./models/stories110M.bin -z ./models/tokenizer.bin -j 14 -q int8 -n 200 -i 'That was a long long story happened in the ancient China.'
image

2. Run with hugging face format models

Step 1: Download a model

See Chinese-LLaMA-Alpaca-2

Step 2: Convert the model info FLM format

python3 ./tools/convert_flm.py -m /path/to/model-directory -o ./models/model-name-int8.flm -t int8

Step 3: Run the model

./main -c ./models/model-name-int8.flm -j 40 -n 200 -i 'That was a long long story happened in the ancient China.'

image

All supported command-line options are as follows:

  • -c: Path to the model file
  • -f: Model file format (e.g., gguf)
  • -j: Number of threads to use (e.g., 56)
  • -q: Quantization mode (e.g., int8)
  • -n: Number of tokens to generate (e.g., 200)
  • -i: Input text (e.g., 'That was a long long story happened in the ancient China.')
  • -h: show usage information

Performance

Below are some incomplete test results

Testing Result:

Model Model Size OutputSpeed/8 threads OutputSpeed/28 threads OutputSpeed/56 threads
stories110M 110M 237tps 400tps 440tps
Chinese-LLaMA-1.3B 1.3B 38.9tps 127tps 155tps
Chinese-LLaMA-7B 7B 7.4tps 17.4tps 23.5tps
  • Note: tps = tokens / second

Testing Conditions

  • Testing Prompt: "That was a long long story happened in the ancient Europe. It was about a brave boy name Oliver. Oliver lived in a small village among many big moutains. It was a beautiful village."
  • Quantization: int8
  • NUMA: 2 sockets
    • Note: Make sure that NUMA is truely available if you expect to accelerate with NUMA)
  • System: (uname -a)Linux coderlsf 5.15.0-72-generic #79-Ubuntu SMP Wed Apr 19 08:22:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
  • CPU: 56 physical cores, AVX-512
Architecture:            x86_64
Model name:              Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
CPU(s):                  112 (56 physical cores)
Thread(s) per core:      2
Core(s) per socket:      28
Socket(s):               2

2a58bda471f0aa2770f349dba73a530d

Latancy of first token will be optimized laterly.

Why

Why is it so fast?

  • Ultimate memory efficiency
    • Zero memory allocations and frees during inferencing.
    • Maximization of memory locality.
  • Well-designed thread scheduling algorithm
  • Optimized operators
    • Fuse all operators that can be fused together
    • Optmize calculation of several operators
  • Proper Quantizations

License

fast-llama is licensed under the MIT.

Acknowledgements

Special thanks to AlpinDale for his professional, meticulous, and patient guidance and assistance.

Contact

Email: 📩topcoderlsf@gmail.com

Contact me if you any questions.