TPI-LLM (Tensor Parallelism Inference for Large Language Models) is a LLM serving system designed to bring LLM functions to low-resource edge devices. While cloud LLM services have achieved great success, privacy concerns arise and users do not want their conversations uploaded to the cloud as these conversations could involve sensitive personal information.
Our TPI-LLM system addresses the privacy issue by enabling LLM inference on edge devices with limited computing and memory resources. The system leverages multiple edge devices to perform inference through tensor parallelism, combined with a sliding window memory scheduler to reduce peak memory footprint. Currently, TPI-LLM can run Yi-34B in full precision on 4 laptops with 5GB of memory on each laptop, and run Llama 2-70B on 8 devices with 3GB of memory on each device. Furthermore, TPI-LLM has demonstrated 80%-90% less TTFT and token latency compared to Transformers, Accelerate, Galaxy, and 43%-55% less compared to llama.cpp on larger models (>13B).
Note: Computations were in full precision on solely CPUs, except for llama.cpp, which used Apple Metal Graphics and Q8 quantization for acceleration.
-
Clone this repo and enter the project folder.
-
Add
PYTHONPATH
to.bashrc
:
> vim ~/.bashrc
export PYTHONPATH=<PATH-TO-TPI-LLM>/src
- Create a new conda environment and install dependencies:
> conda create -n tpi-llm python=3.9
> conda activate tpi-llm
(tpi-llm) > pip install -r requirements.txt
We provide Docker images for TPI-LLM, available on Docker Hub. This is the easiest way to get started, but the container may slow down inference speed.
If the container is a master node, use docker cp <HOST_MODEL_PATH> master:/root/TPI-LLM/
to copy the pretrained model files
to the container of the master node.
If you prefer to build the Docker image yourself, you can modify and use the provided Dockerfile in our repo.
> docker build -t tpi-llm:local .
> docker run -dit --name master tpi-llm:local
To get started, you’ll need to download the pretrained model weights from Hugging Face:
- Llama 2 series, for example, Meta/Llama-2-7b-hf
- Llama 3 series, for example, Meta/Llama-3-8b
- Llama 3.1 series, for example, Meta/Llama-3.1-8b-Instruct
- 01 AI Yi series, for example, chargoddard/Yi-34B-Llama
Please make sure that the downloaded weight files conform to the HuggingFace format.
After downloading, save the model files in a directory of your choice, which we’ll refer to as /root/TPI-LLM/pretrained_models/Llama-2-1.1b-ft
.
Run the example script for a trial:
> python examples/run_multiprocess.py --world_size 4 --model_type llama --model_path /root/TPI-LLM/pretrained_models/Llama-2-1.1b-ft --prompt "how are you?" --length 20 --memory_window 4
This command will run 4 processes on a single machine, creating a pseudo-distributed environment that leverages tensor parallelism for Llama inference.
First-Time Setup:
If this is your first time running the task, the master node will automatically slice the pretrained weight files. Suppose we have 4 worker nodes (including the master node), the sliced weight files should be like the following:
> ls <PATH-TO-MODEL-FILES>
|- config.json
|- model-00001-of-00004.safetensors
|- model-00002-of-00004.safetensors
|- model-00003-of-00004.safetensors
|- model-00004-of-00004.safetensors
|- model.safetensors.index.json
|- ...
|- split/
|--- node_0
|--- node_1
|--- node_2
|--- node_3
Subsequent Runs:
For subsequent runs, the sliced model weight files can be reused. Or you can include the --split_bin
option
to re-split it.
Assume we have 2 laptops with IP addresses as follows:
IP of host 1: 192.168.2.1 (master node)
IP of host 2: 192.168.2.2 (worker node)
The master node is regarded as the task publisher, who initiates the prompt and display generated text to users, it also slices the pretrained weight files and serve as a file server to distribute the sliced files to other worker nodes.
Step 1: To launch the master node, run the following command on laptop 1:
# Run the master node on laptop 1 (IP: 192.168.2.1, RANK = 0)
> python examples/run_multihost.py --rank 0 --world_size 2 --master_ip 192.168.2.1 --master_port=29500 --model_type llama --model_path /root/TPI-LLM/pretrained_models/Llama-2-1.1b-ft --prompt "how are you?" --length 20 --memory_window 4
NOTE: Please make sure the master node can be connected by all other nodes. The master node also participate in tensor-parallel inference.
Step 2: To launch other worker nodes, use the following command on other laptops (e.g., laptop 2):
# Run the worker node on host 2 (IP: 192.168.2.2, RANK = 1)
> python examples/run_multihost.py --rank 1 --world_size 2 --master_ip 192.168.2.1 --master_port=29500 --model_type llama --model_path /root/TPI-LLM/pretrained_models/sync --memory_window 4
The worker nodes will automatically download their weight files from the master node. If you have downloaded
the files before, you can use the option --force_download
to force a re-download.
TPI-LLM provides several optional parameters that you can customize to control various aspects of the inference process. Below is a list of these options:
Argument | Default | Type | Description |
---|---|---|---|
--prompt |
"" |
str |
The input prompt. |
--length |
20 |
int |
Maximum length of the generated sequence. |
--prefix |
"" |
str |
Text added prior to input for context. |
--split_bin |
False |
bool |
Split the pretrained model file. (available only on the master node) |
--save_dir |
"split" |
str |
The directory to save split model files. |
--seed |
42 |
int |
Random seed for reproducibility. |
--file_port |
29600 |
int |
Port number on the master node where the file server is listening on. |
--force_download |
False |
bool |
Force worker nodes to re-download model weight slices. (available only on the non-master node) |
--temperature |
1.0 |
float |
Sampling temperature for text generation. (available only on the master node) |
--k |
0 |
int |
Number of highest probability tokens to keep for top-k sampling. (available only on the master node) |
--p |
0.9 |
float |
Cumulative probability for nucleus sampling (top-p). (available only on the master node) |
--disable_memory_schedule |
False |
bool |
Set to True to disable memory window scheduling, this may lead to higher speed. |
--memory_window |
2 |
int |
Size of the memory window used during inference. Should be at least 2. |
--torch_dist |
False |
bool |
Whether to use torch.distributed. |