Skip to content

TPI-LLM: Serving 70b-scale LLMs Efficiently on Low-resource Edge Devices

License

Notifications You must be signed in to change notification settings

Lizonghang/TPI-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TPI-LLM: Serving 70b-scale LLMs Efficiently on Low-resource Edge Devices

TPI-LLM (Tensor Parallelism Inference for Large Language Models) is a LLM serving system designed to bring LLM functions to low-resource edge devices. While cloud LLM services have achieved great success, privacy concerns arise and users do not want their conversations uploaded to the cloud as these conversations could involve sensitive personal information.

Our TPI-LLM system addresses the privacy issue by enabling LLM inference on edge devices with limited computing and memory resources. The system leverages multiple edge devices to perform inference through tensor parallelism, combined with a sliding window memory scheduler to reduce peak memory footprint. Currently, TPI-LLM can run Yi-34B in full precision on 4 laptops with 5GB of memory on each laptop, and run Llama 2-70B on 8 devices with 3GB of memory on each device. Furthermore, TPI-LLM has demonstrated 80%-90% less TTFT and token latency compared to Transformers, Accelerate, Galaxy, and 43%-55% less compared to llama.cpp on larger models (>13B).

Note: Computations were in full precision on solely CPUs, except for llama.cpp, which used Apple Metal Graphics and Q8 quantization for acceleration.

Installation

Use the Source Code

  1. Clone this repo and enter the project folder.

  2. Add PYTHONPATH to .bashrc:

> vim ~/.bashrc
export PYTHONPATH=<PATH-TO-TPI-LLM>/src
  1. Create a new conda environment and install dependencies:
> conda create -n tpi-llm python=3.9
> conda activate tpi-llm
(tpi-llm) > pip install -r requirements.txt

Using Pre-built Docker Image

We provide Docker images for TPI-LLM, available on Docker Hub. This is the easiest way to get started, but the container may slow down inference speed.

If the container is a master node, use docker cp <HOST_MODEL_PATH> master:/root/TPI-LLM/ to copy the pretrained model files to the container of the master node.

Build from Dockerfile

If you prefer to build the Docker image yourself, you can modify and use the provided Dockerfile in our repo.

> docker build -t tpi-llm:local .
> docker run -dit --name master tpi-llm:local

How to Use?

Download Pretrained Model Weights

To get started, you’ll need to download the pretrained model weights from Hugging Face:

Please make sure that the downloaded weight files conform to the HuggingFace format.

After downloading, save the model files in a directory of your choice, which we’ll refer to as /root/TPI-LLM/pretrained_models/Llama-2-1.1b-ft.

Run on Your Laptop

Run the example script for a trial:

> python examples/run_multiprocess.py --world_size 4 --model_type llama --model_path /root/TPI-LLM/pretrained_models/Llama-2-1.1b-ft --prompt "how are you?" --length 20 --memory_window 4

This command will run 4 processes on a single machine, creating a pseudo-distributed environment that leverages tensor parallelism for Llama inference.

First-Time Setup:

If this is your first time running the task, the master node will automatically slice the pretrained weight files. Suppose we have 4 worker nodes (including the master node), the sliced weight files should be like the following:

> ls <PATH-TO-MODEL-FILES>
|- config.json
|- model-00001-of-00004.safetensors
|- model-00002-of-00004.safetensors
|- model-00003-of-00004.safetensors
|- model-00004-of-00004.safetensors
|- model.safetensors.index.json
|- ...
|- split/
|--- node_0
|--- node_1
|--- node_2
|--- node_3

Subsequent Runs:

For subsequent runs, the sliced model weight files can be reused. Or you can include the --split_bin option to re-split it.

Run on Multiple Hosts

Assume we have 2 laptops with IP addresses as follows:

IP of host 1: 192.168.2.1 (master node)
IP of host 2: 192.168.2.2 (worker node)

The master node is regarded as the task publisher, who initiates the prompt and display generated text to users, it also slices the pretrained weight files and serve as a file server to distribute the sliced files to other worker nodes.

Step 1: To launch the master node, run the following command on laptop 1:

# Run the master node on laptop 1 (IP: 192.168.2.1, RANK = 0)
> python examples/run_multihost.py --rank 0 --world_size 2 --master_ip 192.168.2.1 --master_port=29500 --model_type llama --model_path /root/TPI-LLM/pretrained_models/Llama-2-1.1b-ft --prompt "how are you?" --length 20 --memory_window 4

NOTE: Please make sure the master node can be connected by all other nodes. The master node also participate in tensor-parallel inference.

Step 2: To launch other worker nodes, use the following command on other laptops (e.g., laptop 2):

# Run the worker node on host 2 (IP: 192.168.2.2, RANK = 1)
> python examples/run_multihost.py --rank 1 --world_size 2 --master_ip 192.168.2.1 --master_port=29500 --model_type llama --model_path /root/TPI-LLM/pretrained_models/sync --memory_window 4

The worker nodes will automatically download their weight files from the master node. If you have downloaded the files before, you can use the option --force_download to force a re-download.

Other Arguments

TPI-LLM provides several optional parameters that you can customize to control various aspects of the inference process. Below is a list of these options:

Argument Default Type Description
--prompt "" str The input prompt.
--length 20 int Maximum length of the generated sequence.
--prefix "" str Text added prior to input for context.
--split_bin False bool Split the pretrained model file. (available only on the master node)
--save_dir "split" str The directory to save split model files.
--seed 42 int Random seed for reproducibility.
--file_port 29600 int Port number on the master node where the file server is listening on.
--force_download False bool Force worker nodes to re-download model weight slices. (available only on the non-master node)
--temperature 1.0 float Sampling temperature for text generation. (available only on the master node)
--k 0 int Number of highest probability tokens to keep for top-k sampling. (available only on the master node)
--p 0.9 float Cumulative probability for nucleus sampling (top-p). (available only on the master node)
--disable_memory_schedule False bool Set to True to disable memory window scheduling, this may lead to higher speed.
--memory_window 2 int Size of the memory window used during inference. Should be at least 2.
--torch_dist False bool Whether to use torch.distributed.

About

TPI-LLM: Serving 70b-scale LLMs Efficiently on Low-resource Edge Devices

Topics

Resources

License

Stars

Watchers

Forks