-
Notifications
You must be signed in to change notification settings - Fork 866
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add llama2 text completion example with streaming response support
- Loading branch information
Naman Nandan
committed
Jul 28, 2023
1 parent
71e1c2e
commit 631b253
Showing
10 changed files
with
221 additions
and
262 deletions.
There are no files selected for viewing
This file was deleted.
Oops, something went wrong.
167 changes: 0 additions & 167 deletions
167
examples/large_models/inferentia2/llama/inf2_handler.py
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
# Large model inference on Inferentia2 | ||
|
||
This document briefs on serving the [Llama 2](https://huggingface.co/meta-llama) model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) with streaming response support. | ||
|
||
Inferentia2 uses [Neuron SDK](https://aws.amazon.com/machine-learning/neuron/) which is built on top of PyTorch XLA stack. For large model inference [`transformers-neuronx`](https://github.com/aws-neuron/transformers-neuronx) package is used that takes care of model partitioning and running inference. | ||
|
||
Let's take a look at the steps to prepare our model for inference on Inf2 instances. | ||
|
||
**Note** To run the model on an Inf2 instance, the model gets compiled as a preprocessing step. As part of the compilation process, to generate the model graph, a specific batch size is used. Following this, when running inference, we need to pass the same batch size that was used during compilation. This example uses batch size of 1 to demonstrate real-time inference with streaming response. | ||
|
||
### Step 1: Inf2 instance | ||
|
||
Get an Inf2 instance(Note: This example was tested on instance type:`inf2.24xlarge`), ssh to it, make sure to use the following DLAMI as it comes with PyTorch and necessary packages for AWS Neuron SDK pre-installed. | ||
DLAMI Name: ` Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230720 Amazon Machine Image (AMI)` | ||
|
||
### Step 1: Package Installations | ||
|
||
Follow the steps below to complete package installations | ||
|
||
```bash | ||
sudo apt-get update | ||
sudo apt-get upgrade | ||
|
||
# Update Neuron Runtime | ||
sudo apt-get install aws-neuronx-collectives=2.* -y | ||
sudo apt-get install aws-neuronx-runtime-lib=2.* -y | ||
|
||
# Activate Python venv | ||
source /opt/aws_neuron_venv_pytorch/bin/activate | ||
|
||
# Clone Torchserve git repository | ||
git clone https://github.com/pytorch/serve.git | ||
cd serve | ||
|
||
# Install dependencies | ||
python ts_scripts/install_dependencies.py --neuronx | ||
|
||
# Set pip repository pointing to the Neuron repository | ||
python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com | ||
|
||
# Update Neuron Compiler, Framework and Transformers | ||
python -m pip install --upgrade neuronx-cc torch-neuronx transformers-neuronx | ||
|
||
# Install additional necessary packages | ||
python -m pip install --upgrade transformers tokenizers sentencepiece | ||
|
||
``` | ||
|
||
|
||
|
||
### Step 2: Save the model split checkpoints compatible with `transformers-neuronx` | ||
Login to Huggingface | ||
```bash | ||
huggingface-cli login | ||
``` | ||
|
||
Navigate to `large_model/inferentia2/llama2` directory and run the following script | ||
|
||
```bash | ||
python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split' | ||
``` | ||
|
||
|
||
### Step 3: Generate Tar/ MAR file | ||
|
||
```bash | ||
torch-model-archiver --model-name llama-2-13b --version 1.0 --handler inf2_handler.py --extra-files ./llama-2-13b-split -r requirements.txt --config-file model-config.yaml --archive-format no-archive | ||
``` | ||
|
||
### Step 4: Add the mar file to model store | ||
|
||
```bash | ||
mkdir model_store | ||
mv llama-2-13b model_store | ||
``` | ||
|
||
### Step 5: Start torchserve | ||
|
||
```bash | ||
torchserve --ncs --start --model-store model_store | ||
``` | ||
|
||
### Step 6: Register model | ||
|
||
```bash | ||
curl -X POST "http://localhost:8081/models?url=llama-2-13b" | ||
``` | ||
|
||
### Step 7: Run inference | ||
|
||
```bash | ||
python test_stream_response.py | ||
``` |
Oops, something went wrong.