Skip to content

Commit

Permalink
Add llama2 text completion example with streaming response support
Browse files Browse the repository at this point in the history
  • Loading branch information
Naman Nandan committed Jul 28, 2023
1 parent 71e1c2e commit 631b253
Show file tree
Hide file tree
Showing 10 changed files with 221 additions and 262 deletions.
77 changes: 0 additions & 77 deletions examples/large_models/inferentia2/llama/Readme.md

This file was deleted.

167 changes: 0 additions & 167 deletions examples/large_models/inferentia2/llama/inf2_handler.py

This file was deleted.

12 changes: 0 additions & 12 deletions examples/large_models/inferentia2/llama/model-config.yaml

This file was deleted.

5 changes: 0 additions & 5 deletions examples/large_models/inferentia2/llama/requirements.txt

This file was deleted.

1 change: 0 additions & 1 deletion examples/large_models/inferentia2/llama/sample_text.txt

This file was deleted.

93 changes: 93 additions & 0 deletions examples/large_models/inferentia2/llama2/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Large model inference on Inferentia2

This document briefs on serving the [Llama 2](https://huggingface.co/meta-llama) model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) with streaming response support.

Inferentia2 uses [Neuron SDK](https://aws.amazon.com/machine-learning/neuron/) which is built on top of PyTorch XLA stack. For large model inference [`transformers-neuronx`](https://github.com/aws-neuron/transformers-neuronx) package is used that takes care of model partitioning and running inference.

Let's take a look at the steps to prepare our model for inference on Inf2 instances.

**Note** To run the model on an Inf2 instance, the model gets compiled as a preprocessing step. As part of the compilation process, to generate the model graph, a specific batch size is used. Following this, when running inference, we need to pass the same batch size that was used during compilation. This example uses batch size of 1 to demonstrate real-time inference with streaming response.

### Step 1: Inf2 instance

Get an Inf2 instance(Note: This example was tested on instance type:`inf2.24xlarge`), ssh to it, make sure to use the following DLAMI as it comes with PyTorch and necessary packages for AWS Neuron SDK pre-installed.
DLAMI Name: ` Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230720 Amazon Machine Image (AMI)`

### Step 1: Package Installations

Follow the steps below to complete package installations

```bash
sudo apt-get update
sudo apt-get upgrade

# Update Neuron Runtime
sudo apt-get install aws-neuronx-collectives=2.* -y
sudo apt-get install aws-neuronx-runtime-lib=2.* -y

# Activate Python venv
source /opt/aws_neuron_venv_pytorch/bin/activate

# Clone Torchserve git repository
git clone https://github.com/pytorch/serve.git
cd serve

# Install dependencies
python ts_scripts/install_dependencies.py --neuronx

# Set pip repository pointing to the Neuron repository
python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com

# Update Neuron Compiler, Framework and Transformers
python -m pip install --upgrade neuronx-cc torch-neuronx transformers-neuronx

# Install additional necessary packages
python -m pip install --upgrade transformers tokenizers sentencepiece

```



### Step 2: Save the model split checkpoints compatible with `transformers-neuronx`
Login to Huggingface
```bash
huggingface-cli login
```

Navigate to `large_model/inferentia2/llama2` directory and run the following script

```bash
python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split'
```


### Step 3: Generate Tar/ MAR file

```bash
torch-model-archiver --model-name llama-2-13b --version 1.0 --handler inf2_handler.py --extra-files ./llama-2-13b-split -r requirements.txt --config-file model-config.yaml --archive-format no-archive
```

### Step 4: Add the mar file to model store

```bash
mkdir model_store
mv llama-2-13b model_store
```

### Step 5: Start torchserve

```bash
torchserve --ncs --start --model-store model_store
```

### Step 6: Register model

```bash
curl -X POST "http://localhost:8081/models?url=llama-2-13b"
```

### Step 7: Run inference

```bash
python test_stream_response.py
```
Loading

0 comments on commit 631b253

Please sign in to comment.