Add llama2 text completion example with streaming response support

pytorch · Jul 28, 2023 · 631b253 · 631b253
1 parent 71e1c2e
commit 631b253
Show file tree

Hide file tree

Showing 10 changed files with 221 additions and 262 deletions.
diff --git a/examples/large_models/inferentia2/llama/Readme.md b/examples/large_models/inferentia2/llama/Readme.md
diff --git a/examples/large_models/inferentia2/llama/inf2_handler.py b/examples/large_models/inferentia2/llama/inf2_handler.py
diff --git a/examples/large_models/inferentia2/llama/model-config.yaml b/examples/large_models/inferentia2/llama/model-config.yaml
diff --git a/examples/large_models/inferentia2/llama/requirements.txt b/examples/large_models/inferentia2/llama/requirements.txt
diff --git a/examples/large_models/inferentia2/llama/sample_text.txt b/examples/large_models/inferentia2/llama/sample_text.txt
diff --git a/examples/large_models/inferentia2/llama2/Readme.md b/examples/large_models/inferentia2/llama2/Readme.md
@@ -0,0 +1,93 @@
+# Large model inference on Inferentia2
+
+This document briefs on serving the [Llama 2](https://huggingface.co/meta-llama) model on [AWS Inferentia2](https://aws.amazon.com/ec2/instance-types/inf2/) with streaming response support.
+
+Inferentia2 uses [Neuron SDK](https://aws.amazon.com/machine-learning/neuron/) which is built on top of PyTorch XLA stack. For large model inference [`transformers-neuronx`](https://github.com/aws-neuron/transformers-neuronx) package is used that takes care of model partitioning and running inference.
+
+Let's take a look at the steps to prepare our model for inference on Inf2 instances.
+
+**Note** To run the model on an Inf2 instance, the model gets compiled as a preprocessing step. As part of the compilation process, to generate the model graph, a specific batch size is used. Following this, when running inference, we need to pass the same batch size that was used during compilation. This example uses batch size of 1 to demonstrate real-time inference with streaming response.
+
+### Step 1: Inf2 instance
+
+Get an Inf2 instance(Note: This example was tested on instance type:`inf2.24xlarge`), ssh to it, make sure to use the following DLAMI as it comes with PyTorch and necessary packages for AWS Neuron SDK pre-installed.
+DLAMI Name: ` Deep Learning AMI Neuron PyTorch 1.13 (Ubuntu 20.04) 20230720 Amazon Machine Image (AMI)`
+
+### Step 1: Package Installations
+
+Follow the steps below to complete package installations
+
+```bash
+sudo apt-get update
+sudo apt-get upgrade
+
+# Update Neuron Runtime
+sudo apt-get install aws-neuronx-collectives=2.* -y
+sudo apt-get install aws-neuronx-runtime-lib=2.* -y
+
+# Activate Python venv
+source /opt/aws_neuron_venv_pytorch/bin/activate
+
+# Clone Torchserve git repository
+git clone https://github.com/pytorch/serve.git
+cd serve
+
+# Install dependencies
+python ts_scripts/install_dependencies.py --neuronx
+
+# Set pip repository pointing to the Neuron repository
+python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
+
+# Update Neuron Compiler, Framework and Transformers
+python -m pip install --upgrade neuronx-cc torch-neuronx transformers-neuronx
+
+# Install additional necessary packages
+python -m pip install --upgrade transformers tokenizers sentencepiece
+
+```
+
+
+
+### Step 2: Save the model split checkpoints compatible with `transformers-neuronx`
+Login to Huggingface
+```bash
+huggingface-cli login
+```
+
+Navigate to `large_model/inferentia2/llama2` directory and run the following script
+
+```bash
+python ../util/inf2_save_split_checkpoints.py --model_name meta-llama/Llama-2-13b-hf --save_path './llama-2-13b-split'
+```
+
+
+### Step 3: Generate Tar/ MAR file
+
+```bash
+torch-model-archiver --model-name llama-2-13b --version 1.0 --handler inf2_handler.py --extra-files ./llama-2-13b-split  -r requirements.txt --config-file model-config.yaml --archive-format no-archive
+```
+
+### Step 4: Add the mar file to model store
+
+```bash
+mkdir model_store
+mv llama-2-13b model_store
+```
+
+### Step 5: Start torchserve
+
+```bash
+torchserve --ncs --start --model-store model_store
+```
+
+### Step 6: Register model
+
+```bash
+curl -X POST "http://localhost:8081/models?url=llama-2-13b"
+```
+
+### Step 7: Run inference
+
+```bash
+python test_stream_response.py
+```