Skip to content

Commit

Permalink
Merge branch 'master' into iterative-typing
Browse files Browse the repository at this point in the history
  • Loading branch information
agunapal authored Aug 29, 2023
2 parents b425439 + fc814be commit 866bf0d
Show file tree
Hide file tree
Showing 10 changed files with 231 additions and 7 deletions.
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,4 +52,4 @@ TorchServe is a performant, flexible and easy to use tool for serving PyTorch ea
* [TorchServe on Kubernetes](https://github.com/pytorch/serve/blob/master/kubernetes/README.md#torchserve-on-kubernetes) - Demonstrates a Torchserve deployment in Kubernetes using Helm Chart supported in both Azure Kubernetes Service and Google Kubernetes service
* [mlflow-torchserve](https://github.com/mlflow/mlflow-torchserve) - Deploy mlflow pipeline models into TorchServe
* [Kubeflow pipelines](https://github.com/kubeflow/pipelines/tree/master/samples/contrib/pytorch-samples) - Kubeflow pipelines and Google Vertex AI Managed pipelines
* [NVIDIA MPS](mps.md) - Use NVIDIA MPS to optimize multi-worker deployment on a single GPU
* [NVIDIA MPS](nvidia_mps.md) - Use NVIDIA MPS to optimize multi-worker deployment on a single GPU
2 changes: 1 addition & 1 deletion docs/contents.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
model_zoo
request_envelopes
server
mps
nvidia_mps
snapshot
torchserve_on_win_native
torchserve_on_wsl
Expand Down
8 changes: 4 additions & 4 deletions docs/mps.md → docs/nvidia_mps.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ Please note that we set the concurrency level to 600 which will make sure that t
We first perform the single worker benchmark for the G4 instance.
In the figure below we see that up to a batch size of four we see a steady increase of the throughput over the batch size.

![G4 benchmark, single worker](images/mps_g4_single.png)
![G4 benchmark, single worker](https://raw.githubusercontent.com/pytorch/serve/master/docs/images/mps_g4_single.png)

Next, we increase the number of workers to two in order to compare the throughput with and without MPS running.
To enable MPS for the second set of runs we first set the exclusive processing mode for the GPU and then start the MPS daemon as shown above.
Expand All @@ -69,19 +69,19 @@ We select the batch size between one and eight according to our previous finding
In the figure we can see that the performance in terms of throughput can be better in case of batch size 1 and 8 (up to +18%) while it can be worse for others (-11%).
An interpretation of this result could be that the G4 instance has not many resources to share when we run a BERT model in one of the workers.

![G4 benchmark, two workers](images/mps_g4_two_worker.png)
![G4 benchmark, two workers](https://raw.githubusercontent.com/pytorch/serve/master/docs/images/mps_g4_two_worker.png)

### P3 instance
Next, we will run the same experiment with the bigger p3.2xlarge instance.
With a single worker we get the following throughput values:

![P3 benchmark, single worker](images/mps_p3_single.png)
![P3 benchmark, single worker](https://raw.githubusercontent.com/pytorch/serve/master/docs/images/mps_p3_single.png)

We can see that the throughput steady increases but for a batch size over eight we see diminishing returns.
Finally, we deploy two workers on the P3 instance and compare running them with and without MPS.
We can see that for batch size between 1 and 32 the throughput is consistently higher (up to +25%) for MPS enabled with the exception of batch size 16.

![P3 benchmark, two workers](images/mps_p3_two_worker.png)
![P3 benchmark, two workers](https://raw.githubusercontent.com/pytorch/serve/master/docs/images/mps_p3_two_worker.png)

## Summary
In the previous section we saw that by enabling MPS for two workers running the same model we receive mixed results.
Expand Down
2 changes: 1 addition & 1 deletion docs/performance_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ While NVIDIA GPUs allow multiple processes to run on CUDA kernels, this comes wi
* The execution of the kernels is generally serialized
* Each processes creates its own CUDA context which occupies additional GPU memory

To get around these drawbacks, you can utilize the NVIDIA Multi-Process Service (MPS) to increase performance. You can find more information on how to utilize NVIDIA MPS with TorchServe [here](mps.md).
To get around these drawbacks, you can utilize the NVIDIA Multi-Process Service (MPS) to increase performance. You can find more information on how to utilize NVIDIA MPS with TorchServe [here](nvidia_mps.md).

<h6> NVIDIA DALI</h6>

Expand Down
60 changes: 60 additions & 0 deletions examples/large_models/Huggingface_accelerate/llama2/Readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Loading meta-llama/Llama-2-70b-chat-hf on AWS EC2 g5.24xlarge using accelerate

This document briefs on serving large HG models with limited resource using accelerate. This option can be activated with `low_cpu_mem_usage=True`. The model is first created on the Meta device (with empty weights) and the state dict is then loaded inside it (shard by shard in the case of a sharded checkpoint).

### Step 1: Download model Permission

Follow [this instruction](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) to get permission

Login with a Hugging Face account
```
huggingface-cli login
# or using an environment variable
huggingface-cli login --token $HUGGINGFACE_TOKEN
```

```bash
python ../Download_model.py --model_path model --model_name meta-llama/Llama-2-70b-chat-hf
```
Model will be saved in the following path, `model/models--meta-llama--Llama-2-70b-chat-hf`.

### Step 2: Generate MAR file

Add the downloaded path to " model_path:" in `model-config.yaml` and run the following.

```bash
torch-model-archiver --model-name llama2-70b-chat --version 1.0 --handler custom_handler.py --config-file model-config.yaml -r requirements.txt --archive-format no-archive
```

If you are using conda, and notice issues with mpi4py, you would need to install openmpi-mpicc using the following

```
conda install -c conda-forge openmpi-mpicc
```

### Step 3: Add the mar file to model store

```bash
mkdir model_store
mv llama2-70b-chat model_store
mv model model_store/llama2-70b-chat
```

### Step 3: Start torchserve

Update config.properties and start torchserve

```bash
torchserve --start --ncs --ts-config config.properties --model-store model_store --models llama2-70b-chat
```

### Step 4: Run inference

```bash
curl -v "http://localhost:8080/predictions/llama2-70b-chat" -T sample_text.txt
```

results in the following output
```
Mayonnaise is a thick, creamy condiment made from a mixture of egg yolks, oil, vinegar or lemon juice, and seasonings'
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
metrics_address=http://0.0.0.0:8082
enable_envvars_config=true
install_py_dep_per_model=true

139 changes: 139 additions & 0 deletions examples/large_models/Huggingface_accelerate/llama2/custom_handler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
import logging
from abc import ABC

import torch
import transformers
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from accelerate import init_empty_weights
from accelerate import load_checkpoint_and_dispatch

from ts.context import Context
from ts.torch_handler.base_handler import BaseHandler

logger = logging.getLogger(__name__)
logger.info("Transformers version %s", transformers.__version__)


class LlamaHandler(BaseHandler, ABC):
"""
Transformers handler class for sequence, token classification and question answering.
"""

def __init__(self):
super(LlamaHandler, self).__init__()
self.max_length = None
self.max_new_tokens = None
self.tokenizer = None
self.initialized = False

def initialize(self, ctx: Context):
"""In this initialize function, the HF large model is loaded and
partitioned using DeepSpeed.
Args:
ctx (context): It is a JSON Object containing information
pertaining to the model artifacts parameters.
"""
model_dir = ctx.system_properties.get("model_dir")
self.max_length = int(ctx.model_yaml_config["handler"]["max_length"])
self.max_new_tokens = int(ctx.model_yaml_config["handler"]["max_new_tokens"])
model_name = ctx.model_yaml_config["handler"]["model_name"]
model_path = f'{model_dir}/{ctx.model_yaml_config["handler"]["model_path"]}'
seed = int(ctx.model_yaml_config["handler"]["manual_seed"])
torch.manual_seed(seed)

logger.info("Model %s loading tokenizer", ctx.model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_path,
device_map="balanced",
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
load_in_8bit=True,
trust_remote_code=True)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.tokenizer.add_special_tokens(
{

"pad_token": "<PAD>",
}
)
self.model.resize_token_embeddings(self.model.config.vocab_size + 1)

logger.info("Model %s loaded successfully", ctx.model_name)
self.initialized = True

def preprocess(self, requests):
"""
Basic text preprocessing, based on the user's choice of application mode.
Args:
requests (list): A list of dictionaries with a "data" or "body" field, each
containing the input text to be processed.
Returns:
tuple: A tuple with two tensors: the batch of input ids and the batch of
attention masks.
"""
input_texts = [data.get("data") or data.get("body") for data in requests]
input_ids_batch, attention_mask_batch = [], []
for input_text in input_texts:
input_ids, attention_mask = self.encode_input_text(input_text)
input_ids_batch.append(input_ids)
attention_mask_batch.append(attention_mask)
input_ids_batch = torch.cat(input_ids_batch, dim=0).to(self.model.device)
attention_mask_batch = torch.cat(attention_mask_batch, dim=0).to(self.device)
return input_ids_batch, attention_mask_batch

def encode_input_text(self, input_text):
"""
Encodes a single input text using the tokenizer.
Args:
input_text (str): The input text to be encoded.
Returns:
tuple: A tuple with two tensors: the encoded input ids and the attention mask.
"""
if isinstance(input_text, (bytes, bytearray)):
input_text = input_text.decode("utf-8")
logger.info("Received text: '%s'", input_text)
inputs = self.tokenizer.encode_plus(
input_text,
max_length=self.max_length,
padding=True,
add_special_tokens=True,
return_tensors="pt",
truncation=True,
)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
return input_ids, attention_mask

def inference(self, input_batch):
"""
Predicts the class (or classes) of the received text using the serialized transformers
checkpoint.
Args:
input_batch (tuple): A tuple with two tensors: the batch of input ids and the batch
of attention masks, as returned by the preprocess function.
Returns:
list: A list of strings with the predicted values for each input text in the batch.
"""
input_ids_batch, attention_mask_batch = input_batch
input_ids_batch = input_ids_batch.to(self.device)
outputs = self.model.generate(
input_ids_batch,
attention_mask=attention_mask_batch,
max_length=self.max_new_tokens,
)

inferences = self.tokenizer.batch_decode(
outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

logger.info("Generated text: %s", inferences)
return inferences

def postprocess(self, inference_output):
"""Post Process Function converts the predicted response into Torchserve readable format.
Args:
inference_output (list): It contains the predicted response of the input text.
Returns:
(list): Returns a list of the Predictions and Explanations.
"""
return inference_output
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# TorchServe frontend parameters
minWorkers: 1
maxWorkers: 1
maxBatchDelay: 100
responseTimeout: 1200
deviceType: "gpu"

handler:
model_name: "meta-llama/Llama-2-70b-chat-hf"
model_path: "model/models--meta-llama--Llama-2-70b-chat-hf/snapshots/9ff8b00464fc439a64bb374769dec3dd627be1c2"
max_length: 50
max_new_tokens: 50
manual_seed: 40
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
transformers==4.31.0
accelerate
bitsandbytes
scipy
mpi4py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
what is the recipe of mayonnaise?

0 comments on commit 866bf0d

Please sign in to comment.