Merge branch 'master' into iterative-typing

pytorch · Aug 29, 2023 · 866bf0d · 866bf0d
2 parents b425439 + fc814be
commit 866bf0d
Show file tree

Hide file tree

Showing 10 changed files with 231 additions and 7 deletions.
diff --git a/docs/README.md b/docs/README.md
@@ -52,4 +52,4 @@ TorchServe is a performant, flexible and easy to use tool for serving PyTorch ea
 * [TorchServe on Kubernetes](https://github.com/pytorch/serve/blob/master/kubernetes/README.md#torchserve-on-kubernetes) -  Demonstrates a Torchserve deployment in Kubernetes using Helm Chart supported in both Azure Kubernetes Service and Google Kubernetes service
 * [mlflow-torchserve](https://github.com/mlflow/mlflow-torchserve) - Deploy mlflow pipeline models into TorchServe
 * [Kubeflow pipelines](https://github.com/kubeflow/pipelines/tree/master/samples/contrib/pytorch-samples) - Kubeflow pipelines and Google Vertex AI Managed pipelines
-* [NVIDIA MPS](mps.md) - Use NVIDIA MPS to optimize multi-worker deployment on a single GPU
+* [NVIDIA MPS](nvidia_mps.md) - Use NVIDIA MPS to optimize multi-worker deployment on a single GPU
diff --git a/docs/contents.rst b/docs/contents.rst
@@ -16,7 +16,7 @@
   model_zoo
   request_envelopes
   server
-  mps
+  nvidia_mps
   snapshot
   torchserve_on_win_native
   torchserve_on_wsl

diff --git a/docs/mps.md → docs/nvidia_mps.md b/docs/mps.md → docs/nvidia_mps.md
@@ -60,7 +60,7 @@ Please note that we set the concurrency level to 600 which will make sure that t
 We first perform the single worker benchmark for the G4 instance.
 In the figure below we see that up to a batch size of four we see a steady increase of the throughput over the batch size.
 
-![G4 benchmark, single worker](images/mps_g4_single.png)
+![G4 benchmark, single worker](https://raw.githubusercontent.com/pytorch/serve/master/docs/images/mps_g4_single.png)
 
 Next, we increase the number of workers to two in order to compare the throughput with and without MPS running.
 To enable MPS for the second set of runs we first set the exclusive processing mode for the GPU and then start the MPS daemon as shown above.
@@ -69,19 +69,19 @@ We select the batch size between one and eight according to our previous finding
 In the figure we can see that the performance in terms of throughput can be better in case of batch size 1 and 8 (up to +18%) while it can be worse for others (-11%).
 An interpretation of this result could be that the G4 instance has not many resources to share when we run a BERT model in one of the workers.
 
-![G4 benchmark, two workers](images/mps_g4_two_worker.png)
+![G4 benchmark, two workers](https://raw.githubusercontent.com/pytorch/serve/master/docs/images/mps_g4_two_worker.png)
 
 ### P3 instance
 Next, we will run the same experiment with the bigger p3.2xlarge instance.
 With a single worker we get the following throughput values:
 
-![P3 benchmark, single worker](images/mps_p3_single.png)
+![P3 benchmark, single worker](https://raw.githubusercontent.com/pytorch/serve/master/docs/images/mps_p3_single.png)
 
 We can see that the throughput steady increases but for a batch size over eight we see diminishing returns.
 Finally, we deploy two workers on the P3 instance and compare running them with and without MPS.
 We can see that for batch size between 1 and 32 the throughput is consistently higher (up to +25%) for MPS enabled with the exception of batch size 16.
 
-![P3 benchmark, two workers](images/mps_p3_two_worker.png)
+![P3 benchmark, two workers](https://raw.githubusercontent.com/pytorch/serve/master/docs/images/mps_p3_two_worker.png)
 
 ## Summary
 In the previous section we saw that by enabling MPS for two workers running the same model we receive mixed results.

diff --git a/docs/performance_guide.md b/docs/performance_guide.md
@@ -69,7 +69,7 @@ While NVIDIA GPUs allow multiple processes to run on CUDA kernels, this comes wi
 * The execution of the kernels is generally serialized
 * Each processes creates its own CUDA context which occupies additional GPU memory
 
-To get around these drawbacks, you can utilize the NVIDIA Multi-Process Service (MPS) to increase performance. You can find more information on how to utilize NVIDIA MPS with TorchServe  [here](mps.md).
+To get around these drawbacks, you can utilize the NVIDIA Multi-Process Service (MPS) to increase performance. You can find more information on how to utilize NVIDIA MPS with TorchServe  [here](nvidia_mps.md).
 
 <h6> NVIDIA DALI</h6>
 

diff --git a/examples/large_models/Huggingface_accelerate/llama2/Readme.md b/examples/large_models/Huggingface_accelerate/llama2/Readme.md
@@ -0,0 +1,60 @@
+# Loading meta-llama/Llama-2-70b-chat-hf on AWS EC2 g5.24xlarge using accelerate
+
+This document briefs on serving large HG models with limited resource using accelerate. This option can be activated with `low_cpu_mem_usage=True`. The model is first created on the Meta device (with empty weights) and the state dict is then loaded inside it (shard by shard in the case of a sharded checkpoint).
+
+### Step 1: Download model Permission
+
+Follow [this instruction](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) to get permission
+
+Login with a Hugging Face account
+```
+huggingface-cli login
+# or using an environment variable
+huggingface-cli login --token $HUGGINGFACE_TOKEN
+```
+
+```bash
+python ../Download_model.py --model_path model --model_name meta-llama/Llama-2-70b-chat-hf
+```
+Model will be saved in the following path, `model/models--meta-llama--Llama-2-70b-chat-hf`.
+
+### Step 2: Generate MAR file
+
+Add the downloaded path to " model_path:" in `model-config.yaml` and run the following.
+
+```bash
+torch-model-archiver --model-name llama2-70b-chat --version 1.0 --handler custom_handler.py --config-file model-config.yaml -r requirements.txt --archive-format no-archive
+```
+
+If you are using conda, and notice issues with mpi4py, you would need to install openmpi-mpicc using the following
+
+```
+conda install -c conda-forge openmpi-mpicc
+```
+
+### Step 3: Add the mar file to model store
+
+```bash
+mkdir model_store
+mv llama2-70b-chat model_store
+mv model model_store/llama2-70b-chat
+```
+
+### Step 3: Start torchserve
+
+Update config.properties and start torchserve
+
+```bash
+torchserve --start --ncs --ts-config config.properties --model-store model_store --models llama2-70b-chat
+```
+
+### Step 4: Run inference
+
+```bash
+curl -v "http://localhost:8080/predictions/llama2-70b-chat" -T sample_text.txt
+```
+
+results in the following output
+```
+Mayonnaise is a thick, creamy condiment made from a mixture of egg yolks, oil, vinegar or lemon juice, and seasonings'
+```
diff --git a/examples/large_models/Huggingface_accelerate/llama2/config.properties b/examples/large_models/Huggingface_accelerate/llama2/config.properties
@@ -0,0 +1,6 @@
+inference_address=http://0.0.0.0:8080
+management_address=http://0.0.0.0:8081
+metrics_address=http://0.0.0.0:8082
+enable_envvars_config=true
+install_py_dep_per_model=true
+
diff --git a/examples/large_models/Huggingface_accelerate/llama2/custom_handler.py b/examples/large_models/Huggingface_accelerate/llama2/custom_handler.py
@@ -0,0 +1,139 @@
+import logging
+from abc import ABC
+
+import torch
+import transformers
+from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+from accelerate import init_empty_weights
+from accelerate import load_checkpoint_and_dispatch
+
+from ts.context import Context
+from ts.torch_handler.base_handler import BaseHandler
+
+logger = logging.getLogger(__name__)
+logger.info("Transformers version %s", transformers.__version__)
+
+
+class LlamaHandler(BaseHandler, ABC):
+    """
+    Transformers handler class for sequence, token classification and question answering.
+    """
+
+    def __init__(self):
+        super(LlamaHandler, self).__init__()
+        self.max_length = None
+        self.max_new_tokens = None
+        self.tokenizer = None
+        self.initialized = False
+
+    def initialize(self, ctx: Context):
+        """In this initialize function, the HF large model is loaded and
+        partitioned using DeepSpeed.
+        Args:
+            ctx (context): It is a JSON Object containing information
+            pertaining to the model artifacts parameters.
+        """
+        model_dir = ctx.system_properties.get("model_dir")
+        self.max_length = int(ctx.model_yaml_config["handler"]["max_length"])
+        self.max_new_tokens = int(ctx.model_yaml_config["handler"]["max_new_tokens"])
+        model_name = ctx.model_yaml_config["handler"]["model_name"]
+        model_path = f'{model_dir}/{ctx.model_yaml_config["handler"]["model_path"]}'
+        seed = int(ctx.model_yaml_config["handler"]["manual_seed"])
+        torch.manual_seed(seed)
+
+        logger.info("Model %s loading tokenizer", ctx.model_name)
+        self.model = AutoModelForCausalLM.from_pretrained(
+            model_path,
+            device_map="balanced",
+            low_cpu_mem_usage=True,
+            torch_dtype=torch.float16,
+            load_in_8bit=True,
+            trust_remote_code=True)
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.tokenizer.add_special_tokens(
+            {
+
+            "pad_token": "<PAD>",
+            }
+        )
+        self.model.resize_token_embeddings(self.model.config.vocab_size + 1) 
+
+        logger.info("Model %s loaded successfully", ctx.model_name)
+        self.initialized = True
+
+    def preprocess(self, requests):
+        """
+        Basic text preprocessing, based on the user's choice of application mode.
+        Args:
+            requests (list): A list of dictionaries with a "data" or "body" field, each
+                            containing the input text to be processed.
+        Returns:
+            tuple: A tuple with two tensors: the batch of input ids and the batch of
+                attention masks.
+        """
+        input_texts = [data.get("data") or data.get("body") for data in requests]
+        input_ids_batch, attention_mask_batch = [], []
+        for input_text in input_texts:
+            input_ids, attention_mask = self.encode_input_text(input_text)
+            input_ids_batch.append(input_ids)
+            attention_mask_batch.append(attention_mask)
+        input_ids_batch = torch.cat(input_ids_batch, dim=0).to(self.model.device)
+        attention_mask_batch = torch.cat(attention_mask_batch, dim=0).to(self.device)
+        return input_ids_batch, attention_mask_batch
+
+    def encode_input_text(self, input_text):
+        """
+        Encodes a single input text using the tokenizer.
+        Args:
+            input_text (str): The input text to be encoded.
+        Returns:
+            tuple: A tuple with two tensors: the encoded input ids and the attention mask.
+        """
+        if isinstance(input_text, (bytes, bytearray)):
+            input_text = input_text.decode("utf-8")
+        logger.info("Received text: '%s'", input_text)
+        inputs = self.tokenizer.encode_plus(
+            input_text,
+            max_length=self.max_length,
+            padding=True,
+            add_special_tokens=True,
+            return_tensors="pt",
+            truncation=True,
+        )
+        input_ids = inputs["input_ids"]
+        attention_mask = inputs["attention_mask"]
+        return input_ids, attention_mask
+
+    def inference(self, input_batch):
+        """
+        Predicts the class (or classes) of the received text using the serialized transformers
+        checkpoint.
+        Args:
+            input_batch (tuple): A tuple with two tensors: the batch of input ids and the batch
+                                of attention masks, as returned by the preprocess function.
+        Returns:
+            list: A list of strings with the predicted values for each input text in the batch.
+        """
+        input_ids_batch, attention_mask_batch = input_batch
+        input_ids_batch = input_ids_batch.to(self.device)
+        outputs = self.model.generate(
+            input_ids_batch,
+            attention_mask=attention_mask_batch,
+            max_length=self.max_new_tokens,
+        )
+
+        inferences = self.tokenizer.batch_decode(
+            outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False
+        )
+
+        logger.info("Generated text: %s", inferences)
+        return inferences
+
+    def postprocess(self, inference_output):
+        """Post Process Function converts the predicted response into Torchserve readable format.
+        Args:
+            inference_output (list): It contains the predicted response of the input text.
+        Returns:
+            (list): Returns a list of the Predictions and Explanations.
+        """
+        return inference_output
diff --git a/examples/large_models/Huggingface_accelerate/llama2/model-config.yaml b/examples/large_models/Huggingface_accelerate/llama2/model-config.yaml
@@ -0,0 +1,13 @@
+# TorchServe frontend parameters
+minWorkers: 1
+maxWorkers: 1
+maxBatchDelay: 100
+responseTimeout: 1200
+deviceType: "gpu"
+
+handler:
+    model_name: "meta-llama/Llama-2-70b-chat-hf"
+    model_path: "model/models--meta-llama--Llama-2-70b-chat-hf/snapshots/9ff8b00464fc439a64bb374769dec3dd627be1c2"
+    max_length: 50
+    max_new_tokens: 50
+    manual_seed: 40
diff --git a/examples/large_models/Huggingface_accelerate/llama2/requirements.txt b/examples/large_models/Huggingface_accelerate/llama2/requirements.txt
@@ -0,0 +1,5 @@
+transformers==4.31.0
+accelerate
+bitsandbytes
+scipy
+mpi4py
diff --git a/examples/large_models/Huggingface_accelerate/llama2/sample_text.txt b/examples/large_models/Huggingface_accelerate/llama2/sample_text.txt
@@ -0,0 +1 @@
+what is the recipe of mayonnaise?