Skip to content

Commit

Permalink
Philip/even more docs (#563)
Browse files Browse the repository at this point in the history
* tutorial refactor

* CLI reference

* usage

* user guide

* fix up 3 tutorials

* 2 more tutorials

* VLLM tutorial
  • Loading branch information
philipkiely-baseten authored Aug 15, 2023
1 parent 5f0fc96 commit fd9b905
Show file tree
Hide file tree
Showing 31 changed files with 1,382 additions and 383 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
## Why Truss?

* **Write once, run anywhere:** Package and test model code, weights, and dependencies with a model server that behaves the same in development and production.
* **Fast developer loop:** Implement your model with fast feedback from a live reload server, and skip Docker and Kubernetes configuration with Truss' done-for-you model serving environment.
* **Fast developer loop:** Implement your model with fast feedback from a live reload server, and skip Docker and Kubernetes configuration with a batteries-included model serving environment.
* **Support for all Python frameworks**: From `transformers` and `diffusors` to `PyTorch` and `Tensorflow` to `XGBoost` and `sklearn`, Truss supports models created with any framework, even entirely custom models.

See Trusses for popular models including:
Expand Down
16 changes: 8 additions & 8 deletions docs/examples/models/overview.mdx
Original file line number Diff line number Diff line change
@@ -1,20 +1,20 @@
---
title: Example models
description: "Description"
title: Example foundation models
description: "Step-by-step packaging instructions"
---

<CardGroup cols={3}>
<Card title="Llama-2" icon="horse" href="/examples/models/llama-2">
Lorem
A commercially-licensed LLM by Meta
</Card>
<Card title="Stable Diffusion XL" icon="palette" href="/examples/models/sdxl">
Lorem
A text to image model by Stability AI
</Card>
<Card title="Whisper" icon="ear-listen" href="/examples/models/whisper">
Lorem
An audio transcription model by OpenAI
</Card>
</CardGroup>

<Card title="More" icon="ear-listen" href="#">
Lorem
</Card>
<Card title="More examples on GitHub" icon="github" href="https://github.com/basetenlabs/truss-examples">
See Trusses for dozens of models on GitHub.
</Card>
220 changes: 219 additions & 1 deletion docs/examples/performance/cached-weights.mdx
Original file line number Diff line number Diff line change
@@ -1,4 +1,222 @@
---
title: Load cached model weights
description: "Description"
description: "Deploy a model with private Hugging Face weights"
---

In this example, we will cover how you can use the `hf_cache` key in your Truss's `config.yml` to automatically bundle model weights from a private Hugging Face repo.

<Tip>
Bundling model weights can significantly reduce cold start times because your instance won't waste time downloading the model weights from Hugging Face's servers.
</Tip>

We use `Llama-2-7b`, a popular open-source large language model, as an example. In order to follow along with us, you need to request access to Llama 2.

1. First, [sign up for a Hugging Face account](https://huggingface.co/join) if you don't already have one.
2. Request access to Llama 2 from [Meta's website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/).
2. Next, request access to Llama 2 on [Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) by clicking the "Request access" button on the model page.

<Tip>
If you want to deploy on Baseten, you also need to create a Hugging Face API token and add it to your organizations's secrets.
1. [Create a Hugging Face API token](https://huggingface.co/settings/tokens) and copy it to your clipboard.
2. Add the token with the key `hf_access_token` to [your organization's secrets](https://app.baseten.co/settings/secrets) on Baseten.
</Tip>

### Step 0: Initialize Truss

Get started by creating a new Truss:

```sh
truss init llama-2-7b-chat
```

Select the `TrussServer` option then hit `y` to confirm Truss creation. Then navigate to the newly created directory:

```sh
cd llama-2-7b-chat
```

### Step 1: Implement Llama 2 7B in Truss

Next, we'll fill out the `model.py` file to implement Llama 2 7B in Truss.


In `model/model.py`, we write the class `Model` with three member functions:

* `__init__`, which creates an instance of the object with a `_model` property
* `load`, which runs once when the model server is spun up and loads the `pipeline` model
* `predict`, which runs each time the model is invoked and handles the inference. It can use any JSON-serializable type as input and output.

We will also create a helper function `format_prompt` outside of the `Model` class to appropriately format the incoming text according to the Llama 2 specification.

[Read the quickstart guide](/quickstart) for more details on `Model` class implementation.

```python model/model.py
from typing import Dict, List

import torch
from transformers import LlamaForCausalLM, LlamaTokenizer

DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant."

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"

class Model:
def __init__(self, **kwargs) -> None:
self._data_dir = kwargs["data_dir"]
self._config = kwargs["config"]
self._secrets = kwargs["secrets"]
self.model = None
self.tokenizer = None

def load(self):
self.model = LlamaForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf",
use_auth_token=self._secrets["hf_access_token"],
torch_dtype=torch.float16,
device_map="auto"
)
self.tokenizer = LlamaTokenizer.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf",
use_auth_token=self._secrets["hf_access_token"]
)

def predict(self, request: Dict) -> Dict[str, List]:
prompt = request.pop("prompt")
prompt = format_prompt(prompt)

inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

return {"response": response}

def format_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
return f"{B_INST} {B_SYS} {system_prompt} {E_SYS} {prompt} {E_INST}"
```

### Step 2: Set Python dependencies

Now, we can turn our attention to configuring the model server in `config.yaml`.

In addition to `transformers`, Llama 2 has three other dependencies. We list them below as follows:

```yaml config.yaml
requirements:
- accelerate==0.21.0
- safetensors==0.3.2
- torch==2.0.1
- transformers==4.30.2
```
<Note>
Always pin exact versions for your Python dependencies. The ML/AI space moves fast, so you want to have an up-to-date version of each package while also being protected from breaking changes.
</Note>
### Step 3: Configure Hugging Face caching
Finally, we can configure Hugging Face caching in `config.yaml` by adding the `hf_cache` key. When building the image for your Llama 2 deployment, the Llama 2 model weights will be downloaded and cached for future use.

```yaml config.yaml
hf_cache:
- repo_id: "meta-llama/Llama-2-7b-chat-hf"
ignore_patterns:
- "*.bin"
```

In this configuration:
- `meta-llama/Llama-2-7b-chat-hf` is the `repo_id`, pointing to the exact model to cache.
- We use a wild card to ignore all `.bin` files in the model directory by providing a pattern under `ignore_patterns`. This is because the model weights are stored in `.bin` and `safetensors` format, and we only want to cache the `safetensors` files.


### Step 4: Deploy the model

<Note>
You'll need a [Baseten API key](https://app.baseten.co/settings/account/api_keys) for this step. Make sure you added your `HUGGING_FACE_HUB_TOKEN` to your organization's secrets.
</Note>

We have successfully packaged Llama 2 as a Truss. Let's deploy!

```sh
truss push --trusted
```

### Step 5: Invoke the model

You can invoke the model with:

```sh
truss predict -d '{"prompt": "What is a large language model?"}'
```

<RequestExample>

```yaml config.yaml
environment_variables: {}
external_package_dirs: []
model_metadata: {}
model_name: null
python_version: py39
requirements:
- accelerate==0.21.0
- safetensors==0.3.2
- torch==2.0.1
- transformers==4.30.2
hf_cache:
- repo_id: "NousResearch/Llama-2-7b-chat-hf"
ignore_patterns:
- "*.bin"
resources:
cpu: "4"
memory: 30Gi
use_gpu: True
accelerator: A10G
secrets: {}
```

```python model/model.py
from typing import Dict, List
import torch
from transformers import LlamaForCausalLM, LlamaTokenizer
DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant."
B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
class Model:
def __init__(self, **kwargs) -> None:
self._data_dir = kwargs["data_dir"]
self._config = kwargs["config"]
self._secrets = kwargs["secrets"]
self.model = None
self.tokenizer = None
def load(self):
self.model = LlamaForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf",
use_auth_token=self._secrets["hf_access_token"],
torch_dtype=torch.float16,
device_map="auto"
)
self.tokenizer = LlamaTokenizer.from_pretrained(
"meta-llama/Llama-2-7b-chat-hf",
use_auth_token=self._secrets["hf_access_token"]
)
def predict(self, request: Dict) -> Dict[str, List]:
prompt = request.pop("prompt")
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100)
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
return {"response": response}
def format_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str:
return f"{B_INST} {B_SYS} {system_prompt} {E_SYS} {prompt} {E_INST}"
```

</RequestExample>
96 changes: 95 additions & 1 deletion docs/examples/performance/vllm-server.mdx
Original file line number Diff line number Diff line change
@@ -1,4 +1,98 @@
---
title: Serve models with VLLM
description: "Description"
description: "Deploy a language model using vLLM"
---

[vLLM](https://github.com/vllm-project/vllm) is a Python-based package that optimizes the Attention layer in Transformer models. By better allocating memory used during the attention computation, vLLM can reduce the memory footprint of a model and significantly improve inference speed. Truss supports vLLM out of the box, so you can deploy vLLM-optimized models with ease. We're going to walk through deploying a vLLM-optimized [OPT-125M model](https://huggingface.co/facebook/opt-125m).

<Tip>
You can see the config for the finished model on the right. Keep reading for step-by-step instructions on how to generate it.
</Tip>

This example will cover:

1. Generating the base Truss
2. Setting sufficient model resources for inference
3. Deploying the model

### Step 1: Generating the base Truss

Get started by creating a new Truss:

```sh
truss init opt125
```

You're going to see a couple of prompts. Follow along with the instructions below:
1. Type `facebook/opt-125M` when prompted for `model`.
2. Press the `tab` key when prompted for `endpoint`. Select the `Completions` endpoint.
3. Give your model a name like `OPT-125M`.

<Note>
The underlying server that we use is OpenAI compatible. If you plan on using the model as a chat model, then select `ChatCompletion`. OPT-125M is not a chat model so we selected `Completion`.
</Note>

Finally, navigate to the directory:

```sh
cd opt125
```

### Step 2: Setting resources and other arguments

You'll notice that there's a `config.yaml` in the new directory. This is where we'll set the resources and other arguments for the model. Open the file in your favorite editor.

OPT-125M will need a GPU so let's set the correct resources. Update the `resources` key with the following:

```yaml config.yaml
resources:
accelerator: T4
cpu: "4"
memory: 16Gi
use_gpu: true
```
Also notice the `build` key which contains the `model_server` we're using as well as other arguments. These arguments are passed to the underlying vLLM server which you can find [here](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py).

### Step 3: Deploy the model

<Note>
You'll need a [Baseten API key](https://app.baseten.co/settings/account/api_keys) for this step.
</Note>

Let's deploy our OPT-125M vLLM model.

```sh
truss push
```

You can invoke the model with:

```sh
truss predict -d '{"prompt": "What is a large language model?"}'
```

<RequestExample>

```yaml config.yaml
build:
arguments:
endpoint: Completions
model: facebook/opt-125M
model_server: VLLM
environment_variables: {}
external_package_dirs: []
model_metadata: {}
model_name: OPT-125M
python_version: py39
requirements: []
resources:
accelerator: T4
cpu: "4"
memory: 16Gi
use_gpu: true
secrets: {}
system_packages: []
```

</RequestExample>
Loading

0 comments on commit fd9b905

Please sign in to comment.