-
Notifications
You must be signed in to change notification settings - Fork 80
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* tutorial refactor * CLI reference * usage * user guide * fix up 3 tutorials * 2 more tutorials * VLLM tutorial
- Loading branch information
1 parent
5f0fc96
commit fd9b905
Showing
31 changed files
with
1,382 additions
and
383 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,20 +1,20 @@ | ||
--- | ||
title: Example models | ||
description: "Description" | ||
title: Example foundation models | ||
description: "Step-by-step packaging instructions" | ||
--- | ||
|
||
<CardGroup cols={3}> | ||
<Card title="Llama-2" icon="horse" href="/examples/models/llama-2"> | ||
Lorem | ||
A commercially-licensed LLM by Meta | ||
</Card> | ||
<Card title="Stable Diffusion XL" icon="palette" href="/examples/models/sdxl"> | ||
Lorem | ||
A text to image model by Stability AI | ||
</Card> | ||
<Card title="Whisper" icon="ear-listen" href="/examples/models/whisper"> | ||
Lorem | ||
An audio transcription model by OpenAI | ||
</Card> | ||
</CardGroup> | ||
|
||
<Card title="More" icon="ear-listen" href="#"> | ||
Lorem | ||
</Card> | ||
<Card title="More examples on GitHub" icon="github" href="https://github.com/basetenlabs/truss-examples"> | ||
See Trusses for dozens of models on GitHub. | ||
</Card> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,222 @@ | ||
--- | ||
title: Load cached model weights | ||
description: "Description" | ||
description: "Deploy a model with private Hugging Face weights" | ||
--- | ||
|
||
In this example, we will cover how you can use the `hf_cache` key in your Truss's `config.yml` to automatically bundle model weights from a private Hugging Face repo. | ||
|
||
<Tip> | ||
Bundling model weights can significantly reduce cold start times because your instance won't waste time downloading the model weights from Hugging Face's servers. | ||
</Tip> | ||
|
||
We use `Llama-2-7b`, a popular open-source large language model, as an example. In order to follow along with us, you need to request access to Llama 2. | ||
|
||
1. First, [sign up for a Hugging Face account](https://huggingface.co/join) if you don't already have one. | ||
2. Request access to Llama 2 from [Meta's website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/). | ||
2. Next, request access to Llama 2 on [Hugging Face](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) by clicking the "Request access" button on the model page. | ||
|
||
<Tip> | ||
If you want to deploy on Baseten, you also need to create a Hugging Face API token and add it to your organizations's secrets. | ||
1. [Create a Hugging Face API token](https://huggingface.co/settings/tokens) and copy it to your clipboard. | ||
2. Add the token with the key `hf_access_token` to [your organization's secrets](https://app.baseten.co/settings/secrets) on Baseten. | ||
</Tip> | ||
|
||
### Step 0: Initialize Truss | ||
|
||
Get started by creating a new Truss: | ||
|
||
```sh | ||
truss init llama-2-7b-chat | ||
``` | ||
|
||
Select the `TrussServer` option then hit `y` to confirm Truss creation. Then navigate to the newly created directory: | ||
|
||
```sh | ||
cd llama-2-7b-chat | ||
``` | ||
|
||
### Step 1: Implement Llama 2 7B in Truss | ||
|
||
Next, we'll fill out the `model.py` file to implement Llama 2 7B in Truss. | ||
|
||
|
||
In `model/model.py`, we write the class `Model` with three member functions: | ||
|
||
* `__init__`, which creates an instance of the object with a `_model` property | ||
* `load`, which runs once when the model server is spun up and loads the `pipeline` model | ||
* `predict`, which runs each time the model is invoked and handles the inference. It can use any JSON-serializable type as input and output. | ||
|
||
We will also create a helper function `format_prompt` outside of the `Model` class to appropriately format the incoming text according to the Llama 2 specification. | ||
|
||
[Read the quickstart guide](/quickstart) for more details on `Model` class implementation. | ||
|
||
```python model/model.py | ||
from typing import Dict, List | ||
|
||
import torch | ||
from transformers import LlamaForCausalLM, LlamaTokenizer | ||
|
||
DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant." | ||
|
||
B_INST, E_INST = "[INST]", "[/INST]" | ||
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n" | ||
|
||
class Model: | ||
def __init__(self, **kwargs) -> None: | ||
self._data_dir = kwargs["data_dir"] | ||
self._config = kwargs["config"] | ||
self._secrets = kwargs["secrets"] | ||
self.model = None | ||
self.tokenizer = None | ||
|
||
def load(self): | ||
self.model = LlamaForCausalLM.from_pretrained( | ||
"meta-llama/Llama-2-7b-chat-hf", | ||
use_auth_token=self._secrets["hf_access_token"], | ||
torch_dtype=torch.float16, | ||
device_map="auto" | ||
) | ||
self.tokenizer = LlamaTokenizer.from_pretrained( | ||
"meta-llama/Llama-2-7b-chat-hf", | ||
use_auth_token=self._secrets["hf_access_token"] | ||
) | ||
|
||
def predict(self, request: Dict) -> Dict[str, List]: | ||
prompt = request.pop("prompt") | ||
prompt = format_prompt(prompt) | ||
|
||
inputs = tokenizer(prompt, return_tensors="pt") | ||
|
||
outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100) | ||
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0] | ||
|
||
return {"response": response} | ||
|
||
def format_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str: | ||
return f"{B_INST} {B_SYS} {system_prompt} {E_SYS} {prompt} {E_INST}" | ||
``` | ||
|
||
### Step 2: Set Python dependencies | ||
|
||
Now, we can turn our attention to configuring the model server in `config.yaml`. | ||
|
||
In addition to `transformers`, Llama 2 has three other dependencies. We list them below as follows: | ||
|
||
```yaml config.yaml | ||
requirements: | ||
- accelerate==0.21.0 | ||
- safetensors==0.3.2 | ||
- torch==2.0.1 | ||
- transformers==4.30.2 | ||
``` | ||
<Note> | ||
Always pin exact versions for your Python dependencies. The ML/AI space moves fast, so you want to have an up-to-date version of each package while also being protected from breaking changes. | ||
</Note> | ||
### Step 3: Configure Hugging Face caching | ||
Finally, we can configure Hugging Face caching in `config.yaml` by adding the `hf_cache` key. When building the image for your Llama 2 deployment, the Llama 2 model weights will be downloaded and cached for future use. | ||
|
||
```yaml config.yaml | ||
hf_cache: | ||
- repo_id: "meta-llama/Llama-2-7b-chat-hf" | ||
ignore_patterns: | ||
- "*.bin" | ||
``` | ||
|
||
In this configuration: | ||
- `meta-llama/Llama-2-7b-chat-hf` is the `repo_id`, pointing to the exact model to cache. | ||
- We use a wild card to ignore all `.bin` files in the model directory by providing a pattern under `ignore_patterns`. This is because the model weights are stored in `.bin` and `safetensors` format, and we only want to cache the `safetensors` files. | ||
|
||
|
||
### Step 4: Deploy the model | ||
|
||
<Note> | ||
You'll need a [Baseten API key](https://app.baseten.co/settings/account/api_keys) for this step. Make sure you added your `HUGGING_FACE_HUB_TOKEN` to your organization's secrets. | ||
</Note> | ||
|
||
We have successfully packaged Llama 2 as a Truss. Let's deploy! | ||
|
||
```sh | ||
truss push --trusted | ||
``` | ||
|
||
### Step 5: Invoke the model | ||
|
||
You can invoke the model with: | ||
|
||
```sh | ||
truss predict -d '{"prompt": "What is a large language model?"}' | ||
``` | ||
|
||
<RequestExample> | ||
|
||
```yaml config.yaml | ||
environment_variables: {} | ||
external_package_dirs: [] | ||
model_metadata: {} | ||
model_name: null | ||
python_version: py39 | ||
requirements: | ||
- accelerate==0.21.0 | ||
- safetensors==0.3.2 | ||
- torch==2.0.1 | ||
- transformers==4.30.2 | ||
hf_cache: | ||
- repo_id: "NousResearch/Llama-2-7b-chat-hf" | ||
ignore_patterns: | ||
- "*.bin" | ||
resources: | ||
cpu: "4" | ||
memory: 30Gi | ||
use_gpu: True | ||
accelerator: A10G | ||
secrets: {} | ||
``` | ||
|
||
```python model/model.py | ||
from typing import Dict, List | ||
import torch | ||
from transformers import LlamaForCausalLM, LlamaTokenizer | ||
DEFAULT_SYSTEM_PROMPT = "You are a helpful, respectful and honest assistant." | ||
B_INST, E_INST = "[INST]", "[/INST]" | ||
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n" | ||
class Model: | ||
def __init__(self, **kwargs) -> None: | ||
self._data_dir = kwargs["data_dir"] | ||
self._config = kwargs["config"] | ||
self._secrets = kwargs["secrets"] | ||
self.model = None | ||
self.tokenizer = None | ||
def load(self): | ||
self.model = LlamaForCausalLM.from_pretrained( | ||
"meta-llama/Llama-2-7b-chat-hf", | ||
use_auth_token=self._secrets["hf_access_token"], | ||
torch_dtype=torch.float16, | ||
device_map="auto" | ||
) | ||
self.tokenizer = LlamaTokenizer.from_pretrained( | ||
"meta-llama/Llama-2-7b-chat-hf", | ||
use_auth_token=self._secrets["hf_access_token"] | ||
) | ||
def predict(self, request: Dict) -> Dict[str, List]: | ||
prompt = request.pop("prompt") | ||
inputs = tokenizer(prompt, return_tensors="pt") | ||
outputs = model.generate(**inputs, do_sample=True, num_beams=1, max_new_tokens=100) | ||
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0] | ||
return {"response": response} | ||
def format_prompt(prompt: str, system_prompt: str = DEFAULT_SYSTEM_PROMPT) -> str: | ||
return f"{B_INST} {B_SYS} {system_prompt} {E_SYS} {prompt} {E_INST}" | ||
``` | ||
|
||
</RequestExample> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,98 @@ | ||
--- | ||
title: Serve models with VLLM | ||
description: "Description" | ||
description: "Deploy a language model using vLLM" | ||
--- | ||
|
||
[vLLM](https://github.com/vllm-project/vllm) is a Python-based package that optimizes the Attention layer in Transformer models. By better allocating memory used during the attention computation, vLLM can reduce the memory footprint of a model and significantly improve inference speed. Truss supports vLLM out of the box, so you can deploy vLLM-optimized models with ease. We're going to walk through deploying a vLLM-optimized [OPT-125M model](https://huggingface.co/facebook/opt-125m). | ||
|
||
<Tip> | ||
You can see the config for the finished model on the right. Keep reading for step-by-step instructions on how to generate it. | ||
</Tip> | ||
|
||
This example will cover: | ||
|
||
1. Generating the base Truss | ||
2. Setting sufficient model resources for inference | ||
3. Deploying the model | ||
|
||
### Step 1: Generating the base Truss | ||
|
||
Get started by creating a new Truss: | ||
|
||
```sh | ||
truss init opt125 | ||
``` | ||
|
||
You're going to see a couple of prompts. Follow along with the instructions below: | ||
1. Type `facebook/opt-125M` when prompted for `model`. | ||
2. Press the `tab` key when prompted for `endpoint`. Select the `Completions` endpoint. | ||
3. Give your model a name like `OPT-125M`. | ||
|
||
<Note> | ||
The underlying server that we use is OpenAI compatible. If you plan on using the model as a chat model, then select `ChatCompletion`. OPT-125M is not a chat model so we selected `Completion`. | ||
</Note> | ||
|
||
Finally, navigate to the directory: | ||
|
||
```sh | ||
cd opt125 | ||
``` | ||
|
||
### Step 2: Setting resources and other arguments | ||
|
||
You'll notice that there's a `config.yaml` in the new directory. This is where we'll set the resources and other arguments for the model. Open the file in your favorite editor. | ||
|
||
OPT-125M will need a GPU so let's set the correct resources. Update the `resources` key with the following: | ||
|
||
```yaml config.yaml | ||
resources: | ||
accelerator: T4 | ||
cpu: "4" | ||
memory: 16Gi | ||
use_gpu: true | ||
``` | ||
Also notice the `build` key which contains the `model_server` we're using as well as other arguments. These arguments are passed to the underlying vLLM server which you can find [here](https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py). | ||
|
||
### Step 3: Deploy the model | ||
|
||
<Note> | ||
You'll need a [Baseten API key](https://app.baseten.co/settings/account/api_keys) for this step. | ||
</Note> | ||
|
||
Let's deploy our OPT-125M vLLM model. | ||
|
||
```sh | ||
truss push | ||
``` | ||
|
||
You can invoke the model with: | ||
|
||
```sh | ||
truss predict -d '{"prompt": "What is a large language model?"}' | ||
``` | ||
|
||
<RequestExample> | ||
|
||
```yaml config.yaml | ||
build: | ||
arguments: | ||
endpoint: Completions | ||
model: facebook/opt-125M | ||
model_server: VLLM | ||
environment_variables: {} | ||
external_package_dirs: [] | ||
model_metadata: {} | ||
model_name: OPT-125M | ||
python_version: py39 | ||
requirements: [] | ||
resources: | ||
accelerator: T4 | ||
cpu: "4" | ||
memory: 16Gi | ||
use_gpu: true | ||
secrets: {} | ||
system_packages: [] | ||
``` | ||
|
||
</RequestExample> |
Oops, something went wrong.