Skip to content

Commit

Permalink
Merge pull request #83 from anyscale/remove-nfs
Browse files Browse the repository at this point in the history
Remove NFS mentions, update LLM serving template to use ipynb
  • Loading branch information
ericl authored Feb 21, 2024
2 parents b4f4032 + b1c84cd commit 5768dfd
Show file tree
Hide file tree
Showing 4 changed files with 217 additions and 64 deletions.
168 changes: 168 additions & 0 deletions templates/endpoints_v2/README.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Endpoints - Deploy, configure, and serve LLMs \n",
"\n",
"The guide below walks you through the steps required for deployment of LLM endpoints. Based on Ray Serve and RayLLM, the foundation for [Anyscale-Hosted Endpoints](http://anyscale.com/endpoints), the Endpoints template provides an easy to configure solution for ML Platform teams, Infrastructure engineers, and Developers who want to deploy optimized LLMs in production. We have provided a number of examples for popular open-source models (Llama2, Mistral, Mixtral, embedding models, and more) with different GPU accelerator and tensor-parallelism configurations in the `models` directory. \n",
"\n",
"# Step 1 - Run the model locally in the Workspace\n",
"\n",
"The llm-serve.yaml file in this example runs the Mistral-7B model. There are 2 important configurations you would need to modify:\n",
"1. The `models` config in `llm-serve.yaml` contains a list of YAML files for the models you want to deploy. You can run any of the models in the `models` directory or define your own model YAML file and run that instead. All config files follow the naming convention `{model_name}_{accelerator_type}_{tensor_parallelism}`. Follow the CustomModels [guide](CustomModels.md) for bringing your own models.\n",
"2. `HUGGING_FACE_HUB_TOKEN` - The Meta Llama-2 family of models need the HUGGING_FACE_HUB_TOKEN variable to be set to a Hugging Face Access Token for an account with permissions to download the model.\n",
"\n",
"From the terminal use the Ray Serve CLI to deploy a model. It will be run locally in this workspace's cluster:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Deploy the Mistral-7b model locally in the workspace.\n",
"\n",
"!serve run --non-blocking llm-serve.yaml"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# Step 2 - Query the model\n",
"\n",
"Once deployed you can use the OpenAI SDK to interact with the models, ensuring an easy integration for your applications.\n",
"\n",
"Run the following command to query. You should get the following output:\n",
"```\n",
"The top rated restaurants in San Francisco include:\n",
" • Chez Panisse\n",
" • Momofuku Noodle Bar\n",
" • Nopa\n",
" • Saison\n",
" • Mission Chinese Food\n",
" • Sushi Nakazawa\n",
" • The French Laundry\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Query the local service we just deployed.\n",
"\n",
"!python llm-query.py"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Endpoints uses an OpenAI-compatible API, allowing us to use the OpenAI SDK to access Endpoint backends."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from openai import OpenAI\n",
"\n",
"client = OpenAI(\n",
" base_url=\"http://localhost:8000/v1\",\n",
" api_key=\"NOT A REAL KEY\",\n",
")\n",
"\n",
"# List all models.\n",
"models = client.models.list()\n",
"print(models)\n",
"\n",
"# Note: not all arguments are currently supported and will be ignored by the backend.\n",
"chat_completion = client.chat.completions.create(\n",
" model=\"mistralai/Mistral-7B-Instruct-v0.1\",\n",
" messages=[\n",
" {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n",
" {\"role\": \"user\", \"content\": \"What are some of the highest rated restaurants in San Francisco?'.\"},\n",
" ],\n",
" temperature=0.01\n",
")\n",
"\n",
"print(chat_completion)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Step 3 - Deploying a production service\n",
"\n",
"To deploy an application with one model as an Anyscale Service you can run:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Deploy the serve app to production with a given service name.\n",
"\n",
"!serve deploy --name=my_service_name service.yaml"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is setup to run the Mistral-7B model, but can be easily modified to run any of the other models in this repo.\n",
"\n",
"# Step 4 - Query the service endpoint\n",
"\n",
"In order to query the endpoint, you can modify the `llm-query.py` script, replacing the query url with the Service URL found in the Service UI.\n",
"\n",
"Note: please make sure to include the path \"/v1\" at the end of the Service url.\n",
"\n",
"# More Guides\n",
"\n",
"Endpoints makes it easy for LLM Developers to interact with OpenAI compatible APIs for their applications by providing an easy to manage backend for serving OSS LLMs. It does this by:\n",
"\n",
"- Providing an extensive suite of pre-configured open source LLMs and embedding models, with defaults that work out of the box. \n",
"- Simplifying the addition of new LLMs.\n",
"- Simplifying the deployment of multiple LLMs\n",
"- Offering unique autoscaling support, including scale-to-zero.\n",
"- Fully supporting multi-GPU & multi-node model deployments.\n",
"- Offering high performance features like continuous batching, quantization and streaming.\n",
"- Providing a REST API that is similar to OpenAI's to make it easy to migrate and integrate with other tools.\n",
"\n",
"Look at the following guides for more advanced use-cases -\n",
"* [Deploy models for embedding generation](EmbeddingModels.md)\n",
"* [Learn how to bring your own models](CustomModels.md)\n",
"* [Deploy multiple LoRA fine-tuned models](DeployLora.md)\n",
"* [Deploy Function calling models](DeployFunctionCalling.md)\n",
"* [Learn how to leverage different configurations that can optimize the latency and throughput of your models](OptimizeModels.md)\n",
"* [Learn how to fully configure your deployment including auto-scaling, optimization parameters and tensor-parallelism](AdvancedModelConfigs.md)\n",
"\n",
"# Application Examples\n",
"See examples of building applications with your deployed endpoint on the [Anyscale Endpoints](https://docs.endpoints.anyscale.com/category/examples) page.\n",
"\n",
"Be sure to update the api_base and token for your private deployment. This can be found under the \"Serve deployments\" tab on the \"Query\" button when deploying on your Workspace.\n"
]
}
],
"metadata": {
"language_info": {
"name": "python"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}
45 changes: 29 additions & 16 deletions templates/endpoints_v2/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,28 @@

The guide below walks you through the steps required for deployment of LLM endpoints. Based on Ray Serve and RayLLM, the foundation for [Anyscale-Hosted Endpoints](http://anyscale.com/endpoints), the Endpoints template provides an easy to configure solution for ML Platform teams, Infrastructure engineers, and Developers who want to deploy optimized LLMs in production. We have provided a number of examples for popular open-source models (Llama2, Mistral, Mixtral, embedding models, and more) with different GPU accelerator and tensor-parallelism configurations in the `models` directory.

# Step 1 - Deploy the model on Workspace
# Step 1 - Run the model locally in the Workspace

The llm-serve.yaml file in this example runs the Mistral-7B model. There are 2 important configurations you would need to modify:
1. The `models` config in `llm-serve.yaml` contains a list of YAML files for the models you want to deploy. You can run any of the models in the `models` directory or define your own model YAML file and run that instead. All config files follow the naming convention `{model_name}_{accelerator_type}_{tensor_parallelism}`. Follow the CustomModels [guide](CustomModels.md) for bringing your own models.
2. `HUGGING_FACE_HUB_TOKEN` - The Meta Llama-2 family of models need the HUGGING_FACE_HUB_TOKEN variable to be set to a Hugging Face Access Token for an account with permissions to download the model.

From the terminal use the Ray Serve CLI to deploy a model:
From the terminal use the Ray Serve CLI to deploy a model. It will be run locally in this workspace's cluster:

```shell
# Deploy the Mistral-7b model.

serve run llm-serve.yaml
```python
# Deploy the Mistral-7b model locally in the workspace.

!serve run --non-blocking llm-serve.yaml
```


# Step 2 - Query the model

Once deployed you can use the OpenAI SDK to interact with the models, ensuring an easy integration for your applications. Run the following command in a separate terminal to query.
Once deployed you can use the OpenAI SDK to interact with the models, ensuring an easy integration for your applications.

```shell
python llm-query.py
Run the following command to query. You should get the following output:
```
```text
Output:
The top rated restaurants in San Francisco include:
• Chez Panisse
• Momofuku Noodle Bar
Expand All @@ -35,8 +34,16 @@ The top rated restaurants in San Francisco include:
• The French Laundry
```


```python
# Query the local service we just deployed.

!python llm-query.py
```

Endpoints uses an OpenAI-compatible API, allowing us to use the OpenAI SDK to access Endpoint backends.


```python
from openai import OpenAI

Expand All @@ -52,20 +59,25 @@ print(models)
# Note: not all arguments are currently supported and will be ignored by the backend.
chat_completion = client.chat.completions.create(
model="mistralai/Mistral-7B-Instruct-v0.1",
messages=[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What are some of the highest rated restaurants in San Francisco?'."}],
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What are some of the highest rated restaurants in San Francisco?'."},
],
temperature=0.01
)

print(chat_completion)

```

# Step 3 - Deploying a production service - TODO : Update with new CLI
# Step 3 - Deploying a production service

To deploy an application with one model as an Anyscale Service you can run:

To deploy an application with one model on an Anyscale Service you can run:

```shell
anyscale service rollout -f service.yaml --name {ENTER_NAME_FOR_SERVICE_HERE}
```python
# Deploy the serve app to production with a given service name.

!serve deploy --name=my_service_name service.yaml
```

This is setup to run the Mistral-7B model, but can be easily modified to run any of the other models in this repo.
Expand Down Expand Up @@ -100,3 +112,4 @@ Look at the following guides for more advanced use-cases -
See examples of building applications with your deployed endpoint on the [Anyscale Endpoints](https://docs.endpoints.anyscale.com/category/examples) page.

Be sure to update the api_base and token for your private deployment. This can be found under the "Serve deployments" tab on the "Query" button when deploying on your Workspace.

39 changes: 10 additions & 29 deletions templates/intro-workspaces/README.ipynb
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

{
"cells": [
{
Expand All @@ -10,7 +9,7 @@
"Welcome! You are currently in a Workspace, which is a persistent cloud IDE connected to a Ray cluster.\n",
"\n",
"In this tutorial, you will learn:\n",
"1. Basic workspace features such as git repo persistence, NFS mounts, cloud storage, and SSH authentication.\n",
"1. Basic workspace features such as git repo persistence, cloud storage, and SSH authentication.\n",
"2. Ray cluster management features, such as adding multiple worker nodes.\n",
"3. Ray monitoring features such as viewing tasks in the dashboard.\n",
"4. Dependency management.\n",
Expand Down Expand Up @@ -81,31 +80,11 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"### NFS Mounts\n",
"### Cloud Storage\n",
"\n",
"Workspace local storage is limited to 1GB, so we recommend only using it to store git repos and smaller files. To persist larger files, you can save data to NFS mounts and cloud storage.\n",
"Workspace local storage is limited to 1GB, so we recommend only using it to store git repos and smaller files. To persist larger files, you can save data to cloud storage.\n",
"\n",
"Here are a few handy NFS mounts included:\n",
"- `/mnt/shared_storage` is a mount shared across all users of your organization\n",
"- `/mnt/user_storage` is a mount for your user account\n",
"\n",
"NFS storage can be read and written from the workspace, as well as from any node in the Ray cluster:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!echo \"hello world\" > /mnt/user_storage/persisted_file.txt && cat /mnt/user_storage/persisted_file.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Cloud Storage\n",
"Cloud storage can be read and written from the workspace, as well as from any node in the Ray cluster.\n",
"\n",
"Access built-in cloud storage using the `$ANYSCALE_ARTIFACT_STORAGE` URI as a prefix:"
]
Expand All @@ -116,7 +95,8 @@
"metadata": {},
"outputs": [],
"source": [
"!aws s3 cp /mnt/user_storage/persisted_file.txt $ANYSCALE_ARTIFACT_STORAGE/persisted_object.txt"
"# Note: \"gsutil cp\" instead of \"aws s3 cp\" in GCP clouds.\n",
"!echo \"hello world\" > /tmp/input.txt && aws s3 cp /tmp/input.txt $ANYSCALE_ARTIFACT_STORAGE/saved.txt"
]
},
{
Expand All @@ -125,7 +105,8 @@
"metadata": {},
"outputs": [],
"source": [
"!aws s3 cp $ANYSCALE_ARTIFACT_STORAGE/persisted_object.txt /tmp/object.txt && cat /tmp/object.txt"
"# Note: \"gsutil cp\" instead of \"aws s3 cp\" in GCP clouds.\n",
"!aws s3 cp $ANYSCALE_ARTIFACT_STORAGE/saved.txt /tmp/output.txt && cat /tmp/output.txt"
]
},
{
Expand Down Expand Up @@ -157,9 +138,9 @@
"<img src=\"assets/add-node-type.png\" height=300px/>\n",
"<img src=\"assets/add-node-dialog.png\" height=300px/>\n",
"\n",
"### Using \"Auto\" workers mode\n",
"### Using \"Auto-select workers\" mode\n",
"\n",
"To let Ray automatically select what kind of worker nodes to add to the cluster, check the \"Auto-select machines\" box. Ray will try to autoscale cluster worker nodes to balance cost and performance. In auto mode, you cannot configure worker node types, but the resources panel will show which node types have been launched.\n",
"To let Ray automatically select what kind of worker nodes to add to the cluster, check the \"Auto-select workers\" box. Ray will try to autoscale cluster worker nodes to balance cost and performance. In auto mode, you cannot configure worker node types, but the resources panel will show which node types have been launched.\n",
"\n",
"We recommend using auto mode if you do not have specific cluster requirements, and are ok with waiting for the autoscaler to add nodes on-demand to the cluster."
]
Expand Down
Loading

0 comments on commit 5768dfd

Please sign in to comment.