-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #83 from anyscale/remove-nfs
Remove NFS mentions, update LLM serving template to use ipynb
- Loading branch information
Showing
4 changed files
with
217 additions
and
64 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,168 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Endpoints - Deploy, configure, and serve LLMs \n", | ||
"\n", | ||
"The guide below walks you through the steps required for deployment of LLM endpoints. Based on Ray Serve and RayLLM, the foundation for [Anyscale-Hosted Endpoints](http://anyscale.com/endpoints), the Endpoints template provides an easy to configure solution for ML Platform teams, Infrastructure engineers, and Developers who want to deploy optimized LLMs in production. We have provided a number of examples for popular open-source models (Llama2, Mistral, Mixtral, embedding models, and more) with different GPU accelerator and tensor-parallelism configurations in the `models` directory. \n", | ||
"\n", | ||
"# Step 1 - Run the model locally in the Workspace\n", | ||
"\n", | ||
"The llm-serve.yaml file in this example runs the Mistral-7B model. There are 2 important configurations you would need to modify:\n", | ||
"1. The `models` config in `llm-serve.yaml` contains a list of YAML files for the models you want to deploy. You can run any of the models in the `models` directory or define your own model YAML file and run that instead. All config files follow the naming convention `{model_name}_{accelerator_type}_{tensor_parallelism}`. Follow the CustomModels [guide](CustomModels.md) for bringing your own models.\n", | ||
"2. `HUGGING_FACE_HUB_TOKEN` - The Meta Llama-2 family of models need the HUGGING_FACE_HUB_TOKEN variable to be set to a Hugging Face Access Token for an account with permissions to download the model.\n", | ||
"\n", | ||
"From the terminal use the Ray Serve CLI to deploy a model. It will be run locally in this workspace's cluster:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Deploy the Mistral-7b model locally in the workspace.\n", | ||
"\n", | ||
"!serve run --non-blocking llm-serve.yaml" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"\n", | ||
"# Step 2 - Query the model\n", | ||
"\n", | ||
"Once deployed you can use the OpenAI SDK to interact with the models, ensuring an easy integration for your applications.\n", | ||
"\n", | ||
"Run the following command to query. You should get the following output:\n", | ||
"```\n", | ||
"The top rated restaurants in San Francisco include:\n", | ||
" • Chez Panisse\n", | ||
" • Momofuku Noodle Bar\n", | ||
" • Nopa\n", | ||
" • Saison\n", | ||
" • Mission Chinese Food\n", | ||
" • Sushi Nakazawa\n", | ||
" • The French Laundry\n", | ||
"```" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Query the local service we just deployed.\n", | ||
"\n", | ||
"!python llm-query.py" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Endpoints uses an OpenAI-compatible API, allowing us to use the OpenAI SDK to access Endpoint backends." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from openai import OpenAI\n", | ||
"\n", | ||
"client = OpenAI(\n", | ||
" base_url=\"http://localhost:8000/v1\",\n", | ||
" api_key=\"NOT A REAL KEY\",\n", | ||
")\n", | ||
"\n", | ||
"# List all models.\n", | ||
"models = client.models.list()\n", | ||
"print(models)\n", | ||
"\n", | ||
"# Note: not all arguments are currently supported and will be ignored by the backend.\n", | ||
"chat_completion = client.chat.completions.create(\n", | ||
" model=\"mistralai/Mistral-7B-Instruct-v0.1\",\n", | ||
" messages=[\n", | ||
" {\"role\": \"system\", \"content\": \"You are a helpful assistant.\"},\n", | ||
" {\"role\": \"user\", \"content\": \"What are some of the highest rated restaurants in San Francisco?'.\"},\n", | ||
" ],\n", | ||
" temperature=0.01\n", | ||
")\n", | ||
"\n", | ||
"print(chat_completion)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Step 3 - Deploying a production service\n", | ||
"\n", | ||
"To deploy an application with one model as an Anyscale Service you can run:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Deploy the serve app to production with a given service name.\n", | ||
"\n", | ||
"!serve deploy --name=my_service_name service.yaml" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"This is setup to run the Mistral-7B model, but can be easily modified to run any of the other models in this repo.\n", | ||
"\n", | ||
"# Step 4 - Query the service endpoint\n", | ||
"\n", | ||
"In order to query the endpoint, you can modify the `llm-query.py` script, replacing the query url with the Service URL found in the Service UI.\n", | ||
"\n", | ||
"Note: please make sure to include the path \"/v1\" at the end of the Service url.\n", | ||
"\n", | ||
"# More Guides\n", | ||
"\n", | ||
"Endpoints makes it easy for LLM Developers to interact with OpenAI compatible APIs for their applications by providing an easy to manage backend for serving OSS LLMs. It does this by:\n", | ||
"\n", | ||
"- Providing an extensive suite of pre-configured open source LLMs and embedding models, with defaults that work out of the box. \n", | ||
"- Simplifying the addition of new LLMs.\n", | ||
"- Simplifying the deployment of multiple LLMs\n", | ||
"- Offering unique autoscaling support, including scale-to-zero.\n", | ||
"- Fully supporting multi-GPU & multi-node model deployments.\n", | ||
"- Offering high performance features like continuous batching, quantization and streaming.\n", | ||
"- Providing a REST API that is similar to OpenAI's to make it easy to migrate and integrate with other tools.\n", | ||
"\n", | ||
"Look at the following guides for more advanced use-cases -\n", | ||
"* [Deploy models for embedding generation](EmbeddingModels.md)\n", | ||
"* [Learn how to bring your own models](CustomModels.md)\n", | ||
"* [Deploy multiple LoRA fine-tuned models](DeployLora.md)\n", | ||
"* [Deploy Function calling models](DeployFunctionCalling.md)\n", | ||
"* [Learn how to leverage different configurations that can optimize the latency and throughput of your models](OptimizeModels.md)\n", | ||
"* [Learn how to fully configure your deployment including auto-scaling, optimization parameters and tensor-parallelism](AdvancedModelConfigs.md)\n", | ||
"\n", | ||
"# Application Examples\n", | ||
"See examples of building applications with your deployed endpoint on the [Anyscale Endpoints](https://docs.endpoints.anyscale.com/category/examples) page.\n", | ||
"\n", | ||
"Be sure to update the api_base and token for your private deployment. This can be found under the \"Serve deployments\" tab on the \"Query\" button when deploying on your Workspace.\n" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"language_info": { | ||
"name": "python" | ||
}, | ||
"orig_nbformat": 4 | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.