diff --git a/examples/Xinference_Quick_Start.ipynb b/examples/Xinference_Quick_Start.ipynb index e7db75268f..e0e04db15a 100644 --- a/examples/Xinference_Quick_Start.ipynb +++ b/examples/Xinference_Quick_Start.ipynb @@ -2,27 +2,30 @@ "cells": [ { "cell_type": "markdown", - "source": [ - "> **NOTE**: This tutorial will demonstrate how to utilize the GPU provided by Colab to run LLM with Xinference local server, and how to interact with the model in different ways (OpenAI-Compatible endpoints/Xinference's builtin Client/LangChain).\n" - ], "metadata": { "id": "WoegBf2gjiW4" - } + }, + "source": [ + "> **NOTE**: This tutorial will demonstrate how to utilize the GPU provided by Colab to run LLM with Xinference local server, and how to interact with the model in different ways (OpenAI-Compatible endpoints/Xinference's builtin Client/LangChain).\n" + ] }, { "cell_type": "markdown", + "metadata": { + "id": "FAhwDgtUGIEo" + }, "source": [ "# Xinference\n", "\n", "Xorbits Inference (Xinference) is an open-source platform to streamline the operation and integration of a wide array of AI models. With Xinference, you’re empowered to run inference using any open-source LLMs, embedding models, and multimodal models either in the cloud or on your own premises, and create robust AI-driven applications.\n", "\n" - ], - "metadata": { - "id": "FAhwDgtUGIEo" - } + ] }, { "cell_type": "markdown", + "metadata": { + "id": "WzJegrOpGH4N" + }, "source": [ "\n", "* [Docs](https://inference.readthedocs.io/en/latest/index.html)\n", @@ -30,48 +33,31 @@ "* [Custom Models](https://inference.readthedocs.io/en/latest/models/custom.html)\n", "* [Deployment Docs](https://inference.readthedocs.io/en/latest/getting_started/using_xinference.html)\n", "* [Examples and Tutorials](https://inference.readthedocs.io/en/latest/examples/index.html)\n" - ], - "metadata": { - "id": "WzJegrOpGH4N" - } + ] }, { "cell_type": "markdown", + "metadata": { + "id": "LovckG0kGr9j" + }, "source": [ "## Set up the environment\n", "\n", "> **NOTE**: We recommend you run this demo on a GPU. To change the runtime type: In the toolbar menu, click **Runtime** > **Change runtime typ**e > **Select the GPU (T4)**\n" - ], - "metadata": { - "id": "LovckG0kGr9j" - } + ] }, { "cell_type": "markdown", - "source": [ - "### Check memory and GPU resources" - ], "metadata": { "id": "bPDeDltCGABt" - } + }, + "source": [ + "### Check memory and GPU resources" + ] }, { "cell_type": "code", - "source": [ - "import psutil\n", - "import torch\n", - "\n", - "\n", - "ram = psutil.virtual_memory()\n", - "ram_total = ram.total / (1024**3)\n", - "print('RAM: %.2f GB' % ram_total)\n", - "\n", - "print('=============GPU INFO=============')\n", - "if torch.cuda.is_available():\n", - " !/opt/bin/nvidia-smi || ture\n", - "else:\n", - " print('GPU NOT available')" - ], + "execution_count": 1, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -79,11 +65,10 @@ "id": "qhoItBBhF7uY", "outputId": "99209d5a-78a2-405b-c8b2-4b840ecc78f1" }, - "execution_count": 1, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ "RAM: 12.67 GB\n", "=============GPU INFO=============\n", @@ -109,16 +94,31 @@ "+---------------------------------------------------------------------------------------+\n" ] } + ], + "source": [ + "import psutil\n", + "import torch\n", + "\n", + "\n", + "ram = psutil.virtual_memory()\n", + "ram_total = ram.total / (1024**3)\n", + "print('RAM: %.2f GB' % ram_total)\n", + "\n", + "print('=============GPU INFO=============')\n", + "if torch.cuda.is_available():\n", + " !/opt/bin/nvidia-smi || ture\n", + "else:\n", + " print('GPU NOT available')" ] }, { "cell_type": "markdown", - "source": [ - "### Install Xinference and dependencies" - ], "metadata": { "id": "eFzlnU4gG_JL" - } + }, + "source": [ + "### Install Xinference and dependencies" + ] }, { "cell_type": "code", @@ -128,14 +128,12 @@ }, "outputs": [], "source": [ - "!pip install -U -q xinference[transformers] openai langchain" + "%pip install -U -q typing_extensions==4.5.0 xinference[transformers] openai langchain" ] }, { "cell_type": "code", - "source": [ - "!pip show xinference" - ], + "execution_count": 3, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -143,11 +141,10 @@ "id": "vPq_TWiRQCAA", "outputId": "4076a2ed-d42d-43ab-8e5d-cd57ffdab7ba" }, - "execution_count": 3, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ "Name: xinference\n", "Version: 0.7.4.1\n", @@ -161,58 +158,59 @@ "Required-by: \n" ] } + ], + "source": [ + "!pip show xinference" ] }, { "cell_type": "markdown", + "metadata": { + "id": "2EACA0GYHm2o" + }, "source": [ "## A Quick Start Demo\n", "### Start Local Server\n", "\n", "\n", "To start a local instance of Xinference, run `xinference` in the background via `nohup`:" - ], - "metadata": { - "id": "2EACA0GYHm2o" - } + ] }, { "cell_type": "code", - "source": [ - "!nohup xinference-local > xinference.log 2>&1 &" - ], + "execution_count": 4, "metadata": { "id": "5EM01Gq7IQ2y" }, - "execution_count": 4, - "outputs": [] + "outputs": [], + "source": [ + "!nohup xinference-local > xinference.log 2>&1 &" + ] }, { "cell_type": "markdown", + "metadata": { + "id": "WXtUJSC3I3kh" + }, "source": [ "Congrats! You now have Xinference running in Colab machine. The default host and ip is 127.0.0.1 and 9997 respectively.\n", "\n", "\n", "Once Xinference is running, there are multiple ways we can try it: via the web UI, via cURL, via the command line, or via the Xinference’s python client." - ], - "metadata": { - "id": "WXtUJSC3I3kh" - } + ] }, { "cell_type": "markdown", - "source": [ - "The command line tool is `xinference`. You can list the commands that can be used by running:" - ], "metadata": { "id": "0mkyrGIHJekz" - } + }, + "source": [ + "The command line tool is `xinference`. You can list the commands that can be used by running:" + ] }, { "cell_type": "code", - "source": [ - "!xinference --help" - ], + "execution_count": 5, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -220,11 +218,10 @@ "id": "yayFuLIgJhYX", "outputId": "56a696ee-b37d-44b6-9f07-39f18cffd099" }, - "execution_count": 5, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ "Usage: xinference [OPTIONS] COMMAND [ARGS]...\n", "\n", @@ -250,35 +247,36 @@ " unregister Unregisters a model from Xinference, removing it from...\n" ] } + ], + "source": [ + "!xinference --help" ] }, { "cell_type": "markdown", + "metadata": { + "id": "wvhIEjcHKXc5" + }, "source": [ "### Run Qwen-Chat\n", "\n", "Xinference supports a variety of LLMs. Learn more in https://inference.readthedocs.io/en/latest/models/builtin/.\n", "\n", "Let’s start by running a built-in model: `Qwen-1_8B-Chat`.\n" - ], - "metadata": { - "id": "wvhIEjcHKXc5" - } + ] }, { "cell_type": "markdown", - "source": [ - "We can specify the model’s UID using the `--model-uid` or `-u` flag. If not specified, Xinference will generate it. This create a new model instance with unique ID `my-llvm`:\n" - ], "metadata": { "id": "z7OyMw8sKjj6" - } + }, + "source": [ + "We can specify the model’s UID using the `--model-uid` or `-u` flag. If not specified, Xinference will generate it. This create a new model instance with unique ID `my-llvm`:\n" + ] }, { "cell_type": "code", - "source": [ - "!xinference launch -u my-llm -n qwen-chat -s 1_8 -f pytorch" - ], + "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -286,74 +284,61 @@ "id": "B_hQFqxOKiww", "outputId": "e46d4138-20c5-4238-e131-f48bd0cd6e94" }, - "execution_count": 6, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ "Model uid: my-llm\n" ] } + ], + "source": [ + "!xinference launch -u my-llm -n qwen-chat -s 1_8 -f pytorch" ] }, { "cell_type": "markdown", - "source": [ - "When you start a model for the first time, Xinference will download the model parameters from HuggingFace, which might take a few minutes depending on the size of the model weights. We cache the model files locally, so there’s no need to redownload them for subsequent starts.\n" - ], "metadata": { "id": "Q--ic56eNDyo" - } + }, + "source": [ + "When you start a model for the first time, Xinference will download the model parameters from HuggingFace, which might take a few minutes depending on the size of the model weights. We cache the model files locally, so there’s no need to redownload them for subsequent starts.\n" + ] }, { "cell_type": "markdown", - "source": [ - "## Interact with the running model" - ], "metadata": { "id": "cfF-cCFlMCvE" - } + }, + "source": [ + "## Interact with the running model" + ] }, { "cell_type": "markdown", + "metadata": { + "id": "MYKNW0c-MONc" + }, "source": [ "Congrats! You now have the model running by Xinference. Once the model is running, we can try it out either command line, via cURL, or via Xinference’s python client:\n", "\n" - ], - "metadata": { - "id": "MYKNW0c-MONc" - } + ] }, { "cell_type": "markdown", + "metadata": { + "id": "VfZZham7Lj3X" + }, "source": [ "### 1.Use the OpenAI compatible endpoint\n", "\n", "Xinference provides OpenAI-compatible APIs for its supported models, so you can use Xinference as a local drop-in replacement for OpenAI APIs. For example:\n" - ], - "metadata": { - "id": "VfZZham7Lj3X" - } + ] }, { "cell_type": "code", - "source": [ - "import openai\n", - "\n", - "messages=[\n", - " {\n", - " \"role\": \"user\",\n", - " \"content\": \"Who are you?\"\n", - " }\n", - "]\n", - "\n", - "client = openai.Client(api_key=\"empty\", base_url=f\"http://0.0.0.0:9997/v1\")\n", - "client.chat.completions.create(\n", - " model=\"my-llm\",\n", - " messages=messages,\n", - ")" - ], + "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -361,40 +346,49 @@ "id": "GOStrwtRLehN", "outputId": "3a67ba15-271a-4841-bdd2-d260a4ebb0ff" }, - "execution_count": 7, "outputs": [ { - "output_type": "execute_result", "data": { "text/plain": [ "ChatCompletion(id='chat899575cc-aa0e-11ee-9dba-0242ac1c000c', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='I am an AI language model created by Alibaba Cloud. I have been trained on a vast amount of text data and can answer questions, provide suggestions, and engage in conversations with users. How may I assist you today?', role='assistant', function_call=None, tool_calls=None))], created=1704268990, model='my-llm', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=44, prompt_tokens=23, total_tokens=67))" ] }, + "execution_count": 7, "metadata": {}, - "execution_count": 7 + "output_type": "execute_result" } + ], + "source": [ + "import openai\n", + "\n", + "messages=[\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"Who are you?\"\n", + " }\n", + "]\n", + "\n", + "client = openai.Client(api_key=\"empty\", base_url=f\"http://0.0.0.0:9997/v1\")\n", + "client.chat.completions.create(\n", + " model=\"my-llm\",\n", + " messages=messages,\n", + ")" ] }, { "cell_type": "markdown", + "metadata": { + "id": "AmYB-_K4aXnG" + }, "source": [ "### 2. Send request using curl\n", "\n", "\n" - ], - "metadata": { - "id": "AmYB-_K4aXnG" - } + ] }, { "cell_type": "code", - "source": [ - "!curl -k -X 'POST' -N \\\n", - " 'http://127.0.0.1:9997/v1/chat/completions' \\\n", - " -H 'accept: application/json' \\\n", - " -H 'Content-Type: application/json' \\\n", - " -d '{ \"model\": \"my-llm\", \"messages\": [ {\"role\": \"system\", \"content\": \"You are a helpful assistant.\" }, {\"role\": \"user\", \"content\": \"What is the largest animal?\"} ]}'" - ], + "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -402,41 +396,35 @@ "id": "n7VLGirDaaR3", "outputId": "ff2d2a17-1e7f-46dc-818f-72f520ee5607" }, - "execution_count": 8, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ "{\"id\":\"chat8bd9a524-aa0e-11ee-9dba-0242ac1c000c\",\"object\":\"chat.completion\",\"created\":1704268994,\"model\":\"my-llm\",\"choices\":[{\"index\":0,\"message\":{\"role\":\"assistant\",\"content\":\"It is difficult to determine which animal is the largest as there is no single animal that can be considered the \\\"largest\\\" in all senses. The size of an animal can vary greatly depending on its habitat, species, and individual characteristics.\\n\\nFor example, giant pandas are known for their large size, with males weighing up to 200 pounds and females weighing up to 135 pounds. Other large animals include blue whales, the world's largest animal, with estimated populations ranging from over 100,000 individuals; elephants, which can weigh over 6,000 pounds and stand up to \"},\"finish_reason\":\"length\"}],\"usage\":{\"prompt_tokens\":25,\"completion_tokens\":127,\"total_tokens\":152}}" ] } + ], + "source": [ + "!curl -k -X 'POST' -N \\\n", + " 'http://127.0.0.1:9997/v1/chat/completions' \\\n", + " -H 'accept: application/json' \\\n", + " -H 'Content-Type: application/json' \\\n", + " -d '{ \"model\": \"my-llm\", \"messages\": [ {\"role\": \"system\", \"content\": \"You are a helpful assistant.\" }, {\"role\": \"user\", \"content\": \"What is the largest animal?\"} ]}'" ] }, { "cell_type": "markdown", - "source": [ - "### 3. Use Xinference's Python client" - ], "metadata": { "id": "RJ_72F51XFZY" - } + }, + "source": [ + "### 3. Use Xinference's Python client" + ] }, { "cell_type": "code", - "source": [ - "from xinference.client import RESTfulClient\n", - "client = RESTfulClient(\"http://127.0.0.1:9997\")\n", - "model = client.get_model(\"my-llm\")\n", - "model.chat(\n", - " prompt=\"hello\",\n", - " chat_history=[\n", - " {\n", - " \"role\": \"user\",\n", - " \"content\": \"What is the largest animal?\"\n", - " }]\n", - ")" - ], + "execution_count": 9, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -444,10 +432,8 @@ "id": "ohZPPubkXKLl", "outputId": "51ae6581-216c-4b30-8b88-0965193cd091" }, - "execution_count": 9, "outputs": [ { - "output_type": "execute_result", "data": { "text/plain": [ "{'id': 'chat8cef808c-aa0e-11ee-9dba-0242ac1c000c',\n", @@ -461,38 +447,37 @@ " 'usage': {'prompt_tokens': 31, 'completion_tokens': 29, 'total_tokens': 60}}" ] }, + "execution_count": 9, "metadata": {}, - "execution_count": 9 + "output_type": "execute_result" } + ], + "source": [ + "from xinference.client import RESTfulClient\n", + "client = RESTfulClient(\"http://127.0.0.1:9997\")\n", + "model = client.get_model(\"my-llm\")\n", + "model.chat(\n", + " prompt=\"hello\",\n", + " chat_history=[\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"What is the largest animal?\"\n", + " }]\n", + ")" ] }, { "cell_type": "markdown", - "source": [ - "### 4. Langchain intergration" - ], "metadata": { "id": "P4PaU0fpdAuB" - } + }, + "source": [ + "### 4. Langchain intergration" + ] }, { "cell_type": "code", - "source": [ - "from langchain.llms import Xinference\n", - "from langchain.chains import LLMChain\n", - "from langchain.prompts import PromptTemplate\n", - "\n", - "llm = Xinference(server_url='http://127.0.0.1:9997', model_uid='my-llm')\n", - "\n", - "template = 'What is the largest {kind} on the earth?'\n", - "\n", - "prompt = PromptTemplate(template=template, input_variables=['kind'])\n", - "\n", - "llm_chain = LLMChain(prompt=prompt, llm=llm)\n", - "\n", - "generated = llm_chain.run(kind='plant')\n", - "print(generated)" - ], + "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -500,11 +485,10 @@ "id": "ijeadB9DdDO8", "outputId": "1b8969cc-5d05-4cee-fac6-d8aa7d0efbd4" }, - "execution_count": 10, "outputs": [ { - "output_type": "stream", "name": "stdout", + "output_type": "stream", "text": [ " The answer to this question is subjective and can vary depending on factors such as the definition of \"largest\" and location. However, some estimates put the size of a particular plant at over 50 feet tall or more.\n", "\n", @@ -513,13 +497,30 @@ "Another\n" ] } + ], + "source": [ + "from langchain.llms import Xinference\n", + "from langchain.chains import LLMChain\n", + "from langchain.prompts import PromptTemplate\n", + "\n", + "llm = Xinference(server_url='http://127.0.0.1:9997', model_uid='my-llm')\n", + "\n", + "template = 'What is the largest {kind} on the earth?'\n", + "\n", + "prompt = PromptTemplate(template=template, input_variables=['kind'])\n", + "\n", + "llm_chain = LLMChain(prompt=prompt, llm=llm)\n", + "\n", + "generated = llm_chain.run(kind='plant')\n", + "print(generated)" ] } ], "metadata": { + "accelerator": "GPU", "colab": { - "provenance": [], - "gpuType": "T4" + "gpuType": "T4", + "provenance": [] }, "kernelspec": { "display_name": "Python 3", @@ -527,9 +528,8 @@ }, "language_info": { "name": "python" - }, - "accelerator": "GPU" + } }, "nbformat": 4, "nbformat_minor": 0 -} \ No newline at end of file +}