diff --git a/examples/getting_started/olive-awq-ft-llama.ipynb b/examples/getting_started/olive-awq-ft-llama.ipynb new file mode 100644 index 000000000..29afe3ea1 --- /dev/null +++ b/examples/getting_started/olive-awq-ft-llama.ipynb @@ -0,0 +1,290 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "tv6vx7wooDfk" + }, + "source": [ + "# ✨ Quantize & Finetune an SLM with Olive\n", + "\n", + "> ⚠️ **This notebook will quantize an Small Language Model (SLM) using the AWQ algorithm, which requires an Nvidia A10 or A100 GPU device.**\n", + "\n", + "In this notebook, you will:\n", + "\n", + "1. Quantize Llama-3.2-1B-Instruct model using the [AWQ Algorithm](https://ar5iv.labs.arxiv.org/html/2306.00978).\n", + "1. Fine-tune the quantized model to classify English phrases into Surprise/Joy/Fear/Sadness.\n", + "1. Optimize the fine-tuned model for the ONNX Runtime.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🐍 Install Python dependencies\n", + "\n", + "The following cells create a pip requirements file and then install the libraries." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%%writefile requirements.txt\n", + "\n", + "olive-ai==0.7.1\n", + "transformers==4.44.2\n", + "autoawq==0.2.6\n", + "optimum==1.23.1\n", + "peft==0.13.2\n", + "accelerate>=0.30.0\n", + "scipy==1.14.1\n", + "onnxruntime-genai==0.5.0\n", + "torchvision==0.18.1\n", + "tabulate==0.9.0" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZtY3VYxCoDfm" + }, + "outputs": [], + "source": [ + "%%capture\n", + "\n", + "%pip install -r requirements.txt" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🤗 Login to Hugging Face\n", + "\n", + "In this notebook you'll be finetuning [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), which is *gated* on Hugging Face and therefore you will need to request access to the model. Once you have access to the model, you'll need to log-in to Hugging Face with a [user access token](https://huggingface.co/docs/hub/security-tokens) so that Olive can download it." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!huggingface-cli login --token USER_ACCESS_TOKEN" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🗜️ Quantize the model using AWQ\n", + "First, you'll quantize the [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) model using the [AWQ Algorithm](https://ar5iv.labs.arxiv.org/html/2306.00978). Olive also supports other quantization algorithms, such as GPTQ, HQQ, and RTN.\n", + "\n", + "You can choose a different model to quantize from Hugging-Face, just update the `--model_name_or_path` argument.\n", + "> ⏳ **It takes approximately ~6mins to complete the AWQ quantization**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!olive quantize \\\n", + " --model_name_or_path \"meta-llama/Llama-3.2-1B-Instruct\" \\\n", + " --trust_remote_code \\\n", + " --algorithm awq \\\n", + " --output_path models/llama/awq \\\n", + " --log_level 1" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nxJCT5wioDfp" + }, + "source": [ + "## 🏃 Train the model\n", + "\n", + "Fine-tuning language models helps when we desire very specific outputs. In this example, you'll fine-tune the **AWQ quantized model variant** of Llama-3.2-1B-instruct from the previous cell to respond to an English phrase with a single word answer that classifies the phrases into one of surprise/fear/joy/sadness categories. Here is a sample of the data used for fine-tuning:\n", + "\n", + "```jsonl\n", + "{\"phrase\": \"The sudden thunderstorm caught me off guard.\", \"tone\": \"surprise\"}\n", + "{\"phrase\": \"The creaking door at night is quite spooky.\", \"tone\": \"fear\"}\n", + "{\"phrase\": \"Celebrating my birthday with friends is always fun.\", \"tone\": \"joy\"}\n", + "{\"phrase\": \"Saying goodbye to my pet was heart-wrenching.\", \"tone\": \"sadness\"}\n", + "```\n", + "\n", + "Fine-tuning *after* quantization provides an opportunity to recover some of the loss from the quantization process and enhance the model quality. For more details on quantization and finetuning, read [Is it better to quantize before or after finetuning?](https://onnxruntime.ai/blogs/olive-quant-ft).\n", + "\n", + "In the following `olive finetune` command the `--data_name` argument is a Hugging Face dataset [xxyyzzz/phrase_classification](https://huggingface.co/datasets/xxyyzzz/phrase_classification). You can also provide your own data from local disk using the `--data_files` argument.\n", + "\n", + "> ⏳ **It takes ~6mins to complete the Finetuning**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8t36pRF2oDfq" + }, + "outputs": [], + "source": [ + "!olive finetune \\\n", + " --method lora \\\n", + " --model_name_or_path models/llama/awq \\\n", + " --trust_remote_code \\\n", + " --data_name xxyyzzz/phrase_classification \\\n", + " --text_template \"<|start_header_id|>user<|end_header_id|>\\n{phrase}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n{tone}<|eot_id|>\" \\\n", + " --max_steps 300 \\\n", + " --output_path models/llama/ft \\\n", + " --log_level 1" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7woNXLDF0bhh" + }, + "source": [ + "## 🪄 Automatic model optimization with Olive\n", + "\n", + "Next, you'll execute Olive's automatic optimizer using the `auto-opt` CLI command, which will:\n", + "\n", + "1. Capture the fine-tuned model into an ONNX graph and convert the weights into the ONNX format.\n", + "1. Optimize the ONNX graph (e.g. fuse nodes, reshape, etc).\n", + "1. Extract the fine-tuned LoRA weights and place them into a separate file.\n", + "\n", + "> ⏳**It takes ~2mins for the automatic optimization to complete**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "M-prKBy20U5m" + }, + "outputs": [], + "source": [ + "!olive auto-opt \\\n", + " --model_name_or_path models/llama/ft/model \\\n", + " --adapter_path models/llama/ft/adapter \\\n", + " --device cpu \\\n", + " --provider CPUExecutionProvider \\\n", + " --use_ort_genai \\\n", + " --output_path models/llama/onnx-ao \\\n", + " --log_level 1" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8Uwm432loDfr" + }, + "source": [ + "## 🧠 Inference\n", + "\n", + "The code below creates a test app that consumes the model in a simple console chat interface. You will be prompted to enter an English phrase (for example: \"Cricket is a wonderful game\") and the app will output a chat completion using:\n", + "\n", + "1. The base model only (no adapter). You should notice that the model gives a verbose response.\n", + "1. The base model **plus adapter**. You should notice that we get one word classification. \n", + "\n", + "In the code, you'll notice that ONNX Runtime allows you to hot-swap adapters for different tasks, which is often referred to as *multi-LoRA* serving.\n", + "\n", + "Whilst the inference code uses the Python API for the ONNX Runtime, other language bindings are available in [Java, C#, C++](https://github.com/microsoft/onnxruntime-genai/tree/main/examples).\n", + "\n", + "To exit the chat interface, enter `exit` or select `Ctrl+c`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "puMdoAxjoDfr" + }, + "outputs": [], + "source": [ + "import onnxruntime_genai as og\n", + "\n", + "model_path = \"models/llama/onnx-ao/model\"\n", + "\n", + "model = og.Model(f'{model_path}')\n", + "adapters = og.Adapters(model)\n", + "adapters.load(f'{model_path}/adapter_weights.onnx_adapter', \"classifier\")\n", + "tokenizer = og.Tokenizer(model)\n", + "tokenizer_stream = tokenizer.create_stream()\n", + "\n", + "# Keep asking for input prompts in a loop\n", + "while True:\n", + " phrase = input(\"Phrase: \")\n", + " prompt = f\"<|start_header_id|>user<|end_header_id|>\\n{phrase}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n\"\n", + " input_tokens = tokenizer.encode(prompt)\n", + " \n", + " # first run without the adapter\n", + " params = og.GeneratorParams(model)\n", + " params.set_search_options(past_present_share_buffer=False)\n", + " params.input_ids = input_tokens\n", + " generator = og.Generator(model, params)\n", + "\n", + " print()\n", + " print(\"Output from Base Model (notice verbosity): \", end='', flush=True)\n", + "\n", + " while not generator.is_done():\n", + " generator.compute_logits()\n", + " generator.generate_next_token()\n", + "\n", + " new_token = generator.get_next_tokens()[0]\n", + " print(tokenizer_stream.decode(new_token), end='', flush=True)\n", + " print()\n", + " print()\n", + " \n", + " # Delete the generator to free the captured graph for the next generator, if graph capture is enabled\n", + " del generator\n", + " \n", + " # now run with adapter\n", + " generator = og.Generator(model, params)\n", + " # set the adapter to active for this response\n", + " generator.set_active_adapter(adapters, \"classifier\")\n", + "\n", + " print()\n", + " print(\"Output from Base Model + Adapter (notice single word response): \", end='', flush=True)\n", + "\n", + " while not generator.is_done():\n", + " generator.compute_logits()\n", + " generator.generate_next_token()\n", + "\n", + " new_token = generator.get_next_tokens()[0]\n", + " print(tokenizer_stream.decode(new_token), end='', flush=True)\n", + " print()\n", + " print()\n", + " del generator" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "gpuType": "A100", + "provenance": [], + "toc_visible": true + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.10" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}