Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a notebook example of AWQ->FT->AutoOpt #1517

Merged
merged 3 commits into from
Dec 10, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
290 changes: 290 additions & 0 deletions examples/getting_started/olive-awq-ft-llama.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,290 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "tv6vx7wooDfk"
},
"source": [
"# ✨ Quantize & Finetune an SLM with Olive\n",
"\n",
"> ⚠️ **This notebook will quantize an Small Language Model (SLM) using the AWQ algorithm, which requires an Nvidia A10 or A100 GPU device.**\n",
"\n",
"In this notebook, you will:\n",
"\n",
"1. Quantize Llama-3.2-1B-Instruct model using the [AWQ Algorithm](https://ar5iv.labs.arxiv.org/html/2306.00978).\n",
"1. Fine-tune the quantized model to classify English phrases into Surprise/Joy/Fear/Sadness.\n",
"1. Optimize the fine-tuned model for the ONNX Runtime.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 🐍 Install Python dependencies\n",
"\n",
"The following cells create a pip requirements file and then install the libraries."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%writefile requirements.txt\n",
"\n",
"olive-ai==0.7.1\n",
"transformers==4.44.2\n",
"autoawq==0.2.6\n",
"optimum==1.23.1\n",
"peft==0.13.2\n",
"accelerate>=0.30.0\n",
"scipy==1.14.1\n",
"onnxruntime-genai==0.5.0\n",
"torchvision==0.18.1\n",
"tabulate==0.9.0"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "ZtY3VYxCoDfm"
},
"outputs": [],
"source": [
"%%capture\n",
"\n",
"%pip install -r requirements.txt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 🤗 Login to Hugging Face\n",
"\n",
"In this notebook you'll be finetuning [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), which is *gated* on Hugging Face and therefore you will need to request access to the model. Once you have access to the model, you'll need to log-in to Hugging Face with a [user access token](https://huggingface.co/docs/hub/security-tokens) so that Olive can download it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!huggingface-cli login --token USER_ACCESS_TOKEN"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 🗜️ Quantize the model using AWQ\n",
"First, you'll quantize the [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) model using the [AWQ Algorithm](https://ar5iv.labs.arxiv.org/html/2306.00978). Olive also supports other quantization algorithms, such as GPTQ, HQQ, and RTN.\n",
"\n",
"You can choose a different model to quantize from Hugging-Face, just update the `--model_name_or_path` argument.\n",
"> ⏳ **It takes approximately ~6mins to complete the AWQ quantization**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!olive quantize \\\n",
" --model_name_or_path \"meta-llama/Llama-3.2-1B-Instruct\" \\\n",
" --trust_remote_code \\\n",
" --algorithm awq \\\n",
" --output_path models/llama/awq \\\n",
" --log_level 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nxJCT5wioDfp"
},
"source": [
"## 🏃 Train the model\n",
"\n",
"Fine-tuning language models helps when we desire very specific outputs. In this example, you'll fine-tune the **AWQ quantized model variant** of Llama-3.2-1B-instruct from the previous cell to respond to an English phrase with a single word answer that classifies the phrases into one of surprise/fear/joy/sadness categories. Here is a sample of the data used for fine-tuning:\n",
"\n",
"```jsonl\n",
"{\"phrase\": \"The sudden thunderstorm caught me off guard.\", \"tone\": \"surprise\"}\n",
"{\"phrase\": \"The creaking door at night is quite spooky.\", \"tone\": \"fear\"}\n",
"{\"phrase\": \"Celebrating my birthday with friends is always fun.\", \"tone\": \"joy\"}\n",
"{\"phrase\": \"Saying goodbye to my pet was heart-wrenching.\", \"tone\": \"sadness\"}\n",
"```\n",
"\n",
"Fine-tuning *after* quantization provides an opportunity to recover some of the loss from the quantization process and enhance the model quality. For more details on quantization and finetuning, read [Is it better to quantize before or after finetuning?](https://onnxruntime.ai/blogs/olive-quant-ft).\n",
"\n",
"In the following `olive finetune` command the `--data_name` argument is a Hugging Face dataset [xxyyzzz/phrase_classification](https://huggingface.co/datasets/xxyyzzz/phrase_classification). You can also provide your own data from local disk using the `--data_files` argument.\n",
"\n",
"> ⏳ **It takes ~6mins to complete the Finetuning**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "8t36pRF2oDfq"
},
"outputs": [],
"source": [
"!olive finetune \\\n",
" --method lora \\\n",
" --model_name_or_path models/llama/awq \\\n",
" --trust_remote_code \\\n",
" --data_name xxyyzzz/phrase_classification \\\n",
" --text_template \"<|start_header_id|>user<|end_header_id|>\\n{phrase}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n{tone}<|eot_id|>\" \\\n",
" --max_steps 300 \\\n",
" --output_path models/llama/ft \\\n",
" --log_level 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "7woNXLDF0bhh"
},
"source": [
"## 🪄 Automatic model optimization with Olive\n",
"\n",
"Next, you'll execute Olive's automatic optimizer using the `auto-opt` CLI command, which will:\n",
"\n",
"1. Capture the fine-tuned model into an ONNX graph and convert the weights into the ONNX format.\n",
"1. Optimize the ONNX graph (e.g. fuse nodes, reshape, etc).\n",
"1. Extract the fine-tuned LoRA weights and place them into a separate file.\n",
"\n",
"> ⏳**It takes ~2mins for the automatic optimization to complete**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "M-prKBy20U5m"
},
"outputs": [],
"source": [
"!olive auto-opt \\\n",
" --model_name_or_path models/llama/ft/model \\\n",
" --adapter_path models/llama/ft/adapter \\\n",
" --device cpu \\\n",
" --provider CPUExecutionProvider \\\n",
" --use_ort_genai \\\n",
" --output_path models/llama/onnx-ao \\\n",
" --log_level 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8Uwm432loDfr"
},
"source": [
"## 🧠 Inference\n",
"\n",
"The code below creates a test app that consumes the model in a simple console chat interface. You will be prompted to enter an English phrase (for example: \"Cricket is a wonderful game\") and the app will output a chat completion using:\n",
"\n",
"1. The base model only (no adapter). You should notice that the model gives a verbose response.\n",
"1. The base model **plus adapter**. You should notice that we get one word classification. \n",
"\n",
"In the code, you'll notice that ONNX Runtime allows you to hot-swap adapters for different tasks, which is often referred to as *multi-LoRA* serving.\n",
"\n",
"Whilst the inference code uses the Python API for the ONNX Runtime, other language bindings are available in [Java, C#, C++](https://github.com/microsoft/onnxruntime-genai/tree/main/examples).\n",
"\n",
"To exit the chat interface, enter `exit` or select `Ctrl+c`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"id": "puMdoAxjoDfr"
},
"outputs": [],
"source": [
"import onnxruntime_genai as og\n",
"\n",
"model_path = \"models/llama/onnx-ao/model\"\n",
"\n",
"model = og.Model(f'{model_path}')\n",
"adapters = og.Adapters(model)\n",
"adapters.load(f'{model_path}/adapter_weights.onnx_adapter', \"classifier\")\n",
"tokenizer = og.Tokenizer(model)\n",
"tokenizer_stream = tokenizer.create_stream()\n",
"\n",
"# Keep asking for input prompts in a loop\n",
"while True:\n",
" phrase = input(\"Phrase: \")\n",
" prompt = f\"<|start_header_id|>user<|end_header_id|>\\n{phrase}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\\n\"\n",
" input_tokens = tokenizer.encode(prompt)\n",
" \n",
" # first run without the adapter\n",
" params = og.GeneratorParams(model)\n",
" params.set_search_options(past_present_share_buffer=False)\n",
" params.input_ids = input_tokens\n",
" generator = og.Generator(model, params)\n",
"\n",
" print()\n",
" print(\"Output from Base Model (notice verbosity): \", end='', flush=True)\n",
"\n",
" while not generator.is_done():\n",
" generator.compute_logits()\n",
" generator.generate_next_token()\n",
"\n",
" new_token = generator.get_next_tokens()[0]\n",
" print(tokenizer_stream.decode(new_token), end='', flush=True)\n",
" print()\n",
" print()\n",
" \n",
" # Delete the generator to free the captured graph for the next generator, if graph capture is enabled\n",
" del generator\n",
" \n",
" # now run with adapter\n",
" generator = og.Generator(model, params)\n",
" # set the adapter to active for this response\n",
" generator.set_active_adapter(adapters, \"classifier\")\n",
"\n",
" print()\n",
" print(\"Output from Base Model + Adapter (notice single word response): \", end='', flush=True)\n",
"\n",
" while not generator.is_done():\n",
" generator.compute_logits()\n",
" generator.generate_next_token()\n",
"\n",
" new_token = generator.get_next_tokens()[0]\n",
" print(tokenizer_stream.decode(new_token), end='', flush=True)\n",
" print()\n",
" print()\n",
" del generator"
]
}
],
"metadata": {
"accelerator": "GPU",
"colab": {
"gpuType": "A100",
"provenance": [],
"toc_visible": true
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Loading