diff --git a/tutorials/image_classification_tutorial.ipynb b/tutorials/image_classification_tutorial.ipynb new file mode 100644 index 0000000000..ee6fc80edd --- /dev/null +++ b/tutorials/image_classification_tutorial.ipynb @@ -0,0 +1,426 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "_X9GuXoSXleA" + }, + "source": [ + "
\n", + "

\n", + " \"phoenix\n", + "
\n", + " Docs\n", + " |\n", + " GitHub\n", + " |\n", + " Community\n", + "

\n", + "
\n", + "

Active Learning for a Drifting Image Classification Model

\n", + "\n", + "Imagine you're in charge of maintaining a model that classifies the action of people in photographs. Your model initially performs well in production, but its performance gradually degrades over time.\n", + "\n", + "Phoenix helps you surface the reason for this regression by analyzing the embeddings representing each image. Your model was trained on crisp and high-resolution images, but as you'll discover, it's encountering blurred and noisy images in production that it can't correctly classify.\n", + "\n", + "In this tutorial, you will:\n", + "\n", + "- Download curated datasets of embeddings and predictions\n", + "- Define a schema to describe the format of your data\n", + "- Launch Phoenix to visually explore your embeddings\n", + "- Investigate problematic clusters\n", + "- Export problematic production data for labeling and fine-tuning\n", + "\n", + "Let's get started!\n", + "\n", + "## 1. Install Dependencies and Import Libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install arize-phoenix" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QvPo5LKZjpfs" + }, + "outputs": [], + "source": [ + "import uuid\n", + "from dataclasses import replace\n", + "from datetime import datetime\n", + "\n", + "from IPython.display import display, HTML\n", + "import pandas as pd\n", + "import phoenix as px" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OFeF5_Bysd2f" + }, + "source": [ + "## 2. Download and Inspect the Data\n", + "\n", + "Download the curated dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "train_df = pd.read_parquet(\"https://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/cv/human-actions/human_actions_training.parquet\")\n", + "prod_df = pd.read_parquet(\"https://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/cv/human-actions/human_actions_production.parquet\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "View the first few rows of the training DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "train_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The columns of the DataFrame are:\n", + "- **prediction_id:** a unique identifier for each data point\n", + "- **prediction_ts:** the Unix timestamps of your predictions\n", + "- **url:** a link to the image data\n", + "- **image_vector:** the embedding vectors representing each review\n", + "- **actual_action:** the ground truth for each image (sleeping, eating, running, etc.)\n", + "- **predicted_action:** the predicted class for the image\n", + "\n", + "View the first few rows of the production DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prod_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Notice that the production data is missing ground truth, i.e., has no \"actual_action\" column.\n", + "\n", + "Display a few images alongside their predicted and actual labels. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def display_examples(df):\n", + " \"\"\"\n", + " Displays each image alongside the actual and predicted classes.\n", + " \"\"\"\n", + " sample_df = df[[\"actual_action\", \"predicted_action\", \"url\"]].rename(columns={\"url\": \"image\"})\n", + " html = sample_df.to_html(escape=False, index=False, formatters={\"image\": lambda url: f''})\n", + " display(HTML(html))\n", + " \n", + "display_examples(train_df.head())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0BIeGAemfziv" + }, + "source": [ + "## 3. Prepare the Data\n", + "\n", + "The original data is from April 2022. Update the timestamps to the current time." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xzYoV-hemYsE" + }, + "outputs": [], + "source": [ + "latest_timestamp = max(prod_df['prediction_ts'])\n", + "current_timestamp = datetime.timestamp(datetime.now())\n", + "delta = current_timestamp - latest_timestamp\n", + "\n", + "train_df['prediction_ts'] = (train_df['prediction_ts'] + delta).astype(float)\n", + "prod_df['prediction_ts'] = (prod_df['prediction_ts'] + delta).astype(float)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Launch Phoenix\n", + "\n", + "### a) Define Your Schema\n", + "To launch Phoenix with your data, you first need to define a schema that tells Phoenix which columns of your DataFrames correspond to features, predictions, actuals (i.e., ground truth), embeddings, etc.\n", + "\n", + "The trickiest part is defining embedding features. In this case, each embedding feature has two pieces of information: the embedding vector itself contained in the \"image_vector\" column and the link to the image contained in the \"url\" column.\n", + "\n", + "Define a schema for your training data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "train_schema = px.Schema(\n", + " timestamp_column_name=\"prediction_ts\",\n", + " prediction_label_column_name=\"predicted_action\",\n", + " actual_label_column_name=\"actual_action\",\n", + " embedding_feature_column_names={\n", + " \"image_embedding\": px.EmbeddingColumnNames(\n", + " vector_column_name=\"image_vector\",\n", + " link_to_data_column_name=\"url\",\n", + " ),\n", + " },\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The schema for your production data is the same, except it does not have an actual label column." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prod_schema = replace(train_schema, actual_label_column_name=None)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### b) Define Your Datasets\n", + "Next, define your primary and reference datasets. In this case, your reference dataset contains training data and your primary dataset contains production data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prod_ds = px.Dataset(prod_df, prod_schema)\n", + "train_ds = px.Dataset(train_df, train_schema)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### c) Create a Phoenix Session" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "session = px.launch_app(prod_ds, train_ds)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### d) Launch the Phoenix UI\n", + "\n", + "You can open Phoenix by copying and pasting the output of `session.url` into a new browser tab." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "session.url" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Alternatively, you can open the Phoenix UI in your notebook with" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "session.view()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Find and Export Problematic Clusters\n", + "\n", + "### Steps\n", + "\n", + "1. Click on \"image_embedding\" in the \"Embeddings\" section.\n", + "1. In the Euclidean distance graph at the top of the page, select a point on the graph where the Euclidean distance is high.\n", + "1. Click on the top cluster in the panel on the left.\n", + "1. Use the panel at the bottom to examine the data points in this cluster.\n", + "1. Click on the \"Export\" button to save your cluster.\n", + "\n", + "### Questions:\n", + "\n", + "1. What does the Euclidean distance graph measure?\n", + "1. What do the points in the point cloud represent?\n", + "1. What do you notice about the cluster you selected?\n", + "1. What's gone wrong with your model in production?\n", + "\n", + "### Answers\n", + "\n", + "1. This graph measures the drift of your production data relative to your training data over time.\n", + "1. Each point in the point cloud corresponds to an image. Phoenix has taken the high-dimensional embeddings in your original DataFrame and has reduced the dimensionality so that you can view them in lower dimensions.\n", + "1. It consists almost entirely of production data, meaning that your model is seeing data in production the likes of which it never saw during training.\n", + "1. Your model was trained crisp and high-resolution images. In production, your model is encountering blurry and noisy images that it cannot correctly classify.\n", + "\n", + "## 6. Load and View Exported Data\n", + "\n", + "View your exported files." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "session.exports" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Load your most recent exported data back into a DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "export_df = session.exports[0].dataframe\n", + "export_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Display a few examples from your export." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "display_examples(export_df.head())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Congrats! You've pinpointed the blurry or noisy images that are hurting your model's performance in production. As an actionable next step, you can label your exported production data and fine-tune your model to improve performance.\n", + "\n", + "## 7. Close the App\n", + "\n", + "When you're done, don't forget to close the app." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "px.close_app()" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "collapsed_sections": [ + "QOudyT6lPBqp" + ], + "machine_shape": "hm", + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.15" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +}