From 213f70423ab6fba39f3dc1fc434ee3510d70ecba Mon Sep 17 00:00:00 2001 From: Alexander Song Date: Sat, 8 Apr 2023 16:50:39 -1000 Subject: [PATCH 1/2] docs: credit card fraud tutorial notebook update --- tutorials/credit_card_fraud_tutorial.ipynb | 283 ++++++++++++--------- 1 file changed, 163 insertions(+), 120 deletions(-) diff --git a/tutorials/credit_card_fraud_tutorial.ipynb b/tutorials/credit_card_fraud_tutorial.ipynb index 01b1fcd88e..48fd351f03 100644 --- a/tutorials/credit_card_fraud_tutorial.ipynb +++ b/tutorials/credit_card_fraud_tutorial.ipynb @@ -2,29 +2,36 @@ "cells": [ { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "id": "2WSnqj3dtf4F" + }, "source": [ - "#
Phoenix in Flight
\n", - "##
Surfacing Feature Drift and Data Quality Issues for a Fraud-Detection Model
\n", + "
\n", + "

\n", + " \"phoenix\n", + "
\n", + " Docs\n", + " |\n", + " GitHub\n", + " |\n", + " Community\n", + "

\n", + "
\n", + "

Detecting Fraud with Tabular Embeddings

\n", "\n", "Imagine you maintain a fraud-detection service for your e-commerce company. In the past few weeks, there's been an alarming spike in undetected cases of fraudulent credit card transactions. These false negatives are hurting your bottom line, and you've been tasked with solving the issue.\n", "\n", - "Phoenix provides opinionated workflows to surface feature drift and data quality issues quickly so you can get straight to the root-cause of the problem. As you'll see, your fraud-detection service is receiving more and more traffic from an untrustworthy merchant, and a missing feature in your pipeline is causing your model's false negative rate to skyrocket.\n", + "Phoenix provides opinionated workflows to surface feature drift and data quality issues quickly so you can get straight to the root-cause of the problem. As you'll see, your fraud-detection service is receiving more and more traffic from an untrustworthy merchant, causing your model's false negative rate to skyrocket.\n", "\n", "In this tutorial, you will:\n", "* Download curated datasets of credit card transaction and fraud-detection data\n", - "* Investigate troublesome \"slices\" of your features to detect drift caused by a fraudulent merchant\n", - "* Uncover a data quality issue causing a spike in false negatives\n", - "* Generate a report to share these insights with your co-workers and other stakeholders at your company\n", + "* Compute tabular embeddings to represent each transaction\n", + "* Pinpoint fraudulent transactions from a suspicious merchant\n", + "* Export data from this merchant to retrain your model\n", "\n", - "Let's get started!" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 1. Install Dependencies and Import Libraries ๐Ÿ“š" + "Let's get started!\n", + "\n", + "## 1. Install Dependencies and Import Libraries" ] }, { @@ -33,7 +40,7 @@ "metadata": {}, "outputs": [], "source": [ - "%pip install -q arize-phoenix" + "!pip install -q arize-phoenix \"arize[AutoEmbeddings]\"" ] }, { @@ -42,17 +49,21 @@ "metadata": {}, "outputs": [], "source": [ + "from arize.pandas.embeddings.tabular_generators import EmbeddingGeneratorForTabularFeatures\n", "import pandas as pd\n", - "import phoenix as px" + "import phoenix as px\n", + "import torch" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "id": "2ux2rILWtf4I" + }, "source": [ - "### 2. Download the Data ๐Ÿ“Š\n", + "## 2. Download the Data\n", "\n", - "Load your training and production data into two pandas dataframes and inspect a few rows of the training dataframe." + "Load your training and production data into two pandas DataFrames and inspect a few rows of the training DataFrame." ] }, { @@ -72,7 +83,9 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "id": "ZebmnDlutf4J" + }, "source": [ "The columns of the dataframe are:\n", "- **prediction_id:** the unique ID for each prediction\n", @@ -80,31 +93,15 @@ "- **predicted_label:** the label your model predicted\n", "- **predicted_score:** the score of each prediction\n", "- **actual_label:** the true, ground-truth label for each prediction (fraud vs. not_fraud)\n", + "- **tabular_vector:** pre-computed tabular embeddings for each row of data\n", "- **age:** a tag used to filter your data in the Phoenix UI\n", - "- the rest of the columns are features" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 3. Generate Embeddings using Arize AutoEmbeddings\n", + "- the rest of the columns are features\n", "\n", - "We can generate an embedding vector per row of our dataframe using `airze[AutoEmbeddings]`. Arize offers the ability of generating embeddings seemlessly using large pre-trained models. In this example, we will use the pre-trained language model `distilbert-base-uncased`.\n", + "## 3. Compute Embeddings\n", "\n", - "**NOTE: The use of GPUs is recommended for embedding generation. If you are running in Colab, we encourage upgrading to Colab Pro.** \n", + "Run the cell below if you have a CUDA-enabled GPU and want to compute embeddings for your tabular data from scratch; otherwise, skip this step to use the pre-computed embeddings downloaded with the rest of your data in step 2.\n", "\n", - "The large language models that Arize's embedding generators use have already been trained in such a huge amount of data that the embeddings can capture relevant structure in your data without being fine-tuned." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "%pip install -q arize[AutoEmbeddings]\n", - "from arize.pandas.embeddings.tabular_generators import EmbeddingGeneratorForTabularFeatures" + "`EmbeddingGeneratorForTabularFeatures` represents each row of your DataFrame as a piece of text and computes an embedding for that text using a pre-trained large language model (in this case, \"distilbert-base-uncased\"). For example, if a row of your DataFrame represents a transaction in the state of California from a merchant named \"Leannon Ward\" with a FICO score of 616 and a merchant risk score of 23, `EmbeddingGeneratorForTabularFeatures` computes an embedding for the text: \"The state is CA. The merchant ID is Leannon Ward. The fico score is 616. The merchant risk score is 23...\"" ] }, { @@ -113,57 +110,59 @@ "metadata": {}, "outputs": [], "source": [ - "generator = EmbeddingGeneratorForTabularFeatures(\n", - " model_name=\"distilbert-base-uncased\",\n", - " tokenizer_max_length=512,\n", - ")\n", - "\n", - "selected_cols = [\n", - " \"fico_score\",\n", - " \"merchant_risk_score\",\n", - " \"loan_amount\",\n", - " \"term\",\n", - " \"interest_rate\",\n", - " \"installment\",\n", - " \"grade\",\n", - " \"home_ownership\",\n", - " \"annual_income\",\n", - " \"verification_status\",\n", - " \"num_credit_lines\",\n", - " \"dti\",\n", - " \"delinq_2yrs\",\n", - " \"inq_last_6mths\",\n", - " \"mths_since_last_delinq\",\n", - " \"open_acc\",\n", - " \"revol_bal\",\n", - " \"state\",\n", - " \"age\",\n", + "feature_column_names = [\n", + " 'fico_score',\n", + " 'loan_amount',\n", + " 'term',\n", + " 'interest_rate',\n", + " 'installment',\n", + " 'grade',\n", + " 'home_ownership',\n", + " 'annual_income',\n", + " 'verification_status',\n", + " 'pymnt_plan',\n", + " 'addr_state',\n", + " 'dti',\n", + " 'delinq_2yrs',\n", + " 'inq_last_6mths',\n", + " 'mths_since_last_delinq',\n", + " 'mths_since_last_record',\n", + " 'open_acc',\n", + " 'pub_rec',\n", + " 'revol_bal',\n", + " 'revol_util',\n", + " 'state',\n", + " 'merchant_ID',\n", + " 'merchant_risk_score',\n", "]\n", "\n", - "train_df[\"tabular_vector\"] = generator.generate_embeddings(\n", - " train_df,\n", - " selected_columns=selected_cols,\n", - ")\n", - "prod_df[\"tabular_vector\"] = generator.generate_embeddings(\n", - " prod_df,\n", - " selected_columns=selected_cols,\n", - ")" + "if torch.cuda.is_available():\n", + " generator = EmbeddingGeneratorForTabularFeatures(\n", + " model_name=\"distilbert-base-uncased\",\n", + " )\n", + " train_df[\"tabular_vector\"] = generator.generate_embeddings(\n", + " train_df,\n", + " selected_columns=feature_column_names,\n", + " )\n", + " prod_df[\"tabular_vector\"] = generator.generate_embeddings(\n", + " prod_df,\n", + " selected_columns=feature_column_names,\n", + " )\n", + "else:\n", + " print(\"CUDA is not available. Using pre-computed embeddings.\")" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "id": "MW-npos-tf4J" + }, "source": [ - "### 4. Launch Phoenix ๐Ÿ”ฅ๐Ÿฆ" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### a) Define Your Schema\n", + "## 4. Launch Phoenix\n", + "\n", + "### a) Define Your Schema\n", "\n", - "To launch Phoenix with your data, you first need to define a schema that tells Phoenix which columns of your dataframes correspond to features, predictions, actuals (i.e., ground truth), tags, etc." + "To launch Phoenix with your data, you first need to define a schema that tells Phoenix which columns of your DataFrames correspond to features, predictions, actuals (i.e., ground truth), tags, etc." ] }, { @@ -172,35 +171,31 @@ "metadata": {}, "outputs": [], "source": [ - "embedding_features = {\n", - " \"tabular_embedding\": px.EmbeddingColumnNames(\n", - " vector_column_name=\"tabular_vector\",\n", - " ),\n", - "}\n", - "\n", "schema = px.Schema(\n", " prediction_id_column_name=\"prediction_id\",\n", " prediction_label_column_name=\"predicted_label\",\n", " prediction_score_column_name=\"predicted_score\",\n", " actual_label_column_name=\"actual_label\",\n", " timestamp_column_name=\"prediction_timestamp\",\n", + " feature_column_names=feature_column_names,\n", " tag_column_names=[\"age\"],\n", - " embedding_feature_column_names=embedding_features,\n", + " embedding_feature_column_names={\n", + " \"tabular_embedding\": px.EmbeddingColumnNames(\n", + " vector_column_name=\"tabular_vector\",\n", + " ),\n", + " },\n", ")" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "id": "vmM01H1Ytf4K" + }, "source": [ - "You'll notice that the schema above doesn't explicitly specify features. That's because feature columns are automatically inferred if you don't pass `feature_column_names` to your `Schema` object." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### b) Define Your Datasets \n", + "You'll notice that the schema above doesn't explicitly specify features. That's because feature columns are automatically inferred if you don't pass `feature_column_names` to your `Schema` object.\n", + "\n", + "### b) Define Your Datasets \n", "Next, define your primary and reference datasets. In this case, your reference dataset contains training data and your primary dataset contains production data." ] }, @@ -210,15 +205,17 @@ "metadata": {}, "outputs": [], "source": [ - "primary_dataset = px.Dataset(dataframe=prod_df, schema=schema, name=\"primary\")\n", - "reference_dataset = px.Dataset(dataframe=train_df, schema=schema, name=\"reference\")" + "prod_ds = px.Dataset(dataframe=prod_df, schema=schema, name=\"production\")\n", + "train_ds = px.Dataset(dataframe=train_df, schema=schema, name=\"training\")" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "id": "y0LzPbuytf4K" + }, "source": [ - "#### c) Create a Phoenix Session" + "### c) Create a Phoenix Session" ] }, { @@ -227,21 +224,36 @@ "metadata": {}, "outputs": [], "source": [ - "session = px.launch_app(primary=primary_dataset, reference=reference_dataset)" + "session = px.launch_app(primary=prod_ds, reference=train_ds)" ] }, { "cell_type": "markdown", + "metadata": { + "id": "jzTRaybgtf4K" + }, + "source": [ + "### d) Launch the Phoenix UI\n", + "\n", + "You can open Phoenix by copying and pasting the output of `session.url` into a new browser tab." + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": {}, + "outputs": [], "source": [ - "#### d) Launch the Phoenix UI" + "session.url" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "id": "hkTRHpI1tf4L" + }, "source": [ - "You can open Phoenix by copying and pasting the output of `session.url` into a new browser tab." + "Alternatively, you can open the Phoenix UI in your notebook with" ] }, { @@ -250,14 +262,42 @@ "metadata": {}, "outputs": [], "source": [ - "session.url" + "session.view()" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "id": "jFYdi3vktf4L" + }, "source": [ - "Alternatively, you can open the Phoenix UI in your notebook with" + "## 5. Find and Export Fraudulent Transactions\n", + "\n", + "### Steps\n", + "\n", + "1. Click on \"tabular_embedding\" in the \"Embeddings\" section.\n", + "1. In the Euclidean distance graph at the top of the page, select a point on the graph where the Euclidean distance is high.\n", + "1. In the display settings in the bottom left, select \"dimension\" in the \"Color By\" dropdown. Then select the \"merchant_ID\" feature in the \"Dimension\" dropdown.\n", + "1. Click on the top cluster in the panel on the left.\n", + "1. Click on the \"Export\" button to save your cluster.\n", + "\n", + "### Questions:\n", + "\n", + "1. What does the Euclidean distance graph measure?\n", + "1. What do the points in the point cloud represent?\n", + "1. What do you notice about the cluster you selected?\n", + "1. What is the cause of your model's high false negative rate in production?\n", + "\n", + "### Answers\n", + "\n", + "1. This graph measures the drift of your production data relative to your training data over time.\n", + "1. Each point in the point cloud represents an individual credit card transaction.\n", + "1. It consists mostly of production data from the Scammeds merchant.\n", + "1. Your model was trained on relatively little data from the Scammeds merchant, but is seeing a high volume of transactions from this merchant in production.\n", + "\n", + "## 6. Load and View Exported Data\n", + "\n", + "View your most recently exported data as a DataFrame." ] }, { @@ -266,23 +306,26 @@ "metadata": {}, "outputs": [], "source": [ - "session.view()" + "export_df = session.exports[-1]\n", + "export_df.head()" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "id": "UhL6OoH5zs7G" + }, "source": [ - "### 4. Explore Your Data ๐Ÿ“ˆ\n", - "\n", - "Phoenix is under active development. At the moment, we display your model schema and a few data quality statistics. Check back soon for more updates." + "Congrats! You've successfully pinpointed a cluster of fraudulent transactions. You can now fine-tune your model on the exported data in order to detect similar cases of fraud in the future." ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "id": "siUbGmK2tf4L" + }, "source": [ - "### 5. Close the App ๐Ÿงน\n", + "## 7. Close the App\n", "\n", "When you're done, don't forget to close the app." ] @@ -303,5 +346,5 @@ } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 1 } From 82df3e405a2eeb4fe453d3993afd56817dfc760b Mon Sep 17 00:00:00 2001 From: Alexander Song Date: Sat, 8 Apr 2023 16:53:28 -1000 Subject: [PATCH 2/2] style notebook --- tutorials/credit_card_fraud_tutorial.ipynb | 46 +++++++++++----------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/tutorials/credit_card_fraud_tutorial.ipynb b/tutorials/credit_card_fraud_tutorial.ipynb index 48fd351f03..801d7793e1 100644 --- a/tutorials/credit_card_fraud_tutorial.ipynb +++ b/tutorials/credit_card_fraud_tutorial.ipynb @@ -111,29 +111,29 @@ "outputs": [], "source": [ "feature_column_names = [\n", - " 'fico_score',\n", - " 'loan_amount',\n", - " 'term',\n", - " 'interest_rate',\n", - " 'installment',\n", - " 'grade',\n", - " 'home_ownership',\n", - " 'annual_income',\n", - " 'verification_status',\n", - " 'pymnt_plan',\n", - " 'addr_state',\n", - " 'dti',\n", - " 'delinq_2yrs',\n", - " 'inq_last_6mths',\n", - " 'mths_since_last_delinq',\n", - " 'mths_since_last_record',\n", - " 'open_acc',\n", - " 'pub_rec',\n", - " 'revol_bal',\n", - " 'revol_util',\n", - " 'state',\n", - " 'merchant_ID',\n", - " 'merchant_risk_score',\n", + " \"fico_score\",\n", + " \"loan_amount\",\n", + " \"term\",\n", + " \"interest_rate\",\n", + " \"installment\",\n", + " \"grade\",\n", + " \"home_ownership\",\n", + " \"annual_income\",\n", + " \"verification_status\",\n", + " \"pymnt_plan\",\n", + " \"addr_state\",\n", + " \"dti\",\n", + " \"delinq_2yrs\",\n", + " \"inq_last_6mths\",\n", + " \"mths_since_last_delinq\",\n", + " \"mths_since_last_record\",\n", + " \"open_acc\",\n", + " \"pub_rec\",\n", + " \"revol_bal\",\n", + " \"revol_util\",\n", + " \"state\",\n", + " \"merchant_ID\",\n", + " \"merchant_risk_score\",\n", "]\n", "\n", "if torch.cuda.is_available():\n",