From c61443adcae23e79401a6c71d9c09a458daccb69 Mon Sep 17 00:00:00 2001 From: Ian Spektor Date: Mon, 18 Dec 2023 16:17:02 -0300 Subject: [PATCH 1/2] add titles to fraud tutorial --- .vscode/settings.json | 2 +- .../bank_fraud_detection_with_tfdf.ipynb | 24 +++++++++++++++---- 2 files changed, 21 insertions(+), 5 deletions(-) diff --git a/.vscode/settings.json b/.vscode/settings.json index d8e37865a..ebf2ca965 100644 --- a/.vscode/settings.json +++ b/.vscode/settings.json @@ -41,7 +41,7 @@ 80 ], "editor.codeActionsOnSave": { - "source.organizeImports": false + "source.organizeImports": "never" }, "files.trimFinalNewlines": true, "files.trimTrailingWhitespace": true, diff --git a/docs/src/tutorials/bank_fraud_detection_with_tfdf.ipynb b/docs/src/tutorials/bank_fraud_detection_with_tfdf.ipynb index a7879baab..ddf03ab88 100644 --- a/docs/src/tutorials/bank_fraud_detection_with_tfdf.ipynb +++ b/docs/src/tutorials/bank_fraud_detection_with_tfdf.ipynb @@ -103,6 +103,8 @@ "id": "2c8ab4a8-db54-47f3-a212-5ef785fa607b", "metadata": {}, "source": [ + "## Load the data\n", + "\n", "The dataset consists of banking transactions sampled between April 1, 2018 and September 30, 2018. The transactions are stored in CSV files, one for each day. The transactions from April 1, 2018 to August 31, 2018 (inclusive) are used for training, while the transactions from September 1, 2018 to September 30, 2018 are used for evaluation." ] }, @@ -421,6 +423,8 @@ "id": "47568218-a7c0-4f2b-8de6-4a738ec1c23f", "metadata": {}, "source": [ + "## Create and plot an EventSet\n", + "\n", "Convert the Pandas DataFrame into a Temporian EventSet." ] }, @@ -530,6 +534,8 @@ "id": "b2f4a4fb-fc01-48a8-b69c-6fdf96ea65eb", "metadata": {}, "source": [ + "## Compute features\n", + "\n", "After exploring the dataset, we want to compute some augmented features that may correlate with fraudulent activities. We will compute the following three features:\n", "\n", "**Calendar features**: Extract the hour of the day and the day of the week as individual features. This is useful because fraudulent transactions may be more likely to occur at specific times.\n", @@ -626,12 +632,12 @@ " augmented_transactions = feature_per_terminal.drop_index().join(\n", " feature_per_customer.drop_index()[[\"per_customer.moving_sum_frauds\",\"transaction_id\"]],\n", " on=\"transaction_id\")\n", - " \n", + "\n", " # Join the calendar features\n", " augmented_transactions = augmented_transactions.join(\n", " calendar[[\"calendar_hour\", \"calendar_day_of_week\", \"transaction_id\"]],\n", " on=\"transaction_id\")\n", - " \n", + "\n", " print(\"AUGMENTED TRANSACTIONS:\\n\", augmented_transactions.schema)\n", "\n", " return {\"output\": augmented_transactions}\n", @@ -674,6 +680,8 @@ "id": "83bd7b19-f7f4-44f6-b6c9-2baa0ce6c85d", "metadata": {}, "source": [ + "## Save the Temporian program\n", + "\n", "Save the Temporian program that computes the augmented transactions to disk.\n", "We will not use this program again in this notebook, but in practice, this data augmentation stage should be included with the model.\n", "\n", @@ -727,7 +735,9 @@ "id": "74afcc35-2103-4b13-8513-fa76dcfbf15b", "metadata": {}, "source": [ - "Plot the relation between the augmented features and the label.\n", + "## Analyze the engineered features\n", + "\n", + "Plot the relation between the engineered features and the label.\n", "\n", "**Observations:** The feature `per_terminal.moving_sum_frauds` and `per_customer.moving_sum_frauds` seems to discriminate between fraudulent and non-fraudulent transactions, while the calendar features are not discriminative." ] @@ -775,6 +785,8 @@ "id": "a9c52af7-be82-45ee-a6a3-df7fbf90f635", "metadata": {}, "source": [ + "## Split the data\n", + "\n", "The next step is to split the dataset into a training and testing dataset.\n", "\n", "One common approach is to use the `EventSet.timestamps()` operator. This operator converts the timestamp of a transaction into a feature that can be compared to `train_test_split`." @@ -849,7 +861,7 @@ "# Plot\n", "train_test_switch_tp.plot()\n", "\n", - "# All the transactions before the demarcating event are part of the training dataset (i.e. `is_train=True`) \n", + "# All the transactions before the demarcating event are part of the training dataset (i.e. `is_train=True`)\n", "is_train = train_test_switch_tp.since_last(sampling=augmented_dataset_tp).isnan()\n", "is_test = ~is_train\n", "\n", @@ -905,6 +917,8 @@ "id": "1182fb3d-f39f-4b76-8e62-b19a29c91998", "metadata": {}, "source": [ + "## Train a model\n", + "\n", "We first convert the Temporal EventSets into Pandas DataFrames. Then, we use the `tfdf.keras.pd_dataframe_to_tf_dataset` function to convert these DataFrames into TensorFlow datasets that can be used by TensorFlow Decision Forests." ] }, @@ -982,6 +996,8 @@ "id": "c6f473ac-0746-4610-ad93-cae58ed0d336", "metadata": {}, "source": [ + "## Evaluate the model\n", + "\n", "Finally, we plot the ROC (Receiver operating characteristic) curve and compute the AUC (Area Under the Curve)." ] }, From b493e51679a4903708e509702c4612e404ce49ed Mon Sep 17 00:00:00 2001 From: Ian Spektor Date: Mon, 18 Dec 2023 16:22:29 -0300 Subject: [PATCH 2/2] improve text in fraud notebook --- .../bank_fraud_detection_with_tfdf.ipynb | 30 +++++++++++-------- 1 file changed, 18 insertions(+), 12 deletions(-) diff --git a/docs/src/tutorials/bank_fraud_detection_with_tfdf.ipynb b/docs/src/tutorials/bank_fraud_detection_with_tfdf.ipynb index ddf03ab88..17687879f 100644 --- a/docs/src/tutorials/bank_fraud_detection_with_tfdf.ipynb +++ b/docs/src/tutorials/bank_fraud_detection_with_tfdf.ipynb @@ -12,16 +12,16 @@ "\n", "Detection of fraud in online banking is critical for banks, businesses, and their consumers. The book \"[Reproducible Machine Learning for Credit Card Fraud Detection](https://fraud-detection-handbook.github.io/fraud-detection-handbook/Foreword.html)\" by Le Borgne et al. introduces the problem of payment card fraud and shows how fraud can be detected using machine learning. However, since banking transactions are sensitive and not widely available, the book uses a synthetic dataset for practical exercises.\n", "\n", - "This notebook uses the same dataset to show how to use Temporian and [TensorFlow Decision Forests](https://www.tensorflow.org/decision_forests) to detect fraud. Temporian is used for data preprocessing and augmentation, while TensorFlow Decision Forests is used for model training. Data augmentation is often critical for temporal data, and this notebook demonstrates how complex data augmentation can be performed with ease using Temporian.\n", + "This notebook uses the same dataset to show how to use Temporian and [TensorFlow Decision Forests](https://www.tensorflow.org/decision_forests) to detect fraud. Temporian is used for data preprocessing and feature engineering, while TensorFlow Decision Forests is used for model training. Feature engineering is often critical for temporal data, and this notebook demonstrates how complex feature engineering can be performed with ease using Temporian.\n", "\n", "The notebook is divided into three parts:\n", "\n", "1. Download the dataset and import it to Temporian.\n", - "1. Perform various types of augmentations and visualize the correlation between the augmented features and fraud target labels.\n", - "1. Train and evaluate a machine learning model to detect fraud using the augmented features.\n", + "1. Perform various types of augmentations and visualize the correlation between the engineered features and fraud target labels.\n", + "1. Train and evaluate a machine learning model to detect fraud using the engineered features.\n", "\n", "\n", - "*Note: This notebook assumes a basic understanding of Temporian. If you are not familiar with Temporian, we recommend that you read the [3 minutes guide to Temporian](https://temporian.readthedocs.io/en/latest/3_minutes) guide first.*\n", + "*Note: This notebook assumes a basic understanding of Temporian. If you are not familiar with Temporian, we recommend that you read the [Getting started](https://temporian.readthedocs.io/en/stable/getting_started) guide first.*\n", "\n" ] }, @@ -536,7 +536,7 @@ "source": [ "## Compute features\n", "\n", - "After exploring the dataset, we want to compute some augmented features that may correlate with fraudulent activities. We will compute the following three features:\n", + "After exploring the dataset, we want to compute some features that may correlate with fraudulent activities. We will compute the following three features:\n", "\n", "**Calendar features**: Extract the hour of the day and the day of the week as individual features. This is useful because fraudulent transactions may be more likely to occur at specific times.\n", "\n", @@ -544,7 +544,7 @@ "\n", "**Moving sum of fraud per terminal**: For each terminal, we will extract the number of fraudulent transactions in the last 4 weeks. This is useful because some fraudulent transactions may be caused by ATM skimmers. In this case, many transactions from the same terminal may be fraudulent. However, since we only know after a week if a transaction is fraudulent, there will be a lag in this feature as well.\n", "\n", - "Data augmentation features often have parameters that need to be selected. For example, why look at the last 4 weeks instead of the last 8 weeks? In practice, you will want to compute the features with many different parameter values (e.g., 1 day, 2 days, 1 week, 2 weeks, 4 weeks, and 8 weeks). However, to keep this example simple, we will only use 4 weeks here.\n" + "Features often have parameters that need to be selected. For example, why look at the last 4 weeks instead of the last 8 weeks? In practice, you will want to compute the features with many different parameter values (e.g., 1 day, 2 days, 1 week, 2 weeks, 4 weeks, and 8 weeks). However, to keep this example simple, we will only use 4 weeks here.\n" ] }, { @@ -651,7 +651,7 @@ "id": "670dcdb5-93d4-495d-a852-5d37dc0a364e", "metadata": {}, "source": [ - "Plot the augmented features on the selected customer." + "Plot the engineered features on the selected customer." ] }, { @@ -683,7 +683,7 @@ "## Save the Temporian program\n", "\n", "Save the Temporian program that computes the augmented transactions to disk.\n", - "We will not use this program again in this notebook, but in practice, this data augmentation stage should be included with the model.\n", + "We will not use this program again in this notebook, but in practice, the feature engineering stage should be included with the model.\n", "\n", "A saved Temporian program can be reloaded later or applied on a large dataset using the [Apache Beam](https://beam.apache.org/) backend." ] @@ -1049,13 +1049,19 @@ "**Observations:**\n", "\n", "- The AUC of 0.79 shows the capability of the model to detect frauds.\n", - "- The augmented features we created are efficient at identifying some types of fraud, as evidenced by the recall at low FPR (see FPR=0.02, TPR=0.5).\n", - "- However, for FPRs greater than 0.02, the TPR increases slowly, indicating that the remaining types of fraud are more difficult to detect. We need to conduct further analysis and create new features to improve our ability to detect these remaining frauds.\n", + "- The engineered features we created are efficient at identifying some types of fraud, as evidenced by the recall at low FPR (see FPR=0.02, TPR=0.5).\n", + "- However, for FPRs greater than 0.02, the TPR increases slowly, indicating that the remaining types of fraud are more difficult to detect. We need to conduct further analysis and create new features to improve our ability to detect them.\n", "\n", - "**Homeworks**\n", + "**Homework**\n", "\n", - "Do you have any ideas for other features or feature augmentations that could improve the model's performance? For example, we could compute features per customer and per terminal, or we could create features related to transaction amount. These changes could help us reach an AUC of >0.88.\n" + "Do you have any ideas for other features that could improve the model's performance? For example, we could compute features per customer and per terminal, or we could create features related to the transacted amount. These changes could help us reach an AUC of >0.88.\n" ] + }, + { + "cell_type": "markdown", + "id": "961534b9", + "metadata": {}, + "source": [] } ], "metadata": {