Fraud Identification - Udacity Machine Learning project

Overview

The goal of this project is to analyse the data set made available as a result of the Enron fraud case from the 2000's, and attempt to identify patterns in the data that can assist in the prediction of future fraud cases in similar industries. In order to help identify the patterns and validate results, I have used a number of machine learning techniques to attempt to automatically classify individuals into either "Person of Interest" or "Non Person of Interest" categories based on financial and communication (email) data.

The remainder of this document outlines how I went about this analysis and what the result of my findings were.

Available Data

The data set originally made available contains a fair amount of information about the Enron conspirators as well as a number of other employees. Some summary statistics for the data set:

146 records are available in the data
18 of those entries are known persons of interest (12.3% of the total data)
Each record has up to 20 features available (not including the POI flag)
However one of those features is "email_address" which cannot be used to predict POI given that it is a direct label for an individual, and provides no meaningful data about fraud intent

The small number of records and even smaller number of known POIs is likely to present challenges in model accuracy / precision / recall as it will be difficult to generalise a model based on so few data points. Refer to the validation and evaluation section below for details on how I approached this challenge.

Outliers

In analysing the data, 3 outliers were identified that must be removed before any processing could be performed:

A user named "TOTAL" which is a copy/paste issue from an excel spreadsheet or similar is grossly biasing the financial data, as it is a summation of all other salaries, etc (not an actual person)
A user named "THE TRAVEL AGENCY IN THE PARK", which is clearly a business and not a person, is also embedded in the data set and should be removed for this exercise
A user named "OCKHART EUGENE E" is found in the document with no identifying features (NaN for all data points), as a result he can/should be removed to avoid biasing the non POI pool

In addition, please refer to the following table for some interesting information regarding sparsity of the remaining features:

Feature	Unique Values	Null Values
salary	94	49
to_messages	87	57
deferral_payments	39	105
loan_advances	4	140
bonus	41	62
restricted_stock_deferred	18	126
deferred_income	44	95
total_stock_value	124	18
expenses	94	49
from_poi_to_this_person	57	69
exercised_stock_options	101	42
from_messages	65	57
other	91	52
from_this_person_to_poi	41	77
long_term_incentives	52	78
shared_receipt_with_poi	84	57
restricted_stock	91	34
director_fees	17	127

I also found in the data there were 2 sets of typos that existed for BELFER ROBERT and BHATNAGAR SANJAY which transposed data into the wrong columns. I manually corrected this error using data from the provided PDF to ensure that the information being used in the model is as accurate as possible.

Feature Selection / Feature Engineering

To improve the effectiveness of my machine learning models, I needed to reduce the diversity of features that I made available. Instead of manually analysing every feature individually, I allowed the classifier search to select between PCA (primary component identification) across the full feature set, or SelectKBest - tuned for what would deliver a maximal f1 value (proxy for recall and precision). The highest performing classifier was a Nearest Centroid leveraging SelectKBest for dimensionality reduction.

The most performant model (Nearest Centriod) leveraged SelectKBest to assist in selection of the most performant features, with 15 features chosen in total. The features selected and their ranks are noted below.

Feature	Score	Selected?
total_stock_value	14.689864354826572	True
exercised_stock_options	13.714161147390753	True
salary	11.196268305382173	True
bonus	11.129479151071294	True
restricted_stock	6.5727016131222769	True
deferred_income	5.2791102364442963	True
expenses	5.8987344206928967	True
shared_receipt_with_poi	5.495321379933408	True
from_poi_to_this_person	3.5906800150961415	True
long_term_incentive	2.611274332028505	True
from_this_person_to_poi	2.1438408958291655	True
director_fees	1.9616084482924003	True
restricted_stock_deferred	0.79383448019021663	True
to_messages	0.68481475854756191	True
deferral_payments	0.26396796392070854	True
loan_advances	0.19714881780250351	False
from_messages	0.17138368605383292	False
other	0.014163413206358109	False

Based purely on the classifier types, scaling was not relevant for my model (as the GridSearch had to choose from classifiers that do not benefit from scaled data). However as I used PCA to further reduce the dimensionality of my data, I had to normalise all of the remaining feature values to ensure that large values did not grossly outweigh smaller values (e.g. comparing salary data to number of emails received. To do so, I used an absolute value scaler which ensured that having negative values in some fields would not impact the component selection.

In addition to the existing features, I identified and engineered 2 of my own feaures - converting the number of emails sent to/from a person of interest to a percentage of the total emails that person has sent or received. This value would logically seem to be of more interest than an absolute value of emails as it allows more ready comparison between actors regardless of how heavily the individual uses email.

I ran my optimised models (see below) both with and without my features to determine whether they would be selected by SelectKBest and what scale of impact they would have on the resulting model scores. Here are the statistics for the most successful classifier both with and without my engineered features.

Type	Model	Accuracy	Precision	Recall	F1
With my features	Nearest Centroid	0.77680	0.33014	0.65500	0.43901
Without my features	Nearest Centroid	0.74713	0.31311	0.75100	0.44196

Based on the statistics above, the best classifier with and without my features remains the same, but performance improves if I exclude the features. As a result, I excluded them from the final optimised classifier.

Algorithm Selection

The algorithm was autmatically selected by GridSearchCV, and the results of all the competing algorithm's can be seen below. I setup the GridSearch to optimise for f1 performance of the model to attempt to maximise for a combination of accuracy and precision which were the stated outcomes for this project.

Algorithm	Run time	Accuracy	Precision	Recall	F1	Winner
GaussianNB+	5.6s	0.84853	0.41807	0.34700	0.37923	No
KNearestNeighbors	208.1s	0.88587	0.70168	0.25050	0.36920	No
NearestCentroid	298.4s	0.74713	0.31311	0.75100	0.44196	Yes
DecisionTree	756.6s	0.85327	0.44254	0.38700	0.41291	No
RandomForest+	916.7s*	0.85613	0.39793	0.15400	0.22206	No
AdaBoost+	519.8s*	0.83713	0.36403	0.29650	0.32681	No

Items with an * next to them above were run in multi-core mode individually to speed up processing time. Items with a + performed better with my new features left in place, and the scores above reflect that.

Parameter Tuning

Tuning refers to changing parameters of the selected algorithm to adjust how the model works and interacts with the provided data. If you do it wrong, you can end up negatively impacting the performance of your model. I tuned using GridSearch and a parameter dictionary, to automatically adjust and find the best combination of features that maximised for f1 (a proxy for recall + precision scores that we were aiming for).

Different parameters were tuned in each of the classifiers noted above. The below outlines all the testing performed.

Gaussian (76 combinations)

Parameter Tested	Tested Values	Optimal Value
Scaler	[None, MaxAbsScaler()]	None
Dimensionality Reducer	[PCA(), SelectKBest()]	PCA
Number features (dim reduction)	range(2, 21)	8

NearestNeighbors (2,736 combinations)

Parameter Tested	Tested Values	Optimal Value
Scaler	[None, MaxAbsScaler()]	MaxAbsScaler
Dimensionality Reducer	[PCA(), SelectKBest()]	PCA
Number features (dim reduction)	range(2, 21)	2
Number of neighbors	[1, 3, 5, 7, 9, 10, 15, 20, 50]	3
Weight Function	['uniform','distance']	distance
Power	[1, 2]	2

NearestCentroid (5,016 combinations)

Parameter Tested	Tested Values	Optimal Value
Scaler	[None, MaxAbsScaler()]	MaxAbsScaler
Dimensionality Reducer	[PCA(), SelectKBest()]	SelectKBest
Number features (dim reduction)	range(2, 21)	6
Distance Metric	['cityblock', 'cosine', 'euclidean', 'l1', 'l2', 'manhattan']	manhattan
Shrink Threshold	[None, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]	5

DecisionTree (4,104 combinations)

Parameter Tested	Tested Values	Optimal Value
Dimensionality Reducer	[PCA(), SelectKBest()]	SelectKBest
Number features (dim reduction)	range(2, 21)	13
Split quality criteria	['entropy', 'gini']	entropy
Split strategy	['random', 'best']	best
Minimum split samples	range(2, 20, 2)	16
Max tree features	['sqrt','log2', None]	None (all)

RandomForest (5,130 combinations)

Parameter Tested	Tested Values	Optimal Value
Dimensionality Reducer	[PCA(), SelectKBest()]	SelectKBest
Number features (dim reduction)	range(2, 21)	14
Number estimators	[5, 10, 15, 20, 25]	5
Split quality criteria	['entropy', 'gini']	gini
Minimum split samples	range(2, 20, 2)	12
Max tree features	['sqrt','log2', None]	sqrt

AdaBoost (190 combinations)

Parameter Tested	Tested Values	Optimal Value
Dimensionality Reducer	[PCA(), SelectKBest()]	SelectKBest
Number features (dim reduction)	range(2, 21)	9
Number estimators	[25, 50, 75, 100, 500]	100

Model Validation

Validation refers to how you confirm that your model is predictive of the outcome as desired - either through splitting your data into training/testing sets, or through a form of cross-validation. A classic validation mistake is not splitting your data and testing directly on the same data that you trained on - causing wildly inaccurate (overstated) results.

Given the very small amount of data that was available for training and testing (146 records), a classic single train/test split was not advisable as we could introduce bias into the training or testing data sets by pure luck given that only 12% of the entries were true-positives. Therefore using a cross-validation / k-fold technique like StratifiedShuffleSplit was important as it allowed the training data to be cycled many times in small batches to train the model more thoroughly, and allow the biases in the dat to be averaged out over time.

A stratified split ensures that each run of the shuffle will generate a mutually exclusive set of training participants, that consumes the entire set of training data - while simultaneously randomising the data sets to help remove biases in the underlying training data.

This model was used in both the final tester.py class as well as in my GridSearch validation attempt to find the best classifier and hyperparameters to solve the problem.

Model Evaluation

The goal of this project was to lift recall and precision above a threshold (0.3), in real terms this meant that I wanted to incorrectly accuse individuals of fraud less than 70% of the time (also known as a false positive) as well as incorrectly missing true-fraud (e.g. not calling someone a fraudster who actually is) less than 70% of the time.

In the final model that was submitted for this project, I ended up with the following average performances from the best model:

Precision: 0.75100
Recall: 0.31311

In human readable terms, this means that we had 75.1% certainty that the people that we picked as POIs were actually POIs (as measured by the precision metric). However based on our recall figure of 0.31311, that means we only identified 31.3% of the possible POIs (letting ~68.7% of fraudsters get away with their crime).

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
enron61702insiderpay.pdf		enron61702insiderpay.pdf
final_project_dataset.pkl		final_project_dataset.pkl
my_classifier.pkl		my_classifier.pkl
my_dataset.pkl		my_dataset.pkl
my_feature_list.pkl		my_feature_list.pkl
poi_email_addresses.py		poi_email_addresses.py
poi_id.py		poi_id.py
poi_names.txt		poi_names.txt
readme.md		readme.md
references.md		references.md
tester.py		tester.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fraud Identification - Udacity Machine Learning project

Overview

Available Data

Outliers

Feature Selection / Feature Engineering

Algorithm Selection

Parameter Tuning

Model Validation

Model Evaluation

About

Releases

Packages

Languages

spicypenguin/udacity-analyst-machine-learning

Folders and files

Latest commit

History

Repository files navigation

Fraud Identification - Udacity Machine Learning project

Overview

Available Data

Outliers

Feature Selection / Feature Engineering

Algorithm Selection

Parameter Tuning

Model Validation

Model Evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages