About the Project

Table of Contents

About The Project
Business Objective
Business Metrics
Getting Started
Data Workflow
Prediction using API and Streamlit
- How To Run by API?
- Data Input

About the Project

Auto Insurance Fraud Detection

Frauds are known to be dynamic and have no patterns, hence they are not easy to identify. Amount claimed by fraudulent is significantly huge that may lead to serious loss of money. Such frauds occurred in all areas of insurance claim with high severity including auto sectors that widely claimed by fake accident claim.

The project applying multiple ML techniques to label and classify fraud claim. Before applying any ML model, we use exploratory data analysis to find how the categoric and numeric features relationship toward target (fraud classification), as well as feature selection are used such as Chi-squared test, Shapiro-Wilk, Probability Plot, Welch's Test, Person Correlation, Lasso Method, XGBoost feature importance and Random Forest feature importance.

In preprocessing technique, dataset encountered with unpleasant task of high imbalanced data of fraud and non-fraud case. Hence, Oversampling and SMOTENC are applied toward dataset, and we will have 3 type of configuration data such as Non-Balanced, Oversampling and SMOTENC.

The datasets using Auto Insurance Industry and will approached the accuracy, precision, recall on confusion matrix and also the Area Under the ROC Curve (AUC) as the evaluation metrics that would be used.

Business Objective

This research goal is to build binary classifier model which are able to predict fraud transactions. We present our conclusions based on an empirical study comparing different ML models and classification metrics such Recall and AUC Score.

Business Metrics

Precision and recall should be chosen as the one of the evaluation metrics in classification models.

Precision is the rate of true positives divided by the sum of true positives and false positives. Recall is the number of true positives divided by the sum of true positives and false negatives. A high recall indicates the model is able to classify relevant (positive) results without mislabeling them as irrelevant (negative). On the other hand, high precision indicates the model is able to returned positives predicted values is correctly positives with low irrelevant results (incorrectly positives).

Recall score with low False Negative and high False Positive Rate in AUC score should be parameter to select best model.

AUC Score is also another consideration to choose the best model.
ROC is an evaluation metric for binary classification problems and a probability curve that plots the TPR againts FPR at various threshold values.
An excellent model has AUC near to the 1 which means it has a good measure of separability. A poor model has an AUC near 0 which means it has the worst measure of separability

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
assets		assets
config		config
data		data
docker		docker
log		log
models		models
pict		pict
src		src
.gitignore		.gitignore
01_data_pipeline.ipynb		01_data_pipeline.ipynb
02_eda.ipynb		02_eda.ipynb
03_feature-selection.ipynb		03_feature-selection.ipynb
04_preprocessing_eda.ipynb		04_preprocessing_eda.ipynb
05_modelling.ipynb		05_modelling.ipynb
README.md		README.md
data_profile_report.html		data_profile_report.html
docker-compose.yaml		docker-compose.yaml
modelling_eda.ipynb		modelling_eda.ipynb
requirements.txt		requirements.txt

Column Name	Describe
Policy_annual_premium	Amount of insured need to pay over the year
Insured_zip	Insured number
Capital-gains	Profit on earns of the sales
Capital-loss	Loss incurred
Incident_hour of the day	Hour of the day incident occurred
Total claim amount	Total Amount of claims
Injury claim	Amount of injury claims
Property claim	Amount of property claim
Vehicle claim	Amount of vehicle claim
Policy_state	State of policy holder
Policy_csl	Combined single limit of policy
Policy_deductable	Amount of first pay
Insured_sex	Gender
Insured_hobbies	Hobby of insured
Incident_type	Type of incident occurred
Collision_type	Type of damaged parts
Incident_severity	Type of severity damaged
Authorities contacted	Type of authorities has contacted on incident
Incident_state	State of incident
Incident_city	City of incident
Number of vehicle	Number of vehicle involved
Property damaged	Property damaged or not
Bodily injuries	Number of bodily injured
Witnesses	Number of witnesses
Auto_year	Year of auto model
Police_report	Available or not
Auto_make	Fabrication of Auto
Auto_model	Auto model

DandiMahendris/Auto-Insurance-Fraud-Detection

Folders and files

Latest commit

History

Repository files navigation

About the Project

Auto Insurance Fraud Detection

Business Objective

Business Metrics

Related Works

Getting Started

Data Workflow

Dataset

Data Preparation

Data Preprocessing and Feature Engineering

Exploratory Data Analysis

1. Histogram

2. Kernel Density Plot

3. Boxplot of Numerical Features

4. Barplot

5. Heatmap Pearson Correlation

Feature Selection

1. Chi-Squared Test (Category toward Label)

2. Welch's Test (Numeric toward Label)

3. Heatmap Pearson Correlation

4. Lasso Method

5. XGBoost Method

6. Random Forest

Balancing Data

Handling Outlier Data

Preprocessing

1. Handling Missing Value

2. Label Encoding

3. Scaling

Data Modelling

Baseline Mean Model

Create Training Log

Stratified KFold Cross Validation Score

Fitting Baseline Model

Hyperparameter Tuning

1. Decision Tree

2. Random Forest

3. AdaBoost

4. Gradient Boosting

5. XGBoost

Overall AUC Score

Overall Performance of Hyperparameter Models

Prediction using API and Streamlit

How To Run by API?

Data Input

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages