Author: Marco Pellegrino
Year: 2024
This project aims to build a simple model to predict the probability of loan default based on loan application data. This information helps assess business risk and improve loan approval decisions.
The main steps consist of processing the input data, training models, and evaluating their performance.
1. Data Processing
Raw input data is processed in the following steps:
- Data Inspection
- Domain-based Feature Selection
- Values Formatting
- Correlation Analysis
- Final Distribution Inspection
Note: raw data can be read from a local CSV file or from an AWS S3 bucket.
2. Data Modeling
Develop a machine learning model based on the provided data to predict the probability of loan default. The model should achieve good accuracy and be easily interpretable by business stakeholders.
Some missing values are initially removed, others are imputed during the training phase to avoid data leakage.
Different models are implemented:
- Decision Tree
- Random Forest
- XGBoost
3. Data Evaluation
Model performance is evaluated using the following metrics:
- Log Loss
- AUC Score
- F1 score (for class prediction)
Data visualizations are provided to compare the models on the above metrics.
The dataset contains loan application data. Data cannot be attached to this repository. Each data point includes information relevant to assessing loan risk, such as financial ratios, company demographics, and loan terms. The following features are provided:
Feature | Description | Type | Values |
r_application_id | Application ID | Integer | |
applic_date | Application Date | Date | |
company_type | Company Type | String | Fixed "AB" |
company_rating | Company Rating | Float | Scale 0-100: 0=worst, 100=best |
incorporation_date | Incorporation Date | Date | |
net_turnover | Net Turnover | Float | |
person_scoring | Person's Scoring | Float | Scale 0-100: 0=worst, 100=best |
prev_contr_count | Number of Previous Loan Contracts | Integer | |
max_late_1yr | Longest Payment Delay in Previous 12 Months | Float | |
uc_risk_class | UC Risk Class | Integer | Scale 1-5: 1=worst, 5=best |
default | Loan Default | Integer | Binary: 1 if loan defaulted (was sent to collection), 0 if no default |
To install the required Python libraries:
pip install -r requirements.txt
Note: In some environments, use pip3
instead of pip
.
The code has been tested with Python 3.11.
Note: paths of resources (data frames, plots, ...) are defined in config.py
.
.
├── data/
├── raw/
└── loan_application_data-raw.csv # raw input dataset
└── preprocessed/
└── loan_application_data-preprocessed.csv # preprocessed data for training
├── evaluation
├── all/
└── evaluation-*.csv # AUC, Log Loss, F1 Scores
├── tpf/
└── evaluation_tpr-*.csv # True Positive Rates or ROC-AUC per model
└── fpr/
└── evaluation_fpr-*.csv # # False Positive Rates or ROC-AUC per model
├── plots/
├── raw_data/ # Plot of raw data
├── models/ # Plot of raw data
├── models_feature_importance/ # Plots of model feature importance
├── model_comparison/ # Plots to compare models
├── models_auc_roc_curve/ # Plots of ROC Curve during CV Validation
└── rules_decision_tree.png # Decision Rules of Decision Tree Model
├── 1_EDA-preprocessing.ipynb # Process raw data for modeling
├── 2-training-*.ipynb # Training and evaluation for different models
├── 3_compare_models.ipynb # Compare model performance with visualizations
├── requirements.txt # Required Python libraries
├── config.py # Paths definition
└── README.md # This description
This repository is licensed under the GNU General Public License v3.0 (GPL-3.0). For more details, see the LICENSE file.