Skip to content

marcopellegrinoit/credit-scoring-probability-prediction

Repository files navigation

Credit Scoring Probability Prediction 💳

Author: Marco Pellegrino
Year: 2024

This project aims to build a simple model to predict the probability of loan default based on loan application data. This information helps assess business risk and improve loan approval decisions.

Table of Contents

  1. Description
  2. Input Data
  3. Install Required Libraries
  4. Project Structure
  5. License

Description

The main steps consist of processing the input data, training models, and evaluating their performance.

1. Data Processing

Raw input data is processed in the following steps:

  • Data Inspection
  • Domain-based Feature Selection
  • Values Formatting
  • Correlation Analysis
  • Final Distribution Inspection

Note: raw data can be read from a local CSV file or from an AWS S3 bucket.

2. Data Modeling

Develop a machine learning model based on the provided data to predict the probability of loan default. The model should achieve good accuracy and be easily interpretable by business stakeholders.

Some missing values are initially removed, others are imputed during the training phase to avoid data leakage.

Different models are implemented:

  • Decision Tree
  • Random Forest
  • XGBoost

3. Data Evaluation

Model performance is evaluated using the following metrics:

  • Log Loss
  • AUC Score
  • F1 score (for class prediction)

Data visualizations are provided to compare the models on the above metrics.

Input data

The dataset contains loan application data. Data cannot be attached to this repository. Each data point includes information relevant to assessing loan risk, such as financial ratios, company demographics, and loan terms. The following features are provided:

FeatureDescriptionTypeValues
r_application_idApplication IDInteger 
applic_dateApplication DateDate 
company_typeCompany TypeStringFixed "AB"
company_ratingCompany RatingFloatScale 0-100: 0=worst, 100=best
incorporation_dateIncorporation DateDate 
net_turnoverNet TurnoverFloat 
person_scoringPerson's ScoringFloatScale 0-100: 0=worst, 100=best
prev_contr_countNumber of Previous Loan ContractsInteger 
max_late_1yrLongest Payment Delay in Previous 12 MonthsFloat 
uc_risk_classUC Risk ClassIntegerScale 1-5: 1=worst, 5=best
defaultLoan DefaultIntegerBinary: 1 if loan defaulted (was sent to collection), 0 if no default

Install Required Libraries

To install the required Python libraries:

pip install -r requirements.txt

Note: In some environments, use pip3 instead of pip.

The code has been tested with Python 3.11.

Project Structure

Note: paths of resources (data frames, plots, ...) are defined in config.py.

.
├── data/
    ├── raw/
        └── loan_application_data-raw.csv   # raw input dataset
    └── preprocessed/
        └── loan_application_data-preprocessed.csv # preprocessed data for training
    ├── evaluation
        ├── all/
            └── evaluation-*.csv   # AUC, Log Loss, F1 Scores
        ├── tpf/
            └── evaluation_tpr-*.csv   # True Positive Rates or ROC-AUC per model
        └── fpr/
            └── evaluation_fpr-*.csv # # False Positive Rates or ROC-AUC per model
├── plots/
    ├── raw_data/ # Plot of raw data
    ├── models/ # Plot of raw data
        ├── models_feature_importance/    # Plots of model feature importance
        ├── model_comparison/    # Plots to compare models
        ├── models_auc_roc_curve/    # Plots of ROC Curve during CV Validation
        └── rules_decision_tree.png  # Decision Rules of Decision Tree Model
├── 1_EDA-preprocessing.ipynb    # Process raw data for modeling
├── 2-training-*.ipynb    # Training and evaluation for different models
├── 3_compare_models.ipynb  # Compare model performance with visualizations
├── requirements.txt    # Required Python libraries
├── config.py    # Paths definition
└── README.md    # This description

License

This repository is licensed under the GNU General Public License v3.0 (GPL-3.0). For more details, see the LICENSE file.

About

Credit Scoring Probability Prediction with Machine Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published