The dataset used was a subset of the original dataset from the U.S SBA loan database. The data set included information on whether the loan was paid off in full (PIF) or if the SMA had to charge off any amount (CHGOFF) and how much that amount was. The analysis of SBA loans prediction was performed using Scikit package. The below two classification models were trained:
Logistic Regression Decision Tree Classifier Reviewed the dataset, Modified and cleaned up the dataset, and performed explanatory data analysis. Implemented feature engineering, categorical, and target encoding in order to add more information to the dataset.
Performed hyperparameter tuning on my Decision Tree Classifier to obtain an accuracy of 93%, weighted F1 score of 93%, and AUC of 86% on my test dataset in predicting if a loan was paid in full or was charged off. Also saved all my artifacts to be used later by the scoring function