This project aims to predict bank term deposits using various machine learning algorithms.
The following R libraries are required:
- class
- e1071
- caret
- rpart.plot
- ggplot2
- ranger
- dplyr
- corrplot
- pROC
- reshape2
- shiny
- xgboost
-
File:
bank-full.csv
-
Description: Main dataset used for training and evaluation.
-
File:
DatasetTable.csv
-
Description: Provides a detailed description of the attributes present in the main dataset.
- Check for missing values in the dataset.
- Identify and handle duplicated rows.
- Convert the target variable 'y' to a binary format (0 for "no" and 1 for "yes").
- Plot histograms for numeric attributes.
- Visualize the distribution of the target variable 'y'.
- Generate a correlation matrix for numeric variables.
- Display bar plots for categorical variables.
The following models are trained and evaluated:
- A generalized linear model (GLM) with a binomial family.
- A recursive partitioning method using the rpart library.
- An ensemble learning method that constructs a multitude of decision trees at training time.
The results of the models are then compared in terms of accuracy, sensitivity, specificity, and balanced accuracy.
- Feature importance from the Random Forest model.
- Comparison of model performance metrics using bar plots.
The Portuguese bank dataset used for this project came from the following paper:
- Moro, S., Laureano, R. and Cortez, P. (2011). Using data mining for bank direct marketing: An application of the crisp-dm methodology.