- Team: Marco Bravo, Nasim Ghazanfari Nasrabadi, Shizhe Zhang, Celeste Zhao
In this analysis, we attempt to build a predictive model aimed at determining whether a client will subscribe to a term deposit, utilizing the data associated with direct marketing campaigns, specifically phone calls, in a Portuguese banking institution.
After exploring on several models (logistic regression, KNN, decision tree, naive Bayers), we have selected the logistic regression model as our primary predictive tool. The final model performs fairly well when tested on an unseen dataset, achieving the highest AUC (Area Under the Curve) of 0.899. This exceptional AUC score underscores the model's capacity to effectively differentiate between positive and negative outcomes. Notably, certain factors such as last contact duration, last contact month of the year and the clients' types of jobs play a significant role in influencing the classification decision.
The dataset used in this project originates from the Bank Marketing dataset created by S. Moro, P. Rita and P. Cortez at Iscte - University Institute of Lisbon. This dataset is accessible through the UCI Machine Learning Repository and can be accessed here. Among the four available datasets, we have utilized bank-full.csv which contains all examples and 17 inputs. Each row in the dataset represents an individual client data including the personal details (e.g., age, occupation, loan status, etc.), information regarding their response to the marketing campaign (e.g., outcomes of the previous marketing campaign, number of contacts made during the current campaign, etc.), and the eventual subscription outcome for the term deposit.
The final report can be found here.
-
Install and launch Docker on your computer.
-
Clone this GitHub repository.
- Navigate to the root of this project on your computer using the command line and enter the following command:
docker compose up jupyter-lab
- In the terminal, look for a URL that starts with
http://127.0.0.1:8888/lab?token=
(for an example, see the highlighted text in the terminal below). Copy and paste that URL into your browser.
- There are two options to run the analysis,
- Enter the following commands one by one
# to open the jupyter lab inside the docker container, enter the following command in the terminal of the root folder of the project
docker compose up
# the following commands should run inside the terminal of the container
# first open a new terminal under the root folder of the project, and run the following to open a new terminal in the container
docker compose exec jupyter-lab /bin/bash
# Download and save data
python scripts/data_download.py \
--url='https://archive.ics.uci.edu/static/public/222/data.csv' \
--save_path='data/raw'
# Split data (train-test), process data and save preprocessorsplit
python scripts/split_and_process.py \
--raw_data='data/raw/bank-full.csv' \
--save_to='data/processed' \
--preprocessor_to='results/models' \
--seed=522
# eda plots
python scripts/eda.py \
--training_data='data/processed/bank_train.csv' \
--save_plot_to='results/figures'
# model training, tuning, and save plot and model
python scripts/fit_bank_classifier.py \
--resampled_training_data='data/processed/X_train_resmp.csv' \
--resampled_training_response='data/processed/y_train_resmp.csv' \
--test_data='data/processed/X_test.csv' \
--test_response='data/processed/y_test.csv' \
--preprocessor_pipe='results/models/bank_preprocessor.pickle' \
--save_pipelines_to='results/models' \
--save_plot_to='results/figures' \
--seed=522
# evaluate model
python scripts/feat_imp.py \
--transformed_training_data='data/processed/X_train_trans.csv' \
--pipeline_model='results/models/logistic_pipeline.pickle' \
--save_plot_to='results/figures' \
--seed=522
# build HTML report and copy build to docs folder
# in the terminal of the container, run
cd work
jupyter-book build report
cp -r report/_build/html/* docs
# Clean up
# To shut down the container and clean up the resources, type `Cntrl` + `C` in the terminal where you launched the container, and then type `docker compose rm`
- You can also reproduce the whole process above using makefile. Open a terminal in the root folder of this project, and run the following command.
# open the container, run all the scripts and build the jupyter book
make all
# remove the container, delete all the files generated by scripts and jupyter-book
make clean
-
Add the dependency to the
Dockerfile
file on a new branch. -
Re-build the Docker image locally to ensure it builds and runs properly.
-
Push the changes to GitHub. A new Docker image will be built and pushed to Docker Hub automatically. It will be tagged with the SHA for the commit that changed the file.
-
Update the
docker-compose.yml
file on your branch to use the new container image (make sure to update the tag specifically). -
Send a pull request to merge the changes into the
main
branch.
Tests are run using the pytest
command in the root of the project. More details about the test suite can be found in the tests
directory.
The Bank Marketing dataset and materials are licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license. If re-using/re-mixing please provide attribution and link to this webpage.
Software licensed under the MIT License. See the license file for more information.
Moro,S., Rita,P., and Cortez,P.. (2012). Bank Marketing. UCI Machine Learning Repository. https://doi.org/10.24432/C5K306.
Davis, J., & Goadrich, M. The Relationship Between Precision-Recall and ROC Curves. https://www.biostat.wisc.edu/~page/rocpr.pdf
Saito, T., & Rehmsmeier, M. (2015). The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets. PLOS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432
Flach, P. A., & Kull, M. Precision-Recall-Gain Curves: PR Analysis Done Right. https://papers.nips.cc/paper/2015/file/33e8075e9970de0cfea955afd4644bb2-Paper.pdf
Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. (2015, September 28). Generalization in Adaptive Data Analysis and Holdout Reuse. https://arxiv.org/pdf/1506.02629.pdf
Turkes (Vînt), M. C. (Year, if available). Concept and Evolution of Bank Marketing. Transylvania University of Brasov Faculty of Economic Sciences. Retrieved from link to the PDF or ResearchGate. https://www.researchgate.net/publication/49615486_CONCEPT_AND_EVOLUTION_OF_BANK_MARKETING/fulltext/0ffc5db50cf255165fc80b80/CONCEPT-AND-EVOLUTION-OF-BANK-MARKETING.pdf
Moro, S., Cortez, P., & Rita, P. (2014). A data-driven approach to predict the success of bank telemarketing. Decis. Support Syst., 62, 22-31. https://repositorio.iscte-iul.pt/bitstream/10071/9499/5/dss_v3.pdf