Fraud is a very common issue in the world of finance. Common types of fraud involving bank accounts include, account opening fraud, money laundering and fraudulent transactions etc. In this report, we present an innovative approach to enhancing fraud detection amongst bank accounts using machine learning techniques.
Prior to installation, you can clone the repository. We use python 3.12.2, and the requirements file specifies versions of all other packages. A virtual environment ul-env (replace ul-env with any other name) can be created in the command line:
Fraud-Hackathon
├── LICENSE
├── Makefile <- Makefile with commands like `make data` or `make train`
├── README.md <- The top-level README for developers using this project.
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ ├── processed <- The final, canonical data sets for modeling.
│ └── raw <- The original, immutable data dump.
│ └── Base.zip <- Compressed version of the dataset used
│
├── docs <- A default Sphinx project; see sphinx-doc.org for details
│
├── models <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
│ │ the creator's initials, and a short `-` delimited description, e.g.
│ │ `1.0-jqp-initial-data-exploration`. For more details on what each notebook
│ │ is about, refer to `notebooks/README.md`
│ │
│ ├── AdaBoost-Model.ipynb
│ ├── DecisionTree-Model.ipynb
│ ├── eda.ipynb
│ ├── GaussianNB-Model.ipynb
│ ├── Hypothesis-EDA.ipynb
│ ├── LogisticRegression-Model.ipynb
│ ├── pipeline.ipynb
│ ├── README.md
│ ├── resampling.ipynb
│ └── XGBoost-Model.ipynb
│
├── references <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
│ └── figures <- Generated graphics and figures to be used in reporting
│
├── requirements.txt <- The requirements file for reproducing the analysis environment, e.g.
│ generated with `pip freeze > requirements.txt`
│
└── src <- Source code for use in this project.
├── __init__.py <- Makes src a Python module
│
├── data <- Scripts to download or generate data
│ └── make_dataset.py <- unzips Base.zip to Base.csv
│
├── features <- Scripts to turn raw data into features for modeling
│ └── build_features.py <- Runs the full data processing pipeline and generates train and test datasets.
│
├── models <- Scripts to train models and then use trained models to make
│ │ predictions
│ ├── train_model.py <- Script to train a sklearn model or xgboost model
│ ├── predict_model.py <- Script to make a prediction using a saved model
│ └── evaluate_model.py <- Script to evaluate model performanced using predictions
│
└── visualization <- Scripts to create exploratory and results oriented visualizations
├── create_xml.py <- Script to generate XML related plots.
└── create_visualizations.py <- Script to generate visualizations used in the report
There are 2 methods to run the code in the files, 1) Running using make
and 2) Running the python scripts directly.
Please ensure that the make
command has been installed.
# To run the full pipeline (excluding visualizations and xml visualizations)
1. cd into the Fraud-Hackathon/ directory
2. make
# To run specific parts of the pipeline
1. cd into the Fraud-Hackathon/ directory
2. make <part>
Available options:
venv <- Creates virtual environment and install packages
dataset <- Unzips the dataset
features <- Preprocesses the data and generate train and test sets
train <- Trains a model and saves it
predict <- Makes predictions with a saved model
evaluate <- Evalaute the performance of the models with its predictions
visualize <- Generate diagrams used in the report
xml <- Generate XML plots
] e.g make train
The makefile
assumes that the commands are run sequentially, [venv -> dataset -> features -> train -> predict -> evaluate], if users are using make <part>
.
For make visualize and xml, it assumes that [venv -> dataset -> features -> train] has been run, and that the following files are present Fraud-Hackathon/models/logisticregression.pkl
, Fraud-Hackathon/data/processed/X_train_resampled.csv
, Fraud-Hackathon/data/processed/X_test.csv
.
# Activate environment if necessary
1. Install packages required
pip install -r requirements.txt
2. Change directory into folder containing the script you wish to run
# e.g I want to run models/predict_model.py.
# Assuming users are at the Fraud-hackathon folder,
cd src/models
Refer to `Fraud-Hackathon\src\README.md` for more details to run each script.