Spam Classifier

Training a spam classifier to differentiate between spam and ham emails. The dataset used for training is the SpamAssassin public mail corpus which consists of a seleciton of mail messages, labelled as spam or ham.

Overview

The goal is to train a classification algorithm to differentiate between spam and ham emails.

To reach the goal, several classification models are first trained on the Apache SpamAssassin dataset. After evaluating each classifier, the 3 best performing one are fine-tuned and re-evaluated again. The 'best' classifier is then saved as .pkl file.

All the steps are contained and documented in the Jupyter Notebok Spam classifier. To lighten the amount of line of codes in the notebook, all the functions needed have been collected in different scripts - the notebook will take care to call and run such functions.

Classification models

The classifiers used are:

Dataset

The above classifier are trained on the SpamAssassin public mail corpus which consists of a selection of mail messages, labelled as spam or ham.

Evaluation

Each classifier is evaluated using the following performance measures:

confusion matrix
accuracy
precision
recall
f1 score

How to start

Check out that you have a working Python(preferably Python 3) and Jupyter Notebook installation
Git clone the repo and cd inside directory
Install requirements:

pip install -r requirements.txt
Check that all the modules have been installed, and download the mail datasets :

python startup.py
Use the jupyter notebook Spam classifier to download and familiarize yourself with the data, train, evaluate and fine tune the classifiers, and pick the 'best' performing ones.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
datasets		datasets
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
Spam classifier.ipynb		Spam classifier.ipynb
classifiers.py		classifiers.py
curves.py		curves.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
startup.py		startup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spam Classifier

Overview

Classification models

Dataset

Evaluation

How to start

About

Releases

Packages

Languages

License

ElisaCovato/Spam-Classifier

Folders and files

Latest commit

History

Repository files navigation

Spam Classifier

Overview

Classification models

Dataset

Evaluation

How to start

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages