This software application is a spam email classifier that protects its users from potentially harmful phishing or spam emails. It is implemented as a Microsoft Outlook Add-in that gets called whenever the user receives a new mail. The Add-in will classify the mail as either SPAM or NOT SPAM, prepend its classification and confidence to the email's subject, then move that email into its appropriate folder (Inbox or WatsonSpam).
- Visual Studio 2017
- Microsoft Office 365 account
- Microsoft Outlook
- .NET Framework 4.6.*
- IBM Watson Natural Language Classifier
Open up the SpamClassifier.sln Visual Studio solution file on a Windows computer. This will open Visual Studio and build the project. All Add-in code written by the team is located in SpamClassifier/ThisAddIn.cs.
NOTE: The solution file will not automatically install the IBM Watson Natural Language Classifier package. This can be done with the following command issued in the NuGet console:
PM > Install-Package IBM.WatsonDeveloperCloud.NaturalLanguageClassifier.v1
NOTE: There is a known issue with the version of the System.Net.Http library the Natural Language Classifier package has as a dependency. Please update it to the newest stable version through Nuget.
Two classifiers were trained using IBM Watson's Natural Language Classifier service. The classifiers were trained using an online corpus of 4327 emails that were split into 80% training data and 20% testing data. One classifier was responsible for classifying the subjects of emails and the other was used for the email bodies.
- 92.96% accuracy
- 97.79% average confidence
- 94.77% accuracy
- 95.55% average confidence
The creation, training and testing of these classifiers was done by Kurtis Kuszmaul. Code for these processes can be found in the ClassifierCreation directory.
The Outlook Add-in runs in the background of Outlook and fires whenever new mail is received. It locates the new mail item, extracts the subject and body from it, then classifies those two text fields using the classifiers explained above. The confidence of the classifications is weighted and compared to determine a final classification and confidence level. This classification and confidence percentage is prepended to the subject, then the appropriate action is taken on the email.
- Classification done based on weighted sum of subject and body classifier confidence
- Requires 85% confidence to keep classification
- Classification of the higher of the two weighted confidences taken
- Requires 95% confidence to keep classification
The design and development of this Outlook Add-in was done by Gregory Ochs, Ethan Knez, and Kurtis Kuszmaul. Code for these the add-in can be found in the SpamClassifier directory.
- IBM Watson Natural Language Classifier
- Used for custom classifications of text
- Visual Studio Tools for Office
- Used for integrating custom Add-in functionality with Microsoft Outlook
- Ethan Knez
- Kurtis Kuszmaul
- Gregory Ochs