Made by Christos Kormaris
Programming Language: Python 3
Unzip the compressed file LingspamDataset.zip
in the same directory where the Python files are.
700 train documents and 260 test documents will reside inside the uncompressed folder.
Let's begin by denoting the variable C, which takes the values: C=1 for spam documents and C=0 for ham documents.
First, run the python file FeatureSelectionUsingIG.py
to generate the features tokens that we'll use. Run:
shell python feature_selection_using_ig.py
Feature selection for the most useful feature tokens, using Information Gain (IG) has been implemented.
The feature tokens are boolean, as in they take boolean values, 0 if the token does not appear in a text or 1 if the token appears in a text.
The boolean values are assigned while generating the feature vectors of each text.
At the start of the program, all the train files of the corpus are being parsed and we count in how many spam or ham documents in total, each word appears.
The results are being saved in dictionary data structures with the names:
feature_spam_frequency
, feature_ham_frequency
and feature_frequency
respectively,
with feature tokens being the keys and and frequencies being the values.
These dictionary variables are used for the calculation of the probabilities of the Information Gain algorithm.
We calculate the entropy H(C)
and print it. We proceed by adding each candidate feature token to the IG
dictionary, while calculating its IG score.
Finally, we select the top m
feature tokens, on terms of the highest Information Gain (IG) scores.
The Information Gain score of each feature token is calculated using the formula:
where i=0 and i=1 are the boolean values that a feature token may take, indicating if it appears or not in a text. Concretely, the feature X that reduces the entropy less is the most desired candidate feature because it can discriminate the category of a document more efficiently. The number of feature tokens that the feature selection algorithm returns is set to 1000. The number of feature tokens to use depends on the classification task we want to execute. For our Naive-Bayes spam-ham classifier a number of 1000 feature tokens is a good choice.
In addition, inside of block comments, there is an alternative way for calculating the Information Gain score, by using the following formula:
which is the absolute difference between the conditional probabilities for the 2 classes (spam or ham).
The tokens for which this absolute difference is bigger are selected as feature tokens of the feature vectors, based on which we will classify our corpus files.
Using the feature tokens of this formula, slightly worse accuracy has been observed.
In the end of the program feature_selection_with_ig.py
, two dictionaries will be created,
which contain the feature tokens that will be used for the Naive-Bayes classifier and they are saved in the files:
spam_feature_dictionary.txt
and ham_feature_dictionary.txt
.
After the feature selection step, run naive_bayes_classifier.py
to start the classification process. Run:
shell python naive_bayes_classifier.py
First, the classifier counts the frequency of each feature token in the spam documents, and then in the ham documents.
We use boolean features. The total frequency of each spam and ham feature token is saved in two dictionaries:
token_frequencies_in_spam_class
and token_frequencies_in_ham_class
.
We also how many words the documents of the spam class have and the number of words the ham documents have.
We use the acquired frequencies to estimate the Laplace probabilities that will determine the correct category of the test data.
The Laplace probability estimate classification calculates the conditional probability of each test document's feature vector
for each 2 categories (spam or ham) and categorizes the document in the class for which the conditional probability is higher
i.e. for the feature vector Xi of the test document i
we calculate the probabilities p(Xi|C=1) and p(Xi|C=0)
(reminder: C=1 for spam, C=0 for ham). The class that has the higher probability will be the predicted class of the train document.
For the calculation of the probabilies we also use Laplace smoothing, adding 1 in the enumerator and |V| in the denominator
(where |V| is the number of distinct words in the corpus, i.e. is the vocabulary size of the corpus)
.
This is the formula that is used for calculating the probability of the feature token i, belonging to a spam document:
To calculate the probability of the entire feature vector belonging to the spam class we multiply the probability of each separate feature token belonging to the spam class. The exact formula is:
We do the same for the ham class. We can omit the term P(featureVector)
since it's the same for both 2 classes.
The probability of the 2 which is bigger, indicates the category that the given test document and its feature vector is more likely to belong to.
Also, it is a good idea to use the numerically stable logsumexp trick
, thus taking sum of exponentials, rather than multiplications of probabilities.
The execution results of the classifier delivered an accuracy of 96.92 %
, while using 1000 features tokens, and also the same accuracy, while using 100 features tokens.
- Statistics show that the Naive-Bayes classifier has a higher accuracy on a small corpus, rather than a big corpus.
- Console execution results are included in the
console_outputs
folder.