Skip to content

ryantylerkane/Movie-Reviews

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This project implements a Naive Bayes classifier that is trained using 25,000 positive and negative reviews gathered from imdb.com. After training, the program will output the probability assigned to each lexicon during the process.
The classifier is then tested on an additional 25,000 test files which creates one output file comparing the prediction from the model to the actual label (positive or negative) provided in the test data.

pre-processing.py:
1. The vocabulary file must be named "imdb.vocab" and must appear in the same directory as the program file.
2. The program requires that the input directory contains "pos" and "neg" sub-folders, each containing their respective test/train documents.
3. The output file will take the name of the last directory in the provided input path. This should be "test" or "train" if the same directory structure for the raw data is used.
4. Unicode encoding was required to be used for the input files due to the usage of some Unicode characters in the test/training reviews.

NB.py:
1. The prior probabilities and context features will be outputted in linear space, as there was no problem with underflow when ran on the testing/training corpus.
2. The probabilities found in the prediction files will be outputted in log space. Negative log space was avoided so that the "max" function could easily be used to determine the predicted class.
3. Unicode encoding was required to be used for the input files due to the usage of some Unicode characters in the test/training reviews.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages