Skip to content

gbugaisky/cse151_boosting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 

Repository files navigation

In this assignment, we will look at the task of spam classification using boosting. Our raw data is a set of emails, which were collected from a liguistics mailing list; the emails are labeled as spam or not spam. For your benefit, we have already preprocessed the emails to remove stop-words, punctuation, and to do some preliminary preprocessing that lemmatises the words (for example, that maps words such as include, includes and included to the same word), and converted them to vectors of features.

Download files hw6train.txt, hw6test.txt and hw6dictionary.txt from the class website. The first two files contain your training and test datasets respectively. The third file is a dictionary and contains a list of words. Each line in the files hw6train.txt and hw6test.txt correspond to an email followed a label which can be 1 or -1. An email is represented by a feature vector of length 1531; a label -1 indicates that the email is a spam message, and a label 1 indicates that it is not spam. Coordinate i of the feature vector corresponding to an email is 1 when word i in hw6dictionary.txt is present in the email and 0 otherwise. Write down the training and test errors of the classifiers obtained after t = 3;7;10;15;20 rounds of boosting. Use the following weak learning procedure. Each weak learner corresponds to a classifier hi;+ or hi;-, where i is a word in the dictionary and the classifier hi;+ is the rule: hi;+(x) = 1; if word i occurs in email x = -1; otherwise

Similarly, the classifier hi; is the rule: hi;-(x) = 1; if word i does not occur in email x = -1; otherwise The set of weak learners C is the collection of such classifiers for all i, and your weak learning procedure should select the weak learner which has the highest accuracy in C with respect to the current weighted set of examples. Based on the dictionary file, write down the words corresponding to the weak learners chosen in the first 10 rounds of boosting.

[Hint: If your code is correct, you should get a training error of 0.1140 and a test error of 0.09 after two rounds of boosting.]

Note: minor change made to test SSH connection to Bitbucket.

About

Boosting Assignment for CSE 151

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages