CS4248 Assignment 3 - Perceptron Text Classification

Performs text classification using the Perceptron Learning Algorithm. This application is general enough to work on any number of classes, any class names and any number of training texts within a class.

File Structure

.
├── /stopword-list           # List of all stop words
├── /porter.py               # Porter's Stemmer Algorithm
├── /DataPrepper.py          # Handles the generation of feature vectors
├── /Tokenizer.py            # Contains tokenizer functions for the dataset
├── /tc-train.py             # Runs the training phase (tokenize, feature selection, training)
├── /tc-test.py              # Tests a trained model on a set of blind test documents
├── /tc-crossvalidation.py   # A simple script I made to help me perform cross-validation
├── /tc-crossvalidation.py   # A simple script I made to help me perform cross-validation
└── README.md

Data Preparation / Text Normalization

Stop word removal
Stemming using Porter's Stemmer
Case-folding -(tentative)-

Feature selection for dimensionality reduction

Rules:

Select a stemmed word as a feature for a class c if it has high chi-squared value

Perceptron Learning Step

Note: This is a multi-class perceptron learner where 1 classifier is learned for each class. Each text is assumed to belong to exactly one of the given classes.

Instructions

Train the text classifier:

python tc-train.py stopword-list train-class-list model

where model is the file where we will stored our learned perceptron weights. stopword-list is a file containing a list of stop words. train-class-list is a file containing the following lines:

/home/course/cs4248/tc/c1/37261 c1
/home/course/cs4248/tc/c1/37913 c1
/home/course/cs4248/tc/c1/37914 c1
...
/home/course/cs4248/tc/c1/58343 c1

Run text classifier on given assignment test set:

python tc-test.py stopword-list model test-list test-class-list

where stopword-list is the same file containing a list of stop words. model is the file of weights learned during training. test-list is a file that contains a list of the locations of test texts to be classified as such:

/home/course/cs4248/tc/test/001
/home/course/cs4248/tc/test/002
/home/course/cs4248/tc/test/003
...

and test-class-list is a file in the same format as train-class-list.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS4248 Assignment 3 - Perceptron Text Classification

File Structure

Data Preparation / Text Normalization

Feature selection for dimensionality reduction

Perceptron Learning Step

Instructions

Train the text classifier:

Run text classifier on given assignment test set:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
.gitignore		.gitignore
DataPrepper(OLD).py		DataPrepper(OLD).py
DataPrepper.py		DataPrepper.py
PerceptronClassifier.py		PerceptronClassifier.py
README.md		README.md
Tokenizer.py		Tokenizer.py
automate_cv.sh		automate_cv.sh
automate_cv_fast.sh		automate_cv_fast.sh
generate_cv_datasets.py		generate_cv_datasets.py
porter.py		porter.py
stopword-list		stopword-list
tc-crossvalidation.py		tc-crossvalidation.py
tc-test.py		tc-test.py
tc-train.py		tc-train.py
test-class-list		test-class-list
test-list		test-list
train-class-list		train-class-list

NatashaKSS/simple-perceptron-text-classification

Folders and files

Latest commit

History

Repository files navigation

CS4248 Assignment 3 - Perceptron Text Classification

File Structure

Data Preparation / Text Normalization

Feature selection for dimensionality reduction

Perceptron Learning Step

Instructions

Train the text classifier:

Run text classifier on given assignment test set:

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages