Twitter Sentiment Analysis

CS412 course project at University of Illinois at Chicago.

Task

To perform sentiment analysis over a corpus of tweets during the U.S. 2012 Re-Election about the candidates Barack Obama and Mitt Romney.

The previous best score on the test dataset was 64 % f1-score, suggesting that improvements can be obtained using modern machine learning / deep learning algorithms.

Methodology

This task has several stages

1) Cleaning :

The text corpus provided has many irrelevant textual information, such as hash tags, URLs and numbers which were deemed meaningless to a sentiment classifier. They were thus removed

2) Pre-processing :

The pre-processed datasets was then merged into a single text corpus, for an approach we describe as "Joint Training". This increases the number of samples that the deep learning models have to determing a classification, and that is highly benefitial considering the tendency of LSTMs to overfit.

Other pre-processing such as tokenization, n-grams, tf-idf vectorization and so on were also performed to improve the quality of the training data. GLoVe embeddings, which were trained on the 840 billion word dataset was used to provide a strong pre-trained initialization of embeddings for the deep learning models.

3) Training / Testing

Training was split into 3 branches, Machine Learning, Deep Learning and Ensemble Learning.

Machine Learning

Each machine learning model was trained on 100 fold stratified cross validation, to obtain 100 different models for the 7 different ML models we tried. All models were optimized using cross validation.

Deep Learning

Each Deep Learning model was trained on 10 fold stratified cross validation, to obtain 10 different models for the 4 different DL models we tried. The LSTM model was optimized via cross validation.

Ensemble Learning

We performed 10 fold cross validation on both the Stacked Generalization model and Soft Voting classifier.

Performance

The LSTM RNN performed the best, getting a test time score of nearly 68 %, beating the earlier 64 % benchmark.

The strongest ensemble learning model was the Stacked Ensemble, achieving the second highest score of 66.9 %.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
data		data
models		models
.gitadd		.gitadd
.gitignore		.gitignore
README.md		README.md
bidirectional_lstm_model.py		bidirectional_lstm_model.py
conv_model.py		conv_model.py
dataclean.py		dataclean.py
expectation_maximization.py		expectation_maximization.py
keras_utils.py		keras_utils.py
logistic_model.py		logistic_model.py
lstm_model.py		lstm_model.py
multiplicative_lstm.py		multiplicative_lstm.py
multiplicative_lstm_model.py		multiplicative_lstm_model.py
n_conv_model.py		n_conv_model.py
naive_bayes_model.py		naive_bayes_model.py
nbsvm_model.py		nbsvm_model.py
requirements.txt		requirements.txt
ridge_model.py		ridge_model.py
score_report.py		score_report.py
sgd_model.py		sgd_model.py
sklearn_utils.py		sklearn_utils.py
stack_model.py		stack_model.py
stacked_generalization.py		stacked_generalization.py
svm_model.py		svm_model.py
voting_ensemble.py		voting_ensemble.py
voting_model.py		voting_model.py
xgboost_model.py		xgboost_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twitter Sentiment Analysis

Task

Methodology

1) Cleaning :

2) Pre-processing :

3) Training / Testing

Machine Learning

Deep Learning

Ensemble Learning

Performance

About

Releases

Packages

Languages

ashwanikhemani/TwitterSentimentAnalysis

Folders and files

Latest commit

History

Repository files navigation

Twitter Sentiment Analysis

Task

Methodology

1) Cleaning :

2) Pre-processing :

3) Training / Testing

Machine Learning

Deep Learning

Ensemble Learning

Performance

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages