The build-sentiment-classifier.ipynb
Jupyter Notebook builds and exports a serialized Twitter sentiment classifier twitter_sentiment_model.pkl
using PL/Python for PostgreSQL, Greenplum Database, or Apache HAWQ. The classifier is based on the approach of Go et al using the Sentiment140 data. The data can be downloaded from the Sentiment140 website.
The classifier has an accuracy of 80% on the test dataset consisting of several hundred annotated tweets. The training set consists of 1.6 million tweets automatically labeled by assuming that any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative. This technique is called distant supervision using emoticons as noisy labels.
Chris Rawles