The purpose of this project is - using machine learning methods - to predict the following about tweets (posts on twitter):
- Are tweets informative or non-informative?
- Will a tweet be re-tweeted by another user?
git clone https://github.com/bgold09/tweet_learn.git
cd tweet_learn
Use your preferred method (pip, apt-get, etc.) to install the following Python packages required by tweet learn:
Download and unpack the Stanford Named Entity Recognizer:
wget http://nlp.stanford.edu/software/stanford-ner-2014-01-04.zip
unzip stanford-ner-2014-01-04.zip
Start a local NER java server (do this in a separate terminal window, as starting the process in the background will cause the server to function improperly):
java -mx1000m -cp stanford-ner.jar edu.stanford.nlp.ie.NERServer -loadClassifier classifiers/ner-eng-ie.crf-3-all2008-distsim.ser.gz -port 8080 -outputFormat inlineXML
mysql -u <username> -p -e 'CREATE DATABASE twitter;'
mysql -u <username> -p twitter < data/users_backup.sql
From a python session:
>>> import tweet_learn as tl
>>> tl.store_initial_data("train_test_set")
>>> tl.add_centrality_feature("train_test_set")
From a python session:
>>> ml = tl.extract_transform_data("train_test_set", 0, 1001)
Check out confusion.py, score.py and roc.py for various methods for testing the quality of your models.
Copyright (c) 2014 Scott Bickel, Brian Golden, Stephen Styer
Licensed under the MIT license.