Follow all the same data labeling and migration steps as before, operating on the same tables as before, but also add a step to de-dup the status texts to prevent over-fitting:
Since the same status text can be retweeted by many users and the resulting retweet text will be the same, we need to remove duplicate statuses from the training data, to prevent model over-fitting.
CREATE TABLE impeachment_production.2_community_labeled_status_texts as (
SELECT
status_text
,count(distinct status_id) as status_occurrences
,avg(community_id) as avg_community_score
-- TODO maybe add the median or mode community score as well
FROM impeachment_production.2_community_labeled_tweets
GROUP BY 1
-- HAVING status_occurrences > 1 and avg_community_score between 0.3 and 0.7
-- ORDER BY 2 DESC
) -- 2,771,905 tweets
Then download a copy of that table into this directory as "data/nlp_v2/2_community_labeled_status_texts.csv".
Now you are ready for training.
Train some models on the labeled training data, and save them:
APP_ENV="prodlike" python -m app.nlp_v2.model_training
Promote a given model to use for classification:
SOURCE="nlp_v2/models/2020-10-07-0220/logistic_regression" DESTINATION="nlp_v2/models/best/logistic_regression" python -m app.nlp_v2.model_promotion
SOURCE="nlp_v2/models/2020-10-07-0222/multinomial_nb" DESTINATION="nlp_v2/models/best/multinomial_nb" python -m app.nlp_v2.model_promotion
And use the trained model to make ad-hoc predictions:
python -m app.nlp_v2.client
Or to score all the unseen tweets:
APP_ENV="prodlike" LIMIT=10000 BATCH_SIZE=900 python -m app.nlp_v2.bulk_predict
Use the bot impact preparation code to produce daily "tweets.csv" and "nodes.csv" files in the "daily_active_edge_friend_graphs_v5" directory.
Upload these files to Google Drive.
Use the Colab Notebook to train a BERT model on each day's tweets, and save daily scores as "tweets_BERT_Impeachment_800KTweets.csv".
Download all the daily "tweets_BERT_Impeachment_800KTweets.csv" files into this local repo (it takes a while - an automated google drive downloader would be helpful).
Bulk upload all daily BERT scores to BQ (and GCS):
APP_ENV="prodlike" python -m app.nlp_v2.bert_score_uploader