Skip to content

Latest commit

 

History

History
74 lines (49 loc) · 2.41 KB

README.md

File metadata and controls

74 lines (49 loc) · 2.41 KB

NLP v2

Data Labeling

Follow all the same data labeling and migration steps as before, operating on the same tables as before, but also add a step to de-dup the status texts to prevent over-fitting:

Since the same status text can be retweeted by many users and the resulting retweet text will be the same, we need to remove duplicate statuses from the training data, to prevent model over-fitting.

CREATE TABLE impeachment_production.2_community_labeled_status_texts as (
  SELECT
    status_text
    ,count(distinct status_id) as status_occurrences
    ,avg(community_id) as avg_community_score
    -- TODO maybe add the median or mode community score as well
  FROM impeachment_production.2_community_labeled_tweets
  GROUP BY 1
  -- HAVING status_occurrences > 1 and avg_community_score between 0.3 and 0.7
  -- ORDER BY 2 DESC
) -- 2,771,905 tweets

Then download a copy of that table into this directory as "data/nlp_v2/2_community_labeled_status_texts.csv".

Now you are ready for training.

Sklearn Models

Training and Evaluation

Train some models on the labeled training data, and save them:

APP_ENV="prodlike" python -m app.nlp_v2.model_training

Predictions

Promote a given model to use for classification:

SOURCE="nlp_v2/models/2020-10-07-0220/logistic_regression" DESTINATION="nlp_v2/models/best/logistic_regression" python -m app.nlp_v2.model_promotion

SOURCE="nlp_v2/models/2020-10-07-0222/multinomial_nb" DESTINATION="nlp_v2/models/best/multinomial_nb" python -m app.nlp_v2.model_promotion

And use the trained model to make ad-hoc predictions:

python -m app.nlp_v2.client

Or to score all the unseen tweets:

APP_ENV="prodlike" LIMIT=10000 BATCH_SIZE=900 python -m app.nlp_v2.bulk_predict

BERT Transformer

Use the bot impact preparation code to produce daily "tweets.csv" and "nodes.csv" files in the "daily_active_edge_friend_graphs_v5" directory.

Upload these files to Google Drive.

Use the Colab Notebook to train a BERT model on each day's tweets, and save daily scores as "tweets_BERT_Impeachment_800KTweets.csv".

Download all the daily "tweets_BERT_Impeachment_800KTweets.csv" files into this local repo (it takes a while - an automated google drive downloader would be helpful).

Bulk upload all daily BERT scores to BQ (and GCS):

APP_ENV="prodlike" python -m app.nlp_v2.bert_score_uploader