The project explores the performance of various machine learning algorithms for the retweeting prediction problem, depending on the provided features.
Tweets are labeled into two classes:
0
: tweets retweeted only once
1
: tweets retweeted more than once
Types of features:
a) Content features only, extracted by a transformer model InfoCoV/Cro-CoV-cseBERT
b) Various tabular features representing Twitter users and their interactions
c) Joined features a) and b)
Tested on Python 3.9.9
pip install -r requirements.txt
(virtualenv recommended)
Original input data: ./data/original/Org-Retweeted-Vectors_preproc.csv
Create folders for processed data:
./data/intermediary/
./data/prepared/
./notebooks/exploration/features_analysis.ipynb
Run scripts in the preparation folder:
bert_extract.py
to extract content featurestran_val_test.py
to create train, validation, and test splits forid_str
feature_preparation.py
for training, validation, and test df, without content featuresfull_feature_joing.py
for training, validation, and test df for joined features and content features alone
Alternatively, you can download all the original and prepared data from here: data.zip
After download, unzip all to data
.
Notebooks in the notebooks/baselines
folder investigate MLP and Random Forest runs:
content_only_2_labels.ipynb
, content features onlyfeatures_only_2_labels
, network and tabular features without content featuresfull_features_2_labels
, all features
Run train.py "some_config.yaml"
by providing one of the configs in the configs
folder.
The config file specifies the model type and all related parameters.
The training script runs multiple experiments to find the best hyperparameters by using Optuna optimization.
Results and model checkpoints are stored in the experiments
folder.
You can download all results and model checkpoints from here: experiments.zip
After download, unzip all to experiments
.
Notebooks in notebooks/results_analysis
analyze all the results from experiments
.