Multi-Cro-CoV-cseBERT

The project explores the performance of various machine learning algorithms for the retweeting prediction problem, depending on the provided features.

Data description

Tweets are labeled into two classes:

0: tweets retweeted only once

1: tweets retweeted more than once

Types of features:

a) Content features only, extracted by a transformer model InfoCoV/Cro-CoV-cseBERT

b) Various tabular features representing Twitter users and their interactions

c) Joined features a) and b)

Installation

Tested on Python 3.9.9

pip install -r requirements.txt (virtualenv recommended)

Data placement

Original input data: ./data/original/Org-Retweeted-Vectors_preproc.csv

Create folders for processed data:

./data/intermediary/

./data/prepared/

Data exploration and feature analysis

./notebooks/exploration/features_analysis.ipynb

Data Processing and preparation

Run scripts in the preparation folder:

bert_extract.py to extract content features
tran_val_test.py to create train, validation, and test splits for id_str
feature_preparation.py for training, validation, and test df, without content features
full_feature_joing.py for training, validation, and test df for joined features and content features alone

Alternatively, you can download all the original and prepared data from here: data.zip

After download, unzip all to data.

Baseline training runs

Notebooks in the notebooks/baselines folder investigate MLP and Random Forest runs:

content_only_2_labels.ipynb, content features only
features_only_2_labels, network and tabular features without content features
full_features_2_labels, all features

Model training

Run train.py "some_config.yaml" by providing one of the configs in the configs folder.

The config file specifies the model type and all related parameters.

The training script runs multiple experiments to find the best hyperparameters by using Optuna optimization.

Results and model checkpoints are stored in the experiments folder.

You can download all results and model checkpoints from here: experiments.zip

After download, unzip all to experiments.

Results analysis

Notebooks in notebooks/results_analysis analyze all the results from experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
configs		configs
data		data
experiments		experiments
models		models
notebooks		notebooks
preparation		preparation
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
data_utils.py		data_utils.py
dataset_description.txt		dataset_description.txt
optuna_utils.py		optuna_utils.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Cro-CoV-cseBERT

Data description

Installation

Data placement

Data exploration and feature analysis

Data Processing and preparation

Baseline training runs

Model training

Results analysis

About

Releases

Packages

Contributors 2

Languages

License

InfoCoV/Multi-Cro-CoV-cseBERT

Folders and files

Latest commit

History

Repository files navigation

Multi-Cro-CoV-cseBERT

Data description

Installation

Data placement

Data exploration and feature analysis

Data Processing and preparation

Baseline training runs

Model training

Results analysis

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages