Context In Informational Bias Detection

This repository contains code and instructions for

finetuning the PLM's BERT and RoBERTa
performing Sequential Sentence Classification
obtaining embeddings from a PLM for further experimentation
training a Context-Inclusive Model
further pre-training of RoBERTa on the BASIL corpus of lexical and informational bias towards entities.

For any questions, please contact esthervdenberg [at] gmail.com.

Citing

This repository documents the experiments in a paper on the automatic detection of informational bias towards entities with neural approaches that take into account context beyond the sentence.

@inproceedings{berg2020context,
 author = {Esther van den Berg and Katja Markert},
 title = {Context in Informational Bias Detection},
 year = {2020},
 booktitle = {Proceedings of COLING},
}

Installing

Clone this repository, ensure you are in an environment with Python 3.7, and install dependencies, including the appropriate cudaversion for PyTorch (original experiments used 10.1):

git clone https://github.com/vdenberg/context-in-informational-bias-detection.git
cd context-in-informational-bias-detection
conda create -n ciib python=3.7
conda activate ciib
pip install -r requirements.txt
python -m spacy download en_core_web_sm   
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch

Additionally, append the project directory to your python path:

export PYTHONPATH=$PYTHONPATH:/your/path/to/context-in-informational-bias-detection"

Download and unzip BASIL Corpus from: https://github.com/marshallwhiteorg/emnlp19-media-bias/blob/master/emnlp19-BASIL.zip to context-in-informational-bias-detection/data/emnlp19-BASIL
Preprocess data.

    sh preprocess.sh

Choose the experiment you're interested in, and continue instructions below, or, to reproduce the analyses from the paper, run the following:

   python analyses/significance_tests.py
   python analyses/performance_analysis.py

BERT and RoBERTa baselines

   # sentence classification with bert
   python experiments/finetune_plm.py -clf_task sent_clf -model bert

   # sentence classification with roberta
   python experiments/finetune_plm.py -clf_task sent_clf -model rob_base

   # token classification with bert
   python experiments/finetune_plm.py -clf_task tok_clf -model bert

   # token classification with roberta
   python experiments/finetune_plm.py -clf_task tok_clf -model rob_base

Sequential Sentence Classification

# SSC without window with sequence length of 5 
python experiments/finetune_plm.py -clf_task seq_sent_clf -seq_len 5 

# Window SSC with sequence length of 10
python experiments/finetune_plm.py -clf_task seq_sent_clf -seq_len 10 -win

Article and Event Context

# prepare embeddings
python experiments/finetune_plm.py -clf_task sent_clf -model rob_base -sv 49 -embeds

# ArtCIM
python experiments/context_inclusive.py -context art -cim_type cim

# ArtCIM*
python experiments/context_inclusive.py -context art -cim_type cim*

# EvCIM
python experiments/context_inclusive.py -context ev -cim_type cim

# EvCIM*
python experiments/context_inclusive.py -context ev -cim_type cim*

Domain Context

Clone https://github.com/allenai/dont-stop-pretraining into experiments directory
Follow install instructions of https://github.com/allenai/dont-stop-pretraining.

Run following command to get basil-adapted models:

TAPT

python -m scripts.run_language_modeling --train_data_file ../../data/inputs/tapt/basil_train.txt \
                                        --line_by_line \
                                        --output_dir roberta-basil-tapt \
                                        --model_type roberta-base \
                                        --tokenizer_name roberta-base \
                                        --mlm \
                                        --per_gpu_train_batch_size 6 \
                                        --gradient_accumulation_steps 6  \
                                        --model_name_or_path roberta-base \
                                        --do_eval \
                                        --eval_data_file ../../data/inputs/tapt/basil_test.txt \
                                        --evaluate_during_training  \
                                        --do_train \
                                        --num_train_epochs 100  \
                                        --learning_rate 0.0001 \
                                        --logging_steps 50

DAPTTAPT

python -m scripts.run_language_modeling --train_data_file ../../data/inputs/tapt/basil_train.txt \
                                        --line_by_line \
                                        --output_dir roberta-basil-dapttapt \
                                        --model_type roberta-base \
                                        --tokenizer_name roberta-base \
                                        --mlm \
                                        --per_gpu_train_batch_size 6 \
                                        --gradient_accumulation_steps 6  \
                                        --model_name_or_path ../pretrained_models/news_roberta_base \
                                        --do_eval \
                                        --eval_data_file ../../data/inputs/tapt/basil_test.txt \
                                        --evaluate_during_training  \
                                        --do_train \
                                        --num_train_epochs 100  \
                                        --learning_rate 0.0001 \
                                        --logging_steps 50

Run following commands to get source-adapted models: (For DAPTTAPT specify --output_dir roberta-fox-daptapt and --model_name_or_path ../pretrained_models/news_roberta_base)

python -m scripts.run_language_modeling --train_data_file ../../data/inputs/tapt/fox_train.txt \
                                        --line_by_line \
                                        --output_dir roberta-fox-tapt \
                                        --model_type roberta-base \
                                        --tokenizer_name roberta-base \
                                        --mlm \
                                        --per_gpu_train_batch_size 6 \
                                        --gradient_accumulation_steps 6  \
                                        --model_name_or_path roberta-base \
                                        --do_eval \
                                        --eval_data_file ../../data/inputs/tapt/basil_fox_test.txt \
                                        --evaluate_during_training  \
                                        --do_train \
                                        --num_train_epochs 150  \
                                        --learning_rate 0.0001 \
                                        --logging_steps 50 \ 
                                        --save_total_limit 2 \
                                        --overwrite_output_dir

Name		Name	Last commit message	Last commit date
Latest commit History 992 Commits
analyses		analyses
data		data
experiments		experiments
hyperpartisan		hyperpartisan
lib		lib
preprocessing		preprocessing
.gitignore		.gitignore
README.md		README.md
preprocess.sh		preprocess.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Context In Informational Bias Detection

Citing

Installing

BERT and RoBERTa baselines

Sequential Sentence Classification

Article and Event Context

Domain Context

TAPT

DAPTTAPT

About

Releases

Packages

Languages

vdenberg/context-in-informational-bias-detection

Folders and files

Latest commit

History

Repository files navigation

Context In Informational Bias Detection

Citing

Installing

BERT and RoBERTa baselines

Sequential Sentence Classification

Article and Event Context

Domain Context

TAPT

DAPTTAPT

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages