Rabiul Awal, Roy Ka-Wei Lee, Eshaan Tanwar, Tanmay Garg, Tanmoy Chakraborty
We propose HateMAML, a model-agnostic meta-learning-based framework that effectively performs hate speech detection in low-resource languages. HateMAML utilizes a self-supervision strategy to overcome the limitation of data scarcity and produces better LM initialization for fast adaptation to an unseen target language (i.e., cross-lingual transfer) or other hate speech datasets (i.e., domain generalization).
Follow these steps to prepare your dataset for HateMAML:
- Save your data as a pickled Pandas DataFrame in the
data/raw/
directory, formatted as described in the accompanyingREADME.md
. - Process your data by running
scripts/prepare_data.sh
, which converts raw data into a processed format suitable for HateMAML. - Load your data using the provided scripts in the
src
directory to begin training or evaluation.
- HateMAML (zero-shot): This is similar to X-MAML, but it uses both source and auxiliary languages for meta-training, while X-MAML only uses auxiliary language samples.
- HateMAML (self-training): Choose a target language for selftraining on silver data. First, we generate silver labels and keep only 300 samples. Next, we apply HateMAML in an iterative manner to the silver-labeled data from the target language as explained in Section III-D.
- HateMAML (fine-tune): We generate meta-task from a set of available languages and apply HateMAML. This is very similar to fine-tuning aggregated training examples from the selected language set.
To fine-tune baseline models, use the scripts provided in scripts/finetune.sh.
For example, to fine-tune the English dataset using the XLM-RoBERTa pre-trained model:
MODEL_PATH=xlm-roberta-base
BATCH=32
LR=5e-5
DATASET=founta
FT_CHOICE=fft
python3 src/maml/finetune.py --source_lang en --target_lang en --num_train_epochs 7 --meta_lr $LR \
--seed 1 --model_name_or_path $MODEL_PATH --device_id 1 --batch_size $BATCH --dataset_name $DATASET \
--finetune_type $FT_CHOICE
For HateMAML experiments, follow the scripts in scripts/meta_train.sh corresponding to Tables II, III, and IV in our paper.
For instance, fine-tuning in a zero-shot setting as per Table II:
MODEL_PATH=xlm-roberta-base
EXP_SETTING=hmaml_scale
BATCH=32
SHOTS=32
BASE_DATASET=founta
DATASET=semeval2020
SAMPLE_SZ=500
TYPE=zeroshot
python3 src/maml/hmaml_mixer.py --source_lang en --aux_lang $A_LANG --target_lang $T_LANG --num_train_epochs 5 \
--num_meta_iterations $M_EPOCHS --meta_lr $LR --fast_lr $FAST_LR --seed $ID --model_name_or_path $MODEL_PATH \
--base_model_path runs/finetune/${BASE_DATASET}/en/en_ft/${MODEL_TYPE}/seed1 --shots $SHOTS --batch_size $BATCH \
--exp_setting $EXP_SETTING --device_id 0 --dataset_name $DATASET --num_meta_samples $SAMPLE_SZ --metatune_type $TYPE
If you use this code, please cite our paper:
@article{awal2023model,
title={Model-agnostic meta-learning for multilingual hate speech detection},
author={Awal, Md Rabiul and Lee, Roy Ka-Wei and Tanwar, Eshaan and Garg, Tanmay and Chakraborty, Tanmoy},
journal={IEEE Transactions on Computational Social Systems},
year={2023},
publisher={IEEE}
}