🔍 This is the official code and data repository for the EMNLP 2023 findings paper titled "Mitigating Biases in Hate Speech Detection from A Causal Perspective". Authors: Zhehao Zhang, Jiaao Chen, Diyi Yang
Abstract
Nowadays, many hate speech detectors are built to automatically detect hateful content. However, their training sets are sometimes skewed towards certain stereotypes (e.g., race or religion-related). As a result, the detectors are prone to depend on some shortcuts for predictions. Previous works mainly focus on token-level analysis and heavily rely on human experts' annotations to identify spurious correlations, which is not only costly but also incapable of discovering higher-level artifacts. In this work, we use grammar induction to find grammar patterns for hate speech and analyze this phenomenon from a causal perspective. Concretely, we categorize and verify different biases based on their spuriousness and influence on the model prediction. Then, we propose two mitigation approaches including Multi-Task Intervention and Data-Specific Intervention based on these confounders. Experiments conducted on 9 hate speech datasets demonstrate the effectiveness of our approaches.
All datasets are sourced from hatespeechdata.com.
📥 Download Data:
- Multimodal Meme Dataset: Download
- Hate Speech Dataset from a White Supremacist Forum: Download
- Fox News User Comments: Download
- HASOC19: Download
- ETHOS: Download
- Twitter Sentiment Analysis: Download
- Anatomy of Online Hate: Download
- Hate Speech on Twitter: Download
- Hate-Offensive (AHS): Download
- For the Multi-Task Intervention, we utilize two methods: masked language modeling (MLM) and multi-task learning (MTL). The implementation of MLM is adopted from the code provided in the Hugging Face documentation on Masked Language Modeling.
To execute the script, use the command below:
python masked_LM.py [arguments]
--device
: Specifies the device to use (e.g., 'cuda' or 'cpu'). Defaults to 'cuda' if available, else 'cpu'.--model_checkpoint
: The model checkpoint for initialization. Default value is 'bert-base-cased'.--output_dir
: The directory where the trained model will be saved. Default is set to 'MLM_large_corpus'.--evaluation_strategy
: Specifies the evaluation strategy. The default strategy is 'epoch'.--learning_rate
: Learning rate for training. Default value is 2e-5.--weight_decay
: Weight decay during training. Default value is 0.01.--train_batch_size
: Size of the batch during training. Default value is 64.--eval_batch_size
: Batch size during evaluation. Default is set to 64.--fp16
: Flag to enable half precision training. Set to False by default.--datasets
: Comma-separated list of dataset names to be used. This argument is mandatory.--logging_divisor
: Determines the logging frequency. Logs will be generated everylen(train_data) // logging_divisor
steps. The default value is 4.--num_epochs
: Number of epochs for training. Default value is 5.--save_strategy
: Strategy employed for model saving. Default is 'epoch'.
You can simply concatenate the 9 datasets above and randomly shuffle the large dataset before inputting it into the basic classification pipeline.
- To perform DSI, we should first mitigate two types of correlations described in Section 3.1: Token-level Spurious Correlations and Sentence-level Spurious Correlations,
python token_bias.py [arguments]
--data_file
: Specifies the dataset file path.--tokenizer
: The tokenizer that you want to use. We use the simple word tokenizer in this work.
To find the Sentence-level Spurious Correlations, we follow a previous work (Finding Dataset Shortcuts with Grammar Induction EMNLP2022) and utilize the code from the original GitHub repository at ShortcutGrammar.
To integrate this into your workflow:
-
PCFG Grammar Induction:
- Begin by using the hate speech data to perform PCFG (Probabilistic Context-Free Grammar) grammar induction. This can be done using the scripts (src/run.py) provided in the repository above.
-
Visualizing Sentence-level Biases:
- Once you have obtained the output grammar from the PCFG grammar induction, you can use the
sentence_bias.ipynb
notebook to visualize and analyze the sentence-level biases in the data.
- Once you have obtained the output grammar from the PCFG grammar induction, you can use the
After that you can generate counterfactual data according to the prompt design in A.4 using an LLM.
- After combine the counterfactual data with the original dataset and get the model checkpoint after MTI, we can use the following code to do the hate speech detection (formulated as a binary classification problem).
To train the hate speech detection model, use the following script:
python classification.py [arguments]
--seed
: Seed for random number generation to ensure reproducibility. Default is 60.--epoch
: Number of training epochs. Default is 3.--bz
: Batch size for training and evaluation. Default is 10.--data
: The dataset to be used. It should be the same as the folder name.--lr
: Learning rate for the optimizer. Default is 2e-5.--wd
: Weight decay for the optimizer. Default is 0.--norm
: Max gradient norm for gradient clipping. Default is 0.8.--gpu
: GPU device ID for CUDA_VISIBLE_DEVICES. Default is '6'.--model_checkpoint
: Pretrained model checkpoint for initialization. Default is "bert-base-cased". You can change it to the model after MTI to get the best performance."
Utilize Weights & Biases for hyperparameter tuning by defining and running a wandb sweep in your script. Configure the sweep with your desired parameters, such as learning rate and batch size, and use wandb's dashboard to monitor and analyze the results. Detailed instructions and examples can be found in the official wandb documentation.
Once the model is trained, you can evaluate its performance on the test set using the metrics like accuracy, micro-F1, and macro-F1 scores. The evaluation is automatically done at the end of the training script and the results are displayed.
- The trained model can be saved using PyTorch's
save_pretrained
method and can be loaded usingfrom_pretrained
for further analysis or deployment.
If you find our work useful, please consider citing our paper:
@inproceedings{zhang-etal-2023-mitigating,
title = "Mitigating Biases in Hate Speech Detection from A Causal Perspective",
author = "Zhang, Zhehao and
Chen, Jiaao and
Yang, Diyi",
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-emnlp.440",
pages = "6610--6625",
abstract = "Nowadays, many hate speech detectors are built to automatically detect hateful content. However, their training sets are sometimes skewed towards certain stereotypes (e.g., race or religion-related). As a result, the detectors are prone to depend on some shortcuts for predictions. Previous works mainly focus on token-level analysis and heavily rely on human experts{'} annotations to identify spurious correlations, which is not only costly but also incapable of discovering higher-level artifacts. In this work, we use grammar induction to find grammar patterns for hate speech and analyze this phenomenon from a causal perspective. Concretely, we categorize and verify different biases based on their spuriousness and influence on the model prediction. Then, we propose two mitigation approaches including Multi-Task Intervention and Data-Specific Intervention based on these confounders. Experiments conducted on 9 hate speech datasets demonstrate the effectiveness of our approaches.",
}
For any inquiries or further information regarding this project, feel free to reach out to the leading author:
- Zhehao Zhang: zhehao.zhang.gr@dartmouth.edu | Website
We welcome questions, feedback, and collaboration requests!