A Collaborative Content Moderation Framework for Toxicity Detection based on Conformalized Estimates of Annotation Disagreement
Figure 1: Review process. The STL review process begins by extracting the representation CLS(xi) of the content xi and generating its corresponding classification ŷi. After calibrating the CP algorithm, we generate the prediction set C(xi) for classification. If the size of C(xi) is 2 (i.e., the total number of possible classes in the problem), the comment is flagged for review by a human moderator. Otherwise, the classifier’s output is considered confident, and ŷi is deemed a reliable prediction of the toxicity of xi. For the composite STL models, in addition to generating the classification ŷi, we also compute the annotation disagreement estimate di(xi), a regression value learned from X and A. Once the CP algorithm for regression is calibrated, we produce the confidence interval I(xi). If this interval exceeds a predefined ambiguity threshold γ, the comment is marked for human review. Otherwise, we rely on C(xi) to assess the confidence of the classifier’s output. In the MTL approach, the only difference from the composite model is that both ŷi and di(xi) are generated from the same representation CLS(xi) of the content.
Content moderation typically combines the efforts of human moderators and machine learning models. However, these systems often rely on data where significant disagreement occurs during moderation, reflecting the subjective nature of toxicity perception. Rather than dismissing this disagreement as noise, we interpret it as a valuable signal that highlights the inherent ambiguity of the content—an insight missed when only the majority label is considered. In this work, we introduce a novel content moderation framework that emphasizes the importance of capturing annotation disagreement. Our approach uses multitask learning, where toxicity classification serves as the primary task and annotation disagreement is addressed as an auxiliary task. Additionally, we leverage uncertainty estimation techniques, specifically Conformal Prediction, to account for both the ambiguity in comment annotations and the model's inherent uncertainty in predicting toxicity and disagreement. The framework also allows moderators to adjust thresholds for annotation disagreement, offering flexibility in determining when ambiguity should trigger a review. We demonstrate that our joint approach enhances model performance, calibration, and uncertainty estimation, while offering greater parameter efficiency and improving the review process in comparison to single-task methods.
Eureka requires Python ≥ 3.10. We have tested on Ubuntu 22.04, pytorch version 2.2 and cuda 12.1.0.
Local environment:
- Create a new virtualenv environment with:
virtualenv venv --python=python3.10 source venv/bin/activate
- Install dependecies:
pip install -r requirements.txt
Docker version:
- Having previously install Docker and being in the folder where the docker compose is:
sudo docker compose up
If you want to have access to the dataset via HuggingFace you need to have your HuggingFace token as your environment variable HUGGINGFACE_TOKEN.
Scheme of folder structure
├── images # Collaborative Content Moderation Diagram
├── src # Main folder
│ ├── data # Dataset folder
│ │ ├── final # Clean dataset
│ │ ├── processed # Medium processed dataset
│ │ └── raw # Raw dataset
│ ├── models # Folder for model checkpoints
│ ├── notebooks # Dataset analysis folder
│ ├── results # Metric store folder
│ └── src # Main code
│ ├── config # Training configuration files
│ ├── data # Code for data processing
│ ├── models # Code for model training and metrics
│ ├── utils # Code for supporting models
│ ├── visualization # Code for visuals generation
├── docker-compose.yml
├── Dockerfile
├── LICENSE
├── README.md
└── requirements.txt
Our project has different modules you need to know about: dataset, model training and results.
We provide several options for downloading the dataset: Hugging Face or local preprocessing.
-
For those who want to download the dataset cleaned, you can download it using the following reference: TheMrguiller/Uncertainty_Toxicity
-
For those who want to download and preprocess the data do the following:
python src/data/download_datasets.py
python src/data/data_preprocess.py
We provide in src/config
a parametrized model training yaml for generating the needed models. We have provided with the configuration used for the model training of the article STL and MTL models.
The description of the yaml files are the following:
model_name: # Hugging Face model name
learning_rate:
weight_decay:
accumulate_grad_batches:
experiment_name: # A list of the classification losses wanted to perform for the toxic detection models focal_loss_Weighted or focal_loss
task: # A list of the task we want to perform classification, regression or multitask
disagreement: # A list of the disagreement computation techniques we want to use distance, variance and entropy
regression_name: # A list of the regression losses you want to use for the training BCE,WeightedBCE, MSE, WeightedMSE, R2ccpLoss and WeightedR2ccpLoss
dataset_name: # If Hugging Face dataset is used put "TheMrguiller/Uncertainty_Toxicity"
train_file: "data/processed/jigsaw-unintended-bias-in-toxicity-classification_train_data_clean.csv"
test_file: "data/processed/jigsaw-unintended-bias-in-toxicity-classification_test_data_clean.csv"
valid_file: "data/processed/jigsaw-unintended-bias-in-toxicity-classification_valid_data_clean.csv"
data_batch_size: # Batch size for training
num_nodes: # Num nodes for training
fast_dev_run: # If we want to do a test step
num_workers: # Num workers for dataset preparation
Onced the yaml is configured you can start the training by:
python src/src/models/train.py
To obtain the content moderation metrics for the classification and content moderation framework you need to use the following scripts:
- Conformal Prediction for Classification:
python src/src/models/get_class_cp_metrics.py
- Conformal Prediction for Regression:
python src/src/models/get_regre_cp_metrics.py
- Collaborative Content Moderation performance metrics:
python src/src/models/get_class_cp_metrics.py
This codebase is released under MIT License.
If you find our work useful, please consider citing us!
@article{,
title = {A Collaborative Content Moderation Framework for
Toxicity Detection based on Conformalized Estimates of
Annotation Disagreement},
author = {Guillermo Villate-Castillo and Javier del Ser and Borja Sanz},
year = {2024},
journal = {}
}