Skip to content
This repository has been archived by the owner on Jan 25, 2024. It is now read-only.

This small university project explores how purposely targeting and changing a single character in a hateful text can reduce the performance of a hatespeech detection model by half.

Notifications You must be signed in to change notification settings

Tai-Mai/adversarial-hatespeech

Repository files navigation

adversarial-hatespeech

Warning

Content warning: Hateful language. Due to the nature of the task tackled in this project, the report and the accompanying code and data contain hateful words and phrases that may be upsetting. To avoid confusion with adversarial text produced through methods introduced in this project, I opted not to censor these hateful terms. Reader discretion is advised.

Installation

$ pip install transformers nltk lime pipreqs
$ pip install --no-cache-dir torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
  • nltk for detokenizing
  • lime for explaining
  • pipreqs for creating requirements.txt

Alternatively: (not tested)

$ pip install -r requirements.txt

Usage

  • All important functions are documented with docstrings
$ git clone https://huggingface.co/Hate-speech-CNERG/bert-base-uncased-hatexplain-rationale-two

Preparation

  • Clone the huggingface repo of the HateXplain model

1. Find adversarial examples

  • Run batchscripts/attack.sh
  • Resulting adversarial examples are saved in data/attacks_val_no-letters.json (already done in this repo)
  • Remove --lime to use brute force attacks

2. Analyze adversarial examples for stats

  • Run batchscripts/analyze.sh
  • Results are printed to terminal/saved in outputs/analyze_val_no-letters.txt (already done in this repo)

2.5 (Optional) Explain adversarial examples with LIME

  • Run batchscripts/explain.sh
  • Explanations are saved into the existing data/attacks_val_no-letters.json

3. Test adversarial examples on test split

  • Run batchscripts/test.sh
  • Unsuccessful attacks are saved in data/test_val_no-letters_unsuccessful.json (already done in this repo)

About

This small university project explores how purposely targeting and changing a single character in a hateful text can reduce the performance of a hatespeech detection model by half.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published