Warning
Content warning: Hateful language. Due to the nature of the task tackled in this project, the report and the accompanying code and data contain hateful words and phrases that may be upsetting. To avoid confusion with adversarial text produced through methods introduced in this project, I opted not to censor these hateful terms. Reader discretion is advised.
$ pip install transformers nltk lime pipreqs
$ pip install --no-cache-dir torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
nltk
for detokenizinglime
for explainingpipreqs
for creatingrequirements.txt
Alternatively: (not tested)
$ pip install -r requirements.txt
- All important functions are documented with docstrings
$ git clone https://huggingface.co/Hate-speech-CNERG/bert-base-uncased-hatexplain-rationale-two
- Clone the huggingface repo of the HateXplain model
- Run
batchscripts/attack.sh
- Resulting adversarial examples are saved in
data/attacks_val_no-letters.json
(already done in this repo) - Remove
--lime
to use brute force attacks
- Run
batchscripts/analyze.sh
- Results are printed to terminal/saved in
outputs/analyze_val_no-letters.txt
(already done in this repo)
- Run
batchscripts/explain.sh
- Explanations are saved into the existing
data/attacks_val_no-letters.json
- Run
batchscripts/test.sh
- Unsuccessful attacks are saved in
data/test_val_no-letters_unsuccessful.json
(already done in this repo)