This is the code repository of the NAACL 2022 paper A Study of the Attention Abnormality in Trojaned BERTs. It proposes an attention-based Trojan detector to distinguish Trojaned models from clean ones (in the field of Natural Language Processing).
bash setup_python_environment.sh
to setup environment.
You can simply feed your suspicious model into the detector, and output the probability of whether the model is trojaned or not.
In order to run the detector, please simply run
python detection_attentd.py
Please model path and setting accordingly:
-
Be sure to change your model file path 'model_filepath' in file
detection_attentd.py
line 470. -
The suspicious models trained with different codebases may be different, so please BE SURE that:
-
Please be sure to adjust the model inference code to your own case [in function gene_batch_logits in file
detection_attentd.py
]. -
Please check the tokenizer [in function baseline_trigger_reconstruction in file
detection_attentd.py
].
-
Some additional information (you do not need to do anything for those information, just the default setting should be fine.)
a. The pre defined trigger candidate set is in ./pre_defined_data/trigger_hub.pkl
, clean sentence samples are stored in ./pre_defined_data/dev-custom-imdb
.
b. The output log and file would be stored in ./results/cls_data
and ./results/cls_results