Note that this code is for the reproducible purpose. A wrapper for these modulars is on the plan.
This code is implemented using PaddlePaddle and InterpretDL. A demo using HuggingFace models and InterpretDL can be found here. A demo using Pytorch models and InterpretDL can be found here.
Paper:
Xuhong Li, Mengnan Du, Jiamin Chen, Yekun Chai, Himabindu Lakkaraju, Haoyi Xiong. “M4: A Unified XAI Benchmark for Faithfulness Evaluation of Feature Attribution Methods across Metrics, Modalities and Models.” Neurips 2023, Dataset and Benchmark Track. https://openreview.net/forum?id=6zcfrSz98y.
Pipeline:
-
benchmark_data
: stores the data that are used by the benchmark. -
benchmark-cv
: benchmark for the image modality. -
benchmark-nlp
: benchmark for the text modality. -
training
: the process for training on the synthetic dataset of yellow patches. -
training_nlp
: Finetuning the NLP models on the MovieReview dataset.
To perform the benchmark, it is required to follow the steps as below:
- Prepare the dataset. ImageNet or MovieReview. Note that when using these datasets, you agree to the terms from ImageNet or MovieReview respectively.
(Step 2 is optional. One can download or prepare a trained model.)
- Train models (only required for NLP models). See
training_nlp/train.sh
.
(Step 3 is only required for the SynScore calculation.)
-
Train models on the synthetic dataset. See
training/train.sh
. -
Compute explanations. See
benchmark-cv/run_expl.sh
for image modality. Seebenchmark-nlp/run_expl.sh
for text modality. We compute all the explanation results at once and save them locally. This can avoid repeating the computations. -
Eval MoRF,ABPC. See
benchmark-cv/run_eval.sh
for image modality. Seebenchmark-nlp/run_eval.sh
for text modality. -
Eval Others. See
benchmark-cv/run_eval-pgs.sh
for image modality. Seebenchmark-nlp/run_eval2.sh
for text modality.