Skip to content

NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks, CVPR 2022 (Oral)

Notifications You must be signed in to change notification settings

fawazsammani/nlxgpt

Repository files navigation

Official Code for NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks
arXiv | video

Gradio web-demo for VQA-X Hugging Face Spaces
Gradio web-demo for ACT-X Hugging Face Spaces

[NEW] Our new work Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks won an honorable mention award at ICCVW! Check it out and check our new NLE datasets: VQA-ParaX and ImageNetX!



Requirements

  • PyTorch 1.8 or higher
  • CLIP (install with pip install git+https://github.com/openai/CLIP.git)
  • transformers (install with pip install transformers)
  • accelerate for distributed training (install with pip install git+https://github.com/huggingface/accelerate)

Images Download

We conduct experiments on 4 different V/VL NLE Datasets: VQA-X, ACT-X, e-SNLI-VE and VCR. Please download the images into a folder in your directory named images using the following links (our code does not use pre-cached visual features. Instead, the features are extracted directly during code execution):

  • VQA-X: COCO train2014 and val2014 images
  • ACT-X: MPI images. Rename to mpi
  • e-SNLI-VE: Flickr30K images. Rename to flickr30k
  • VCR: VCR images. Rename to vcr

Annotations Download

We structure the annotations for the NLE datasets. You can dowloaded the structured annotations from here: VQA-X, ACT-X, e-SNLI-VE, VCR. Place them in nle_data/dataset_name/ directory. dataset_name can be {VQA-X, ACT-X, eSNLI-VE, VCR}. The pretraining annotations are here. Please see this issue also for clarification on which pretrain annotations to use. If you want to preprocess yourself rather than downloading the annotations directly, the code can be found in utils/nle_preprocess.ipynb.

You also need cococaption and the annotations in the correct format in order to perform evaluation on NLG metrics. We use the cococaption python3 toolkit here. Please download it and place the cococaption folder in your directory. The annotations in the correct format can be downloaded here. Please place them in the annotations folder. If you want to convert the natural language explanations data from the source to the format that cococaption expects for evaluation manually rather than downloading it directly, the code can be found in utils/preprocess_for_cococaption_eval.ipynb.

You will also need BertScore if you evaluate using it. You may install with pip install bert_score==0.3.7

Code

1 GPU is enough for finetuning on NLE. However if you wish to do distributed training, please setup first using accelerate. Note that you can still use accelerate even if you have 1 GPU. In your environment command line, type:

accelerate config

and answer the questions.

VQA-X

Please run from the command line with:

accelerate launch vqaX.py

Note: To finetune from the pretrained captioning model, please set the finetune_pretrained flag to True.

ACT-X

Please run from the command line with:

accelerate launch actX.py

Note: To finetune from the pretrained captioning model, please set the finetune_pretrained flag to True.

e-SNLI-VE

Please run from the command line with:

accelerate launch esnlive.py
e-SNLI-VE (+ Concepts)

Please run from the command line with:

accelerate launch esnlive_concepts.py
VCR

Please run from the command line with:

accelerate launch vcr.py

This will give you the unfiltered scores. After that, we use BERTScore to filter the incorrect answers and get the filtered scores (see paper Appendix for more details). Since BERTScore takes time to calculate, it is not ideal to run it and filter scores after every epoch. Therefore, we perform this operation once on the epoch with the best unfiltered scores. Please run:

python vcr_filter.py

Models

All models can be downloaded from the links below:

  • Pretrained Model on Image Captioning: link
  • VQA-X (w/o pretraining): link
  • VQA-X (w/ pretraining): link
  • ACT-X (w/o pretraining): link
  • ACT-X (w/ pretraining): link
  • Concept Head + Wordmap (used in e-SNLI-VE w/ concepts): link
  • e-SNLI-VE (w/o concepts): link
  • e-SNLI-VE (w/ concepts): link
  • VCR: link

Note: Place the concept model and its wordmap in a folder: pretrained_model/

Results

The output results (generated text) on the test dataset can be downloaded from the links below. _filtered means that the file contains only the explanations for which the predicted answer is correct. _unfiltered means that all the explanations are included, regardless of whether the predicted answer is correct or not. _full means the full output prediction (inclusing the answer + explanation). _exp means the explanation part only. All evaluation is performed on _exp. See section 4 of the paper for more details.

  • VQA-X (w/o pretraining): link
  • VQA-X (w/ pretraining): link
  • ACT-X (w/o pretraining): link
  • ACT-X (w/ pretraining): link
  • e-SNLI-VE (w/o concepts): link
  • e-SNLI-VE (w/ concepts): link
  • VCR: link

Please note that in case of VCR, the results shown in Page 4 of the appendix may not identically correspond to the results and pretrained model in the links above. We have trained several models and randomly picked one for presenting the qualitative results.

Proposed Evaluation Metrics

Please see explain_predict and retrieval_attack folders.

About

NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks, CVPR 2022 (Oral)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published