This repository contains the e-SNLI-VE dataset, the HTML files for the e-ViL human evaluation framework, and e-UG model of our ICCV 2021 paper:
e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks (ICCV 2021).
The train, dev, and test splits are in the data
folder. The .csv
files contain Flickr30k Image ID's. Flickr30k can be downloaded here.
The e-ViL_MTurk
folder contains the MTurk questionnaires for e-SNLI-VE, VQA-X, and VCR. These HTML
files can be uploaded to the Amazon Mechanical Turk platform for crowd-sourced, human evaluation.
e-UG uses UNITER as vision-language model and GPT-2 to generate explanations. The UNITER implementation is based on the code of the Transformers-VQA repo and the GPT-2 implementation is based on Marasovic et al. 2020.
The entry point for training and testing the models is in eUG.py
.
The environment file is in eUG.yml
.
Create the environment by running conda env create -f eUG.yml
.
In order to run NLG evaluation in this code you need to download the package from this Google Drive link. It needs to be placed in the root directory of this project.
- Run this script to download the Faster-RCNN features for Flickr30k and store them in
data/esnlive/img_db/flickr30k/
. - Download the
.json
files, ready to be used with e-UG, from this Google Drive link and store them indata/esnlive/
.
-
Download the Faster-RCNN features for MS COCO train2014 (17 GB) and val2014 (8 GB) images:
wget https://nlp.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/train2014_obj36.zip -P data/fasterRCNN_features unzip data/img/train2014_obj36.zip -d data/fasterRCNN_features && rm data/fasterRCNN_features/train2014_obj36.zip wget https://nlp.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/val2014_obj36.zip -P data/fasterRCNN_features unzip data/fasterRCNN_features/val2014_obj36.zip -d data && rm data/fasterRCNN_features/val2014_obj36.zip wget https://nlp.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/test2015_obj36.zip -P data/fasterRCNN_features unzip data/fasterRCNN_features/test2015_obj36.zip -d data && rm data/fasterRCNN_features/test2015_obj36.zip
-
Download the VQA-X dataset from this Google Drive link and store the splits in
data/vqax/
.
- Download the Faster R-CNN feature using this script and store them in
data/vcr/vcr_{split}/
. - Download the VCR
.json
files from this Google Drive link and store them indata/vcr/
.
Download the general pre-trained UNITER-base using this link. The pre-trained UNITER-base for VCR is available from this link. We use the general pre-trained model for VQA-X and e-SNLI-VE, and the VCR pre-trained one for VCR.
Check the command line arguments in param.py
.
Here is an example to train the model on e-SNLI-VE:
python eUG.py --task esnlive --train data/esnlive/esnlive_train.json --val data/esnlive/esnlive_dev.json --save_steps 5000 --output experiments/esnlive_run1/train
The model weights, Tensorboard logs, and a text log will be saved in the given output directory.
Check the command line arguments in param.py
.
Here is an example to test a trained model on the e-SNLI-VE test set:
python eUG.py --task esnlive --test data/esnlive/esnlive_test.json --load_trained experiments/esnlive_run1/train/best_global.pth --output experiments/esnlive_run1/eval
All generated explanations, automatic NLG scores, and a text log will be saved in the given output directory.
If you use e-SNLI-VE, e-UG, or the e-ViL benchmark in your work, please cite our paper:
@InProceedings{Kayser_2021_ICCV,
author = {Kayser, Maxime and Camburu, Oana-Maria and Salewski, Leonard and Emde, Cornelius and Do, Virginie and Akata, Zeynep and Lukasiewicz, Thomas},
title = {E-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2021},
pages = {1244-1254}
}