This repository contains code for the paper Understanding Attention for Vision-and-Language Tasks published in COLING 2022.
Feiqi Cao, Soyeon Caren Han, Siqu Long, Changwei Xu, Josiah Poon. (2022, October).
Understanding Attention for Vision-and-Language Tasks
The 29th International Conference on Computational Linguistics
(COLING 2022).
Understanding Attention for Vision-and-Language Tasks
The 29th International Conference on Computational Linguistics
(COLING 2022).
This paper analyzes the effect of different attention alignment calculation scores based on the following four Vision-and-Language (VL) tasks. We follow the instructions from their respective repositories to set up the environment and prepare the datasets.
- Text-to-Image Generation: AttnGAN (Github)
- Text-and-Image Matching: SCAN (Github)
- Visual Question Answering: MAC (Github)
- Text-based Visual Question Answering: M4C (Please take note that we referred to the base code of SAM4C Github and modified the config to include only classic Self-Attention Layers in the model, which becomes identical to the structure of M4C model)
The codes in our repository have the attention calculation part modified for each of the above base models. We provide the instructions for running our codes/experiments:
- Text-to-Image Generation:
- Text-and-Image Matching:
- Visual Question Answering:
- Text-based Visual Question Answering:
- experiment source code and configs
- sample commands to run experiments on Text-VQA:
python train.py --config ./configs/m4c_tvqa_n4.yml --tag scaled_dot
python train.py --config ./configs/m4c_tvqa_n4_dot.yml --tag dot
python train.py --config ./configs/m4c_tvqa_n4_kwq.yml --tag general_kwq
...python train.py --config ./configs/m4c_tvqa_n4_biased_kwq.yml --tag biased_general_kwq
@inproceedings{cao2022attentionvl,
title = {Understanding Attention for Vision-and-Language Tasks},
author = {Cao, Feiqi and Han, Soyeon Caren and Long, Siqu and Xu, Changwei, and Poon, Josiah},
booktitle = {Proceedings of the 30th International Conference on Computational Linguistics},
publisher = {International Committee on Computational Linguistics},
month = {oct},
year = {2022}
}
We visualised the prediction interpretability of the best and worst attention alignment calculation method for each task. Here are some examples. For more details please refer to our paper.