Understanding Attention for Vision-and-Language Tasks

This repository contains code for the paper Understanding Attention for Vision-and-Language Tasks published in COLING 2022.

Feiqi Cao, Soyeon Caren Han, Siqu Long, Changwei Xu, Josiah Poon. (2022, October).
Understanding Attention for Vision-and-Language Tasks

The 29th International Conference on Computational Linguistics
(COLING 2022).

Set Up

This paper analyzes the effect of different attention alignment calculation scores based on the following four Vision-and-Language (VL) tasks. We follow the instructions from their respective repositories to set up the environment and prepare the datasets.

Text-to-Image Generation: AttnGAN (Github)
Text-and-Image Matching: SCAN (Github)
Visual Question Answering: MAC (Github)
Text-based Visual Question Answering: M4C (Please take note that we referred to the base code of SAM4C Github and modified the config to include only classic Self-Attention Layers in the model, which becomes identical to the structure of M4C model)

Run Experiments

The codes in our repository have the attention calculation part modified for each of the above base models. We provide the instructions for running our codes/experiments:

Text-to-Image Generation:
Text-and-Image Matching:
Visual Question Answering:
- experiment source code and configs
- sample tutorial to run experiments on CLEVR

Text-based Visual Question Answering:

experiment source code and configs
sample commands to run experiments on Text-VQA:

python train.py --config ./configs/m4c_tvqa_n4.yml --tag scaled_dot

python train.py --config ./configs/m4c_tvqa_n4_dot.yml --tag dot

python train.py --config ./configs/m4c_tvqa_n4_kwq.yml --tag general_kwq

python train.py --config ./configs/m4c_tvqa_n4_biased_kwq.yml --tag biased_general_kwq

...

Citation

@inproceedings{cao2022attentionvl,
  title     = {Understanding Attention for Vision-and-Language Tasks},
  author    = {Cao, Feiqi and Han, Soyeon Caren and Long, Siqu and Xu, Changwei, and Poon, Josiah},
  booktitle = {Proceedings of the 30th International Conference on Computational Linguistics},
  publisher = {International Committee on Computational Linguistics},
  month     = {oct},
  year      = {2022}
}

Qualitative Examples

We visualised the prediction interpretability of the best and worst attention alignment calculation method for each task. Here are some examples. For more details please refer to our paper.

Text-to-Image Generation:
Text-and-Image Matching:
Visual Question Answering:
Text-based Visual Question Answering:

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
AttnGan		AttnGan
M4C		M4C
MAC		MAC
SCAN		SCAN
figures		figures
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Understanding Attention for Vision-and-Language Tasks

Feiqi Cao, Soyeon Caren Han, Siqu Long, Changwei Xu, Josiah Poon. (2022, October).
Understanding Attention for Vision-and-Language Tasks

The 29th International Conference on Computational Linguistics
(COLING 2022).

Set Up

Run Experiments

Citation

Qualitative Examples

About

Releases

Packages

Contributors 2

Languages

adlnlp/Attention_VL

Folders and files

Latest commit

History

Repository files navigation

Understanding Attention for Vision-and-Language Tasks

Feiqi Cao, Soyeon Caren Han, Siqu Long, Changwei Xu, Josiah Poon. (2022, October).Understanding Attention for Vision-and-Language TasksThe 29th International Conference on Computational Linguistics (COLING 2022).

Set Up

Run Experiments

Citation

Qualitative Examples

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Feiqi Cao, Soyeon Caren Han, Siqu Long, Changwei Xu, Josiah Poon. (2022, October).
Understanding Attention for Vision-and-Language Tasks

The 29th International Conference on Computational Linguistics
(COLING 2022).

Packages