Skip to content

Latest commit

 

History

History
72 lines (63 loc) · 4.62 KB

README.md

File metadata and controls

72 lines (63 loc) · 4.62 KB

Understanding Attention for Vision-and-Language Tasks

This repository contains code for the paper Understanding Attention for Vision-and-Language Tasks published in COLING 2022.

Feiqi Cao, Soyeon Caren Han, Siqu Long, Changwei Xu, Josiah Poon. (2022, October).
Understanding Attention for Vision-and-Language Tasks

The 29th International Conference on Computational Linguistics
(COLING 2022).

Set Up

This paper analyzes the effect of different attention alignment calculation scores based on the following four Vision-and-Language (VL) tasks. We follow the instructions from their respective repositories to set up the environment and prepare the datasets.

  • Text-to-Image Generation: AttnGAN (Github)
  • Text-and-Image Matching: SCAN (Github)
  • Visual Question Answering: MAC (Github)
  • Text-based Visual Question Answering: M4C (Please take note that we referred to the base code of SAM4C Github and modified the config to include only classic Self-Attention Layers in the model, which becomes identical to the structure of M4C model)

Run Experiments

The codes in our repository have the attention calculation part modified for each of the above base models. We provide the instructions for running our codes/experiments:

Citation

@inproceedings{cao2022attentionvl,
  title     = {Understanding Attention for Vision-and-Language Tasks},
  author    = {Cao, Feiqi and Han, Soyeon Caren and Long, Siqu and Xu, Changwei, and Poon, Josiah},
  booktitle = {Proceedings of the 30th International Conference on Computational Linguistics},
  publisher = {International Committee on Computational Linguistics},
  month     = {oct},
  year      = {2022}
}

Qualitative Examples

We visualised the prediction interpretability of the best and worst attention alignment calculation method for each task. Here are some examples. For more details please refer to our paper.

  • Text-to-Image Generation:

  • Text-and-Image Matching:

  • Visual Question Answering:

  • Text-based Visual Question Answering: