Aisha Urooj Khan, Hilde Kuehne, Chuang Gan, Niels Da Vitoria Lobo, Mubarak Shah
Official Pytorch implementation and pre-trained models for Weakly Supervised Grounding for VQA in Vision-Language Transformers (coming soon).
Transformers for visual-language representation learning have been getting a lot of interest and shown tremendous performance on visual question answering (VQA) and grounding. But most systems that show good performance of those tasks still rely on pre-trained object detectors during training, which limits their applicability to the object classes available for those detectors. To mitigate this limitation, the following paper focuses on the problem of weakly supervised grounding in context of visual question answering in transformers. The approach leverages capsules by grouping each visual token in the visual encoder and uses activations from language self-attention layers as a text-guided selection module to mask those capsules before they are forwarded to the next layer. We evaluate our approach on the challenging GQA as well as VQA-HAT dataset for VQA grounding. Our experiments show that: while removing the information of masked objects from standard transformer architectures leads to a significant drop in performance, the integration of capsules significantly improves the grounding ability of such systems and provides new state-of-the-art results compared to other approaches in the field.
This code is built upon code base of LXMERT. Thanks to Hao Tan for providing excellent code for their model.
for pretraining, we used MSCOCO, VG for image-captions pairs and Viz7W, VQA v2.0, GQA for question-image pairs. We used instructions provided by LXMERT to prepare the data except a few changes.
- We removed GQA validation set from pretraining data as we use it for grounding evaluation.
- We validate our pretraining on mscoco-minival split.
To pretrain the backbone, use the following command:
bash run/pretrain_2stage_fulldata_no_init_16_caps.bash
See run/gqa_finetune_caps.bash
for finetuning on GQA dataset.
Finetuning on VQA-HAT is similar to how we finetune the model on GQA. I will keep adding more concrete details in next few days.
If this work is useful for your research, please cite our paper.
@InProceedings{10.1007/978-3-031-19833-5_38,
author="Khan, Aisha Urooj
and Kuehne, Hilde
and Gan, Chuang
and Lobo, Niels Da Vitoria
and Shah, Mubarak",
editor="Avidan, Shai
and Brostow, Gabriel
and Ciss{\'e}, Moustapha
and Farinella, Giovanni Maria
and Hassner, Tal",
title="Weakly Supervised Grounding for VQA in Vision-Language Transformers",
booktitle="Computer Vision -- ECCV 2022",
year="2022",
publisher="Springer Nature Switzerland",
address="Cham",
pages="652--670",
isbn="978-3-031-19833-5"
}
Please contact 'aishaurooj@gmail.com'