Welcome to the official code repository for the paper "A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models" authored by Noriyuki Kojima, Hadar Averbuch-Elor, and Yoav Artzi.
Key to tasks that require reasoning about natural language in visual contexts is grounding words and phrases to image regions. However, observing this grounding in contemporary models is complex, even if it is generally expected to take place if the task is addressed in a way that is conductive to generalization. We propose a framework to jointly study task performance and phrase grounding, and propose three benchmarks to study the relation between the two. Our results show that contemporary models demonstrate inconsistency between their ability to ground phrases and solve tasks. We show how this can be addressed through brute-force training on ground phrasing annotations, and analyze the dynamics it creates.
- Set up the conda environment:
conda create -n grounding python=3.8
- Clone the repository.
- Install necessary dependencies:
pip install -r requirements.txt
READMEs/
: Contains instructions for training and testing models.src/
: Contains scripts to train and test models.data/
: Stores data files.results/
: Stores experimental outcomes, like model checkpoints.media/
: Features images, GIFs, and videos for presentations and PRs.
To prepare and preprocess data, refer to the instructions provided.
To train and test models, refer to the instructions provided.
Licensed under the MIT License.
If our work aids your research, kindly reference our paper:
@misc{Kojima2023:grounding,
title = {A Joint Study of Phrase Grounding and Task Performance in Vision and Language Models},
author = {Noriyuki Kojima and Hadar Averbuch-Elor and Yoav Artzi},
year = {2023},
eprint = {},
archiveprefix = {arXiv}
}