This is the official implementation of Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
Our proposed framework for visual grounding. With the features from the two modalities as input, the visual-linguistic verification module and language-guided context encoder establish discriminative features for the referred object. Then, the multi-stage cross-modal decoder iteratively mulls over all the visual and linguistic features to identify and localize the object.
For different input images and texts, we visualize the verification scores, the iterative attention maps of the multi-stage decoder, and the final localization results.
The models are available in Google Drive.
RefCOCO | RefCOCO+ | RefCOCOg | ReferItGame | Flickr30k | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
val | testA | testB | val | testA | testB | val-g | val-u | test-u | test | test | |
R50 | 84.53 | 87.69 | 79.22 | 73.60 | 78.37 | 64.53 | 72.53 | 74.90 | 73.88 | 71.60 | 79.18 |
R101 | 84.77 | 87.24 | 80.49 | 74.19 | 78.93 | 65.17 | 72.98 | 76.04 | 74.18 | 71.98 | 79.84 |
-
Clone the repository.
git clone https://github.com/yangli18/VLTVG
-
Install PyTorch 1.5+ and torchvision 0.6+.
conda install -c pytorch pytorch torchvision
-
Install the other dependencies.
pip install -r requirements.txt
Please refer to get_started.md for the preparation of the datasets and pretrained checkpoints.
The following is an example of model training on the RefCOCOg dataset.
python -m torch.distributed.launch --nproc_per_node=4 --use_env train.py --config configs/VLTVG_R50_gref.py
We train the model on 4 GPUs with a total batch size of 64 for 90 epochs.
The model and training hyper-parameters are defined in the configuration file VLTVG_R50_gref.py
.
We prepare the configuration files for different datasets in the configs/
folder.
Run the following script to evaluate the trained model with a single GPU.
python test.py --config configs/VLTVG_R50_gref.py --checkpoint VLTVG_R50_gref.pth --batch_size_test 16 --test_split val
Or evaluate the trained model with 4 GPUs:
python -m torch.distributed.launch --nproc_per_node=4 --use_env test.py --config configs/VLTVG_R50_gref.py --checkpoint VLTVG_R50_gref.pth --batch_size_test 16 --test_split val
If you find our code useful, please cite our paper.
@inproceedings{yang2022vgvl,
title={Improving Visual Grounding with Visual-Linguistic Verification and Iterative Reasoning},
author={Yang, Li and Xu, Yan and Yuan, Chunfeng and Liu, Wei and Li, Bing and Hu, Weiming},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2022}
}
Part of our code is based on the previous works DETR and ReSC.