This is the official code for YORO - Lightweight End to End Visual Grounding, accepted by European Conference On Computer Vision (ECCV) Workshop on International Challenge on Compositional and Multimodal Perception, Tel-Aviv, Israel, 2022
Use environment/environment.yml or environment/environment_cuda102.yml depending on cuda version for creating the environment.
conda env create -f environment/environment.yml
conda activate yoro
python -m spacy download en_core_web_sm
- Comment out the dataset in download.sh that is not needed. It takes few hours to download all datasets
- Download the dataset to the "./dataset/raw" folder by
sh download.sh
Converting the raw data to arrow format.
- Comment out the dataset in preprocess_dataset.py that is not needed
python preprocess_dataset.py
- The preprocessed dataset will be stored in "./dataset/arrow"
cd pretrained_weight
sh download_weight.sh
- Download result.zip from google drive google drive
- unzip the result.zip
For each eval.sh file in the script/DATASET, change the flag "debug" to False to run full evaluation. Below, we will describe how to run the eval.sh for different datasets.
sh script/pretrain/eval.sh
sh script/RefCoco/eval.sh
sh script/RefCocoP/eval.sh
sh script/RefCocog/eval.sh
sh script/copsref/eval.sh
sh script/ReferItGame/eval.sh
For all run.sh file, please change the "debug" flag to True to run the full training.
For Modulated detection pretraining, we start from a mlm-itm pretrained model, such as the vilt pretraining checkpoint. For example, the below script is for training with 5 det tokens for 40 epochs on 1 gpu. Please refer to the comment in the script for more details.
sh script/pretrain/run.sh 5 40 1
For RefCoco dataset, we load the pretraining checkpoints as initial weight. For example, the below script is for training with 5 det tokens for 10 epochs on 1 gpu. Please refer to the comment in the script for more details.
sh script/RefCoco/run.sh 5 10 1
For RefCoco+ dataset, we load the pretraining checkpoints as initial weight. For example, the below script is for training with 5 det tokens for 10 epochs on 1 gpu. Please refer to the comment in the script for more details.
sh script/RefCocoP/run.sh 5 10 1
For RefCocog dataset, we load the pretraining checkpoints as initial weight. For example, the below script is for training with 5 det tokens for 10 epochs on 1 gpu. Please refer to the comment in the script for more details.
sh script/RefCocog/run.sh 5 10 1
For copsref dataset, we load the pretraining checkpoints as initial weight. For example, the below script is for training with 5 det tokens for 40 epochs on 1 gpu. Please refer to the comment in the script for more details.
sh script/copsref/run.sh 5 40 1
For ReferItGame/RefClef dataset, we load the pretraining checkpoints as initial weight. For example, the below script is for training with 5 det tokens for 40 epochs on 1 gpu. Please refer to the comment in the script for more details.
sh script/ReferItGame/run.sh 5 40 1
If you find this method useful in your research, please cite this article:
@inproceedings{ho2022yoro,
title={YORO-Lightweight End to End Visual Grounding},
author={Ho, Chih-Hui and Appalaraju, Srikar and Jasani, Bhavan and Manmatha, R and Vasconcelos, Nuno},
booktitle={ECCV 2022 Workshop on International Challenge on Compositional and Multimodal Perception},
year={2022}
}
Please email to Chih-Hui (John) Ho (chh279@eng.ucsd.edu) if further issues are encountered. We heavily used the code from