A Tensorflow implementation of phrase detection framework by Bryan Plummer (bplum@bu.edu) as described in "Revisiting Image-Language Networks for Open-ended Phrase Detection". This repository is based on the tensorflow implementation of Faster R-CNN available here which in turn was based on the python Caffe implementation of Faster RCNN available here.
- A basic Tensorflow installation. The code follows r1.2 format.
- Python packages you might not have:
nltk
cython
opencv-python
easydict==1.6
scikit-image
pyyaml
Code was tested using python 2.7
- Clone the repository
git clone --recursive https://github.com/BryanPlummer/phrase_detection.git
We shall refer to the repo's root directory as $ROOTDIR
- Update your -arch in setup script to match your GPU
cd $ROOTDIR/lib
# Change the GPU architecture (-arch) if necessary
vim setup.py
- Download pretrained COCO models of the desired network which were released in this repo. Dy default, the code assumes they have been unpacked in a directory called
pretrained
. For example, after downloading the res101 coco models, you would use:
mkdir $ROOTDIR/pretrained
cd $ROOTDIR/pretrained
tar zxvf $DOWNLOADDIR/coco_900-1190k.tgz
- Download a pretrained word embedding. By default, the code assumes you have downloaded the HGLMM 6K-D vectors from here and placed the unziped file in the
data
directory. If you want to use a different word embedding, please update the pointer to the embedding file and its dimensions inlib/model/config.py
. E.g.,
cd $ROOTDIR/data
unzip $DOWNLOADDIR/hglmm_6kd.zip
- Download and unpack the Flickr30K Entities and ReferIt Game datasets and build the modules and vocabularies from
$ROOTDIR
using,
./data/scripts/fetch_datasets.sh
Assuming you completed the Installation setup correctly, you should be able to train a model with,
./experiments/scripts/train_phrase_detector.sh [GPU_ID] [DATASET] [NET] [TAG]
# GPU_ID is the GPU you want to test on
# NET in {vgg16, res50, res101, res152} is the network arch to use
# DATASET {flickr, referit} is defined in train_phrase_detector.sh
# TAG is an experiment name
# Examples:
./experiments/scripts/train_phrase_detector.sh 0 flickr res101 default
./experiments/scripts/train_phrase_detector.sh 1 referit res101 default
This will train the model without the augmented phrases, to train with augmented phrases use:
./experiments/scripts/train_augmented_phrase_detector.sh [GPU_ID] [DATASET] [NET] [TAG]
You can test your models using,
./experiments/scripts/test_phrase_detector.sh [GPU_ID] [DATASET] [NET] [TAG]
# GPU_ID is the GPU you want to test on
# NET in {vgg16, res50, res101, res152} is the network arch to use
# DATASET {flickr, referit} is defined in test_phrase_detector.sh
# TAG is an experiment name
# Examples:
./experiments/scripts/test_phrase_detector.sh 0 flickr res101 default
./experiments/scripts/test_phrase_detector.sh 1 referit res101 default
Analogously, to test with augmented phrases use:
./experiments/scripts/test_augmented_phrase_detector.sh [GPU_ID] [DATASET] [NET] [TAG]
By default, trained networks are saved under:
output/[NET]/[DATASET]/{TAG}/
If you find our code useful please consider citing:
@article{plummerPhrasedetection,
title={Revisiting Image-Language Networks for Open-ended Phrase Detection},
author={Bryan A. Plummer and Kevin J. Shih and Yichen Li and Ke Xu and Svetlana Lazebnik and Stan Sclaroff and Kate Saenko},
journal={arXiv:1811.07212},
year={2018}
}