This repository contains the implementation of visual-semantic embedding.
Training and evaluation is done on the MSCOCO dataset.
python>=3.7
numpy
matplotlib
pytorch>=1.1.0
torchvision
Pillow
faiss-cpu (for nearest neighbor search)
accimage (optional, for fast loading of images)
torchtext (for vocabulary)
spacy (for spacy tokenizer)
Run the below command before training.
$ python -m spacy download en
environment.yml
file contains environment details for Anaconda users.- run
conda env create -f environment.yml && conda activate mse
for simple use.
Go to the directory where the data should be and run download_coco.sh
.
This directory would be denoted $ROOTPATH
.
$ python train.py --root_path $ROOTPATH
R@1 | R@5 | R@10 | Med r | |
---|---|---|---|---|
VSE++ | 41.3 | 71.1 | 81.2 | 2.0 |
Our Implementation | 31.7 | 61.5 | 72.6 | 3.0 |
R@1 | R@5 | R@10 | Med r | |
---|---|---|---|---|
VSE++ | 30.3 | 59.4 | 72.4 | 4.0 |
Our Implementation | 22.4 | 48.8 | 61.9 | 6.0 |
$ python eval.py --root_path $ROOTPATH --checkpoint hogehoge.ckpt --image_path $IMAGE --caption $CAPTION
$IMAGE
denotes the path to reference image. Defaults to samples/sample1.jpg
.
$CAPTION
denotes the reference caption. Defaults to "the cat is walking on the street"
Retrieval is done on MSCOCO validation set.
- add Flickr8k
- add Flickr30k
- clean up validation
- find optimal hyperparams