This is the PyTorch code repository for the ECIR 2022 reproducibility track submission: "Do Lessons from Metric Learning Generalize to Image-Caption Retrieval?" By Maurits Bleeker and Maarten de Rijke.
This project is built on top of/is using the code of VSE++, VSRN, and some work of SmoothAP.
Make sure to install anaconda.
$ conda env create -f environment.yml
In the root of this repository, create the following folders.
$ mkdir data
$ mkdir out
For the VSRN method, we used the all data (train/test/val + vocab) as provided by the authors.
For VSE++ we have reformatted the data to a pickle file and changed the data loader. We did this to improve the training time.
The data for the Flickr30k dataset (train/test/val + vocab) can be downloaded here.
The MS-COCO dataset is too big to upload in one file. However, if the MS-Coco caption dataset is formatted by using the format below, using dataset_coco.json, then the experiments can be executed.
In /notebooks
we provide a example script on how to format the MS-COCO data. Be aware that this is not the script we have used for our experiments, so there might be some differences.
The vocab file for MS-COCO can be found here
{
'images': {img_id: {
'image': str(binary_img_file),
'filename': str(file_namem),
'caption_ids': list(id1, ...)
}
},
'captions': {
'caption': caption['raw']
'imgid': caption['imgid']
'tokens': caption['tokens']
'filename'
},
}
This research has been conducted by using a compute cluster in combination with SLURM. In the folder /jobs
, we give all job files used to run the experiments. To run the job files yourself, change all the {path to repo}
and {path to vocab}
into the correct folder or use an environment variable. In the hyper-parameter file in /jobs/{dataset}/papper_experiments/baseline/*hyper_params.txt
, all the hyper-parameters are given.
To start training (one experiment) from the command line, use the following exaples:
$ python vsepp/train.py --data_name=coco --data_path={path to repo}/data/vsepp/coco -vocab_path={path to repo}/vsepp/coco --cnn_type=resnet50 --workers=10 --experiment_name=coco_triplet_max --criterion triplet --max_violation --logger_name {path to repo}/out/vsepp/out/coco/paper_experiments/triplet_max/0
$ python VSRN/train.py --data_name=coco_precomp --data_path={path to repo}/data/vsrn/coco -vocab_path={path to repo}/vsrn/coco --cnn_type=resnet50 --workers=10 --max_len 60 -experiment_name=coco_triplet_max --criterion triplet --max_violation --logger_name {path to vocab}/out/vsrn/out/coco/paper_experiments/triplet_max/0 --lr_update 15
We run each experiment five times. The model files for each experiment are stored at {path to repo}/out/{method}/out/{dataset}/paper_experiments/{loss function}/{0,1,3,4}
In in the *.job
and *hyper_params.txt
you can set the following parameters.
--data_name
: Name of the datasets used in the work. For VSE++ usef30k
andcoco
, for VSRN usef30k_precomp
andcoco_precomp
.--data_path
: Path where datasets are stored.--vocab_path
: Path where vocab files are stored.--max_violation
: Only use the maximum violating triplet in the batch per query. I.e. Triplet loss SH. Only possible in combination with triplet loss.--cnn_type
: For VSE++ only, useresnet50
.--workers
: Number of workers for the DataLoader, default is 10.--max_len
: For VSRN only, max-length of the caption.--experiment_name
: Name of the experiment, forwandb
. Options used:triplet_max
,triplet_max
,triplet
,ntxent
,smoothap
.--criterion
: Loss functions used for training. Options used:triplet
,ntxent
,ranking
(SmoothAP).--ranking_based
: Flag to indicate that we optimize a ranking. I.e., we want to use all the captions that match to an image.--tau
: Tau hyper-parameter value for SmoothAP and NTXent--logger_name
: Path where all the log-files are stored. Use the following format{path to repo}/out/{method}/out/{dataset}/paper_experiments/{loss }/{experiment number}
. For {loss} usetriplet_max
,triplet
,ntxent
,ranking
.--num_epochs
: Number of training epochs.--lr_update
: Drop lr after n epochs with factor 0.1.
The rest of the method specific parameters can be found in vseppp/train.py
and VSRN/train.py
.
In /evaluation
, the notebooks are provided to generate all the results tables from Sections 3 and 5.
The file Evaluation notebook.ipynb
is used to generate Table 3.
The COCOs results are generated by using Gradient counting-Coco-VSE++.ipynb
, Gradient counting-Coco-VSRN.ipynb
, Gradient counting-F30k-VSE++-.ipynb
and Gradient counting-F30k-VSRN.ipynb
.
In total, we have trained 5 * 4 * 2 * 2 models. It is not feasible to share all the parameter files. Here we give all the model checkpoints for experiment 1.2, 1.6, 2.2, 2.6 (the experiments with Triplet loss SH/Triplet loss max).
Be aware that the checkpoint files contain a copy of all the hyper-parameters that have been used for training and that some of the paths in this config file might point to folders that have been set while training the models. This should be changed.
If you use this code to produce results for your scientific work please refer to our ECIR 2022 paper:
@inproceedings{bleeker2022lessons,
title={Do Lessons from Metric Learning Generalize to Image-Caption Retrieval?},
author={Bleeker, Maurits and Rijke, Maarten de},
booktitle={European Conference on Information Retrieval},
pages={535--551},
year={2022},
organization={Springer}
}