This is the code for reproducing the results from our paper ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval, accepted at CBMI 2022.
Our code is based on OSCAR, whose repository is available here.
- Python 3.7
- Pytorch 1.2
- torchvision 0.4.0
- cuda 10.0
# create a new environment
conda create --name oscar python=3.7
conda activate oscar
# install pytorch1.2
conda install pytorch==1.2.0 torchvision==0.4.0 cudatoolkit=10.0 -c pytorch
export INSTALL_DIR=$PWD
# install apex
cd $INSTALL_DIR
git clone https://github.com/NVIDIA/apex.git
cd apex
git checkout f3a960f80244cf9e80558ab30f7f7e8cbf03c0a0
python setup.py install --cuda_ext --cpp_ext
# install this repo
cd $INSTALL_DIR
git clone --recursive https://github.com/mesnico/OSCAR-TERAN-distillation
cd OSCAR-TERAN-distillation/coco_caption
./get_stanford_models.sh
cd ..
python setup.py build develop
# install requirements
pip install -r requirements.txt
unset INSTALL_DIR
Download the checkpoint folder with azcopy:
path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/model_ckpts/coco_ir/base/checkpoint-0132780/' <checkpoint-target-folder> --recursive
Download the IR data
path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/datasets/coco_ir' <data-folder> --recursive
Download the pre-extracted Bottom-Up features
path/to/azcopy copy 'https://biglmdiag.blob.core.windows.net/vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/' <features-folder> --recursive
cd alad
python train.py --data_dir <data-folder>/coco_ir --img_feat_file <features-folder>/features.tsv --eval_model_dir <checkpoint-target-folder>/checkpoint-0132780 --config configs/<config>.yaml --logger_name <output-folder> --val_step 7000 --max_seq_length 50 --max_img_seq_length 34
The parameter --config
is very important. Configurations are placed in yaml format inside the configs
folder:
alad-alignment-triplet.yaml
: Trains the alignment head using hinge-based triplet ranking loss, finetuning also the Vin-VL backbone;alad-matching-triplet-finetune.yaml
: Trains only the matching head using hinge-based triplet ranking loss. The parameter--load-teacher-model
can be used to provide a backbone previously trained using thealad-alignment-triplet.yaml
configuration;alad-matching-distill-finetune.yaml
: Trains only the matching head by distilling the scores from the alignment head. The parameter--load-teacher-model
in this case IS NEEDED to provide a correctly trained alignment head, previously trained using thealad-alignment-triplet.yaml
configuration;alad-matching-triplet-e2e.yaml
: Trains the matching head, finetuning also the Vin-VL backbone;alad-alignment-and-matching-distill.yaml
: Trains the whole architecture (matching+alignment heads) end-to-end. The variableactivate_distillation_after
inside the configuration file controls how many epochs to wait before activating the distillation loss (wait that the backbone is minimally stable); alternatively, you can load a pre-trained backbone using the--load-teacher-model
option.
Training and validation metrics, as well as model checkpoints are put inside the <output-folder>
path.
You can live monitor all the metrics using tensorboard:
tensorboard --logdir <output-folder>
The following script tests a model on the 1k MS-COCO test set (you can download our best model from here; it is obtained with the alad-alignment-and-matching-distill.yaml
configuration.)
cd alad
python test.py --data_dir <data-folder>/coco_ir --img_feat_file <features-folder>/features.tsv --eval_model_dir <checkpoint-target-folder>/checkpoint-0132780 --max_seq_length 50 --max_img_seq_length 34 --eval_img_keys_file test_img_keys_1k.tsv --load_checkpoint <path/to/checkpoint.pth.tar>
To test on 5k test set, simply set --eval_img_keys_file test_img_keys.tsv
.
If you found this code useful, please cite the following paper:
@inproceedings{messina2022aladin,
title={ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval},
author={Messina, Nicola and Stefanini, Matteo and Cornia, Marcella and Baraldi, Lorenzo and Falchi, Fabrizio and Amato, Giuseppe and Cucchiara, Rita},
booktitle={International Conference on Content-based Multimedia Indexing},
pages={64--70},
year={2022}
}