This is the implementation of our paper entitled "Improving distinctiveness in video captioning with text-video similarity".
Our approach enhances the distinctiveness of video captioning by integrating video retrieval into the training process. We calculate similarity scores between the generated text and videos, incorporating them into the training loss. Additionally, we use reference scores, representing similarity between ground truth sentences and videos, to scale the training loss. This guides the model to generate sentences that closely match the desired level of distinctiveness indicated by the reference scores.
It is demonstrated in the experiments of MSVD and MSR-VTT that our method improved video captioning quantitatively and qualitatively.
The illustration of our proposed method is shown below:
Install and create conda environment with the provided environment.yml
file.
This conda environment was tested with the NVIDIA RTX 3090.
The details of each dependency can be found in the environment.yml file.
conda env create -f environment.yml
conda activate rl
pip install git+https://github.com/Maluuba/nlg-eval.git@master
pip install pycocoevalcap
Install torch following this page: https://pytorch.org/get-started/locally
pip install opencv-python
pip install seaborn
pip install boto3
pip install ftfy
pip install h5py
├── dataset
│ ├── MSVD
│ │ ├── raw # put the 1970 raw videos in here
│ │ ├── captions
│ │ ├── raw-captions_mapped.pkl # mapping between video id with captions
│ │ ├── train_list_mapping.txt
│ │ ├── val_list_mapping.txt
│ │ ├── test_list_mapping.txt
│ ├── MSRVTT
│ │ ├── raw # put the 10000 raw videos in here
│ │ ├── msrvtt.csv # list of video id in msrvtt dataset
│ │ ├── MSRVTT_data.json # metadata of msrvtt dataset, which includes video url, video id, and caption
Raw videos can be downloaded from this link.
Raw videos can be downloaded from this link.
- Download the extracted video features from link.
- Put the extracted video features into ./features folder.
In our paper, we use CLIP4Caption and CLIP4Clip as our video captioning and video retrieval, respectively.
- Clone our implementation of CLIP4Caption to the root folder.
git clone https://github.com/Sejong-VLI/V2T-CLIP4Caption-Reproduction.git
- Rename the folder as you want.
- Modify the import library in
<VIDEOCAPTIONINGFOLDER>/modules/modeling.py
as follows:
from <VIDEOCAPTIONINGFOLDER>.modules.until_module import PreTrainedModel, LayerNorm, CrossEn
from <VIDEOCAPTIONINGFOLDER>.modules.module_bert import BertModel, BertConfig
from <VIDEOCAPTIONINGFOLDER>.modules.module_visual import VisualModel, VisualConfig, VisualOnlyMLMHead
from <VIDEOCAPTIONINGFOLDER>.modules.module_decoder import DecoderModel, DecoderConfig
- Modify the import library in
<VIDEOCAPTIONINGFOLDER>/modules/until_module.py
as follows:
from <VIDEOCAPTIONINGFOLDER>.modules.until_config import PretrainedConfig
- Replace
<VIDEOCAPTIONINGFOLDER>/dataloaders
with the provideddataloaders
folder.
Note Please replace the <VIDEOCAPTIONINGFOLDER>
with the folder name you have chosen. For example, if your folder name is CLIP4Caption then the import library in policy_gradient.py will be from CLIP4Caption.modules.tokenization import BertTokenizer
.
- Clone the implementation of CLIP4Clip to the root folder.
git clone https://github.com/ArrowLuo/CLIP4Clip.git
-
Rename the folder as you want.
-
Replace
<VIDEORETRIEVALFOLDER>/modules/modeling.py
with the providedmodeling.py
. -
Replace
<VIDEORETRIEVALFOLDER>/modules/tokenization_clip.py
with the providedtokenization_clip.py
. -
Modify the import library in
<VIDEORETRIEVALFOLDER>/modules/until_module.py
as follows:
from <VIDEORETRIEVALFOLDER>.modules.until_config import PretrainedConfig
Note Please replace the <VIDEORETRIEVALFOLDER>
with the folder name you have chosen. For example, if your folder name is CLIP4Clip then the import library in /modules/until_module.py will be from CLIP4Clip.modules.until_config import PretrainedConfig
.
The folder structure after downloading the video captioning and video retrieval should look as follows:
├── <VIDEOCAPTIONINGFOLDER>
├── <VIDEORETRIEVALFOLDER>
├── dataset
├── features
├── pretrained
├── environment.yml
├── policy_gradient.py
├── train.py
├── converter.py
├── retrieval_utils.py
Download pretrained model from link and put into ./pretrained folder.
- Initialize our CLIP4Caption.
mkdir -p ./<VIDEOCAPTIONINGFOLDER>/weight
wget -P ./<VIDEOCAPTIONINGFOLDER>/weight https://github.com/microsoft/UniVL/releases/download/v0/univl.pretrained.bin
Note Please replace the <VIDEOCAPTIONINGFOLDER>
with the folder name you have chosen. For example, if your video captioning folder name is CLIP4Caption then the command will be mkdir -p ./CLIP4Caption/weight
.
- In each train script (.sh), change following parameters based on the specs of your machine and the data location:
- N_GPU = [Total GPU to use]
- N_THREAD = [Total thread to use]
- DATA_PATH = [JSON file location]
- CKPT_ROOT = [Your desired folder for saving the models and results]
- INIT_MODEL_PATH = [UniVL pretrained model location]
- FEATURES_PATH = [Generated video features path]
- MODEL_FILE_RET = [Pretrain video retrieval checkpoint]
- MODEL_FILE = [Saved video captioning model for evaluation]
- Execute the following scripts to start the training process.
- Run following script:
python3 converter.py --replace_variable='<VIDEOCAPTIONINGFOLDER>' --target_variable='<VIDEOCAPTIONINGFOLDER>'
python3 converter.py --replace_variable='<VIDEORETRIEVALFOLDER>' --target_variable='<VIDEORETRIEVALFOLDER>'
For example if your video captioning folder name is CLIP4Caption then the script will become:
python3 converter.py --replace_variable='CLIP4Caption' --target_variable='<VIDEOCAPTIONINGFOLDER>'
cd scripts/
./msvd_train.sh
cd scripts/
./msrvtt_train.sh
After the training is done, an evaluation process on the test set will be automatically executed using the best checkpoint among all epochs. However, if you want to evaluate a checkpoint at a specific epoch, you can use the provided training shell script by modifying the value of INIT_MODEL_PATH
to the location of the desired checkpoint, and replacing the --do_train
to --do_eval
.
The comparison with the existing methods and also the ablation study of our method can be found in our paper.
Method | CLIP Model | BLEU@4 | METEOR | ROUGE-L | CIDEr | R@1 |
---|---|---|---|---|---|---|
Ours | ViT-B/16 | 64.77 | 42.05 | 78.77 | 124.47 | 30.8 |
Method | CLIP Model | BLEU@4 | METEOR | ROUGE-L | CIDEr | R@1 |
---|---|---|---|---|---|---|
Ours | ViT-B/16 | 48.78 | 31.28 | 65.01 | 60.51 | 17.0 |
Our code is developed based on https://github.com/microsoft/UniVL, which is also developed based on https://github.com/huggingface/transformers/tree/v0.4.0 and https://github.com/antoine77340/howto100m .
Please cite our paper in your publications if it helps your research as follows:
@article{Velda2023,
title = {Improving distinctiveness in video captioning with text-video similarity},
author = {V. Velda and S. A. Immanuel and W. F. Hendria and C. Jeong},
journal = {Image and Vision Computing},
volume = {136},
pages = {104728},
month = aug,
year = {2023}
}