Official implementation for paper Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos
GVL is a grounded video-language representation learning framework for untrimmed videos, which can automatically detect informative events and effectively excavates the alignments between multi-sentence descriptions and corresponding event segments. GVL is easily extensible to tasks covering visually-grounded language understanding and generation. We achieve state-of-the-art dense video captioning performance on ActivityNet Captions, YouCook2, and YouMakeup, and competitive performance on several other language generation and understanding tasks. Our method also achieved 1st place in both MTVG and MDVC tracks of the Person In Context (PIC) Challenge 2022.
git clone --recursive https://github.com/zjr2000/GVL.git
conda create -n gvl python=3.7
conda activate gvl
conda install pytorch==1.7.1 torchvision==0.8.2 cudatoolkit=10.1 -c pytorch
conda install ffmpeg
pip install -r requirements.txt
cd pdvc/ops
sh make.sh
- Download C3D features and TSP features ActivityNet Captions
cd data/anet/features
bash download_anet_c3d.sh
bash download_tsp_features.sh
- Download C3D features for TACoS
Download
tall_c3d_features.hdf5
from box provided by 2D-TAN. Put it underdata/tacos/features
.
cd data/tacos/features
python convert_c3d_h5_to_npy.py
- Download TSN features for YouCook2
cd data/yc2/features
bash download_yc2_tsn_features.sh
- Download official provided I3D features for YouMakeup from google drive. Put
makeup_i3d_rgb_stride_1s.hdf5
underdata/youmakeup/features
.
cd data/youmakeup/features
python convert_i3d_h5_to_npy.py
You can also extract features for your own dataset with this feature extractor
config_path=cfgs/anet_tsp_ssvg.yml
gpu_id=0
python train.py --cfg ${config_path} --gpu_id ${gpu_id}
# Checkpoints and logs will be saved under "save/" folder.
Evaluate dense captioning performance
eval_folder=YOUR_EVAL_FOLDER # specify the folder to be evaluated
model_path=save/${eval_folder}/model-best-dvc.pth
python eval.py --eval_folder ${eval_folder} \
--gpu_id=YOUR_GPU_ID \
--eval_model_path=${model_path} \
--eval_batch_size=16 \
--eval_caption_file=data/anet/captiondata/val_1.json \
--eval_save_dir save \
Evaluate video grounding performance
eval_folder=YOUR_EVAL_FOLDER # specify the folder to be evaluated
model_path=save/${eval_folder}/model-best-grounding.pth
python eval.py --eval_folder ${eval_folder} \
--gpu_id=YOUR_GPU_ID \
--eval_model_path=${model_path} \
--eval_batch_size=16 \
--eval_disable_captioning \
--eval_caption_file=data/anet/captiondata/val_2.json \
--eval_save_dir save \
--eval_gt_file_for_grounding data/anet/captiondata/grounding/val2_for_grounding.json
We also provide several checkpoints for reproducing our experiment results. You can download them from google drive, put them under save/
and use the above scripts to evaluate them.
The result files are at: google drive
If you find this repo helpful, please consider citing:
@article{wang2023learning,
title={Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos},
author={Wang, Teng and Zhang, Jinrui and Zheng, Feng and Jiang, Wenhao and Cheng, Ran and Luo, Ping},
journal={arXiv preprint arXiv:2303.06378},
year={2023}
}
This repo is mainly based on PDVC and Deformable DETR. We thank the authors for their efforts.