【AAAI'2024 🔥】DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

The official implementation of AAAI24 paper DGL:Dynamic Global-Local Prompt Tuning for Text-Video Retrieval. With only training 0.83 MB parameters, we can surpass fully finetuning/PEFL methods in Text2Video Retrieval.

📌 Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@inproceedings{yang2024dgl,
  title={DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval},
  author={Yang, Xiangpeng and Zhu, Linchao and Wang, Xiaohan and Yang, Yi},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={7},
  pages={6540--6548},
  year={2024}
}

📣 Updates

Oct 14 2024: Update code for qb norm and visualization.
Feb 15 2024: Release the code of DGL.

📕 Overview

Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal interaction. Furthermore, we propose modeling video in a global-local attention mechanism to capture global video information from the perspective of prompt tuning. Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets.

📚 Method

😍 Visualization

DGL can extract global information (bottom) and temporal dynamics (top)

More examples for global information and temporal dynamics

global information

temporal dynamics

Since the visualization code need to cache the global prompt on frame weights and we provide another code project for visualization, the full code is provided at visualization code

#unzip the code
#then replace pretrained_weight path(model_dir in mstvtt.sh)
python main.py

🚀 Quick Start

Setup Setup conda environment

conda env create -f environment.yml

Download CLIP Model

Download CLIP pre-trained weights and place them in ${HOME}/models/pretrained.

wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt

Download Datasets

MSR-VTT Download the splits and captions from CLIP4clip:

wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zip

Download the videos from Frozen️-in-Time:

wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip

Prepare data

Video preprocessing can be done by preprocess/compress_video.py.

python preprocess/compress_video.py --input_root [raw_video_path] --output_root [compressed_video_path]

This script will compress the video to 3fps with width 224 (or height 224). Modify the variables for your customization.

Test

Model Zoo

Note that, due to hardware difference, the results may slightly differ. We have test the performance on A100 GPU with T2V/V2T R@1 is 45.8/43.5 log, on A6000 GPU with T2V/V2T R@1 is 45.4/44.1 log.

You can also only adapt global-local video attention with BLIP, following the implementation of tokenmix , you can get T2V/V2T R@1 is 48.9/49.0 log.

Checkpoint	CLIP	Shared Latent Space	Google Cloud
MSR-VTT	ViT-B/32	Transformer	Download
MSR-VTT	ViT-B/16	Transformer	Download
VATEX	ViT-B/32	Linear	Download
LSMDC	ViT-B/32	Linear	Download
ActivityNet	ViT-B/32	Transformer	Download

#eval in MSRVTT
#set
do_train=0
do_eval=1
shared_latent_space=transformer/linear
resume='path of ckpt.best.pth.tar'

bash scripts/msrvtt.sh

Search for best performance

Prepare sim matrix and train_test t2v and v2t, search for your best T2V/V2T R@1!

#Search for best performance using QB norm
#set prepare sim matrix in the folder, i,e, msrvtt_vit16_sim_matrix.npy, msrvtt_vit16_train_test_t2v.npy, msrvtt_vit16_train_test_v2t.npy
python search_for_best_r1_with_qb_norm.py

Train

#set
shared_latent_space=transformer/linear

#For DGL-Linear, your can only training with 0.83 MB parameters.

#MSR-VTT
scripts/msrvtt.sh

# VATEX
scripts/vatex.sh

# LSMDC
scripts/lsmdc.sh

# ActivityNet
scripts/activitynet.sh

Acknowledgements

This repo is built upon these previous works.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
assets		assets
dataloaders		dataloaders
dataset		dataset
modules		modules
preprocess		preprocess
qb_norm_sim_matrix		qb_norm_sim_matrix
scripts		scripts
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
kill.sh		kill.sh
main.py		main.py
params.py		params.py
search_for_best_r1_with_qb_norm.py		search_for_best_r1_with_qb_norm.py
visualization.py		visualization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

【AAAI'2024 🔥】DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

📌 Citation

Table of Contents

📣 Updates

📕 Overview

📚 Method

😍 Visualization

DGL can extract global information (bottom) and temporal dynamics (top)

global information

temporal dynamics

🚀 Quick Start

Setup Setup conda environment

Download CLIP Model

Download Datasets

Prepare data

Test

Model Zoo

Search for best performance

Train

Acknowledgements

About

Releases

Packages

Languages

License

knightyxp/DGL

Folders and files

Latest commit

History

Repository files navigation

【AAAI'2024 🔥】DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

📌 Citation

Table of Contents

📣 Updates

📕 Overview

📚 Method

😍 Visualization

DGL can extract global information (bottom) and temporal dynamics (top)

global information

temporal dynamics

🚀 Quick Start

Setup Setup conda environment

Download CLIP Model

Download Datasets

Prepare data

Test

Model Zoo

Search for best performance

Train

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages