Skip to content
/ DGL Public

[AAAI 2024] DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval.

License

Notifications You must be signed in to change notification settings

knightyxp/DGL

Repository files navigation

【AAAI'2024 🔥】DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

Conference arxiv

The official implementation of AAAI24 paper DGL:Dynamic Global-Local Prompt Tuning for Text-Video Retrieval. With only training 0.83 MB parameters, we can surpass fully finetuning/PEFL methods in Text2Video Retrieval.

📌 Citation

If you find this paper useful, please consider staring 🌟 this repo and citing 📑 our paper:

@inproceedings{yang2024dgl,
  title={DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval},
  author={Yang, Xiangpeng and Zhu, Linchao and Wang, Xiaohan and Yang, Yi},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={38},
  number={7},
  pages={6540--6548},
  year={2024}
}

Table of Contents

📣 Updates

  • Oct 14 2024: Update code for qb norm and visualization.
  • Feb 15 2024: Release the code of DGL.

📕 Overview

Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal interaction. Furthermore, we propose modeling video in a global-local attention mechanism to capture global video information from the perspective of prompt tuning. Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets.

📚 Method

😍 Visualization

DGL can extract global information (bottom) and temporal dynamics (top)

More examples for global information and temporal dynamics

global information

temporal dynamics

Since the visualization code need to cache the global prompt on frame weights and we provide another code project for visualization, the full code is provided at visualization code

#unzip the code
#then replace pretrained_weight path(model_dir in mstvtt.sh)
python main.py

🚀 Quick Start

Setup Setup conda environment

conda env create -f environment.yml

Download CLIP Model

Download CLIP pre-trained weights and place them in ${HOME}/models/pretrained.

wget https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt

Download Datasets

MSR-VTT Download the splits and captions from CLIP4clip:

wget https://github.com/ArrowLuo/CLIP4Clip/releases/download/v0.0/msrvtt_data.zip

Download the videos from Frozen️-in-Time:

wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip

Prepare data

Video preprocessing can be done by preprocess/compress_video.py.

python preprocess/compress_video.py --input_root [raw_video_path] --output_root [compressed_video_path]

This script will compress the video to 3fps with width 224 (or height 224). Modify the variables for your customization.

Test

Model Zoo

Note that, due to hardware difference, the results may slightly differ. We have test the performance on A100 GPU with T2V/V2T R@1 is 45.8/43.5 log, on A6000 GPU with T2V/V2T R@1 is 45.4/44.1 log.

You can also only adapt global-local video attention with BLIP, following the implementation of tokenmix , you can get T2V/V2T R@1 is 48.9/49.0 log.

Checkpoint CLIP Shared Latent Space Google Cloud
MSR-VTT ViT-B/32 Transformer Download
MSR-VTT ViT-B/16 Transformer Download
VATEX ViT-B/32 Linear Download
LSMDC ViT-B/32 Linear Download
ActivityNet ViT-B/32 Transformer Download
#eval in MSRVTT
#set
do_train=0
do_eval=1
shared_latent_space=transformer/linear
resume='path of ckpt.best.pth.tar'

bash scripts/msrvtt.sh

Search for best performance

Prepare sim matrix and train_test t2v and v2t, search for your best T2V/V2T R@1!

#Search for best performance using QB norm
#set prepare sim matrix in the folder, i,e, msrvtt_vit16_sim_matrix.npy, msrvtt_vit16_train_test_t2v.npy, msrvtt_vit16_train_test_v2t.npy
python search_for_best_r1_with_qb_norm.py

Train

#set
shared_latent_space=transformer/linear

#For DGL-Linear, your can only training with 0.83 MB parameters.

#MSR-VTT
scripts/msrvtt.sh

# VATEX
scripts/vatex.sh

# LSMDC
scripts/lsmdc.sh

# ActivityNet
scripts/activitynet.sh

Acknowledgements

This repo is built upon these previous works.