Omkar Thawakar, Muzammal Naseer, Rao Muhammad Anwer, Salman Khan, Michael Felsberg, Mubarak Shah and Fahad Khan
Composed video retrieval (CoVR) is a challenging problem in computer vision which has recently highlighted the integration of modification text with visual queries for more sophisticated video search in large databases. Existing works predominantly rely on visual queries combined with modification text to distinguish relevant videos. However, such a strategy struggles to fully preserve the rich query-specific context in retrieved target videos and only represents the target video using visual embedding. We introduce a novel CoVR framework that leverages detailed language descriptions to explicitly encode query-specific contextual information and learns discriminative embeddings of vision only, text only and vision-text for better alignment to accurately retrieve matched target videos. Our proposed framework can be flexibly employed for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on three datasets show that our approach obtains state-of-the-art performance for both CovR and zero-shot CoIR tasks, achieving gains as high as around 7% in terms of recall@K=1 score.
To download the webvid-covr videos, install mpi4py
and run:
python tools/scripts/download_covr.py <split>
To download the annotations of webvid-covr:
bash tools/scripts/download_annotation.sh covr
To generate the descriptions of webvid-covr videos, use script tools/scripts/generate_webvid_description_2m.py
and tools/scripts/generate_webvid_description_8m.py
inside main directory of MiniGPT-4
Download the webvid-covr annotation files with our generated descriptions from here : OneDrive Link
Download the model checkpoints from here : OneDrive Link.
Save the checkpoint in folder structure : outputs/webvid-covr/blip-large/blip-l-coco/tv-False_loss-hnnce_lr-1e-05/
Final repository contains:
📦 composed-video-retrieval
┣ 📂 annotations
┣ 📂 configs
┣ 📂 datasets
┣ 📂 outputs
┣ 📂 src
┣ 📂 tools
┣ 📜 LICENSE
┣ 📜 README.md
┣ 📜 test.py
┗ 📜 train.py
conda create --name covr
conda activate covr
Install the following packages inside the conda environment:
pip install -r requirements.txt
The code was tested on Python 3.10 and PyTorch >= 2.0.
Before training, you will need to compute the BLIP embeddings for the videos/images. To do so, run:
python tools/embs/save_blip_embs_vids.py # This will compute the embeddings for the WebVid-CoVR videos.
python tools/embs/save_blip_embs_imgs.py # This will compute the embeddings for the CIRR or FashionIQ images.
The command to launch a training experiment is the folowing:
python train.py [OPTIONS]
The parsing is done by using the powerful Hydra library. You can override anything in the configuration by passing arguments like foo=value
or foo.bar=value
.
The command to evaluate is the folowing:
python test.py test=<test> [OPTIONS]
data=webvid-covr
: WebVid-CoVR datasets.data=cirr
: CIRR dataset.data=fashioniq-split
: FashionIQ dataset, changesplit
todress
,shirt
ortoptee
.
test=all
: Test on WebVid-CoVR, CIRR and all three Fashion-IQ test sets.test=webvid-covr
: Test on WebVid-CoVR.test=cirr
: Test on CIRR.test=fashioniq
: Test on all three Fashion-IQ test sets (dress
,shirt
andtoptee
).
model/ckpt=blip-l-coco
: Default checkpoint for BLIP-L finetuned on COCO.model/ckpt=webvid-covr
: Default checkpoint for CoVR finetuned on WebVid-CoVR.
trainer=gpu
: training with CUDA, changedevices
to the number of GPUs you want to use.trainer=ddp
: training with Distributed Data Parallel (DDP), changedevices
andnum_nodes
to the number of GPUs and number of nodes you want to use.trainer=cpu
: training on the CPU (not recommended).
trainer/logger=csv
: log the results in a csv file. Very basic functionality.trainer/logger=wandb
: log the results in wandb. This requires to installwandb
and to set up your wandb account. This is what we used to log our experiments.trainer/logger=<other>
: Other loggers (not tested).
machine=server
: You can change the default path to the dataset folder and the batch size. You can create your own machine configuration by adding a new file inconfigs/machine
.
There are many pre-defined experiments from the paper in configs/experiments
. Simply add experiment=<experiment>
to the command line to use them.
Use slurm_train.sh
and slurm_test.sh
in case of slurm setting.
- We built our approach using CoVR-BLIP and BLIP using lightning-hydra-template in the backend.
- To generate Video descriptions we used MiniGPT-4.
@article{thawakar2024composed,
title={Composed Video Retrieval via Enriched Context and Discriminative Embeddings},
author={Omkar Thawakar and Muzammal Naseer and Rao Muhammad Anwer and Salman Khan and Michael Felsberg and Mubarak Shah and Fahad Shahbaz Khan},
journal={The IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2024}
}