Skip to content

hbzhang/cvpr2020

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This repo provides code for learning and evaluating joint video-text embeddings for the task of video retrieval. Our approach is described in the paper "Use What You Have: Video retrieval using representations from collaborative experts" (paper, project page)

CE diagram

High-level Overview: The Collaborative Experts framework aims to achieve robustness through two mechanisms:

  1. The use of information from a wide range of modalities, including those that are typically always available in video (such as RGB) as well as more "specific" clues which may only occasionally be present (such as overlaid text).
  2. A module that aims to combine these modalities into a fixed size representation that in a manner that is robust to noise.

Requirements: The code assumes PyTorch 1.4 and Python 3.7 (other versions may work, but have not been tested). See the section on dependencies towards the end of this file for specific package requirements.

Important: A note on the updated results: A previous version of the codebase (and paper) reported results on the retrieval benchmarks that included a signficant software bug leading to an overestimate of performance. We are extremely grateful to Valentin Gabeur who discovered this bug (it has been corrected in the current codebase).

CVPR 2020: Pentathlon challenge

logo

We are hosting a video retrieval challenge as part of the Video Pentathlon Workshop. Find out how to participate here!

Pretrained video embeddings

We provide pretrained models for each dataset to reproduce the results reported in the paper [1] (references follow at the end of this README). Each model is accompanied by training and evaluation logs. Performance is evalauted for retrieval in both directions (joint-embeddings can be used for either of these two tasks):

  • t2v denotes that a text query is used to retrieve videos
  • v2t denotes that a video query is used to retrieve text video descriptions

In the results reported below, the same model is used for both the t2v and v2t evaluations. Each metric is reported as the mean and standard deviation (in parentheses) across three training runs.

MSRVTT Benchmark

Model Split Task R@1 R@5 R@10 R@50 MdR MnR Links
CE Full t2v 10.0(0.1) 29.0(0.3) 41.2(0.2) 71.4(0.1) 16.0(0.0) 86.8(0.3) config, model, log
CE 1k-A t2v 20.9(1.2) 48.8(0.6) 62.4(0.8) 89.1(0.4) 6.0(0.0) 28.2(0.8) config, model, log
CE 1k-B t2v 18.2(0.7) 46.0(0.4) 60.7(0.2) 86.6(0.5) 7.0(0.0) 35.3(1.1) config, model, log
MoEE* 1k-B t2v 15.0(0.7) 39.7(1.0) 54.5(1.1) 82.7(0.6) 8.3(0.6) 43.7(0.7) config, model, log
CE Full v2t 15.6(0.3) 40.9(1.4) 55.2(1.0) 84.0(0.1) 8.3(0.6) 38.1(1.8) config, model, log
CE 1k-A v2t 20.6(0.6) 50.3(0.5) 64.0(0.2) 89.9(0.3) 5.3(0.6) 25.1(0.8) config, model, log
CE 1k-B v2t 18.0(0.8) 46.0(0.5) 60.3(0.5) 86.4(0.3) 6.5(0.5) 30.6(1.2) config, model, log
MoEE* 1k-B v2t 14.5(0.8) 40.4(0.8) 54.9(1.0) 83.8(0.5) 8.8(0.4) 38.7(0.9) config, model, log

Models marked with * use the features made available with the MoEE model of [2] (without OCR, speech and scene features), unstarred models on the 1k-B and Full splits make use of OCR, speech and scene features, as well slightly stronger text encodings (GPT, rather than word2vec - see [1] for details). The MoEE model is implemented as a sanity check that our codebase approximately reproduces [2] (the MoEE paper).

See the MSRVTT README for links to the train/val/test lists of each split.

MSVD Benchmark

Model Task R@1 R@5 R@10 R@50 MdR MnR Links
CE t2v 19.8(0.3) 49.0(0.3) 63.8(0.1) 89.0(0.2) 6.0(0.0) 23.1(0.3) config, model, log
CE v2t 23.9(1.4) 50.2(0.8) 59.6(1.2) 82.3(0.7) 5.6(0.5) 41.2(3.4) config, model, log

See the MSVD README for descriptions of the train/test splits. Note that the videos in the MSVD dataset do not have soundtracks.

DiDeMo Benchmark

Model Task R@1 R@5 R@10 R@50 MdR MnR Links
CE t2v 16.1(1.4) 41.1(0.4) 54.4(0.8) 82.7(0.3) 8.3(0.6) 43.7(3.6) config, model, log
CE v2t 15.6(1.3) 40.9(0.4) 55.2(0.5) 82.2(1.3) 8.2(0.3) 42.4(3.3) config, model, log

See the DiDeMo README for descriptions of the train/val/test splits.

ActivityNet Benchmark

Model Task R@1 R@5 R@10 R@50 MdR MnR Links
CE t2v 18.2(0.3) 47.7(0.6) 63.9(0.5) 91.4(0.4) 6.0(0.0) 23.1(0.5) config, model, log
CE v2t 17.7(0.6) 46.6(0.7) 62.8(0.4) 90.9(0.2) 6.0(0.0) 24.4(0.5) config, model, log

See the ActivityNet README for descriptions of the train/test splits.

LSMDC Benchmark

Model Task R@1 R@5 R@10 R@50 MdR MnR Links
CE t2v 11.2(0.4) 26.9(1.1) 34.8(2.0) 62.1(1.5) 25.3(3.1) 96.8(5.0) config, model, log
CE v2t 11.7(0.5) 25.8(1.5) 34.4(1.7) 61.4(0.7) 28.0(2.6) 97.6(2.8) config, model, log

See the LSMDC README for descriptions of the train/test splits. Please note that to obtain the features and descriptions for this dataset, you must obtain permission from MPII to use the data (this is process is described here. Once you have done so, please request that a member of the LSMDC team contacts us to confirm approval (via albanie at robots dot ox dot ac dot uk) - we can then provide you with a link to the features.

Ablation studies on MSRVTT

We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the MSRVTT dataset.

CE Design: First, we investigate the importance of the parts used by the CE model.

Model Task R@1 R@5 R@10 MdR Params Links
Concat t2v 0.0(0.0) 0.0(0.0) 0.0(0.0) 1495.5(0.0) 369.72k config, model, log
CE - MW,P,CG t2v 8.5(0.1) 25.9(0.3) 37.6(0.2) 19.0(0.0) 246.22M config, model, log
CE - P,CG t2v 9.6(0.1) 28.0(0.2) 39.7(0.2) 17.7(0.6) 400.41M config, model, log
CE - CG t2v 9.7(0.1) 28.1(0.2) 40.2(0.1) 17.0(0.0) 181.07M config, model, log
CE t2v 10.0(0.1) 29.0(0.3) 41.2(0.2) 16.0(0.0) 183.45M config, model, log
Concat v2t 0.0(0.0) 0.0(0.0) 0.0(0.0) 29897.5(0.0) 369.72k config, model, log
CE - MW,P,CG v2t 13.7(0.4) 38.8(1.2) 53.1(1.1) 9.2(0.8) 246.22M config, model, log
CE - P,CG v2t 14.1(0.2) 39.5(1.0) 53.2(0.3) 9.0(0.0) 400.41M config, model, log
CE - CG v2t 15.1(0.3) 40.3(0.5) 54.3(0.7) 8.8(0.3) 181.07M config, model, log
CE v2t 15.6(0.3) 40.9(1.4) 55.2(1.0) 8.3(0.6) 183.45M config, model, log

Each row adds an additional component to the model. The names refer to the following model designs:

  • Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
  • CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
  • CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
  • CE - CG - The CE model without Collaborative Gating (CG).
  • CE - The full CE model.

Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.

Importance of Different Experts: The next ablation investigates the value of each of the different experts towards the final embedding. Since not all experts are available in every video, we pair each expert with scene features, to give an approximation of their usefulness.

Experts Task R@1 R@5 R@10 MdR Params Links
Scene t2v 4.0(0.1) 14.1(0.1) 22.4(0.3) 50.0(1.0) 19.46M config, model, log
Scene + Inst. t2v 7.2(0.1) 22.3(0.3) 33.0(0.2) 25.3(0.6) 41.12M config, model, log
Scene + r2p1d t2v 6.8(0.1) 21.7(0.1) 32.4(0.1) 25.7(0.6) 39.95M config, model, log
Scene + RGB t2v 5.0(0.2) 16.6(0.7) 25.5(1.0) 40.7(2.1) 41.12M config, model, log
Scene + Flow t2v 5.3(0.3) 17.6(0.8) 27.1(0.9) 36.0(1.7) 40.34M config, model, log
Scene + Audio t2v 5.6(0.0) 18.7(0.1) 28.2(0.1) 33.7(0.6) 40.34M config, model, log
Scene + OCR t2v 4.1(0.1) 14.1(0.1) 22.2(0.2) 50.3(1.2) 49.49M config, model, log
Scene + Speech t2v 4.6(0.1) 15.5(0.2) 24.4(0.2) 44.7(1.2) 43.94M config, model, log
Scene + Face t2v 4.1(0.1) 14.2(0.3) 22.4(0.4) 49.7(0.6) 39.95M config, model, log
Scene v2t 5.6(0.6) 18.2(0.6) 27.7(0.3) 39.0(0.0) 19.46M config, model, log
Scene + Inst. v2t 10.1(0.3) 29.7(0.5) 41.9(0.7) 15.2(0.9) 41.12M config, model, log
Scene + r2p1d v2t 9.4(0.3) 27.8(0.6) 40.1(1.1) 17.2(1.1) 39.95M config, model, log
Scene + RGB v2t 6.9(0.5) 21.2(0.9) 31.1(1.9) 28.7(3.8) 41.12M config, model, log
Scene + Flow v2t 7.3(0.6) 22.3(1.4) 33.4(1.7) 25.2(2.0) 40.34M config, model, log
Scene + Audio v2t 8.2(0.4) 24.8(0.4) 36.0(0.1) 21.7(0.6) 40.34M config, model, log
Scene + OCR v2t 5.4(0.5) 18.6(1.2) 26.6(1.2) 40.0(1.0) 49.49M config, model, log
Scene + Speech v2t 6.0(0.2) 20.4(0.5) 30.3(1.0) 33.0(2.0) 43.94M config, model, log
Scene + Face v2t 5.6(1.0) 17.9(0.7) 26.7(0.8) 39.1(2.6) 39.95M config, model, log

We can also study their cumulative effect:

Experts Task R@1 R@5 R@10 MdR Params Links
Scene t2v 4.0(0.1) 14.1(0.1) 22.4(0.3) 50.0(1.0) 19.46M config, model, log
Prev. + Speech t2v 4.6(0.1) 15.5(0.2) 24.4(0.2) 44.7(1.2) 43.94M config, model, log
Prev. + Audio t2v 5.8(0.1) 19.0(0.3) 28.8(0.2) 32.3(0.6) 62.45M config, model, log
Prev. + Flow t2v 6.7(0.2) 21.8(0.4) 32.5(0.5) 25.3(0.6) 80.96M config, model, log
Prev. + RGB t2v 7.5(0.1) 23.4(0.0) 34.1(0.2) 23.7(0.6) 100.26M config, model, log
Prev. + Inst t2v 9.5(0.2) 27.7(0.1) 39.4(0.1) 18.0(0.0) 119.56M config, model, log
Prev. + R2P1D t2v 9.9(0.1) 28.6(0.3) 40.7(0.1) 17.0(0.0) 137.67M config, model, log
Prev. + OCR t2v 10.0(0.1) 28.8(0.2) 40.9(0.2) 16.7(0.6) 165.33M config, model, log
Prev. + Face t2v 10.0(0.1) 29.0(0.3) 41.2(0.2) 16.0(0.0) 183.45M config, model, log
Scene v2t 5.6(0.6) 18.2(0.6) 27.7(0.3) 39.0(0.0) 19.46M config, model, log
Prev. + Speech v2t 6.0(0.2) 20.4(0.5) 30.3(1.0) 33.0(2.0) 43.94M config, model, log
Prev. + Audio v2t 8.6(0.2) 26.1(0.6) 37.8(0.8) 19.8(0.8) 62.45M config, model, log
Prev. + Flow v2t 9.9(0.4) 28.6(0.7) 41.7(0.8) 15.7(0.6) 80.96M config, model, log
Prev. + RGB v2t 11.2(0.3) 32.1(0.8) 45.4(0.6) 13.7(0.6) 100.26M config, model, log
Prev. + Inst. v2t 14.7(0.6) 38.9(0.8) 53.1(1.0) 9.3(0.6) 119.56M config, model, log
Prev. + R2P1D v2t 15.5(0.6) 40.1(1.2) 54.4(1.3) 8.7(0.6) 137.67M config, model, log
Prev. + OCR v2t 15.2(0.1) 41.1(0.6) 54.6(0.7) 8.5(0.5) 165.33M config, model, log
Prev. + Face v2t 15.6(0.3) 40.9(1.4) 55.2(1.0) 8.3(0.6) 183.45M config, model, log

Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.

Dimension Task R@1 R@5 R@10 MdR Params Links
384 t2v 9.4(0.2) 27.8(0.4) 39.8(0.4) 17.7(0.6) 88.62M config, model, log
512 t2v 9.8(0.3) 28.6(0.4) 40.6(0.4) 17.0(0.0) 119.51M config, model, log
640 t2v 10.1(0.1) 28.8(0.1) 40.9(0.2) 16.7(0.6) 151.12M config, model, log
768 t2v 10.0(0.1) 29.0(0.3) 41.2(0.2) 16.0(0.0) 183.45M config, model, log
1024 t2v 9.9(0.1) 28.6(0.3) 40.7(0.4) 17.0(0.0) 250.27M config, model, log
384 v2t 14.0(0.5) 38.7(0.5) 52.7(1.4) 9.3(0.6) 88.62M config, model, log
512 v2t 14.8(0.4) 40.4(0.6) 53.9(0.4) 8.8(0.3) 119.51M config, model, log
640 v2t 15.6(0.6) 41.3(0.7) 55.0(0.5) 8.3(0.6) 151.12M config, model, log
768 v2t 15.6(0.3) 40.9(1.4) 55.2(1.0) 8.3(0.6) 183.45M config, model, log
1024 v2t 14.7(0.4) 40.7(0.8) 54.4(0.3) 8.5(0.5) 250.27M config, model, log

Training with more captions: Rather than varying the number of experts, we can also investigate how performance changes as we change the number of training captions available per-video.

Experts Caps. Task R@1 R@5 R@10 MdR Params Links
RGB 1 t2v 2.6(0.1) 9.3(0.4) 15.0(0.7) 101.3(15.5) 56.7M config, model, log
RGB 20 t2v 4.9(0.1) 16.5(0.2) 25.3(0.4) 40.7(1.2) 58.05M config, model, log
All 1 t2v 4.8(0.2) 16.2(0.5) 25.0(0.7) 43.3(4.0) 183.45M config, model, log
All 20 t2v 10.0(0.1) 29.0(0.3) 41.2(0.2) 16.0(0.0) 183.45M config, model, log
RGB 1 v2t 3.7(0.3) 13.5(0.6) 20.8(0.4) 60.0(2.0) 56.7M config, model, log
RGB 20 v2t 6.9(0.6) 21.0(0.3) 31.3(0.3) 30.0(1.7) 58.05M config, model, log
All 1 v2t 8.4(0.5) 25.6(0.7) 37.1(0.2) 20.3(0.6) 183.45M config, model, log
All 20 v2t 15.6(0.3) 40.9(1.4) 55.2(1.0) 8.3(0.6) 183.45M config, model, log

Similar ablation studies for the remaining datasets can be found here.

Expert Zoo

For each dataset, the Collaborative Experts model makes use of a collection of pretrained "expert" feature extractors (see [1] for more precise descriptions). Some experts have been obtained from other sources (described where applicable), rather than extracted by us. To reproduce the experiments listed above, the experts for each dataset have been bundled into compressed tar files. These can be downloaded and unpacked with a utility script (recommended -- see example usage below), which will store them in the locations expected by the training code. Each set of experts has a brief README, which also provides a link from which they can be downloaded directly.

Dataset Experts Details and links Archive size sha1sum
MSRVTT audio, face, flow, ocr, rgb, scene, speech README 19.6 GiB 959bda588793ef05f348d16de26da84200c5a469
LSMDC audio, face, flow, ocr, rgb, scene README 6.1 GiB 7ce018e981752db9e793e449c2ba5bc88217373d
MSVD face, flow, ocr, rgb, scene README 2.1 GiB 6071827257c14de455b3a13fe1e885c2a7887c9e
DiDeMo audio, face, flow, ocr, rgb, scene, speech README 2.3 GiB 6fd4bcc68c1611052de2499fd8ab3f488c7c195b
ActivityNet audio, face, flow, ocr, rgb, scene, speech README 3.8 GiB b16685576c97cdec2783fb89ea30ca7d17abb021

Evaluating a pretrained model

Evaluting a pretrained model for a given dataset requires:

  1. The pretrained experts for the target dataset, which should be located in <root>/data/<dataset-name>/symlinked-feats (this will be done automatically by the utility script, or can be done manually).
  2. A config.json file.
  3. A trained_model.pth file.

Evaluation is then performed with the following command:

python3 test.py --config <path-to-config.json> --resume <path-to-trained_model.pth> --device <gpu-id>

where <gpu-id> is the index of the GPU to evaluate on. This option can be ommitted to run the evaluation on the CPU.

For example, to reproduce the MSVD results described above, run the following sequence of commands:

# fetch the pretrained experts for MSVD 
python3 misc/sync_experts.py --dataset MSVD

# find the name of a pretrained model using the links in the tables above 
export MODEL=data/models/msvd-train-full-ce/5bb8dda1/seed-0/2020-01-30_12-29-56/trained_model.pth

# create a local directory and download the model into it 
mkdir -p $(dirname "${MODEL}")
wget --output-document="${MODEL}" "http://www.robots.ox.ac.uk/~vgg/research/collaborative-experts/${MODEL}"

# Evaluate the model
python3 test.py --config configs/msvd/train-full-ce.json --resume ${MODEL} --device 0 --eval_from_training_config

Training a new model

Training a new video-text embedding requires:

  1. The pretrained experts for the dataset used for training, which should be located in <root>/data/<dataset-name>/symlinked-feats (this will be done automatically by the utility script, or can be done manually).
  2. A config.json file. You can define your own, or use one of the provided configs in the configs directory.

Training is then performed with the following command:

python3 train.py --config <path-to-config.json> --device <gpu-id>

where <gpu-id> is the index of the GPU to train on. This option can be ommitted to run the training on the CPU.

For example, to train a new embedding for the LSMDC dataset, run the following sequence of commands:

# fetch the pretrained experts for LSMDC 
python3 misc/sync_experts.py --dataset LSMDC

# Train the model
python3 train.py --config configs/lsmdc/train-full-ce.json --device 0

Visualising the retrieval ranking

Tensorboard lacks video support via HTML5 tags (at the time of writing) so after each evaluation of a retrieval model, a simple HTML file is generated to allow the predicted rankings of different videos to be visualised: an example screenshot is given below (this tool is inspired by the visualiser in the pix2pix codebase). To view the visualisation, navigate to the web directory (this is generated for each experiment, and will be printed in the log during training) and run python3 -m http.server 9999, then navigate to localhost:9999 in your web browser. You should see something like the following:

visualisation

Note that the visualising the results in this manner requires that you also download the source videos for each of the datasets to some directory <src-video-dir>. Then set the visualizer.args.src_video_dir attribute of the training config.json file to point to <src-video-dir>.

Dependencies

Dependencies can be installed via pip install -r requirements/pip-requirements.txt.

References

[1] If you find this code useful or use the extracted features, please consider citing:

@inproceedings{Liu2019a,
  author    = {Liu, Y. and Albanie, S. and Nagrani, A. and Zisserman, A.},
  booktitle = {arXiv preprint arxiv:1907.13487},
  title     = {Use What You Have: Video retrieval using representations from collaborative experts},
  date      = {2019},
}

[2] If you make use of the MSRVTT or LSMDC features provided by Miech et al. (details are given in their respective READMEs here and here), please cite:

@article{miech2018learning,
  title={Learning a text-video embedding from incomplete and heterogeneous data},
  author={Miech, Antoine and Laptev, Ivan and Sivic, Josef},
  journal={arXiv preprint arXiv:1804.02516},
  year={2018}
}

Acknowledgements

This work was inspired by a number of prior works for learning joint embeddings of text and video, but in particular the Mixture-of-Embedding-Experts method proposed by Antoine Miech, Ivan Laptev and Josef Sivic (paper, code). We would also like to thank Zak Stone and Susie Lim for their help with using Cloud TPUs. The code structure uses the pytorch-template by @victoresque.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages