This repo provides code for learning and evaluating joint video-text embeddings for the task of video retrieval. Our approach is described in the paper "Use What You Have: Video retrieval using representations from collaborative experts" (paper, project page)
High-level Overview: The Collaborative Experts framework aims to achieve robustness through two mechanisms:
- The use of information from a wide range of modalities, including those that are typically always available in video (such as RGB) as well as more "specific" clues which may only occasionally be present (such as overlaid text).
- A module that aims to combine these modalities into a fixed size representation that in a manner that is robust to noise.
Requirements: The code assumes PyTorch 1.4 and Python 3.7 (other versions may work, but have not been tested). See the section on dependencies towards the end of this file for specific package requirements.
Important: A note on the updated results: A previous version of the codebase (and paper) reported results on the retrieval benchmarks that included a signficant software bug leading to an overestimate of performance. We are extremely grateful to Valentin Gabeur who discovered this bug (it has been corrected in the current codebase).
We are hosting a video retrieval challenge as part of the Video Pentathlon Workshop. Find out how to participate here!
We provide pretrained models for each dataset to reproduce the results reported in the paper [1] (references follow at the end of this README). Each model is accompanied by training and evaluation logs. Performance is evalauted for retrieval in both directions (joint-embeddings can be used for either of these two tasks):
t2v
denotes that a text query is used to retrieve videosv2t
denotes that a video query is used to retrieve text video descriptions
In the results reported below, the same model is used for both the t2v and v2t evaluations. Each metric is reported as the mean and standard deviation (in parentheses) across three training runs.
MSRVTT Benchmark
Model | Split | Task | R@1 | R@5 | R@10 | R@50 | MdR | MnR | Links |
---|---|---|---|---|---|---|---|---|---|
CE | Full | t2v | 10.0(0.1) | 29.0(0.3) | 41.2(0.2) | 71.4(0.1) | 16.0(0.0) | 86.8(0.3) | config, model, log |
CE | 1k-A | t2v | 20.9(1.2) | 48.8(0.6) | 62.4(0.8) | 89.1(0.4) | 6.0(0.0) | 28.2(0.8) | config, model, log |
CE | 1k-B | t2v | 18.2(0.7) | 46.0(0.4) | 60.7(0.2) | 86.6(0.5) | 7.0(0.0) | 35.3(1.1) | config, model, log |
MoEE* | 1k-B | t2v | 15.0(0.7) | 39.7(1.0) | 54.5(1.1) | 82.7(0.6) | 8.3(0.6) | 43.7(0.7) | config, model, log |
CE | Full | v2t | 15.6(0.3) | 40.9(1.4) | 55.2(1.0) | 84.0(0.1) | 8.3(0.6) | 38.1(1.8) | config, model, log |
CE | 1k-A | v2t | 20.6(0.6) | 50.3(0.5) | 64.0(0.2) | 89.9(0.3) | 5.3(0.6) | 25.1(0.8) | config, model, log |
CE | 1k-B | v2t | 18.0(0.8) | 46.0(0.5) | 60.3(0.5) | 86.4(0.3) | 6.5(0.5) | 30.6(1.2) | config, model, log |
MoEE* | 1k-B | v2t | 14.5(0.8) | 40.4(0.8) | 54.9(1.0) | 83.8(0.5) | 8.8(0.4) | 38.7(0.9) | config, model, log |
Models marked with * use the features made available with the MoEE model of [2] (without OCR, speech and scene features), unstarred models on the 1k-B
and Full
splits make use of OCR, speech and scene features, as well slightly stronger text encodings (GPT, rather than word2vec - see [1] for details). The MoEE model is implemented as a sanity check that our codebase approximately reproduces [2] (the MoEE paper).
See the MSRVTT README for links to the train/val/test lists of each split.
MSVD Benchmark
Model | Task | R@1 | R@5 | R@10 | R@50 | MdR | MnR | Links |
---|---|---|---|---|---|---|---|---|
CE | t2v | 19.8(0.3) | 49.0(0.3) | 63.8(0.1) | 89.0(0.2) | 6.0(0.0) | 23.1(0.3) | config, model, log |
CE | v2t | 23.9(1.4) | 50.2(0.8) | 59.6(1.2) | 82.3(0.7) | 5.6(0.5) | 41.2(3.4) | config, model, log |
See the MSVD README for descriptions of the train/test splits. Note that the videos in the MSVD dataset do not have soundtracks.
DiDeMo Benchmark
Model | Task | R@1 | R@5 | R@10 | R@50 | MdR | MnR | Links |
---|---|---|---|---|---|---|---|---|
CE | t2v | 16.1(1.4) | 41.1(0.4) | 54.4(0.8) | 82.7(0.3) | 8.3(0.6) | 43.7(3.6) | config, model, log |
CE | v2t | 15.6(1.3) | 40.9(0.4) | 55.2(0.5) | 82.2(1.3) | 8.2(0.3) | 42.4(3.3) | config, model, log |
See the DiDeMo README for descriptions of the train/val/test splits.
ActivityNet Benchmark
Model | Task | R@1 | R@5 | R@10 | R@50 | MdR | MnR | Links |
---|---|---|---|---|---|---|---|---|
CE | t2v | 18.2(0.3) | 47.7(0.6) | 63.9(0.5) | 91.4(0.4) | 6.0(0.0) | 23.1(0.5) | config, model, log |
CE | v2t | 17.7(0.6) | 46.6(0.7) | 62.8(0.4) | 90.9(0.2) | 6.0(0.0) | 24.4(0.5) | config, model, log |
See the ActivityNet README for descriptions of the train/test splits.
LSMDC Benchmark
Model | Task | R@1 | R@5 | R@10 | R@50 | MdR | MnR | Links |
---|---|---|---|---|---|---|---|---|
CE | t2v | 11.2(0.4) | 26.9(1.1) | 34.8(2.0) | 62.1(1.5) | 25.3(3.1) | 96.8(5.0) | config, model, log |
CE | v2t | 11.7(0.5) | 25.8(1.5) | 34.4(1.7) | 61.4(0.7) | 28.0(2.6) | 97.6(2.8) | config, model, log |
See the LSMDC README for descriptions of the train/test splits. Please note that to obtain the features and descriptions for this dataset, you must obtain permission from MPII to use the data (this is process is described here. Once you have done so, please request that a member of the LSMDC team contacts us to confirm approval (via albanie at robots dot ox dot ac dot uk) - we can then provide you with a link to the features.
We conduct several ablation studies to investigate the importance of different components in the Collaborative Experts design. Each ablation is conducted on the MSRVTT dataset.
CE Design: First, we investigate the importance of the parts used by the CE model.
Model | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
Concat | t2v | 0.0(0.0) | 0.0(0.0) | 0.0(0.0) | 1495.5(0.0) | 369.72k | config, model, log |
CE - MW,P,CG | t2v | 8.5(0.1) | 25.9(0.3) | 37.6(0.2) | 19.0(0.0) | 246.22M | config, model, log |
CE - P,CG | t2v | 9.6(0.1) | 28.0(0.2) | 39.7(0.2) | 17.7(0.6) | 400.41M | config, model, log |
CE - CG | t2v | 9.7(0.1) | 28.1(0.2) | 40.2(0.1) | 17.0(0.0) | 181.07M | config, model, log |
CE | t2v | 10.0(0.1) | 29.0(0.3) | 41.2(0.2) | 16.0(0.0) | 183.45M | config, model, log |
Concat | v2t | 0.0(0.0) | 0.0(0.0) | 0.0(0.0) | 29897.5(0.0) | 369.72k | config, model, log |
CE - MW,P,CG | v2t | 13.7(0.4) | 38.8(1.2) | 53.1(1.1) | 9.2(0.8) | 246.22M | config, model, log |
CE - P,CG | v2t | 14.1(0.2) | 39.5(1.0) | 53.2(0.3) | 9.0(0.0) | 400.41M | config, model, log |
CE - CG | v2t | 15.1(0.3) | 40.3(0.5) | 54.3(0.7) | 8.8(0.3) | 181.07M | config, model, log |
CE | v2t | 15.6(0.3) | 40.9(1.4) | 55.2(1.0) | 8.3(0.6) | 183.45M | config, model, log |
Each row adds an additional component to the model. The names refer to the following model designs:
- Concat: A barebones concatenation model. After aggregating each expert across time (which still requires some parameters for the variable-length VLAD layers), the experts are concatenated and compared directly against the aggregated text embeddings. Note: this model uses a slightly greater number of VLAD clusters than the others to allow the concatentated embedding to match the dimensionality of the text.
- CE - MW,P,CG - The CE model without MoE weights, projecting to a common dimension or Collaborative Gating.
- CE - P,CG - The CE model without projecting to a common dimension or Collaborative Gating (note that this is equivalent to the MoEE model proposed in [2]).
- CE - CG - The CE model without Collaborative Gating (CG).
- CE - The full CE model.
Note that in the table above some metrics have been removed to allow the number of parameters to be displayed---these additional metrics can be found in the linked logs.
Importance of Different Experts: The next ablation investigates the value of each of the different experts towards the final embedding. Since not all experts are available in every video, we pair each expert with scene features, to give an approximation of their usefulness.
Experts | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
Scene | t2v | 4.0(0.1) | 14.1(0.1) | 22.4(0.3) | 50.0(1.0) | 19.46M | config, model, log |
Scene + Inst. | t2v | 7.2(0.1) | 22.3(0.3) | 33.0(0.2) | 25.3(0.6) | 41.12M | config, model, log |
Scene + r2p1d | t2v | 6.8(0.1) | 21.7(0.1) | 32.4(0.1) | 25.7(0.6) | 39.95M | config, model, log |
Scene + RGB | t2v | 5.0(0.2) | 16.6(0.7) | 25.5(1.0) | 40.7(2.1) | 41.12M | config, model, log |
Scene + Flow | t2v | 5.3(0.3) | 17.6(0.8) | 27.1(0.9) | 36.0(1.7) | 40.34M | config, model, log |
Scene + Audio | t2v | 5.6(0.0) | 18.7(0.1) | 28.2(0.1) | 33.7(0.6) | 40.34M | config, model, log |
Scene + OCR | t2v | 4.1(0.1) | 14.1(0.1) | 22.2(0.2) | 50.3(1.2) | 49.49M | config, model, log |
Scene + Speech | t2v | 4.6(0.1) | 15.5(0.2) | 24.4(0.2) | 44.7(1.2) | 43.94M | config, model, log |
Scene + Face | t2v | 4.1(0.1) | 14.2(0.3) | 22.4(0.4) | 49.7(0.6) | 39.95M | config, model, log |
Scene | v2t | 5.6(0.6) | 18.2(0.6) | 27.7(0.3) | 39.0(0.0) | 19.46M | config, model, log |
Scene + Inst. | v2t | 10.1(0.3) | 29.7(0.5) | 41.9(0.7) | 15.2(0.9) | 41.12M | config, model, log |
Scene + r2p1d | v2t | 9.4(0.3) | 27.8(0.6) | 40.1(1.1) | 17.2(1.1) | 39.95M | config, model, log |
Scene + RGB | v2t | 6.9(0.5) | 21.2(0.9) | 31.1(1.9) | 28.7(3.8) | 41.12M | config, model, log |
Scene + Flow | v2t | 7.3(0.6) | 22.3(1.4) | 33.4(1.7) | 25.2(2.0) | 40.34M | config, model, log |
Scene + Audio | v2t | 8.2(0.4) | 24.8(0.4) | 36.0(0.1) | 21.7(0.6) | 40.34M | config, model, log |
Scene + OCR | v2t | 5.4(0.5) | 18.6(1.2) | 26.6(1.2) | 40.0(1.0) | 49.49M | config, model, log |
Scene + Speech | v2t | 6.0(0.2) | 20.4(0.5) | 30.3(1.0) | 33.0(2.0) | 43.94M | config, model, log |
Scene + Face | v2t | 5.6(1.0) | 17.9(0.7) | 26.7(0.8) | 39.1(2.6) | 39.95M | config, model, log |
We can also study their cumulative effect:
Experts | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
Scene | t2v | 4.0(0.1) | 14.1(0.1) | 22.4(0.3) | 50.0(1.0) | 19.46M | config, model, log |
Prev. + Speech | t2v | 4.6(0.1) | 15.5(0.2) | 24.4(0.2) | 44.7(1.2) | 43.94M | config, model, log |
Prev. + Audio | t2v | 5.8(0.1) | 19.0(0.3) | 28.8(0.2) | 32.3(0.6) | 62.45M | config, model, log |
Prev. + Flow | t2v | 6.7(0.2) | 21.8(0.4) | 32.5(0.5) | 25.3(0.6) | 80.96M | config, model, log |
Prev. + RGB | t2v | 7.5(0.1) | 23.4(0.0) | 34.1(0.2) | 23.7(0.6) | 100.26M | config, model, log |
Prev. + Inst | t2v | 9.5(0.2) | 27.7(0.1) | 39.4(0.1) | 18.0(0.0) | 119.56M | config, model, log |
Prev. + R2P1D | t2v | 9.9(0.1) | 28.6(0.3) | 40.7(0.1) | 17.0(0.0) | 137.67M | config, model, log |
Prev. + OCR | t2v | 10.0(0.1) | 28.8(0.2) | 40.9(0.2) | 16.7(0.6) | 165.33M | config, model, log |
Prev. + Face | t2v | 10.0(0.1) | 29.0(0.3) | 41.2(0.2) | 16.0(0.0) | 183.45M | config, model, log |
Scene | v2t | 5.6(0.6) | 18.2(0.6) | 27.7(0.3) | 39.0(0.0) | 19.46M | config, model, log |
Prev. + Speech | v2t | 6.0(0.2) | 20.4(0.5) | 30.3(1.0) | 33.0(2.0) | 43.94M | config, model, log |
Prev. + Audio | v2t | 8.6(0.2) | 26.1(0.6) | 37.8(0.8) | 19.8(0.8) | 62.45M | config, model, log |
Prev. + Flow | v2t | 9.9(0.4) | 28.6(0.7) | 41.7(0.8) | 15.7(0.6) | 80.96M | config, model, log |
Prev. + RGB | v2t | 11.2(0.3) | 32.1(0.8) | 45.4(0.6) | 13.7(0.6) | 100.26M | config, model, log |
Prev. + Inst. | v2t | 14.7(0.6) | 38.9(0.8) | 53.1(1.0) | 9.3(0.6) | 119.56M | config, model, log |
Prev. + R2P1D | v2t | 15.5(0.6) | 40.1(1.2) | 54.4(1.3) | 8.7(0.6) | 137.67M | config, model, log |
Prev. + OCR | v2t | 15.2(0.1) | 41.1(0.6) | 54.6(0.7) | 8.5(0.5) | 165.33M | config, model, log |
Prev. + Face | v2t | 15.6(0.3) | 40.9(1.4) | 55.2(1.0) | 8.3(0.6) | 183.45M | config, model, log |
Importance of Model Capacity: The next ablation investigates the value of the shared embedding dimension used by CE.
Dimension | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|
384 | t2v | 9.4(0.2) | 27.8(0.4) | 39.8(0.4) | 17.7(0.6) | 88.62M | config, model, log |
512 | t2v | 9.8(0.3) | 28.6(0.4) | 40.6(0.4) | 17.0(0.0) | 119.51M | config, model, log |
640 | t2v | 10.1(0.1) | 28.8(0.1) | 40.9(0.2) | 16.7(0.6) | 151.12M | config, model, log |
768 | t2v | 10.0(0.1) | 29.0(0.3) | 41.2(0.2) | 16.0(0.0) | 183.45M | config, model, log |
1024 | t2v | 9.9(0.1) | 28.6(0.3) | 40.7(0.4) | 17.0(0.0) | 250.27M | config, model, log |
384 | v2t | 14.0(0.5) | 38.7(0.5) | 52.7(1.4) | 9.3(0.6) | 88.62M | config, model, log |
512 | v2t | 14.8(0.4) | 40.4(0.6) | 53.9(0.4) | 8.8(0.3) | 119.51M | config, model, log |
640 | v2t | 15.6(0.6) | 41.3(0.7) | 55.0(0.5) | 8.3(0.6) | 151.12M | config, model, log |
768 | v2t | 15.6(0.3) | 40.9(1.4) | 55.2(1.0) | 8.3(0.6) | 183.45M | config, model, log |
1024 | v2t | 14.7(0.4) | 40.7(0.8) | 54.4(0.3) | 8.5(0.5) | 250.27M | config, model, log |
Training with more captions: Rather than varying the number of experts, we can also investigate how performance changes as we change the number of training captions available per-video.
Experts | Caps. | Task | R@1 | R@5 | R@10 | MdR | Params | Links |
---|---|---|---|---|---|---|---|---|
RGB | 1 | t2v | 2.6(0.1) | 9.3(0.4) | 15.0(0.7) | 101.3(15.5) | 56.7M | config, model, log |
RGB | 20 | t2v | 4.9(0.1) | 16.5(0.2) | 25.3(0.4) | 40.7(1.2) | 58.05M | config, model, log |
All | 1 | t2v | 4.8(0.2) | 16.2(0.5) | 25.0(0.7) | 43.3(4.0) | 183.45M | config, model, log |
All | 20 | t2v | 10.0(0.1) | 29.0(0.3) | 41.2(0.2) | 16.0(0.0) | 183.45M | config, model, log |
RGB | 1 | v2t | 3.7(0.3) | 13.5(0.6) | 20.8(0.4) | 60.0(2.0) | 56.7M | config, model, log |
RGB | 20 | v2t | 6.9(0.6) | 21.0(0.3) | 31.3(0.3) | 30.0(1.7) | 58.05M | config, model, log |
All | 1 | v2t | 8.4(0.5) | 25.6(0.7) | 37.1(0.2) | 20.3(0.6) | 183.45M | config, model, log |
All | 20 | v2t | 15.6(0.3) | 40.9(1.4) | 55.2(1.0) | 8.3(0.6) | 183.45M | config, model, log |
Similar ablation studies for the remaining datasets can be found here.
For each dataset, the Collaborative Experts model makes use of a collection of pretrained "expert" feature extractors (see [1] for more precise descriptions). Some experts have been obtained from other sources (described where applicable), rather than extracted by us. To reproduce the experiments listed above, the experts for each dataset have been bundled into compressed tar files. These can be downloaded and unpacked with a utility script (recommended -- see example usage below), which will store them in the locations expected by the training code. Each set of experts has a brief README, which also provides a link from which they can be downloaded directly.
Dataset | Experts | Details and links | Archive size | sha1sum |
---|---|---|---|---|
MSRVTT | audio, face, flow, ocr, rgb, scene, speech | README | 19.6 GiB | 959bda588793ef05f348d16de26da84200c5a469 |
LSMDC | audio, face, flow, ocr, rgb, scene | README | 6.1 GiB | 7ce018e981752db9e793e449c2ba5bc88217373d |
MSVD | face, flow, ocr, rgb, scene | README | 2.1 GiB | 6071827257c14de455b3a13fe1e885c2a7887c9e |
DiDeMo | audio, face, flow, ocr, rgb, scene, speech | README | 2.3 GiB | 6fd4bcc68c1611052de2499fd8ab3f488c7c195b |
ActivityNet | audio, face, flow, ocr, rgb, scene, speech | README | 3.8 GiB | b16685576c97cdec2783fb89ea30ca7d17abb021 |
Evaluting a pretrained model for a given dataset requires:
- The pretrained experts for the target dataset, which should be located in
<root>/data/<dataset-name>/symlinked-feats
(this will be done automatically by the utility script, or can be done manually). - A
config.json
file. - A
trained_model.pth
file.
Evaluation is then performed with the following command:
python3 test.py --config <path-to-config.json> --resume <path-to-trained_model.pth> --device <gpu-id>
where <gpu-id>
is the index of the GPU to evaluate on. This option can be ommitted to run the evaluation on the CPU.
For example, to reproduce the MSVD results described above, run the following sequence of commands:
# fetch the pretrained experts for MSVD
python3 misc/sync_experts.py --dataset MSVD
# find the name of a pretrained model using the links in the tables above
export MODEL=data/models/msvd-train-full-ce/5bb8dda1/seed-0/2020-01-30_12-29-56/trained_model.pth
# create a local directory and download the model into it
mkdir -p $(dirname "${MODEL}")
wget --output-document="${MODEL}" "http://www.robots.ox.ac.uk/~vgg/research/collaborative-experts/${MODEL}"
# Evaluate the model
python3 test.py --config configs/msvd/train-full-ce.json --resume ${MODEL} --device 0 --eval_from_training_config
Training a new video-text embedding requires:
- The pretrained experts for the dataset used for training, which should be located in
<root>/data/<dataset-name>/symlinked-feats
(this will be done automatically by the utility script, or can be done manually). - A
config.json
file. You can define your own, or use one of the provided configs in the configs directory.
Training is then performed with the following command:
python3 train.py --config <path-to-config.json> --device <gpu-id>
where <gpu-id>
is the index of the GPU to train on. This option can be ommitted to run the training on the CPU.
For example, to train a new embedding for the LSMDC dataset, run the following sequence of commands:
# fetch the pretrained experts for LSMDC
python3 misc/sync_experts.py --dataset LSMDC
# Train the model
python3 train.py --config configs/lsmdc/train-full-ce.json --device 0
Tensorboard lacks video support via HTML5 tags (at the time of writing) so after each evaluation of a retrieval model, a simple HTML file is generated to allow the predicted rankings of different videos to be visualised: an example screenshot is given below (this tool is inspired by the visualiser in the pix2pix codebase). To view the visualisation, navigate to the web directory
(this is generated for each experiment, and will be printed in the log during training) and run python3 -m http.server 9999
, then navigate to localhost:9999
in your web browser. You should see something like the following:
Note that the visualising the results in this manner requires that you also download the source videos for each of the datasets to some directory <src-video-dir>
. Then set the visualizer.args.src_video_dir
attribute of the training config.json
file to point to <src-video-dir>
.
Dependencies can be installed via pip install -r requirements/pip-requirements.txt
.
[1] If you find this code useful or use the extracted features, please consider citing:
@inproceedings{Liu2019a,
author = {Liu, Y. and Albanie, S. and Nagrani, A. and Zisserman, A.},
booktitle = {arXiv preprint arxiv:1907.13487},
title = {Use What You Have: Video retrieval using representations from collaborative experts},
date = {2019},
}
[2] If you make use of the MSRVTT or LSMDC features provided by Miech et al. (details are given in their respective READMEs here and here), please cite:
@article{miech2018learning,
title={Learning a text-video embedding from incomplete and heterogeneous data},
author={Miech, Antoine and Laptev, Ivan and Sivic, Josef},
journal={arXiv preprint arXiv:1804.02516},
year={2018}
}
This work was inspired by a number of prior works for learning joint embeddings of text and video, but in particular the Mixture-of-Embedding-Experts method proposed by Antoine Miech, Ivan Laptev and Josef Sivic (paper, code). We would also like to thank Zak Stone and Susie Lim for their help with using Cloud TPUs. The code structure uses the pytorch-template by @victoresque.