-
VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events
[Paper][Homepage]
45,826 videos and their descriptions obtained by harvesting YouTube -
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language (CVPR 2016)
[Paper][Homepage]
10K web video clips, 200K clip-sentence pairs -
VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research (ICCV 2019)
[Paper][Homepage]
41,250 videos, 825,000 captions in both English and Chinese, over 206,000 English-Chinese parallel translation pairs -
ActivityNet Captions: Dense-Captioning Events in Videos (ICCV 2017)
[Paper][Homepage]
20k videos, 100k sentences -
ActivityNet Entities: Grounded Video Description
[Paper][Homepage]
14,281 annotated videos, 52k video segments with at least one noun phrase annotated per segment, augment the ActivityNet Captions dataset with 158k bounding box -
WebVid-2M: Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval (2021)
[Paper][Homepage]
over two million videos with weak captions scraped from the internet -
VTW: Title Generation for User Generated Videos (ECCV 2016)
[Paper][Homepage]
18100 video clips with an average of 1.5 minutes duration per clip -
TGIF: A New Dataset and Benchmark on Animated GIF Description (CVPR 2016)
[Paper][Homepage]
100K animated GIFs from Tumblr and 120K natural language descriptions -
Charades: Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding (ECCV 2016)
[Paper][Homepage]
9,848 annotated videos, 267 people, 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes