Video Captioning

VideoStory: A New Multimedia Embedding for Few-Example Recognition and Translation of Events
[Paper][Homepage]
45,826 videos and their descriptions obtained by harvesting YouTube
MSR-VTT: A Large Video Description Dataset for Bridging Video and Language (CVPR 2016)
[Paper][Homepage]
10K web video clips, 200K clip-sentence pairs
VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research (ICCV 2019)
[Paper][Homepage]
41,250 videos, 825,000 captions in both English and Chinese, over 206,000 English-Chinese parallel translation pairs
ActivityNet Captions: Dense-Captioning Events in Videos (ICCV 2017)
[Paper][Homepage]
20k videos, 100k sentences
ActivityNet Entities: Grounded Video Description
[Paper][Homepage]
14,281 annotated videos, 52k video segments with at least one noun phrase annotated per segment, augment the ActivityNet Captions dataset with 158k bounding box
WebVid-2M: Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval (2021)
[Paper][Homepage]
over two million videos with weak captions scraped from the internet
VTW: Title Generation for User Generated Videos (ECCV 2016)
[Paper][Homepage]
18100 video clips with an average of 1.5 minutes duration per clip
TGIF: A New Dataset and Benchmark on Animated GIF Description (CVPR 2016)
[Paper][Homepage]
100K animated GIFs from Tumblr and 120K natural language descriptions
Charades: Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding (ECCV 2016)
[Paper][Homepage]
9,848 annotated videos, 267 people, 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VideoCaptioning.md

VideoCaptioning.md

Video Captioning

Files

VideoCaptioning.md

Latest commit

History

VideoCaptioning.md

File metadata and controls

Video Captioning