VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
Kevin Qinghong Lin, Mike Zheng Shou
Show Lab @ National University of Singapore
VLog (CVPR'25) | VLog-Agent | |
---|---|---|
TL;DR | Video Narration as Vocabulary | Video as Long Document |
![]() |
![]() |
|
Introduction | A novel, efficient video narrator (GPT2-based) with Narration Vocabulary via Generative Retrieval. | Given a video, we turn it into a textual document containing visual + audio info. By sending this doc to LLM, we can chat over the video! |