A curated list of research papers in Video Captioning(from 2015 to 2020). Link to the code and project website if available.
-
LSTM-P: Translating Videos to Natural Language Using Deep Recurrent Neural Networks
Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko
NAACL, 2015.[caffe-code] -
LRCN: Long-term Recurrent Convolutional Networks for Visual Recognition and Description
Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, Trevor Darrell
CVPR, 2015.[website] -
S2VT: Sequence to Sequence – Video to Text
Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko
ICCV, 2015.[caffe-code] -
SA: Describing Videos by Exploiting Temporal Structure
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, Aaron Courville
ICCV, 2015.[theano-code] [tf-code]
-
LSTM-E: Jointly Modeling Embedding and Translation to Bridge Video and Language
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, Yong Rui
CVPR, 2016. -
HRNE: Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning
Pingbo Pan, Zhongwen Xu, Yi Yang, Fei Wu, Yueting Zhuang
CVPR, 2016. -
h-RNN: Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks
Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, Wei Xu
CVPR, 2016. -
MSR-VTT: MSR-VTT: A Large Video Description Dataset for Bridging Video and Language
Jun Xu , Tao Mei , Ting Yao and Yong Rui
CVPR, 2016.[website] -
BiLSTM: Video Description using Bidirectional Recurrent Neural Networks
Álvaro Peris, Marc Bolaños, Petia Radeva, Francisco Casacuberta
ICANN, 2016.
-
DenseVidCap: Weakly Supervised Dense Video Captioning
Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, Xiangyang Xue
CVPR, 2017.[tf-code] -
LSTM-TSA: Video Captioning with Transferred Semantic Attributes
Yingwei Pan, Ting Yao, Houqiang Li, Tao Mei
CVPR, 2017. -
SCN: Semantic Compositional Networks for Visual Captioning
Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, Li Deng
CVPR, 2017.[theano-code] -
StyleNet: StyleNet: Generating Attractive Visual Captions with Styles
Chuang Gan, Zhe Gan, Xiaodong He, Jianfeng Gao, Li Deng
CVPR, 2017.[pytorch-code] -
CT-SAN: End-to-end Concept Word Detection for Video Captioning, Retrieval, and Question Answering
Youngjae Yu, Hyungjin Ko, Jongwook Choi, Gunhee Kim
CVPR, 2017.[tf-code] -
CGVS: Top-down Visual Saliency Guided by Captions
Vasili Ramanishka, Abir Das, Jianming Zhang, Kate Saenko
CVPR, 2017.[tf-code] -
HBA: Hierarchical Boundary-Aware Neural Encoder for Video Captioning
Lorenzo Baraldi, Costantino Grana, Rita Cucchiara
CVPR, 2017.[pytorch-code] -
TDDF: Task-Driven Dynamic Fusion: Reducing Ambiguity in Video Description
Xishan Zhang, Ke Gao, Yongdong Zhang, Dongming Zhang, Jintao Li,and Qi Tian
CVPR, 2017. -
GEAN: Supervising Neural Attention Models for Video Captioning by Human Gaze Data
Youngjae Yu, Jongwook Choi, Yeonhwa Kim, Kyung Yoo, Sang-Hun Lee, Gunhee Kim
CVPR, 2017.[tf-code] -
MM-Att: Attention-Based Multimodal Fusion for Video Description
Chiori Hori, Takaaki Hori, Teng-Yok Lee, Kazuhiro Sumi, John R. Hershey, Tim K. Marks
ICCV, 2017. -
Tessellation: Temporal Tessellation: A Unified Approach for Video Analysis
Dotan Kaufman, Gil Levi, Tal Hassner, Lior Wolf
ICCV, 2017.[tf-code] -
MTEG: Multi-Task Video Captioning with Video and Entailment Generation
Ramakanth Pasunuru, Mohit Bansal
ACL, 2017. -
MAM-RNN: MAM-RNN: Multi-level Attention Model Based RNN for Video Captioning
Xuelong Li, Bin Zhao, Xiaoqiang Lu
IJCAI, 2017. -
hLSTMat: Hierarchical LSTM with Adjusted Temporal Attention for Video Captioning
Jingkuan Song, Lianli Gao, Zhao Guo, Wu Liu, Dongxiang Zhang, Heng Tao Shen
IJCAI, 2017.[theano-code]
-
Survey: Study of Video Captioning Problem
Jiaqi Su
cos598B, 2018. -
Fine-grained Video Captioning for Sports Narrative
Huanyu Yu, Shuo Cheng, Bingbing Ni, Minsi Wang, Jian Zhang, Xiaokang Yang
CVPR, 2018. -
TSA-ED: Interpretable Video Captioning via Trajectory Structured Localization
Xian Wu, Guanbin Li Qingxing Cao, Qingge Ji, Liang Lin
CVPR, 2018. -
RecNet: Reconstruction Network for Video Captioning
Bairui Wang, Lin Ma, Wei Zhang, Wei Liu
CVPR, 2018.[pytorch-code] -
M3: M3: Multimodal Memory Modelling for Video Captioning
Junbo Wang, Wei Wang, Yan Huang, Liang Wang, Tieniu Tan
CVPR, 2018. -
PickNet: Less Is More: Picking Informative Frames for Video Captioning
Yangyu Chen, Shuhui Wang, Weigang Zhang, Qingming Huang
ECCV, 2018. -
ECO-SCN: ECO: Efficient Convolutional Network for Online Video Understanding
Mohammadreza Zolfaghari, Kamaljeet Singh, Thomas Brox
ECCV, 2018.[caffe-code] [pytorch-code] -
SibNet: SibNet: Sibling Convolutional Encoder for Video Captioning
Sheng liu, Zhou Ren, Junsong Yuan
ACM MM, 2018. -
TubeNet: Video Captioning with Tube Features
Bin Zhao, Xuelong Li, Xiaoqiang Lu
IJCAI, 2018.
-
Survey: Video Description: A Survey of Methods, Datasets and Evaluation Metrics
Nayyer Aafaq, Ajmal Mian, Wei Liu, Syed Zulqarnain Gilani, Mubarak Shah
ACM Computing Surveys, 2019. -
GRU-EVE: Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning
Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, Ajmal Mian
CVPR, 2019. -
MARN: Memory-Attended Recurrent Network for Video Captioning
Wenjie Pei, Jiyuan Zhang, Xiangrong Wang, Lei Ke, Xiaoyong Shen, Yu-Wing Tai
CVPR, 2019. -
OA-BTG: Object-aware Aggregation with Bidirectional Temporal Graph for Video Captioning
Junchao Zhang, Yuxin Peng
CVPR, 2019. -
VATEX: VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, William Yang Wang
ICCV, 2019.[website] -
POS: Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning
Jingyi Hou, Xinxiao Wu, Wentian Zhao, Jiebo Luo, Yunde Jia
ICCV, 2019. -
POS-CG: Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network
Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, Jingwen Wang, Wei Liu
ICCV, 2019.[pytorch-code] -
WIT: Watch It Twice: Video Captioning with a Refocused Video Encoder
Xiangxi Shi, Jianfei Cai, Shafiq Joty, Jiuxiang Gu
ACM MM, 2019. -
MGSA: Motion Guided Spatial Attention for Video Captioning
Shaoxiang Chen and Yu-Gang Jiang
AAAI, 2019. -
TDConvED: Temporal Deformable Convolutional Encoder-Decoder Networks for Video Captioning
Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, Tao Mei
AAAI, 2019. -
FCVC-CF&IA: Fully Convolutional Video Captioning with Coarse-to-Fine and Inherited Attention
Kuncheng Fang, Lian Zhou, Cheng Jin, Yuejie Zhang,Kangnian Weng,Tao Zhang, Weiguo Fan
AAAI, 2019. -
TAMoE: Learning to Compose Topic-Aware Mixture of Experts for Zero-Shot Video Captioning
Xin Wang, Jiawei Wu, Da Zhang, Yu Su, William Yang Wang
AAAI, 2019.[code] -
VIC: Video Interactive Captioning with Human Prompts
Aming Wu, Yahong Han and Yi Yang
IJCAI, 2019.[code]
-
Spatio-Temporal Graph for Video Captioning with Knowledge Distillation
Boxiao Pan, Haoye Cai, De-An Huang, Kuan-Hui Lee, Adrien Gaidon, Ehsan Adeli, Juan Carlos Niebles
CVPR, 2020. -
SAAT: Syntax-Aware Action Targeting for Video Captioning
Zheng, Qi and Wang, Chaoyue and Tao, Dacheng
CVPR, 2020.[pytorch-code] -
ORG-TRL: Object Relational Graph with Teacher-Recommended Learning for Video Captioning
Ziqi Zhang, Yaya Shi, Chunfeng Yuan, Bing Li, Peijin Wang, Weiming Hu, Zhengjun Zha
CVPR, 2020. -
PMI-CAP: Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos
Shaoxiang Chen, Wenhao Jiang, Wei Liu, Yu-Gang Jiang
ECCV, 2020.[pytorch-code] -
RMN: Learning to Discretely Compose Reasoning Module Networks for Video Captioning
Ganchao Tan, Daqing Liu, Meng Wang and Zheng-Jun Zha
IJCAI, 2020.[pytorch-code] -
SBAT: SBAT: Video Captioning with Sparse Boundary-Aware Transformer
Tao Jin, Siyu Huang, Yingming Li, Zhongfei Zhang, Ming Chen
IJCAI, 2020. -
Joint Commonsense and Relation Reasoning for Image and Video Captioning
Jingyi Hou, Xinxiao Wu, Xiaoxun Zhang, Yayun Qi, Yunde Jia, Jiebo Luo
AAAI, 2020. -
SMCG: Controllable Video Captioning with an Exemplar Sentence
Yitian Yuan, Lin Ma, Jingwen Wang, Wenwu Zhu
ACM MM, 2020. -
Poet: Poet: Product-oriented Video Captioner for E-commerce
Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Jie Liu, Jingren Zhou, Hongxia Yang, Fei Wu
ACM MM, 2020. -
Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning
Botian Shi, Lei Ji, Zhendong Niu, Nan Duan, Ming Zhou, Xilin Chen
ACM MM, 2020.
-
Dense-Captioning Events in Videos
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, Juan Carlos Niebles
ICCV, 2017.[code] [website] -
End-to-End Dense Video Captioning with Masked Transformer
Luowei Zhou, Yingbo Zhou, Jason J. Corso, Richard Socher, Caiming Xiong
CVPR, 2018.[pytorch-code] -
Attend and Interact: Higher-Order Object Interactions for Video Understanding
Chih-Yao Ma, Asim Kadav, Iain Melvin, Zsolt Kira, Ghassan AlRegib, and Hans Peter Graf
CVPR, 2018. -
Jointly Localizing and Describing Events for Dense Video Captioning
Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, Tao Mei
CVPR, 2018. -
Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning
Jingwen Wang, Wenhao Jiang, Lin Ma, Wei Liu, Yong Xu
CVPR, 2018.[tf-code] -
Move Forward and Tell: A Progressive Generator of Video Descriptions
Yilei Xiong, Bo Dai, Dahua Lin
ECCV, 2018. -
Adversarial Inference for Multi-sentence Video Description
Jae Sung Park, Marcus Rohrbach, Trevor Darrell, Anna Rohrbach
CVPR, 2019.[pytorch-code] -
Dense Relational Captioning: Triple-stream Networks for Relationship-based Captioning
Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, In So Kweon
CVPR, 2019.[torch-code] -
Streamlined Dense Video Captioning
Jonghwan Mun, Linjie Yang, Zhou Ren, Ning Xu, Bohyung Han
CVPR, 2019. -
Watch, Listen and Tell: Multi-Modal Weakly Supervised Dense Event Captioning
Tanzila Rahman, Bicheng Xu, Leonid Sigal
ICCV, 2019. -
An Efficient Framework for Dense Video Captioning
Maitreya Suin, A. N. Rajagopalan
AAAI, 2020. -
MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning
Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara L. Berg, Mohit Bansal
ACL, 2020. [pytorch-code] -
Identity-Aware Multi-Sentence Video Description
Jae Sung Park, Trevor Darrell, Anna Rohrbach
ECCV, 2020.
-
GVD: Grounded Video Description
Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, Marcus Rohrbach
CVPR, 2019.[pytorch-code] -
Relational Graph Learning for Grounded Video Description Generation
Wenqiao Zhang, Xineric Wang, Siliang Tang, Haizhou Shi, Haochen Shi, Jun Xiao, Yueting Zhuang, Williamyang Wang
ACM MM, 2020.