Deep Learning for Video Analysis Spatiotemporal Features Deep Learning for Video Classification and Captioning https://arxiv.org/pdf/1609.06782.pdf Large-scale Video Classification with Convolutional Neural Networks https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/42455.pdf Learning Spatiotemporal Features with 3D Convolutional Networks http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Tran_Learning_Spatiotemporal_Features_ICCV_2015_paper.pdf Two-Stream Convolutional Networks for Action Recognition in Videos https://papers.nips.cc/paper/5353-two-stream-convolutional-networks-for-action-recognition-in-videos.pdf Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Wang_Action_Recognition_With_2015_CVPR_paper.pdf Multimodal AENet: Learning Deep Audio Features for Video Analysis https://arxiv.org/pdf/1701.00599.pdf Look, Listen and Learn https://arxiv.org/pdf/1705.08168.pdf Objects that Sound https://arxiv.org/pdf/1712.06651 Learning a Text-Video Embedding from Incomplete and Heterogeneous Data https://arxiv.org/pdf/1804.02516.pdf Learning to Separate Object Sounds by Watching Unlabeled Video https://arxiv.org/pdf/1804.01665.pdf