Jinwoo's literature review on computer vision and machine learning papers
-
Human Action Localization with Sparse Spatial Supervision - P. Weinzaepfel et al., arXiv2017.
"Action Detection using Sparse Spatial Supervision"
- Only use 1/5 bounding box annotation(s) per tube for training
- Use human detector and tracking-by-detection method to obtain human tubes
- Human detector is Faster R-CNN trained on MPII Human Pose dataset
- Classify the human tubes afterwards
- Use IDT + ConvNet features
- Introduce a new untrimmed, weakly supervised action detection dataset, DALY
- Using all bounding box annotations and 1/5 bounding box annotation(s) show similar video mAP
- However, even with the all bounding box annotations, the performance is inferior than the state-of-the-art methods
-
Unsupervised Action Discovery and Localization in Videos - K. Soomro and M. Shah, ICCV2017.
"Unsupervised Spatio-Temporal Action Detection"
- First paper on unsupervised action detection problem
- Discriminative clustering to discovery which labels are presented in a dataset
- Use sepctral clustering to get initial clusters
- Iteratively selects videos from the non-dominant set
- Obtain spatio-temporal annotations by
- Oversegmenting the video using supervoxel
- Constructing DAG
- Solving knapsack optimization with temporal constraints: determine wether to include a supervoxel in the current "action" or not
- Shows competitive performance (in terms of AUC) compared to supervised methods
- Might be applied for weakly supervised action detection problem solving
-
Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions - P. Mettes and C. G. M. Snoek, ICCV2017.
"Zero-Shot action detection/classification method using actor, object, actor-object relationship, and global context"
- Zero-Shot learning method: no training videos of action required
- Proposed spatial-aware object embedding: During test time, on top of object and action detectors, actions, objects, and their interactions are used to detect/classify actions in the given frame
- Use word2vec representation to narrow down the possible objects given an action class
- Global objects (objects far away from the actors) are also incorporated to boost the performance
-
Action Tubelet Detector for Spatio-Temporal Action Localization - V. Kalogeiton et al., ICCV2017. [code] [project web]
-
Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos - R. Hou et al., ICCV2017. [project web]
-
Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection - M. Zolfaghari et al., ICCV2017. [project web]
-
TORNADO: A Spatio-Temporal Convolutional Regression Network for Video Action Proposal - H. Zhu et al., ICCV2017.
"Spatio-Temporal action proposal using Spatial and Temporal networks" In this paper, two networks are introduced to capture spatial and temporal contexts for spatio-temporal action proposal generation. Temporal context is captured by ConvLSTM and spatial context is captured by a plain ConvNet. Each network predicts frame-level bounding box proposals with confidence and actionness/backgroundness scores. They link the frame-level proposals temporally to generate tube proposals by dynamic programming with confidence scores and overlaps. Then the tube proposals are temporally trimmed by the peak actionness detection algorithm. They use both RGB and Flow as input modalities. Evaluation metrics are ABO, MABO and recall. UCF-101 and UCF-Sports are their testbeds.
-
Online Real time Multiple Spatiotemporal Action Localisation and Prediction - G. Singh et al., ICCV2017. [code]
-
AMTnet: Action-Micro-Tube regression by end-to-end trainable deep architecture - S. Saha et al., ICCV2017.
"Propose 3D RPN using two frames with an arbitrary interval: Only using RGB frames, no flow frames"
- Incorporating temporal dependencies by 3D RPN using two frames
- 3D RPN is a straghtforward extension of 2D RPN
- Input of the 3D RPN is an element-wise summation of Conv5 features from two VGGNets for two frames with an arbitrary interval (using 1, or 2 frames interval in practice)
- Output of the 3D RPN is two sets of bounding boxes corresponding to two input frames: one set of bounding boxes for the first frame, and the other set of bounding boxes for the second frame
- Instead of RoIPooling, bilinear interpolation is used to get a fixed size feature vector
- Only using RGB frames, no flow frames
- Not showing very strong performance (video mAP) on UCF-101 or J-HMDB
-
Am I Done? Predicting Action Progress in Videos - F. Becattini et al., BMVC2017.
-
Generic Tubelet Proposals for Action Localization - J. He et al., arXiv2017.
-
Incremental Tube Construction for Human Action Detection - H. S. Behl et al., arXiv2017.
-
Multi-region two-stream R-CNN for action detection - X. Peng and C. Schmid. ECCV2016. [code]
-
Spot On: Action Localization from Pointly-Supervised Proposals - P. Mettes et al., ECCV2016.
"Action localization using pointly-supervised proposals"
- Use APT (trajectory clustering based method) to obtain tube proposals - Incoroporate an overlap measure between annotated points and proposals into the mining process of MIL
-
Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos - S. Saha et al., BMVC2016. [code] [project web]
-
Learning to track for spatio-temporal action localization - P. Weinzaepfel et al., ICCV2015.
-
Action detection by implicit intentional motion clustering - W. Chen and J. Corso, ICCV2015.
-
Finding Action Tubes - G. Gkioxari and J. Malik CVPR2015. [code] [project web]
-
APT: Action localization proposals from dense trajectories - J. Gemert et al., BMVC2015. [code]
"Cluster trajectories and use the resulting tubes for action detection"
-
Spatio-Temporal Object Detection Proposals - D. Oneata et al., ECCV2014. [code] [project web]
-
Action localization with tubelets from motion - M. Jain et al., CVPR2014.
"Action localizationn by hierarchical merging supervoxels and use dense trajectory features for tube classification"
-
Spatiotemporal deformable part models for action detection - Y. Tian et al., CVPR2013. [code]
-
Action localization in videos through context walk - K. Soomro et al., ICCV2015.
-
Fast Action Proposals for Human Action Detection and Search - G. Yu and J. Yuan, CVPR2015. Note: code for FAP is NOT available online. Note: Aka FAP.
-
Temporal Action Detection with Structured Segment Networks - Y. Zhao et al., ICCV2017. [code] [project web]
"Temporal Action Detection using temporal pyramid feature and completeness classifier"
- Works on top of temporal proposal method
- Temporal proposal method is also proposed: "Temporal Actionness Grouping (TAG)"
- On top of temporal actionness scores, use water-shed algorithm to group the temporal actioneess scores to obtain temporal proposals
- Temporal proposal method is also proposed: "Temporal Actionness Grouping (TAG)"
- Given an input proposal, generate an augmented proposal
- Augmented proposal has a longer temporal extent to both directions (before and after)
- Notion of "start", "end", and "course" stages: course means the initial proposal, "start" means the frames before the initial proposal starts, "end" means the frames after the initial proposal ends
- Incorporate the temporal context before and after the actions
- Divide the augmented proposal into 9 snippets
- Construct a temporal pyramid feature for the "course" stage
- Two classifiers: "Action classifier" and "Completeness classifier"
- Action classifier: a normal multi-class classifer
- Completeness classifier: a class-specific binary classifier to determine each actions is complete or not
- State-of-the-art performance on temporal action detecction
- Works on top of temporal proposal method
-
Temporal Context Network for Activity Localization in Videos - X. Dai et al., ICCV2017.
"Temporal Activity Detection method incorporating temporal context"
- Temporal context means a temporal proposal with a temporal extent larger than the actual action extent
- Propose temporal anchors with various scales for each temporal position
- When encoding a feature, concat features from two scales to incorporate temporal context
- Apply temporal convolution to further incorporate temporal context
- Temporal context does help: it is shown by ablation study
- State-of-the-art performance on ActivityNet and THUMOS14
-
Detecting the Moment of Completion: Temporal Models for Localising Action Completion - F. Heidarivincheh et al., arXiv2017.
"A trial for detecting action completions using ConvNet + HMM/LSTM." In this paper, we try to detect a moment of an action completion. We want to separate pre-completion and post-completion of an action frame-by-frame. We define the "completion" as the "goal" of an action is achieved. We use HMM and LSTM on top of ConvNet feature to detect a completion of an action. For HMM, we have 2 hidden states, pre and post. The parameters of HMM, initial and transition probs, covariance matrices and mean vectors are learnt from training data. For LSTM, we feed fc7 feature and per-frame labels (pre or post) to LSTM as an input. Experimental results are quite trivial. Both models can detect the completion of an action with a reasonable accuracy, 75% within 10 frames, under strong assumptions: temporally trimmed sequences (no multiple actions per sequence), momentary completion (completion should be detected using only one frame even for human), and uniform prior for completion (50:50 chance of complete vs. incomplete).
-
CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos - Z. Shou et al., CVPR2017. [code]
-
SST: Single-Stream Temporal Action Proposals - S. Buch et al., CVPR2017. [code]
-
R-C3D: Region Convolutional 3D Network for Temporal Activity Detection - H. Xu et al., arXiv2017. [code] [project web]
-
DAPs: Deep Action Proposals for Action Understanding - V. Escorcia et al., ECCV2016. [code] [raw data]
-
Online Action Detection using Joint Classification-Regression Recurrent Neural Networks - Y. Li et al., ECCV2016. Noe: RGB-D Action Detection
-
Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs - Z. Shou et al., CVPR2016. [code] Note: Aka S-CNN.
-
Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos - F. Heilbron et al., CVPR2016. [code] Note: Depends on C3D, aka SparseProp.
-
Actionness Estimation Using Hybrid Fully Convolutional Networks - L. Wang et al., CVPR2016. [code] Note: The code is not a complete verision. It only contains a demo, not training. [project web]
-
Learning Activity Progression in LSTMs for Activity Detection and Early Detection - S. Ma et al., CVPR2016.
-
End-to-end Learning of Action Detection from Frame Glimpses in Videos - S. Yeung et al., CVPR2016. [code] [project web] Note: This method uses reinforcement learning
-
Fast Action Proposals for Human Action Detection and Search - G. Yu and J. Yuan, CVPR2015. Note: code for FAP is NOT available online. Note: Aka FAP.
-
Bag-of-fragments: Selecting and encoding video fragments for event detection and recounting - P. Mettes et al., ICMR2015.
-
Action localization in videos through context walk - K. Soomro et al., ICCV2015.
- Deep Temporal Linear Encoding Networks - A. Diba et al., CVPR2017. [project web] [code]
- Temporal Convolutional Networks: A Unified Approach to Action Segmentation and Detection - C. Lea et al., CVPR 2017. [code]
- Long-term Temporal Convolutions - G. Varol et al., TPAMI2017. [project web] [code]
- Temporal Segment Networks: Towards Good Practices for Deep Action Recognition - L. Wang et al., arXiv 2016. [code]
-
Attentional Pooling for Action Recognition - R. Girdhar and D. Ramanan, NIPS2017.
"New pooling method with attention for action recognition" In this paper, an attention weighted pooling method is proposed. With a rank 1 approximation of second-order pooling and manipulating the order of matrix multiplications, attention pooling can be veiwed as a combination of class-agnostic bottom-up saliency and class-specific top-down attention. We can replace the average pooling operations in the ResNet architecture by the proposed attention pooling. With the attention pooling, we can get state-of-the-art performance on HMDB51 (video), HICO and MPII (image) dataset.
-
Fully Context-Aware Video Prediction - Byeon et al., arXiv2017.
-
Dynamic Image Networks for Action Recognition - H. Bilen et al., CVPR2016. [code] [project web]
-
Long-term Recurrent Convolutional Networks for Visual Recognition and Description - J. Donahue et al., CVPR2015. [code] [project web]
-
Describing Videos by Exploiting Temporal Structure - L. Yao et al., ICCV2015. [code] note: from the same group of RCN paper “Delving Deeper into Convolutional Networks for Learning Video Representations"
-
Two-Stream SR-CNNs for Action Recognition in Videos - L. Wang et al., BMVC2016.
-
Real-time Action Recognition with Enhanced Motion Vector CNNs - B. Zhang et al., CVPR2016. [code]
-
Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors - L. Wang et al., CVPR2015. [code]
-
Convolutional Two-Stream Network Fusion for Video Action Recognition - C. Feichtenhofer et al., CVPR2016. [code]
-
A Closer Look at Spatiotemporal Convolutions for Action Recognition - D. Tran et al., CVPR2018. [code] "2D+1D separate convolution is better than 3D convolution"
- 3D ConvNet architecture search on action classification task
- Baselines implemented using vanilla ResNet-like architecture (has a skip connection)
- fR2D: 2D convolutions over frames independently
- R2D: 2D convolutions over the entire clip. Reshape 4D input tensor x of shape LxHxWx3 to 3LxHxW
- R3D: Use 3D convolutions
- MCx: Use 3D convolutions in the first x layers, use 2D convolutions in the remaining layers
- rMCx: Use 2D convolutions in the first x layers, use 3D convolutions in the remaining layers
- R(2+1)D: Use 2D convolutions + 1D convolutions throughout the entire network. Note that R(2+1)D and R3D have roughly same number of parameters and same computation complexity
- For all the baselines. they sample bunch of clips to do a video classification (avg pooling is conducted to aggreate clip-level predictions)
- In contrast, I3D just use a single clip with L=64 frames by random sampling (for both training and testing)
- Datasets used:
- Training from scratch: Sports 1M, Kinetics
- Transfer learning: UCF101, HMDB51
- Observations
- 2D + 1D convolution is better than 3D convolution, 2D convolution and 3D and 2D convolutions mixed
- Mixed 3D and 2D models: MCx(3D conv early) is beter than rMCx(3D conv in deeper layers)
- Motion pattern is important in earlier layers
- This is an opposite observation from Xie et al.
- Performance
- For RGB only and flow only models, R(2+1)D is better than I3D
- R(2+1)D two-stream model shows slightly worse performance than I3D two-stream model on Kinetics
- Note that I3D is pretrained on ImageNet, while R(2+1)D is trained from scratch
- Why R(2+1)D is better than R3D (single RGB/Flow models)?
- Double number of non-linearity layers
- Better for optimization (note R(2+1)D shows lower training error than R3D)
-
Rethinking Spatiotemporal Feature Learning For Video Understanding - S. Xie et al., arXiv2017.
"Improving I3D, called S3D-G" In this paper, I3D, which inflates all the 2D filters of the InceptionNet to 3D, is enhanced. First, we replace 3D convolutions in a bottom layers to 2D and get higher accuracy and computation efficiency and more compact model. Second, we separate temporal convolution from spatial convolution in every 3D convolution layer. This also makes higher accuracy, more compact model, and faster speed. Finally, spatiotemporal gating is introduced to further boost the accuracy. We show their model performance on the large scale Kinetics dataset for an ablation study. Also we show the proposed model, S3D-G, is generalizable to other tasks such as action classification and detection.
- Action classification performance: 96.8% on UCF-101, 75.9% on HMDB-51 (pretrained on Kinetics)
- Action detection performance: 80.1% on UCF-101, 72.1% on JHMDB (pretrained on Kinetics)
- Maybe most gains come from the Kinetics dataset pretraining.
-
ConvNet Architecture Search for Spatiotemporal Feature Learning - D. Tran et al., arXiv2017. Note: Aka Res3D. [code]: In the repository, C3D-v1.1 is the Res3D implementation.
"3D version of ResNet" In this paper, a 3D version of Residual Network is introduced to better encode spatio-temporal information in a video by extensive experimental search. We fix the number of parameters to 33M and conduct extensive experiments to find an optimal architecture. The Res3D contains 1) skip connections, 2) using frame sampling rate of 2 or 4 (optimal on UCF-101), 3) spatial resolution 112x112, 4) layer depth 18. We also find that using 3D conv is better than using 2D conv or 2.5D conv (spatial and temporal conv separated). Shows higher accuracy than C3D on UCF101 and HMDB51. 85.8 vs. 82.3 and 54.9 vs. 51.6 respectively. 2 times faster speed and 2 times smaller model size.
-
Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks - Z. Qui et al., ICCV2017. [code]
-
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset - J. Carreira et al., CVPR2017. Note: Aka I3D. [code]: training code is not provied [unofficial code]: training code is provided but not offical
-
Spatiotemporal Residual Networks for Video Action Recognition - C. Feichtenhofer et al., NIPS2016. [code]
-
Learning Spatiotemporal Features with 3D Convolutional Networks - D. Tran et al., ICCV2015. [the official Caffe code] [project web] Note: Aka C3D. [Python Wrapper] Note that the official caffe does not support python wrapper. [TensorFlow], [TensorFlow + Keras], [Another TensorFlow Implemetation], [Keras C3D Project web]: [Keras code], [Pretrained weights].
-
PathTrack: Fast Trajectory Annotation with Path Supervision - S. Manen et al., ICCV2017.
"Fast bounding box annotation generation method using path supervision. We may apply this kind of technique to solve weakly supervised detection tasks." In this paper, the goal is generate a large scale multiple-object tracking (MOT) dataset using a path-level supervision. With Amazon Mechanical Turk, they get inputs from users to annotation bounding boxes of objects in various videos. The input annotations are point-wise paths. Using an off-the-shelf object detector and the path annotations, they can automatically generate the full bounding box trajectory annotations. They link and label the detections by optimizing an energy function consists of a unary term and a pairwise term. The unary term penalizes the label outside the bounding box and the pairwise term penalizes the affine detections being assigned to diffrent clusters. By using the proposed method, they can generate a large scale dataset for MOT a with minimum supervision envolved.
-
CortexNet: a Generic Network Family for Robust Visual Temporal Representations A. Canziani and E. Culurciello - arXiv2017. [code] [project web]
-
Slicing Convolutional Neural Network for Crowd Video Understanding - J. Shao et al., CVPR2016. [code]
- Moments in Time, paper
- AVA, paper, [INRIA web] for missing videos
- Kinetics, paper
- DALY Daily Action Localization in Youtube videos. Note: Weakly supervised action detection dataset. Annotations consist of start and end time of each action, one bounding box per each action per video.
- 20BN-JESTER, 20BN-SOMETHING-SOMETHING
- ActivityNet Note: They provide a download script and evaluation code here .
- Charades
- Sports-1M - Large scale action recognition dataset.
- THUMOS14 Note: It overlaps with UCF-101 dataset.
- THUMOS15 Note: It overlaps with UCF-101 dataset.
- HOLLYWOOD2: Spatio-Temporal annotations
- UCF-101, annotation provided by THUMOS-14, and corrupted annotation list, UCF-101 corrected annotations and different version annotaions. And there are also some pre-computed spatiotemporal action detection results
- UCF-50.
- UCF-Sports, note: the train/test split link in the official website is broken. Instead, you can download it from here.
- HMDB
- J-HMDB
- LIRIS-HARL
- KTH
- MSR Action Note: It overlaps with KTH datset.
- Efficiently scaling up crowdsourced video annotation - C. Vondrick et al., IJCV2013. [code]
- The Design and Implementation of ViPER - D. Mihalcik and D. Doermann, Technical report.
- Detectron - Open Source Object Detection Framework from Facebook AI Research. Includes Mask R-CNN, FPN, and etc. Caffe2 implementation.
- Faster R-CNN - S. Ren et al., NIPS2015. [official MatCaffe code], [PyCaffe], [TensorFlow], [Another TF implementation] [Keras] - State-of-the-art object detector.
- YOLO - J. Redmon et al., CVPR2016. [official code], [TensorFLow] - Fast object detector.
- YOLO9000 - J. Redmon and A. Farhadi, CVPR2017. [official code] - State-of-the-art object detector which can detect 9000 objects in realtime.
- SSD - W. Liu et al., ECCV2016. [official PyCaffe code], [TensorFlow], [Keras] - State-of-the-art object detector with realtime processing speed.
- Mask R-CNN - K. He et al., [TensorFlow + Keras], [MXNet], [TensorFlow], [PyTorch] - State-of-the-art object detection/instance segmentation algorithm.
-
[Detect to Track and Track to Detect] - C. Feichtenhofer et al., ICCV2017. [code], [project web]
"Video Object Detection and Tracking using R-FCN"
- On top of two frame-level ConvNets one is for frame t and the other is for frame t +
$\tau$ - Propose a multi-task objective consists of 1)classification loss, 2)bbox regression loss, 3)tracking loss
- The tracking loss is smooth L1 loss between ground truth and a "tracking regression value" for frame t +
$\tau$ - Correlation feature map between the detection at frame t and search candidates at frame t +
$\tau$ is computed - RoI Pooling operation is applied to the correlation feature map
- The tracking loss is smooth L1 loss between ground truth and a "tracking regression value" for frame t +
- Evaluation on ImageNET VID dataset
- On top of two frame-level ConvNets one is for frame t and the other is for frame t +
-
[Flow-Guided Feature Aggregation for Video Object Detection] - X. Zhu et al., ICCV2017. [code], aka FGFA
"Using optical flow to guide the temporal feature aggregation for frame-level detection"
- Temporally aggregating the frame-level features
- Use FlowNet to estimate the motion between reference frame and nearby frames
- Warp the nearby frames' feature map by a bilinear warping function to the reference frame
- Temporally aggregate the feature map of the reference frame, and feature maps of the warped nearby frame
- Use element-wise summation with adaptive weights for the aggregation
- Adaptive weights are computed by cosine similarity measure between the reference frame feature and the nearyby frame feature
- Apply temporal dropout during training
- Dropping out the random nearby frames, e.g. Dropping out 3 frames when testing frame range is 5 and training frame range is 2
- This means we could incorporate long term temporal context by using long testing frame range but at the same time, we could use only training frame range 2 to reduce the computation/memory requirements when training
- Temporally aggregating the frame-level features
-
Detect-and-Track: Efficient Pose Estimation in Videos - R. Girdhar et al., arXiv2017.
"Pose tracking by 3D Mask R-CNN"
- Two-stage approach: 1)dense prediction, 2)link (track) afterwards
- Use 3D Mask R-CNN to detect body keypoints every frame
- Convert the 2D convolutions of ResNet to 3D convolutions
- First show that using the 2D Mask R-CNN achieves the state-of-the-art performance
- Then show that using the proposed "inflated" 3D Mask R-CNN shows better performance than 2D counter part when using the same backbone architecture
- Propose tube proposal network which regresses tube anchors
- Tube anchors are nothing but spatial anchors duplicated in time
- Use bipartite matching to link the predictions over time
- Evaluate on PoseTrack dataset
-
OpenPose Library - Caffe based realtime pose estimation library from CMU.
-
Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields - Z. Cao et al., CVPR2017. [code] depends on the [caffe RT pose] - Earlier version of OpenPose from CMU
License
To the extent possible under law, Jinwoo Choi has waived all copyright and related or neighboring rights to this work.
Please read the contribution guidelines. Then please feel free to send me pull requests or email (jinchoi@vt.edu) to add links.