Skip to content

Latest commit

 

History

History

classifier

Classification experiments

This folder contains classification models that are trained on learned representations generated by PredNet and other pre-trained models.

Relevant files

  • convnet_extract.py: script to extract VGG features from video frames.
  • models.py: definition of the computation graph of the neural network classifiers.
  • train.py: script to train neural network action classifiers.

Example: train a LSTM classifier to solve the 10-class task using features generated by a prednet trained on 67h of data from the Moments in Time dataset.

> python train.py prednet_kitti_finetuned_moments_full --task 10c -m lstm

Example: train a linear SVM classifier to solve the "binary temporal" task using features generated by a prednet trained on 67h of data from the Moments in Time dataset.

> python train_linear.py prednet_kitti_finetuned_moments_full --task 2c_hard
  • data.py: input generator implementation that can handle sequences of images or pickle files.
  • settings.py: parameters for each experiment.

Action recognition with small datasets

Objectives:

  • Investigate the limitations of Convnet representations on video understanding tasks
  • Test the influence of the amount of unsupervised training data on the supervised task performance

Tasks:

  • Easy: "cooking" x "walking"
  • Hard: "running" x "walking"

Results: Test set

Features + Classifier 2-class easy 2-class hard 5-class 10-class
VGG random + SVM 67.0 56.0 27.6 18.7
VGG ImageNet + SVM 85.5 67.0 44.6 52.8
VGG ImageNet + LSTM 87.4 58.4 54.9 43.2
PredNet random + SVM 67.6 62.6 37.2 30.1
PredNet KITTI + SVM 73.2 70.7 50.7 39.8
PredNet Moments 3h + SVM 73.2 66.1 49.5 39.5
PredNet Moments 67h + SVM 74.2 65.1 50.9 41.4
PredNet Moments 67h + LSTM 78.6 55.8 50.1 42.9

Insights

  • VGG features pre-trained on Imagenet work very well when the spatial information is determinant for the action classification. However, it falls short to capture fine-grained temporal patterns needed to distinguish between running and walking actions.
  • PredNet random features perform better than VGG random features, especially in the 2-class temporal task. This indicates that PredNet has better inductive biases to capture temporal patterns.
  • As we add more data, PredNet features improve the performance of action classification task. However, as more out-of-domain data is added (unrelated classes), the performance drops in the 2-class temporal task. Still, the results are competitive with Imagenet-derived features.
  • Using an LSTM instead of an SVM improves the results in the 2-class spatial task and worsen the results in the 2-class temporal task. Not clear why.

Results: Validation set

Model Easy (loss) Hard (loss) Easy (acc) Hard (acc)
VGG ImageNet 0.274 0.688 0.867 0.578
VGG ImageNet LSTM 0.224 0.634 0.933 0.689
PredNet random 0.690 0.694 0.544 0.556
PredNet KITTI 0.582 0.685 0.722 0.622
PredNet KITTI + Moments 1h 0.470 0.649 0.778 0.611
PredNet KITTI + Moments 1.25h* 0.513 0.668 0.768 0.595
PredNet KITTI + Moments 3h 0.583 0.676 0.778 0.500
PredNet KITTI + Moments 67h 0.592 0.666 0.744 0.533

* In this run we let the model "see" the held-out labelled data

Out-of-domain action recognition: UCF-101

Objectives:

  • Assess the performance of model variants on an out-of-domain task
  • Compare with baselines from the literature (focus on unsupervised approaches)
Model UCF-101 RGB (%) Pre-training dataset Pre-training size (frames)
CNN tuple verification [1] 50.2 UCF-101 2.7M
ConvNet + LSTM xx.x - 0
PredNet Video random 1.64 - 0
PredNet Video 67h 51.9 Moments in Time 2.4M
PredNet Video random 22.7 - 0
PredNet Audio 37h 24.8 Moments in Time 2.4M

[1] Misra, I., Zitnick, C. L., & Hebert, M. (2016, October). Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision (pp. 527-544). Springer, Cham.

Exploring the audio modality

Objectives:

  • Check if our unsupervised learning approach can learn useful representations from audio
  • Check if the audio modality provides complementary information

Results: Test set

Features + Classifier 2-class easy 2-class hard 10-class
PredNet Video random + SVM 67.6 62.6 30.1
PredNet Video 66.6h + SVM 74.2 65.1 41.4
PredNet Audio random + SVM 63.6 56.8 30.3
PredNet Audio 2h + SVM 66.9 56.8 29.1
PredNet Audio 37h + SVM 67.8 58.3 30.0