This folder contains classification models that are trained on learned representations generated by PredNet and other pre-trained models.
- convnet_extract.py: script to extract VGG features from video frames.
- models.py: definition of the computation graph of the neural network classifiers.
- train.py: script to train neural network action classifiers.
Example: train a LSTM classifier to solve the 10-class task using features generated by a prednet trained on 67h of data from the Moments in Time dataset.
> python train.py prednet_kitti_finetuned_moments_full --task 10c -m lstm
- train_linear.py: script to train linear (SVM) action classifiers.
Example: train a linear SVM classifier to solve the "binary temporal" task using features generated by a prednet trained on 67h of data from the Moments in Time dataset.
> python train_linear.py prednet_kitti_finetuned_moments_full --task 2c_hard
- data.py: input generator implementation that can handle sequences of images or pickle files.
- settings.py: parameters for each experiment.
- Investigate the limitations of Convnet representations on video understanding tasks
- Test the influence of the amount of unsupervised training data on the supervised task performance
- Easy: "cooking" x "walking"
- Hard: "running" x "walking"
Features + Classifier | 2-class easy | 2-class hard | 5-class | 10-class |
---|---|---|---|---|
VGG random + SVM | 67.0 | 56.0 | 27.6 | 18.7 |
VGG ImageNet + SVM | 85.5 | 67.0 | 44.6 | 52.8 |
VGG ImageNet + LSTM | 87.4 | 58.4 | 54.9 | 43.2 |
PredNet random + SVM | 67.6 | 62.6 | 37.2 | 30.1 |
PredNet KITTI + SVM | 73.2 | 70.7 | 50.7 | 39.8 |
PredNet Moments 3h + SVM | 73.2 | 66.1 | 49.5 | 39.5 |
PredNet Moments 67h + SVM | 74.2 | 65.1 | 50.9 | 41.4 |
PredNet Moments 67h + LSTM | 78.6 | 55.8 | 50.1 | 42.9 |
- VGG features pre-trained on Imagenet work very well when the spatial information is determinant for the action classification. However, it falls short to capture fine-grained temporal patterns needed to distinguish between running and walking actions.
- PredNet random features perform better than VGG random features, especially in the 2-class temporal task. This indicates that PredNet has better inductive biases to capture temporal patterns.
- As we add more data, PredNet features improve the performance of action classification task. However, as more out-of-domain data is added (unrelated classes), the performance drops in the 2-class temporal task. Still, the results are competitive with Imagenet-derived features.
- Using an LSTM instead of an SVM improves the results in the 2-class spatial task and worsen the results in the 2-class temporal task. Not clear why.
Model | Easy (loss) | Hard (loss) | Easy (acc) | Hard (acc) |
---|---|---|---|---|
VGG ImageNet | 0.274 | 0.688 | 0.867 | 0.578 |
VGG ImageNet LSTM | 0.224 | 0.634 | 0.933 | 0.689 |
PredNet random | 0.690 | 0.694 | 0.544 | 0.556 |
PredNet KITTI | 0.582 | 0.685 | 0.722 | 0.622 |
PredNet KITTI + Moments 1h | 0.470 | 0.649 | 0.778 | 0.611 |
PredNet KITTI + Moments 1.25h* | 0.513 | 0.668 | 0.768 | 0.595 |
PredNet KITTI + Moments 3h | 0.583 | 0.676 | 0.778 | 0.500 |
PredNet KITTI + Moments 67h | 0.592 | 0.666 | 0.744 | 0.533 |
* In this run we let the model "see" the held-out labelled data
- Assess the performance of model variants on an out-of-domain task
- Compare with baselines from the literature (focus on unsupervised approaches)
Model | UCF-101 RGB (%) | Pre-training dataset | Pre-training size (frames) |
---|---|---|---|
CNN tuple verification [1] | 50.2 | UCF-101 | 2.7M |
ConvNet + LSTM | xx.x | - | 0 |
PredNet Video random | 1.64 | - | 0 |
PredNet Video 67h | 51.9 | Moments in Time | 2.4M |
PredNet Video random | 22.7 | - | 0 |
PredNet Audio 37h | 24.8 | Moments in Time | 2.4M |
[1] Misra, I., Zitnick, C. L., & Hebert, M. (2016, October). Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision (pp. 527-544). Springer, Cham.
- Check if our unsupervised learning approach can learn useful representations from audio
- Check if the audio modality provides complementary information
Features + Classifier | 2-class easy | 2-class hard | 10-class |
---|---|---|---|
PredNet Video random + SVM | 67.6 | 62.6 | 30.1 |
PredNet Video 66.6h + SVM | 74.2 | 65.1 | 41.4 |
PredNet Audio random + SVM | 63.6 | 56.8 | 30.3 |
PredNet Audio 2h + SVM | 66.9 | 56.8 | 29.1 |
PredNet Audio 37h + SVM | 67.8 | 58.3 | 30.0 |