Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
convnet_extract.py		convnet_extract.py
data.py		data.py
ensemble.ipynb		ensemble.ipynb
models.py		models.py
scratch.ipynb		scratch.ipynb
settings.py		settings.py
train.py		train.py
train.sh		train.sh
train_linear.py		train_linear.py
train_sgd.py		train_sgd.py
utils.py		utils.py

README.md

Classification experiments

This folder contains classification models that are trained on learned representations generated by PredNet and other pre-trained models.

Relevant files

convnet_extract.py: script to extract VGG features from video frames.
models.py: definition of the computation graph of the neural network classifiers.
train.py: script to train neural network action classifiers.

Example: train a LSTM classifier to solve the 10-class task using features generated by a prednet trained on 67h of data from the Moments in Time dataset.

> python train.py prednet_kitti_finetuned_moments_full --task 10c -m lstm

train_linear.py: script to train linear (SVM) action classifiers.

Example: train a linear SVM classifier to solve the "binary temporal" task using features generated by a prednet trained on 67h of data from the Moments in Time dataset.

> python train_linear.py prednet_kitti_finetuned_moments_full --task 2c_hard

data.py: input generator implementation that can handle sequences of images or pickle files.
settings.py: parameters for each experiment.

Action recognition with small datasets

Objectives:

Investigate the limitations of Convnet representations on video understanding tasks
Test the influence of the amount of unsupervised training data on the supervised task performance

Tasks:

Easy: "cooking" x "walking"
Hard: "running" x "walking"

Results: Test set

Features + Classifier	2-class easy	2-class hard	5-class	10-class
VGG random + SVM	67.0	56.0	27.6	18.7
VGG ImageNet + SVM	85.5	67.0	44.6	52.8
VGG ImageNet + LSTM	87.4	58.4	54.9	43.2
PredNet random + SVM	67.6	62.6	37.2	30.1
PredNet KITTI + SVM	73.2	70.7	50.7	39.8
PredNet Moments 3h + SVM	73.2	66.1	49.5	39.5
PredNet Moments 67h + SVM	74.2	65.1	50.9	41.4
PredNet Moments 67h + LSTM	78.6	55.8	50.1	42.9

Insights

VGG features pre-trained on Imagenet work very well when the spatial information is determinant for the action classification. However, it falls short to capture fine-grained temporal patterns needed to distinguish between running and walking actions.
PredNet random features perform better than VGG random features, especially in the 2-class temporal task. This indicates that PredNet has better inductive biases to capture temporal patterns.
As we add more data, PredNet features improve the performance of action classification task. However, as more out-of-domain data is added (unrelated classes), the performance drops in the 2-class temporal task. Still, the results are competitive with Imagenet-derived features.
Using an LSTM instead of an SVM improves the results in the 2-class spatial task and worsen the results in the 2-class temporal task. Not clear why.

Results: Validation set

Model	Easy (loss)	Hard (loss)	Easy (acc)	Hard (acc)
VGG ImageNet	0.274	0.688	0.867	0.578
VGG ImageNet LSTM	0.224	0.634	0.933	0.689
PredNet random	0.690	0.694	0.544	0.556
PredNet KITTI	0.582	0.685	0.722	0.622
PredNet KITTI + Moments 1h	0.470	0.649	0.778	0.611
PredNet KITTI + Moments 1.25h*	0.513	0.668	0.768	0.595
PredNet KITTI + Moments 3h	0.583	0.676	0.778	0.500
PredNet KITTI + Moments 67h	0.592	0.666	0.744	0.533

* In this run we let the model "see" the held-out labelled data

Out-of-domain action recognition: UCF-101

Objectives:

Assess the performance of model variants on an out-of-domain task
Compare with baselines from the literature (focus on unsupervised approaches)

Model	UCF-101 RGB (%)	Pre-training dataset	Pre-training size (frames)
CNN tuple verification [1]	50.2	UCF-101	2.7M
ConvNet + LSTM	xx.x	-	0
PredNet Video random	1.64	-	0
PredNet Video 67h	51.9	Moments in Time	2.4M
PredNet Video random	22.7	-	0
PredNet Audio 37h	24.8	Moments in Time	2.4M

[1] Misra, I., Zitnick, C. L., & Hebert, M. (2016, October). Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision (pp. 527-544). Springer, Cham.

Exploring the audio modality

Objectives:

Check if our unsupervised learning approach can learn useful representations from audio
Check if the audio modality provides complementary information

Results: Test set

Features + Classifier	2-class easy	2-class hard	10-class
PredNet Video random + SVM	67.6	62.6	30.1
PredNet Video 66.6h + SVM	74.2	65.1	41.4
PredNet Audio random + SVM	63.6	56.8	30.3
PredNet Audio 2h + SVM	66.9	56.8	29.1
PredNet Audio 37h + SVM	67.8	58.3	30.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

classifier

classifier

README.md

Classification experiments

Relevant files

Action recognition with small datasets

Objectives:

Tasks:

Results: Test set

Insights

Results: Validation set

Out-of-domain action recognition: UCF-101

Objectives:

Exploring the audio modality

Objectives:

Results: Test set

Files

classifier

Directory actions

More options

Directory actions

More options

Latest commit

History

classifier

Folders and files

parent directory

README.md

Classification experiments

Relevant files

Action recognition with small datasets

Objectives:

Tasks:

Results: Test set

Insights

Results: Validation set

Out-of-domain action recognition: UCF-101

Objectives:

Exploring the audio modality

Objectives:

Results: Test set