IDEAS


must watch: bit.ly/1pEOuYV
Bengio google tech talk video 

read: sparse matrix factorisation
http://nuit-blanche.blogspot.fr/2014/01/sparse-matrix-factorization-simple.html

Tinder chatbot
==============
dataset: JRo / tinder subreddit
pretraining dataset: reddit + word2vec?
model: RNN word by word?
pretraining algo: RNN SGD
finetuning algo: RNN SGD
online learning: many fake tinder accounts
hackathon product: twitch for tinder? encourage dialogue act annotation?

1. re-ranking response model info retrieval (bayesian sets?)
2. re-rank JRo responses given context, using LSTM.
* initialise with big word embedding (instead of one-hot in 1st layer)
  * preprocess to ensure no out of pretraining domain tokens/words
* finetune LSTM on JRo dialogues / tinder subReddit dialogues
  * potential answers is (subset of) JRo / tinder subReddit
3. full generative response models
* is open source code from relevant papers available?
* how does transfer learning fit in?
* is available tinder data likely sufficient?
* do we want to privilege research rather than hackathon results?
  or is there a way to split roles across team
* if satisfactory answers to all above, Tinder interesting to improve
on recent leading papers, by having more context than twitter (and
possible movie subtitles?): "longer convos, task specific, user profiles".

* extended ideas:
  * incorporate task completion metric in learning objective
  * user profile info as context, for diverse dialogue
  * get more diveristy in answers via dialogue act conditioning
    {'prudent', 'aggressive'} x {'answer', 'ask'} x {'same topic',
     'different topic', 'date topic'}
  * learn dialogue act strategy with RL
  
* issues to expect:
  * insuff


High level rule learning
========================
motivation: human can be taught a class from few examples because teacher
& human can talk about which features on example generalise, which don't.
eg back of car mistakes: if human can spot what discriminative features
could be (red lights vs white lights), and these features are likely
to be learned in the model,
why does it not get picked up? how many images are generally required
for it to get picked up? how can we speed this up?
is it correct to visualise this as: find the hyperplane that separates
along the feature, pinned down by as few images (support vectors) as
possible. can reduce the angle span of potential hyperplane by half
every time. maybe don't need to do it for all dims cos data is on a
low dim manifold.
then, once have the hyperplane pinned down, impose it onto the
classifier somehow (reinforce it during training?)


ImageNet of NLP - sufficiently meaningful labels
================================================
ImageNet labels allow model to cast out semantically meaningless
patterns and learn invariances that matter.
Need to find out what patterns word embedding models miss, evaluate
whether best way to overcome is via specific labels, then how to get
these labels. (industry? mechTurk? multi modal eg film?)


How approximation of error surface works
========================================
How does the training set provide an approximation of the true error surface? Since it can introduce dangerous minima, does that mean that, formally, it does not always provide an extrema-conserving approximation of the true error surface? Theoretically, what are the conditions for obtaining an extrema-conserving approximation (i.e. that doesn't introduce fake minima)? Practically, can we perform transformations on the cost function, or do stuff to our data (sampling), to limit the introduction of fake minima? 

Could we answer these questions if, instead of facing the usual problem of having a training set sampled from an unknown distribution, we ran a completely artificial experiment where we start with a known distribution? That way, we know the true error surface, and as we draw samples from the distribution, we can look at how each one approximates the true error surface, how it introduces bad minima? 


Nonconvex optimisation: visualisations and priors for the error surfance
========================================================================
Error surface for eg convnets in compVision is nonConvex and maybe not even very smooth
so global optimisation with eg random jumps is currently worthless 
but if we knew the probability distribution of error surfaces in image classif, we could
exploit it for faster training and perhaps even global optimisation?
How do we find good priors for the error surface?

Well, note it depends just as much on the model as the data.
how do we know that error surface properties of a 2 layer 500k param NN carry over to 
that of a 12 layer 100M param NN?

Apparently saddle points are central to the optimisation prob 
arxiv.org/pdf/1406.2572v1.pdf
"a deeper understanding of the statistical properties of high dimensional error surfaces
will guide the design of novel non-convex optimization algorithms"
-> Razvan Pascanu is in London at DeepMind

idea 1: know the manifold, derive error surface analytically
-> well studied case eg face pose?
-> artificial dataset?

idea 2: brute force sample many points 
-> MNIST

idea 3: look for usual analytical properties looked for in optimisation literature
-> in real life datasets
-> which can lead to useful theorems
-> some properties: http://www.mit.edu/~9.520/spring08/Classes/optlecture.pdf


Make imagenet penultimate layer much smaller!
=============================================
find papers that experiment with this
must be little loss in accuracy when removing lots of dims
will significantly reduce computation


Improve on best imagenet model as initial image embedder
========================================================
might be some dims that dont matter as much for our labelling tasks 


Natural manifolds are discontinuous in pixel space
==================================================
Dunno whether this is already a known and accepted fact, assuming it isnt.
We're always talking about how manifolds are non-linear.
But in pixel space, aren't they even discontinuous? change the viewpoint of a
scene marginally, you end up completely elsewhere in pixel space.
Have we ever tried visualising this stuff? Take a smooth video of a scene, 
varying viewpoint slightly, plot each frame in pixel space, visualise dim 
reduction of set that is spanned.


Finding mis-labelled data
=========================
as information retrieval: find images most similar to these ones but of 
different label.
as labelling tool: paste product idea email.
>>>>>>> 29eef86e704eb54c557fc64402bd52270869d066


Midway Feedback
===============
If you have additional labels that could help with discrimination, train an off-shoot
classifier on a midway layer that feeds off a subset of the output units.
Can do this with the CP dataset using bounding boxes, fitting type, and eg in the case 
of scraping, hatch markings. 
For the scrape case, detecting hatch markings early on using 
a strict subset of the units might be a way of telling the model "if you can't see 
hatch markings around the fitting, use alternative means to determine scrape." it would 
be even better to enforce a kind of if-statement in the architecture for this.


Understanding Deep Models
=========================
Nearest neighbour clustering of inputs at any hidden layer and manual inspection of 
clusters to see whether they have semantic similarities.


Stochastic Dropout
==================
Should we be dropping units randomly? Or should we drop with a
probability based on how inactive they have been when training on the
given class? 
Idea is to get parts of the network to specialise in subtasks. 
At test time, will the net know which guys to switch off? Since
dropout is at fc layers only, we could help it use info from highest
conv layer?


Incorporate more domain knowledge into network architecture
===========================================================
Need to copy my handwritten notes.


Early stopping
==============
For early stopping, it is assumed that the net learns patterns that generalise before those that don't.

Seeing as early stopping works well, I guess it means that this is true. But why does it work like that? What is the mechanism? Are there any papers about this?

The reason for why I wonder is because: what if this is true only to a certain extent? Perhaps the way it works, very crudely speaking, is that the next pattern to be learned is the biggest or simplest (BorS) one. While the next-BorS one generalises, we're good. As soon as the next-BorS one does not, and is merely local to the sample, we get overfit, and that worsens performance. So maybe we miss out on all of the smaller generalising patterns.

There is evidence for this intuition: a bigger dataset generalises
better. a bigger dataset also has sample-specific patterns which are
smaller or more complex. maybe latter causes the former, by pushing
down the net's threshold for minimum size / maximum complexity of
generalising pattern it can pick up.

Why does the network always pick up the next biggest or simplest
pattern? Because of gradient descent: pick the steepest slope. Would
be fantastic to try to show that the two are closely linked, and then
to identify the mechanism that links them.

Even if all this is true, so what? How can we use this knowledge to
improve learning algos?


Training layer by layer, adding an extra only if need be
========================================================
Oh that actually already exists!


Is learning a convnet component in isolation a convex prob?
=====================================================================
From the ILVRC 2013 tutorial (cd Microsoft Research deep learning
textbook) and Boureau et al's paper "A Theoretical Analysis of Feature
Pooling in Visual Recognition", it seems that convnets are learning
components that computer vision people were formally hand-crafting and
de facto manually optimising. 

Might be interesting to study each component (pixel feature, spatial
feature, normalisation, bias), and see if possible to analytically
determine whether the optimisation problem for each, taken
independently, is convex. 


Detect tight bowls and plateaus and deal with them ANALYTICALLY
===============================================================


Unsupervised Learning: remove variation known to not belong to classes
======================================================================
Problem with unsupervised learning is that the 'best' clusters do not
correspond to the classes of interest, e.g. colour or brightness
clusters for image classification.

So why not remove this variation in the data?

There are 3 ways of doing this:
- modify training data (eg subtract the mean)
- embed in model (eg translation equivariance with convolution)
- data augmentation

Latter seems feasible: crops, brightness, saturation, hue etc
modifications. but this is an unsupervised task, so how can we tell
the model that the class is not modified?

Hey, maybe we could have a first supervised learning stage,
with one class per {case U augmentations}. Then we do unsupervised
learning in the feature space obtained from the augmentations.
Hopefully the features will get rid of all the class-invariant
patterns.

We wouldn't need to do the first stage with all training data.
That wouldn't be a good idea because top layer would be too big.
We should select one case per class. (Not 2 cases from same class,
because don't want to create a separate class for each, that would
really confuse the net).

The biological motivation for this is that we learn transformation
invariances by watching a single object move around space
continuously. Because we see video and can segment, we know it's the
same object, so we learn to become invariant to all the
transformations that occur on it as it moves around.

As an experiment, do just that: take 24 frames per second, recognise
the moving person, watch it for a certain amount of time (ie take
maybe 100,000s of images of it) and then do the same when that person
is not present, and learn the supervised binary classifier task.


Distributed Representations are a great prior for data in nature
=================================================================
Y. Bengio:

"Distributed because of compositionality, which results from
hierarchies of layers.

Compositionality of the parameters, i.e., the same parameters can
be re-used in many contexts, so O(N) parameters can allow to
distinguish O(2^N) regions in input space, whereas with nearest-
neighbor-like things, you need O(N) parameters (i.e. O(N) examples)
to characterize a function that can distinguish betwen O(N) regions.

This "gain" is only for some target functions, or more generally,
we can think of it like a prior. If the prior is applicable to our
target distribution, we can gain a lot. As usual in ML, there is no
real free lunch. The good news is that this prior is very broad and
seems to cover most of the things that humans learn about. What it
basically says is that the data we observe are explained by a bunch
of underlying factors, and that you can learn about each of these
factors WITHOUT requiring to see all of the configurations of the
other factors. This is how you are able to generalize to new
configurations and get this exponential statistical gain."

The learned discriminative frontier in the input space is assumed to
have lots of "symmetries", so once a few training examples in one
region give it the right shape, a "similar" shape is correctly
reproduced in another region where no training examples exist?

IDEA: for a specific dataset and architecture:
1) analytically determine (two) regions in input space where shape
   of discriminative frontier is governed by same parameters
2) deliberately remove all training cases from one region
3) see whether test cases in this region are correctly classified? ﻿

background papers:
- 'symmetries': ICLR paper with Pascanu et al on theoretical
analysis of number of regions.
- non-local estimation of manifold structure: bit.ly/1pELZG5


Sparsity vs Distributed vs Dropout; visualise training inside
=============================================================

We're told sparsity is great and didstributed representations are
great and dropout is great. Yet sparsity and dropout reduce the
extent of distributiveness. What's really going on? Why is 0.5
dropout optimal? Is there an optimum between sparsity and density
that 0.5 dropout corresponds to? Can we develop a conceptual framework
that leads to an analytical solution? Or can we train nets more or
less well and visualise the activities in the entire network to
get a detailed and clear understanding of what's going on?


Better error surface at each layer with analytical bias update
==============================================================
error surface should be nice and bowl like
this is achieves when incoming data (or weights?) is symmetric and meaned at zero
this is done by demeaning the data but this only applies to 1st layer neurons
what about subsequent? data can be demeaned again with the biases
so why not update biases with mean of projected data?
maybe this mean can be computed dynamically?
---
data should also have unit std dev, find out why
can something also be done?


Yann LeCun convnet global optimisation
======================================
https://www.youtube.com/watch?v=zPVHH7ZJi9Q&t=1h12m55s


Andrew Davison talk at royal society
====================================
1. learning from remote controlled staford video?
-> highly supervised and rich labels, literally says which
2. prior knowledge of what's in the scene rather than brute force reconsruction
-> instead of being  super granular object level, how about more generalisable knowledge such as surfaces of human constructed objects tend to be composed of planes or curves, so linear or quadratic equation, so not many params to estimates?
3. learning optimal features for SLAM?