-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathIDEAS
350 lines (265 loc) · 15.7 KB
/
IDEAS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
must watch: bit.ly/1pEOuYV
Bengio google tech talk video
read: sparse matrix factorisation
http://nuit-blanche.blogspot.fr/2014/01/sparse-matrix-factorization-simple.html
Tinder chatbot
==============
dataset: JRo / tinder subreddit
pretraining dataset: reddit + word2vec?
model: RNN word by word?
pretraining algo: RNN SGD
finetuning algo: RNN SGD
online learning: many fake tinder accounts
hackathon product: twitch for tinder? encourage dialogue act annotation?
1. re-ranking response model info retrieval (bayesian sets?)
2. re-rank JRo responses given context, using LSTM.
* initialise with big word embedding (instead of one-hot in 1st layer)
* preprocess to ensure no out of pretraining domain tokens/words
* finetune LSTM on JRo dialogues / tinder subReddit dialogues
* potential answers is (subset of) JRo / tinder subReddit
3. full generative response models
* is open source code from relevant papers available?
* how does transfer learning fit in?
* is available tinder data likely sufficient?
* do we want to privilege research rather than hackathon results?
or is there a way to split roles across team
* if satisfactory answers to all above, Tinder interesting to improve
on recent leading papers, by having more context than twitter (and
possible movie subtitles?): "longer convos, task specific, user profiles".
* extended ideas:
* incorporate task completion metric in learning objective
* user profile info as context, for diverse dialogue
* get more diveristy in answers via dialogue act conditioning
{'prudent', 'aggressive'} x {'answer', 'ask'} x {'same topic',
'different topic', 'date topic'}
* learn dialogue act strategy with RL
* issues to expect:
* insuff
High level rule learning
========================
motivation: human can be taught a class from few examples because teacher
& human can talk about which features on example generalise, which don't.
eg back of car mistakes: if human can spot what discriminative features
could be (red lights vs white lights), and these features are likely
to be learned in the model,
why does it not get picked up? how many images are generally required
for it to get picked up? how can we speed this up?
is it correct to visualise this as: find the hyperplane that separates
along the feature, pinned down by as few images (support vectors) as
possible. can reduce the angle span of potential hyperplane by half
every time. maybe don't need to do it for all dims cos data is on a
low dim manifold.
then, once have the hyperplane pinned down, impose it onto the
classifier somehow (reinforce it during training?)
ImageNet of NLP - sufficiently meaningful labels
================================================
ImageNet labels allow model to cast out semantically meaningless
patterns and learn invariances that matter.
Need to find out what patterns word embedding models miss, evaluate
whether best way to overcome is via specific labels, then how to get
these labels. (industry? mechTurk? multi modal eg film?)
How approximation of error surface works
========================================
How does the training set provide an approximation of the true error surface? Since it can introduce dangerous minima, does that mean that, formally, it does not always provide an extrema-conserving approximation of the true error surface? Theoretically, what are the conditions for obtaining an extrema-conserving approximation (i.e. that doesn't introduce fake minima)? Practically, can we perform transformations on the cost function, or do stuff to our data (sampling), to limit the introduction of fake minima?
Could we answer these questions if, instead of facing the usual problem of having a training set sampled from an unknown distribution, we ran a completely artificial experiment where we start with a known distribution? That way, we know the true error surface, and as we draw samples from the distribution, we can look at how each one approximates the true error surface, how it introduces bad minima?
Nonconvex optimisation: visualisations and priors for the error surfance
========================================================================
Error surface for eg convnets in compVision is nonConvex and maybe not even very smooth
so global optimisation with eg random jumps is currently worthless
but if we knew the probability distribution of error surfaces in image classif, we could
exploit it for faster training and perhaps even global optimisation?
How do we find good priors for the error surface?
Well, note it depends just as much on the model as the data.
how do we know that error surface properties of a 2 layer 500k param NN carry over to
that of a 12 layer 100M param NN?
Apparently saddle points are central to the optimisation prob
arxiv.org/pdf/1406.2572v1.pdf
"a deeper understanding of the statistical properties of high dimensional error surfaces
will guide the design of novel non-convex optimization algorithms"
-> Razvan Pascanu is in London at DeepMind
idea 1: know the manifold, derive error surface analytically
-> well studied case eg face pose?
-> artificial dataset?
idea 2: brute force sample many points
-> MNIST
idea 3: look for usual analytical properties looked for in optimisation literature
-> in real life datasets
-> which can lead to useful theorems
-> some properties: http://www.mit.edu/~9.520/spring08/Classes/optlecture.pdf
Make imagenet penultimate layer much smaller!
=============================================
find papers that experiment with this
must be little loss in accuracy when removing lots of dims
will significantly reduce computation
Improve on best imagenet model as initial image embedder
========================================================
might be some dims that dont matter as much for our labelling tasks
Natural manifolds are discontinuous in pixel space
==================================================
Dunno whether this is already a known and accepted fact, assuming it isnt.
We're always talking about how manifolds are non-linear.
But in pixel space, aren't they even discontinuous? change the viewpoint of a
scene marginally, you end up completely elsewhere in pixel space.
Have we ever tried visualising this stuff? Take a smooth video of a scene,
varying viewpoint slightly, plot each frame in pixel space, visualise dim
reduction of set that is spanned.
Finding mis-labelled data
=========================
as information retrieval: find images most similar to these ones but of
different label.
as labelling tool: paste product idea email.
>>>>>>> 29eef86e704eb54c557fc64402bd52270869d066
Midway Feedback
===============
If you have additional labels that could help with discrimination, train an off-shoot
classifier on a midway layer that feeds off a subset of the output units.
Can do this with the CP dataset using bounding boxes, fitting type, and eg in the case
of scraping, hatch markings.
For the scrape case, detecting hatch markings early on using
a strict subset of the units might be a way of telling the model "if you can't see
hatch markings around the fitting, use alternative means to determine scrape." it would
be even better to enforce a kind of if-statement in the architecture for this.
Understanding Deep Models
=========================
Nearest neighbour clustering of inputs at any hidden layer and manual inspection of
clusters to see whether they have semantic similarities.
Stochastic Dropout
==================
Should we be dropping units randomly? Or should we drop with a
probability based on how inactive they have been when training on the
given class?
Idea is to get parts of the network to specialise in subtasks.
At test time, will the net know which guys to switch off? Since
dropout is at fc layers only, we could help it use info from highest
conv layer?
Incorporate more domain knowledge into network architecture
===========================================================
Need to copy my handwritten notes.
Early stopping
==============
For early stopping, it is assumed that the net learns patterns that generalise before those that don't.
Seeing as early stopping works well, I guess it means that this is true. But why does it work like that? What is the mechanism? Are there any papers about this?
The reason for why I wonder is because: what if this is true only to a certain extent? Perhaps the way it works, very crudely speaking, is that the next pattern to be learned is the biggest or simplest (BorS) one. While the next-BorS one generalises, we're good. As soon as the next-BorS one does not, and is merely local to the sample, we get overfit, and that worsens performance. So maybe we miss out on all of the smaller generalising patterns.
There is evidence for this intuition: a bigger dataset generalises
better. a bigger dataset also has sample-specific patterns which are
smaller or more complex. maybe latter causes the former, by pushing
down the net's threshold for minimum size / maximum complexity of
generalising pattern it can pick up.
Why does the network always pick up the next biggest or simplest
pattern? Because of gradient descent: pick the steepest slope. Would
be fantastic to try to show that the two are closely linked, and then
to identify the mechanism that links them.
Even if all this is true, so what? How can we use this knowledge to
improve learning algos?
Training layer by layer, adding an extra only if need be
========================================================
Oh that actually already exists!
Is learning a convnet component in isolation a convex prob?
=====================================================================
From the ILVRC 2013 tutorial (cd Microsoft Research deep learning
textbook) and Boureau et al's paper "A Theoretical Analysis of Feature
Pooling in Visual Recognition", it seems that convnets are learning
components that computer vision people were formally hand-crafting and
de facto manually optimising.
Might be interesting to study each component (pixel feature, spatial
feature, normalisation, bias), and see if possible to analytically
determine whether the optimisation problem for each, taken
independently, is convex.
Detect tight bowls and plateaus and deal with them ANALYTICALLY
===============================================================
Unsupervised Learning: remove variation known to not belong to classes
======================================================================
Problem with unsupervised learning is that the 'best' clusters do not
correspond to the classes of interest, e.g. colour or brightness
clusters for image classification.
So why not remove this variation in the data?
There are 3 ways of doing this:
- modify training data (eg subtract the mean)
- embed in model (eg translation equivariance with convolution)
- data augmentation
Latter seems feasible: crops, brightness, saturation, hue etc
modifications. but this is an unsupervised task, so how can we tell
the model that the class is not modified?
Hey, maybe we could have a first supervised learning stage,
with one class per {case U augmentations}. Then we do unsupervised
learning in the feature space obtained from the augmentations.
Hopefully the features will get rid of all the class-invariant
patterns.
We wouldn't need to do the first stage with all training data.
That wouldn't be a good idea because top layer would be too big.
We should select one case per class. (Not 2 cases from same class,
because don't want to create a separate class for each, that would
really confuse the net).
The biological motivation for this is that we learn transformation
invariances by watching a single object move around space
continuously. Because we see video and can segment, we know it's the
same object, so we learn to become invariant to all the
transformations that occur on it as it moves around.
As an experiment, do just that: take 24 frames per second, recognise
the moving person, watch it for a certain amount of time (ie take
maybe 100,000s of images of it) and then do the same when that person
is not present, and learn the supervised binary classifier task.
Distributed Representations are a great prior for data in nature
=================================================================
Y. Bengio:
"Distributed because of compositionality, which results from
hierarchies of layers.
Compositionality of the parameters, i.e., the same parameters can
be re-used in many contexts, so O(N) parameters can allow to
distinguish O(2^N) regions in input space, whereas with nearest-
neighbor-like things, you need O(N) parameters (i.e. O(N) examples)
to characterize a function that can distinguish betwen O(N) regions.
This "gain" is only for some target functions, or more generally,
we can think of it like a prior. If the prior is applicable to our
target distribution, we can gain a lot. As usual in ML, there is no
real free lunch. The good news is that this prior is very broad and
seems to cover most of the things that humans learn about. What it
basically says is that the data we observe are explained by a bunch
of underlying factors, and that you can learn about each of these
factors WITHOUT requiring to see all of the configurations of the
other factors. This is how you are able to generalize to new
configurations and get this exponential statistical gain."
The learned discriminative frontier in the input space is assumed to
have lots of "symmetries", so once a few training examples in one
region give it the right shape, a "similar" shape is correctly
reproduced in another region where no training examples exist?
IDEA: for a specific dataset and architecture:
1) analytically determine (two) regions in input space where shape
of discriminative frontier is governed by same parameters
2) deliberately remove all training cases from one region
3) see whether test cases in this region are correctly classified?
background papers:
- 'symmetries': ICLR paper with Pascanu et al on theoretical
analysis of number of regions.
- non-local estimation of manifold structure: bit.ly/1pELZG5
Sparsity vs Distributed vs Dropout; visualise training inside
=============================================================
We're told sparsity is great and didstributed representations are
great and dropout is great. Yet sparsity and dropout reduce the
extent of distributiveness. What's really going on? Why is 0.5
dropout optimal? Is there an optimum between sparsity and density
that 0.5 dropout corresponds to? Can we develop a conceptual framework
that leads to an analytical solution? Or can we train nets more or
less well and visualise the activities in the entire network to
get a detailed and clear understanding of what's going on?
Better error surface at each layer with analytical bias update
==============================================================
error surface should be nice and bowl like
this is achieves when incoming data (or weights?) is symmetric and meaned at zero
this is done by demeaning the data but this only applies to 1st layer neurons
what about subsequent? data can be demeaned again with the biases
so why not update biases with mean of projected data?
maybe this mean can be computed dynamically?
---
data should also have unit std dev, find out why
can something also be done?
Yann LeCun convnet global optimisation
======================================
https://www.youtube.com/watch?v=zPVHH7ZJi9Q&t=1h12m55s
Andrew Davison talk at royal society
====================================
1. learning from remote controlled staford video?
-> highly supervised and rich labels, literally says which
2. prior knowledge of what's in the scene rather than brute force reconsruction
-> instead of being super granular object level, how about more generalisable knowledge such as surfaces of human constructed objects tend to be composed of planes or curves, so linear or quadratic equation, so not many params to estimates?
3. learning optimal features for SLAM?