marp | math |
---|---|
true |
true |
- Know the possibilities, architecture and key components of a feedforward neural network.
- Understand how neural networks are trained and tuned.
Attempt to explain synaptic plasticity, the adaptation of brain neurons during the learning process.
"The general idea is an old one, that any two cells or systems of cells that are repeatedly active at the same time will tend to become 'associated' so that activity in one facilitates activity in the other."
- Init randomly the connection weights
$\pmb{\omega}$ . - For each training sample
$\pmb{x}^{(i)}$ :- Compute the perceptron output
$y'^{(i)}$ - Adjust weights :
$\pmb{\omega_{t+1}} = \pmb{\omega_t} + \eta (y^{(i)} - y'^{(i)}) \pmb{x}^{(i)}$
- Compute the perceptron output
One perceptron cannot learn non-linearly separable functions.
At the time, no learning algorithm existed for training the hidden layers of a MLP.
- 1974: backpropagation theory (P. Werbos).
- 1986: learning through backpropagation (Rumelhart, Hinton, Williams).
- 1989: first researchs on deep neural nets (LeCun, Bengio).
- 1991: Universal approximation theorem. Given appropriate complexity and appropriate learning, a network can theorically approximate any continuous function.
-
The hidden layers of a neural network transform their input space.
-
A network can be seen as a series of non-linear compositions applied to the input data.
-
Given appropriate complexity and appropriate learning, a network can theorically approximate any continuous function.
-
One of the most important theoretical results for neural networks.
They are applied to the weighted sum of neuron inputs to produce its output.
They must be:
- non-linear, so that the network has access to a richer representation space and not only linear transformations;
- differentiable, so that gradients can be computed during learning.
This function "squashes" its input between between 0 and 1, outputting something that can be interpreted as the probability of the positive class. It is often used in the final layer of the network for binary classification tasks.
The hyperbolic tangent function has a similar shape as sigmoid, but outputs values in the
The Rectified Linear Unit function has replaced sigmoid and tanh as the default activation function in most contexts.
To facilitate training, initial weights must be:
- non-zero
- random
- have small values
Several techniques exist. A commonly used one is Xavier initialization.
For binary classification tasks, the standard choice is the binary cross-entropy loss, For each sample of the batch, it will compare the output of the model (a value
The standard choice for multiclass classification tasks is the cross-entropy loss a.k.a. negative log-likelihood loss.
The softmax function turns a vector
$$\sigma(\pmb{v})k = \frac{e^{v_k}}{\sum{k=1}^K {e^{v_k}}};;;; \sum_{k=1}^K \sigma(\pmb{v})_k = 1$$
-
$K$ : number of labels. -
$\pmb{v}$ : logits vector, i.e. raw predictions for each class. -
$\sigma(\pmb{v})_k \in [0,1]$ : probability associated to label$k \in [1,K]$ .
The training algorithm is as follows:
- On each iteration on the whole dataset (known as an epoch) and for each data batch inside an epoch, the model output is computed on the current batch.
- This output is used alongside expected results by the loss function to obtain the mean loss for the current batch.
- The gradient of the loss w.r.t. each model parameter is computed (backpropagation).
- The model parameters are updated in the opposite direction of their gradient (one GD step).
Hyperparameters (
- Number of epochs: an epoch is finished when all data samples have been presented to the model during training
- Learning rate: rate of parameter change during gradient descent.
- Batch size: number of samples used for one gradient descent step during training.
Objective: minimize the loss function. Method: gradient descent.
Objective: compute
Method: apply the chain rule to compute partial derivatives backwards, starting from the current output.
In general order of importance:
- Number of layers
- Number of neurons on hidden layers
- Activation functions
- Learning rate
- Batch size
- ...
Limit weights values by adding a penalty to the loss function.
$$\mathcal{l1} = \frac{\lambda}{m} {\sum |{\mathbf{\theta}{ij}}|};;;; \mathcal{l2} = \frac{2\lambda}{m} {\sum {\mathbf{\theta}{ij}}^2}$$
During training, some input units are randomly set to 0. The network must adapt and become more generic. The more units dropped out, the stronger the regularization.