Name	Name	Last commit message	Last commit date
parent directory ..
images	images
README.md	README.md

marp	math
true	true

Feedforward Neural Networks

Learning objectives

Know the possibilities, architecture and key components of a feedforward neural network.
Understand how neural networks are trained and tuned.

Toying with a neural network

History

A biological inspiration

McCulloch & Pitts' formal neuron (1943)

Hebb's rule (1949)

Attempt to explain synaptic plasticity, the adaptation of brain neurons during the learning process.

"The general idea is an old one, that any two cells or systems of cells that are repeatedly active at the same time will tend to become 'associated' so that activity in one facilitates activity in the other."

Franck Rosenblatt's perceptron (1958)

The perceptron learning algorithm

Init randomly the connection weights $\pmb{\omega}$.
For each training sample $\pmb{x}^{(i)}$:
1. Compute the perceptron output $y'^{(i)}$
2. Adjust weights : $\pmb{\omega_{t+1}} = \pmb{\omega_t} + \eta (y^{(i)} - y'^{(i)}) \pmb{x}^{(i)}$

Minsky's critic (1969)

One perceptron cannot learn non-linearly separable functions.

At the time, no learning algorithm existed for training the hidden layers of a MLP.

Decisive breakthroughs (1970s-1990s)

1974: backpropagation theory (P. Werbos).
1986: learning through backpropagation (Rumelhart, Hinton, Williams).
1989: first researchs on deep neural nets (LeCun, Bengio).
1991: Universal approximation theorem. Given appropriate complexity and appropriate learning, a network can theorically approximate any continuous function.

Universal approximation theorem (1991)

The hidden layers of a neural network transform their input space.
A network can be seen as a series of non-linear compositions applied to the input data.
Given appropriate complexity and appropriate learning, a network can theorically approximate any continuous function.
One of the most important theoretical results for neural networks.

Key components

Anatomy of a fully connected network

Neuron output

Activation functions

They are applied to the weighted sum of neuron inputs to produce its output.

They must be:

non-linear, so that the network has access to a richer representation space and not only linear transformations;
differentiable, so that gradients can be computed during learning.

Sigmoid

This function "squashes" its input between between 0 and 1, outputting something that can be interpreted as the probability of the positive class. It is often used in the final layer of the network for binary classification tasks.

$$\sigma(x) = \frac{1}{1 + e^{-x}}$$

$$\sigma'(x) = \frac{e^{-x}}{(1 + e^{-x})^2} = \sigma(x)\big(1 - \sigma(x)\big)$$

tanh

The hyperbolic tangent function has a similar shape as sigmoid, but outputs values in the $[-1,1]$ interval.

$$tanh(x) = \frac{sinh(x)}{cosh(x)} = \frac{e^x - e^{-x}}{e^x + e^{-x}} = \frac{2}{1+e^{-2x}} -1 = 2\sigma(2x) - 1$$

$$tanh'(x) = \frac{4}{(e^x + e^{-x})^2} = \frac{1}{cosh^2(x)}$$

tanh is a rescaled logistic sigmoid function

ReLU

The Rectified Linear Unit function has replaced sigmoid and tanh as the default activation function in most contexts.

$$ReLU(x) = max(0,x)$$

$$ReLU'(x) = \begin{cases} 0 \qquad \forall x \in; ]-\infty, 0] \\ 1 \qquad \forall x \in; ]0, +\infty[ \end{cases}$$

Plotting activation functions

Training process

Learning algorithm

Weights initialization

To facilitate training, initial weights must be:

non-zero
random
have small values

Several techniques exist. A commonly used one is Xavier initialization.

Loss function

For binary classification tasks, the standard choice is the binary cross-entropy loss, For each sample of the batch, it will compare the output of the model (a value $\in [0,1]$ provided by the sigmoid function) with the expected binary value $\in {0,1}$.

The standard choice for multiclass classification tasks is the cross-entropy loss a.k.a. negative log-likelihood loss.

Activation function for multiclass classification

The softmax function turns a vector $\pmb{v} = {v_1, v_2, \dots, v_K } \in \mathbb{R}^K$ of raws values (called a logits vector when it's the output of a ML model) into a probability distribution. It is a multiclass generalization of the sigmoid function.

$$\sigma(\pmb{v})k = \frac{e^{v_k}}{\sum{k=1}^K {e^{v_k}}};;;; \sum_{k=1}^K \sigma(\pmb{v})_k = 1$$

$K$: number of labels.
$\pmb{v}$: logits vector, i.e. raw predictions for each class.
$\sigma(\pmb{v})_k \in [0,1]$: probability associated to label $k \in [1,K]$.

Training algorithm

The training algorithm is as follows:

On each iteration on the whole dataset (known as an epoch) and for each data batch inside an epoch, the model output is computed on the current batch.
This output is used alongside expected results by the loss function to obtain the mean loss for the current batch.
The gradient of the loss w.r.t. each model parameter is computed (backpropagation).
The model parameters are updated in the opposite direction of their gradient (one GD step).

Training hyperparameters

Hyperparameters ($\neq$ model parameters) are adjustable configuration values that let you control the model training process.

Number of epochs: an epoch is finished when all data samples have been presented to the model during training
Learning rate: rate of parameter change during gradient descent.
Batch size: number of samples used for one gradient descent step during training.

Weights update

Objective: minimize the loss function. Method: gradient descent.

$$\pmb{\omega_{t+1}} = \pmb{\omega_t} - \eta\nabla_{\pmb{\omega}}\mathcal{L}(\pmb{\omega_t})$$

Backpropagation

Objective: compute $\nabla_{\pmb{\omega}}\mathcal{L}(\pmb{\omega_t})$, the loss function gradient w.r.t. all the network weights.

Method: apply the chain rule to compute partial derivatives backwards, starting from the current output.

$$y = f(g(x)) ;;;; \frac{\partial y}{\partial x} = \frac{\partial f}{\partial g} \frac{\partial g}{\partial x};;;; \frac{\partial y}{\partial x} = \sum_{i=1}^n \frac{\partial f}{\partial g^{(i)}} \frac{\partial g^{(i)}}{\partial x}$$

Visual demo of backpropagation

Tuning a neural network

Hyperparameters choice

In general order of importance:

Number of layers
Number of neurons on hidden layers
Activation functions
Learning rate
Batch size
...

Tackling overfitting

Regularization

Limit weights values by adding a penalty to the loss function.

$$\mathcal{l1} = \frac{\lambda}{m} {\sum |{\mathbf{\theta}{ij}}|};;;; \mathcal{l2} = \frac{2\lambda}{m} {\sum {\mathbf{\theta}{ij}}^2}$$

$\lambda$ is called the regularization rate.

Dropout

During training, some input units are randomly set to 0. The network must adapt and become more generic. The more units dropped out, the stronger the regularization.

Files

feedforward_neural_networks

Directory actions

More options

Directory actions

More options

Latest commit

History

feedforward_neural_networks

Folders and files

parent directory

README.md

Feedforward Neural Networks

Learning objectives

Toying with a neural network

History

A biological inspiration

McCulloch & Pitts' formal neuron (1943)

Hebb's rule (1949)

Franck Rosenblatt's perceptron (1958)

The perceptron learning algorithm

Minsky's critic (1969)

Decisive breakthroughs (1970s-1990s)

Universal approximation theorem (1991)

Key components

Anatomy of a fully connected network

Neuron output

Activation functions

Sigmoid

tanh

ReLU

Plotting activation functions

Training process

Learning algorithm

Weights initialization

Loss function

Activation function for multiclass classification

Training algorithm

Training hyperparameters

Weights update

Backpropagation

Visual demo of backpropagation

Tuning a neural network

Hyperparameters choice

Tackling overfitting

Regularization

Dropout

Interactive recap