C2-DeepNeuralNetworks.qmd

---
output: html_document
editor_options:
  chunk_output_type: console
---

# Deep Neural Networks (DNN)

```{r}
#| echo: false
#| include: false
#| results: false
reticulate::use_condaenv("r-reticulate")
library(tensorflow)
tf
tf$abs(3.)
```

We can use TensorFlow directly from R (see @sec-tensorflowintro for an introduction to TensorFlow), and we could use this knowledge to implement a neural network in TensorFlow directly in R. However, this can be quite cumbersome. For simple problems, it is usually faster to use a higher-level API that helps us implement the machine learning models in TensorFlow. The most common of these is Keras.

Keras is a powerful framework for building and training neural networks with just a few lines of code. As of the end of 2018, Keras and TensorFlow are fully interoperable, allowing us to take advantage of the best of both.

The goal of this lesson is to familiarize you with Keras. If you have TensorFlow installed, you can find Keras inside TensorFlow: tf.keras. However, the RStudio team has built an R package on top of tf.keras that is more convenient to use. To load the Keras package, type

```{r chunk_chapter3_61}
library(keras)
# or library(torch)
```

## Example workflow in Keras / Torch

We build a small classifier to predict the three species of the iris data set. Load the necessary packages and data sets:

```{r chunk_chapter3_62}
library(keras)
library(tensorflow)
library(torch)
set_random_seed(321L, disable_gpu = FALSE)	# Already sets R's random seed.

data(iris)
head(iris)
```

For neural networks, it is beneficial to scale the predictors (scaling = centering and standardization, see ?scale). We also split our data into predictors (X) and response (Y = the three species).

```{r chunk_chapter3_63, cache=TRUE}
X = scale(iris[,1:4])
Y = iris[,5]
```

Additionally, Keras/TensorFlow cannot handle factors and we have to create contrasts (one-hot encoding). To do so, we have to specify the number of categories. This can be tricky for a beginner, because in other programming languages like Python and C++, arrays start at zero. Thus, when we would specify 3 as number of classes for our three species, we would have the classes 0,1,2,3. Keep this in mind.

```{r chunk_chapter3_64, cache=TRUE}
Y = to_categorical(as.integer(Y) - 1L, 3)
head(Y) # 3 columns, one for each level of the response.
```

After having prepared the data, we start now with the typical workflow in keras.

**1. Initialize a sequential model in Keras:**

::: panel-tabset
## Keras

```{r chunk_chapter3_65, cache=TRUE}
model = keras_model_sequential()
```

## Torch

Torch users can skip this step.
:::

A sequential Keras model is a higher order type of model within Keras and consists of one input and one output model.

**2. Add hidden layers to the model (we will learn more about hidden layers during the next days).**

When specifying the hidden layers, we also have to specify the shape and a so called *activation function*. You can think of the activation function as decision for what is forwarded to the next neuron (but we will learn more about it later). If you want to know this topic in even more depth, consider watching the videos presented in section \@ref(basicMath).

The shape of the input is the number of predictors (here 4) and the shape of the output is the number of classes (here 3).

::: panel-tabset
## Keras

```{r chunk_chapter3_66, cache=TRUE}
model %>%
  layer_dense(units = 20L, activation = "relu", input_shape = list(4L)) %>%
  layer_dense(units = 20L) %>%
  layer_dense(units = 20L) %>%
  layer_dense(units = 3L, activation = "softmax") 
```

## Torch

The Torch syntax is very similar, we will give a list of layers to the "nn_sequential" function. Here, we have to specify the softmax activation function as an extra layer:

```{r chunk_chapter3_67}
model_torch = 
  nn_sequential(
    nn_linear(4L, 20L),
    nn_linear(20L, 20L),
    nn_linear(20L, 20L),
    nn_linear(20L, 3L),
    nn_softmax(2)
  )
```
:::

-   softmax scales a potential multidimensional vector to the interval $(0, 1]$ for each component. The sum of all components equals 1. This might be very useful for example for handling probabilities. **Ensure ther the labels start at 0! Otherwise the softmax function does not work well!**

**3. Compile the model with a loss function (here: cross entropy) and an optimizer (here: Adamax).**

We will learn about other options later, so for now, do not worry about the "**learning_rate**" ("**lr**" in Torch or earlier in TensorFlow) argument, cross entropy or the optimizer.

::: panel-tabset
## Keras

```{r chunk_chapter3_68, cache=TRUE}
model %>%
  compile(loss = loss_categorical_crossentropy,
          keras::optimizer_adamax(learning_rate = 0.001))
summary(model)
```

## Torch

Specify optimizer and the parameters which will be trained (in our case the parameters of the network):

```{r chunk_chapter3_69}
optimizer_torch = optim_adam(params = model_torch$parameters, lr = 0.001)
```
:::

**4. Fit model in 30 iterations (epochs)**

::: panel-tabset
## Keras

```{r chunk_chapter3_70, cache=TRUE}
library(tensorflow)
library(keras)
set_random_seed(321L, disable_gpu = FALSE)	# Already sets R's random seed.

model_history =
  model %>%
    fit(x = X, y = apply(Y, 2, as.integer), epochs = 30L,
        batch_size = 20L, shuffle = TRUE)
```

## Torch

In Torch, we jump directly to the training loop, however, here we have to write our own training loop:

1.  Get a batch of data.
2.  Predict on batch.
3.  Ccalculate loss between predictions and true labels.
4.  Backpropagate error.
5.  Update weights.
6.  Go to step 1 and repeat.

```{r chunk_chapter3_71}
library(torch)
torch_manual_seed(321L)
set.seed(123)

# Calculate number of training steps.
epochs = 30
batch_size = 20
steps = round(nrow(X)/batch_size * epochs)

X_torch = torch_tensor(X)
Y_torch = torch_tensor(apply(Y, 1, which.max)) 

# Set model into training status.
model_torch$train()

log_losses = NULL

# Training loop.
for(i in 1:steps){
  # Get batch.
  indices = sample.int(nrow(X), batch_size)
  
  # Reset backpropagation.
  optimizer_torch$zero_grad()
  
  # Predict and calculate loss.
  pred = model_torch(X_torch[indices, ])
  loss = nnf_cross_entropy(pred, Y_torch[indices])
  
  # Backpropagation and weight update.
  loss$backward()
  optimizer_torch$step()
  
  log_losses[i] = as.numeric(loss)
}
```
:::

**5. Plot training history:**

::: panel-tabset
## Keras

```{r chunk_chapter3_72, cache=TRUE, message=FALSE, warning=FALSE}
plot(model_history)
```

## Torch

```{r chunk_chapter3_73}
plot(log_losses, xlab = "steps", ylab = "loss", las = 1)
```
:::

**6. Create predictions:**

::: panel-tabset
## Keras

```{r chunk_chapter3_74, cache=TRUE}
predictions = predict(model, X) # Probabilities for each class.
```

Get probabilities:

```{r chunk_chapter3_75, cache=TRUE}
head(predictions) # Quasi-probabilities for each species.
```

For each plant, we want to know for which species we got the highest probability:

```{r chunk_chapter3_76, cache=TRUE}
preds = apply(predictions, 1, which.max) 
print(preds)
```

## Torch

```{r chunk_chapter3_77, cache=TRUE}
model_torch$eval()
preds_torch = model_torch(torch_tensor(X))
preds_torch = apply(preds_torch, 1, which.max) 
print(preds_torch)
```
:::

**7. Calculate Accuracy (how often we have been correct):**

::: panel-tabset
## Keras

```{r chunk_chapter3_78, cache=TRUE}
mean(preds == as.integer(iris$Species))
```

## Torch

```{r chunk_chapter3_78_torch, cache=TRUE}
mean(preds_torch == as.integer(iris$Species))
```
:::

**8. Plot predictions, to see if we have done a good job:**

```{r chunk_chapter3_79, cache=TRUE}
oldpar = par(mfrow = c(1, 2))
plot(iris$Sepal.Length, iris$Petal.Length, col = iris$Species,
     main = "Observed")
plot(iris$Sepal.Length, iris$Petal.Length, col = preds,
     main = "Predicted")
par(oldpar)   # Reset par.
```

So you see, building a neural network is very easy with Keras or Torch and you can already do it on your own.

## Exercise

::: {.callout-caution icon="false"}
#### Task: Regression with keras

We now build a regression for the airquality data set with Keras/Torch. We want to predict the variable "Ozone" (continuous).

Tasks:

-   **Fill in the missing steps**

::: panel-tabset
## Keras

Before we start, load and prepare the data set:

```{r chunk_chapter3_task_26, eval=TRUE}
library(tensorflow)
library(keras)
set_random_seed(321L, disable_gpu = FALSE)	# Already sets R's random seed.

data = airquality

summary(data)
```

1.  **There are NAs in the data, which we have to remove because Keras cannot handle NAs. If you don't know how to remove NAs from a data.frame, use Google (e.g. with the query: "remove-rows-with-all-or-some-nas-missing-values-in-data-frame").**

`r hide("Solution")`

```{r}
data = data[complete.cases(data),]  # Remove NAs.
summary(data)
```

`r unhide()`

2.  **Split the data in features ($\boldsymbol{X}$) and response ($\boldsymbol{y}$, Ozone) and scale the $\boldsymbol{X}$ matrix.**

`r hide("Solution")`

```{r}
x = scale(data[,2:6])
y = data[,1]
```

`r unhide()`

3.  **Create a sequential Keras model.**

`r hide("Solution")`

```{r}
library(tensorflow)
library(keras)
model = keras_model_sequential()
```

`r unhide()`

4.  **Add hidden layers (input and output layer are already specified, you have to add hidden layers between them):**

```{r chunk_chapter3_task_28, eval=FALSE, purl=FALSE}
model %>%
  layer_dense(units = 20L, activation = "relu", input_shape = list(5L)) %>%
   ....
  layer_dense(units = 1L, activation = "linear")
```

-   Why do we use 5L as input shape?
-   Why only one output node and "linear" activation layer?

`r hide("Solution")`

```{r}
model %>%
  layer_dense(units = 20L, activation = "relu", input_shape = list(5L)) %>%
  layer_dense(units = 20L) %>%
  layer_dense(units = 20L) %>%
  layer_dense(units = 1L, activation = "linear")
```

`r unhide()`

5.  **Compile the model**

```{r chunk_chapter3_task_29, eval=TRUE, purl=FALSE}
model %>%
  compile(loss = loss_mean_squared_error, optimizer_adamax(learning_rate = 0.05))
```

What is the "mean_squared_error" loss?

6.  **Fit model:**

`r hide("Solution")`

```{r}
model_history =
  model %>%
  fit(x = x, y = as.numeric(y), epochs = 100L,
      batch_size = 20L, shuffle = TRUE)
```

`r unhide()`

7.  **Plot training history.**

`r hide("Solution")`

```{r}
plot(model_history)
```

`r unhide()`

8.  **Create predictions.**

`r hide("Solution")`

```{r}
pred_keras = predict(model, x)
```

`r unhide()`

9.  **Compare your Keras model with a linear model:**

```{r chunk_chapter3_task_30, eval=TRUE, purl=FALSE}
fit = lm(Ozone ~ ., data = data)
pred_lm = predict(fit, data)
rmse_lm = mean(sqrt((y - pred_lm)^2))
rmse_keras = mean(sqrt((y - pred_keras)^2))
print(rmse_lm)
print(rmse_keras)
```

## Torch

Before we start, load and prepare the data set:

```{r chunk_chapter3_task_26_torch, eval=TRUE}
library(torch)

data = airquality
summary(data)
plot(data)
```

1.  **There are NAs in the data, which we have to remove because Keras cannot handle NAs. If you don't know how to remove NAs from a data.frame, use Google (e.g. with the query: "remove-rows-with-all-or-some-nas-missing-values-in-data-frame").**

`r hide("Solution")`

```{r}
data = data[complete.cases(data),]  # Remove NAs.
summary(data)
```

`r unhide()`

2.  **Split the data in features ($\boldsymbol{X}$) and response ($\boldsymbol{y}$, Ozone) and scale the $\boldsymbol{X}$ matrix.**

`r hide("Solution")`

```{r}
x = scale(data[,2:6])
y = data[,1]
```

`r unhide()`

3.  **Pass a list of layer objects to a sequential network class of torch (input and output layer are already specified, you have to add hidden layers between them):**

```{r chunk_chapter3_task_28_torch, eval=FALSE, purl=FALSE}

model_torch = 
  nn_sequential(
    nn_linear(5L, 20L),
    ...
    nn_linear(20L, 1L),
  )
```

-   Why do we use 5L as input shape?
-   Why only one output node and no activation layer?

`r hide("Solution")`

```{r}
library(torch)

model_torch = 
  nn_sequential(
    nn_linear(5L, 20L),
    nn_relu(),
    nn_linear(20L, 20L),
    nn_relu(),
    nn_linear(20L, 20L),
    nn_relu(),
    nn_linear(20L, 1L),
  )
```

`r unhide()`

4.  **Create optimizer**

We have to pass the network's parameters to the optimizer (how is this different to keras?)

```{r chunk_chapter3_task_29_torch, eval=TRUE, purl=FALSE}
optimizer_torch = optim_adam(params = model_torch$parameters, lr = 0.05)
```

5.  **Fit model**

In torch we write the trainings loop on our own. Complete the trainings loop:

```{r chunk_chapter3_task_300_torch, eval=FALSE, purl=FALSE}
# Calculate number of training steps.
epochs = ...
batch_size = 32
steps = ...

X_torch = torch_tensor(x)
Y_torch = torch_tensor(y, ...) 

# Set model into training status.
model_torch$train()

log_losses = NULL

# Training loop.
for(i in 1:steps){
  # Get batch indices.
  indices = sample.int(nrow(x), batch_size)
  X_batch = ...
  Y_batch = ...
  
  # Reset backpropagation.
  optimizer_torch$zero_grad()
  
  # Predict and calculate loss.
  pred = model_torch(X_batch)
  loss = ...
  
  # Backpropagation and weight update.
  loss$backward()
  optimizer_torch$step()
  
  log_losses[i] = as.numeric(loss)
}
```

`r hide("Solution")`

```{r}
# Calculate number of training steps.
epochs = 100
batch_size = 32
steps = round(nrow(x)/batch_size*epochs)

X_torch = torch_tensor(x)
Y_torch = torch_tensor(y, dtype = torch_float32())$view(list(-1, 1)) 

# Set model into training status.
model_torch$train()

log_losses = NULL

# Training loop.
for(i in 1:steps){
  # Get batch indices.
  indices = sample.int(nrow(x), batch_size)
  X_batch = X_torch[indices,]
  Y_batch = Y_torch[indices,]
  
  # Reset backpropagation.
  optimizer_torch$zero_grad()
  
  # Predict and calculate loss.
  pred = model_torch(X_batch)
  loss = nnf_mse_loss(pred, Y_batch)
  
  # Backpropagation and weight update.
  loss$backward()
  optimizer_torch$step()
  
  log_losses[i] = as.numeric(loss)
}
```

`r unhide()`

Tips:

-   Number of training \$ steps = Number of rows / batchsize \* Epochs \$
-   Search torch::nnf\_... for the correct loss function (mse...)
-   Make sure that X_torch and Y_torch have the same data type! (you can set the dtype via torch_tensor(..., dtype = ...)) \_ Check the dimension of Y_torch, we need a matrix!

6.  **Plot training history.**

`r hide("Solution")`

```{r}
plot(y = log_losses, x = 1:steps, xlab = "Epoch", ylab = "MSE")
```

`r unhide()`

7.  **Create predictions.**

`r hide("Solution")`

```{r}
pred_torch = model_torch(X_torch)
pred_torch = as.numeric(pred_torch) # cast torch to R object 
```

`r unhide()`

8.  **Compare your Torch model with a linear model:**

```{r chunk_chapter3_task_311_torch, eval=TRUE, purl=FALSE}
fit = lm(Ozone ~ ., data = data)
pred_lm = predict(fit, data)
rmse_lm = mean(sqrt((y - pred_lm)^2))
rmse_torch = mean(sqrt((y - pred_torch)^2))
print(rmse_lm)
print(rmse_torch)
```
:::
:::


::: {.callout-caution icon="false"}
#### Task: Titanic dataset

Build a Keras DNN for the titanic dataset

:::


::: {.callout-caution icon="false"}
#### Bonus Task: More details on the inner working of Keras

The next task differs for Torch and Keras users. Keras users will learn more about the inner working of training while Torch users will learn how to simplify and generalize the training loop.

Go through the code and try to understand it.

::: panel-tabset
## Keras

Similar to Torch, here we will write the training loop ourselves in the following. The training loop consists of several steps:

1.  Sample batches of X and Y data
2.  Open the gradientTape to create a computational graph (autodiff)
3.  Make predictions and calculate loss
4.  Update parameters based on the gradients at the loss (go back to 1. and repeat)

```{r}
library(tensorflow)
library(keras)

data = airquality
data = data[complete.cases(data),]  # Remove NAs.
x = scale(data[,2:6])
y = data[,1]

layers = tf$keras$layers
model = tf$keras$models$Sequential(
  c(
    layers$InputLayer(input_shape = list(5L)),
    layers$Dense(units = 20L, activation = tf$nn$relu),
    layers$Dense(units = 20L, activation = tf$nn$relu),
    layers$Dense(units = 20L, activation = tf$nn$relu),
    layers$Dense(units = 1L, activation = NULL) # No activation == "linear".
  )
)

epochs = 200L
optimizer = tf$keras$optimizers$Adamax(0.01)

# Stochastic gradient optimization is more efficient
# in each optimization step, we use a random subset of the data.
get_batch = function(batch_size = 32L){
  indices = sample.int(nrow(x), size = batch_size)
  return(list(bX = x[indices,], bY = y[indices]))
}
get_batch() # Try out this function.

steps = floor(nrow(x)/32) * epochs  # We need nrow(x)/32 steps for each epoch.
for(i in 1:steps){
  # Get data.
  batch = get_batch()

  # Transform it into tensors.
  bX = tf$constant(batch$bX)
  bY = tf$constant(matrix(batch$bY, ncol = 1L))
  
  # Automatic differentiation:
  # Record computations with respect to our model variables.
  with(tf$GradientTape() %as% tape,
    {
      pred = model(bX) # We record the operation for our model weights.
      loss = tf$reduce_mean(tf$keras$losses$mean_squared_error(bY, pred))
    }
  )
  
  # Calculate the gradients for our model$weights at the loss / backpropagation.
  gradients = tape$gradient(loss, model$weights) 

  # Update our model weights with the learning rate specified above.
  optimizer$apply_gradients(purrr::transpose(list(gradients, model$weights))) 
  if(! i%%30){
    cat("Loss: ", loss$numpy(), "\n") # Print loss every 30 steps (not epochs!).
  }
}
```

## Torch

Keras and Torch use dataloaders to generate the data batches. Dataloaders are objects that return batches of data infinetly. Keras create the dataloader object automatically in the fit function, in Torch we have to write them ourselves:

1.  Define a dataset object. This object informs the dataloader function about the inputs, outputs, length (nrow), and how to sample from it.
2.  Create an instance of the dataset object by calling it and passing the actual data to it
3.  Pass the initiated dataset to the dataloader function

```{r}
library(torch)

data = airquality
data = data[complete.cases(data),]  # Remove NAs.
x = scale(data[,2:6])
y = matrix(data[,1], ncol = 1L)


torch_dataset = torch::dataset(
    name = "airquality",
    initialize = function(X,Y) {
      self$X = torch::torch_tensor(as.matrix(X), dtype = torch_float32())
      self$Y = torch::torch_tensor(as.matrix(Y), dtype = torch_float32())
    },
    .getitem = function(index) {
      x = self$X[index,]
      y = self$Y[index,]
      list(x, y)
    },
    .length = function() {
      self$Y$size()[[1]]
    }
  )
dataset = torch_dataset(x,y)
dataloader = torch::dataloader(dataset, batch_size = 30L, shuffle = TRUE)
```

Our dataloader is again an object which has to be initiated. The initiated object returns a list of two elements, batch x and batch y. The initated object stops returning batches when the dataset was completly transversed (no worries, we don't have to all of this ourselves).

Our training loop has changed:

```{r}
model_torch = nn_sequential(
  nn_linear(5L, 50L),
  nn_relu(),
  nn_linear(50L, 50L),
  nn_relu(), 
  nn_linear(50L, 50L),
  nn_relu(), 
  nn_linear(50L, 1L)
)
epochs = 50L
opt = optim_adam(model_torch$parameters, 0.01)
train_losses = c()
for(epoch in 1:epochs){
  train_loss = c()
  coro::loop(
    for(batch in dataloader) { 
      opt$zero_grad()
      pred = model_torch(batch[[1]])
      loss = nnf_mse_loss(pred, batch[[2]])
      loss$backward()
      opt$step()
      train_loss = c(train_loss, loss$item())
    }
  )
  train_losses = c(train_losses, mean(train_loss))
  if(!epoch%%10) cat(sprintf("Loss at epoch %d: %3f\n", epoch, mean(train_loss)))
}
```

```{r}
plot(train_losses, type = "o", pch = 15,
        col = "darkblue", lty = 1, xlab = "Epoch",
        ylab = "Loss", las = 1)
```
:::

Now change the code from above for the iris data set. Tip: In tf$keras$losses\$... you can find various loss functions.

::: panel-tabset
## Keras

`r hide("Click here to see the solution")`

```{r}
library(tensorflow)
library(keras)
set_random_seed(321L, disable_gpu = FALSE)  # Already sets R's random seed.

x = scale(iris[,1:4])
y = iris[,5]
y = keras::to_categorical(as.integer(Y)-1L, 3)

layers = tf$keras$layers
model = tf$keras$models$Sequential(
  c(
    layers$InputLayer(input_shape = list(4L)),
    layers$Dense(units = 20L, activation = tf$nn$relu),
    layers$Dense(units = 20L, activation = tf$nn$relu),
    layers$Dense(units = 20L, activation = tf$nn$relu),
    layers$Dense(units = 3L, activation = tf$nn$softmax)
  )
)

epochs = 200L
optimizer = tf$keras$optimizers$Adamax(0.01)

# Stochastic gradient optimization is more efficient.
get_batch = function(batch_size = 32L){
  indices = sample.int(nrow(x), size = batch_size)
  return(list(bX = x[indices,], bY = y[indices,]))
}

steps = floor(nrow(x)/32) * epochs # We need nrow(x)/32 steps for each epoch.

for(i in 1:steps){
  batch = get_batch()
  bX = tf$constant(batch$bX)
  bY = tf$constant(batch$bY)
  
  # Automatic differentiation.
  with(tf$GradientTape() %as% tape,
    {
      pred = model(bX) # we record the operation for our model weights
      loss = tf$reduce_mean(tf$keras$losses$categorical_crossentropy(bY, pred))
    }
  )
  
  # Calculate the gradients for the loss at our model$weights / backpropagation.
  gradients = tape$gradient(loss, model$weights)
  
  # Update our model weights with the learning rate specified above.
  optimizer$apply_gradients(purrr::transpose(list(gradients, model$weights)))
  
  if(! i%%30){
    cat("Loss: ", loss$numpy(), "\n") # Print loss every 30 steps (not epochs!).
  }
}
```

`r unhide()`

## Torch

`r hide("Click here to see the solution")`

```{r, echo = TRUE}
library(torch)

x = scale(iris[,1:4])
y = iris[,5]
y = as.integer(iris$Species)


torch_dataset = torch::dataset(
    name = "iris",
    initialize = function(X,Y) {
      self$X = torch::torch_tensor(as.matrix(X), dtype = torch_float32())
      self$Y = torch::torch_tensor(Y, dtype = torch_long())
    },
    .getitem = function(index) {
      x = self$X[index,]
      y = self$Y[index]
      list(x, y)
    },
    .length = function() {
      self$Y$size()[[1]]
    }
  )
dataset = torch_dataset(x,y)
dataloader = torch::dataloader(dataset, batch_size = 30L, shuffle = TRUE)


model_torch = nn_sequential(
  nn_linear(4L, 50L),
  nn_relu(),
  nn_linear(50L, 50L),
  nn_relu(), 
  nn_linear(50L, 50L),
  nn_relu(), 
  nn_linear(50L, 3L)
)
epochs = 50L
opt = optim_adam(model_torch$parameters, 0.01)
train_losses = c()
for(epoch in 1:epochs){
  train_loss
  coro::loop(
    for(batch in dataloader) { 
      opt$zero_grad()
      pred = model_torch(batch[[1]])
      loss = nnf_cross_entropy(pred, batch[[2]])
      loss$backward()
      opt$step()
      train_loss = c(train_loss, loss$item())
    }
  )
  train_losses = c(train_losses, mean(train_loss))
  if(!epoch%%10) cat(sprintf("Loss at epoch %d: %3f\n", epoch, mean(train_loss)))
}
```

`r unhide()`
:::

Remarks:

-   Mind the different input and output layer numbers.
-   The loss function increases randomly, because different subsets of the data were drawn. This is a downside of stochastic gradient descent.
-   A positive thing about stochastic gradient descent is, that local valleys or hills may be left and global ones can be found instead.
:::

## Underlying mathematical concepts - optional {#basicMath}

If are not yet familiar with the underlying concepts of neural networks and want to know more about that, it is suggested to read / view the following videos / sites. Consider the Links and videos with descriptions in parentheses as optional bonus.

***This might be useful to understand the further concepts in more depth.***

-   (<a href="https://en.wikipedia.org/wiki/Newton%27s_method#Description" target="_blank" rel="noopener">https://en.wikipedia.org/wiki/Newton%27s_method#Description</a> (Especially the animated graphic is interesting).)

-   <a href="https://en.wikipedia.org/wiki/Gradient_descent#Description" target="_blank" rel="noopener">https://en.wikipedia.org/wiki/Gradient_descent#Description</a>

-   <a href="https://mlfromscratch.com/neural-networks-explained/#/" target="_blank" rel="noopener">Neural networks (Backpropagation, etc.)</a>.

-   <a href="https://mlfromscratch.com/activation-functions-explained/#/" target="_blank" rel="noopener">Activation functions in detail</a> (requires the above as prerequisite).

***Videos about the topic***:

-   **Gradient descent explained**

```{r chunk_chapter3_35, eval=knitr::is_html_output(excludes = "epub"), results = 'asis', echo = F}
cat(
  '<iframe width="560" height="315"
  src="https://www.youtube.com/embed/sDv4f4s2SB8"
  frameborder="0" allow="accelerometer; autoplay; clipboard-write;
  encrypted-media; gyroscope; picture-in-picture" allowfullscreen>
  </iframe>'
)
```

-   (Stochastic gradient descent explained)

```{r chunk_chapter3_36, eval=knitr::is_html_output(excludes = "epub"), results = 'asis', echo = F}
cat(
  '<iframe width="560" height="315"
  src="https://www.youtube.com/embed/vMh0zPT0tLI"
  frameborder="0" allow="accelerometer; autoplay; clipboard-write;
  encrypted-media; gyroscope; picture-in-picture" allowfullscreen>
  </iframe>'
)
```

-   (Entropy explained)

```{r chunk_chapter3_37, eval=knitr::is_html_output(excludes = "epub"), results = 'asis', echo = F}
cat(
  '<iframe width="560" height="315"
  src="https://www.youtube.com/embed/YtebGVx-Fxw"
  frameborder="0" allow="accelerometer; autoplay; clipboard-write;
  encrypted-media; gyroscope; picture-in-picture" allowfullscreen>
  </iframe>'
)
```

-   **Short explanation of entropy, cross entropy and Kullback--Leibler divergence**

```{r chunk_chapter3_38, eval=knitr::is_html_output(excludes = "epub"), results = 'asis', echo = F}
cat(
  '<iframe width="560" height="315"
  src="https://www.youtube.com/embed/ErfnhcEV1O8"
  frameborder="0" allow="accelerometer; autoplay; clipboard-write;
  encrypted-media; gyroscope; picture-in-picture" allowfullscreen>
  </iframe>'
)
```

-   **Deep Learning (chapter 1)**

```{r chunk_chapter3_39, eval=knitr::is_html_output(excludes = "epub"), results = 'asis', echo = F}
cat(
  '<iframe width="560" height="315"
  src="https://www.youtube.com/embed/aircAruvnKk"
  frameborder="0" allow="accelerometer; autoplay; clipboard-write;
  encrypted-media; gyroscope; picture-in-picture" allowfullscreen>
  </iframe>'
)
```

-   **How neural networks learn - Deep Learning (chapter 2)**

```{r chunk_chapter3_40, eval=knitr::is_html_output(excludes = "epub"), results = 'asis', echo = F}
cat(
  '<iframe width="560" height="315"
  src="https://www.youtube.com/embed/IHZwWFHWa-w"
  frameborder="0" allow="accelerometer; autoplay; clipboard-write;
  encrypted-media; gyroscope; picture-in-picture" allowfullscreen>
  </iframe>'
)
```

-   **Backpropagation - Deep Learning (chapter 3)**

```{r chunk_chapter3_41, eval=knitr::is_html_output(excludes = "epub"), results = 'asis', echo = F}
cat(
  '<iframe width="560" height="315"
  src="https://www.youtube.com/embed/Ilg3gGewQ5U"
  frameborder="0" allow="accelerometer; autoplay; clipboard-write;
  encrypted-media; gyroscope; picture-in-picture" allowfullscreen>
  </iframe>'
)
```

-   **Another video about backpropagation (extends the previous one) - Deep Learning (chapter 4)**

```{r chunk_chapter3_42, eval=knitr::is_html_output(excludes = "epub"), results = 'asis', echo = F}
cat(
  '<iframe width="560" height="315"
  src="https://www.youtube.com/embed/tIeHLnjs5U8"
  frameborder="0" allow="accelerometer; autoplay; clipboard-write;
  encrypted-media; gyroscope; picture-in-picture" allowfullscreen>
  </iframe>'
)
```

### Caveats of neural network optimization

Depending on activation functions, it might occur that the network won't get updated, even with high learning rates (called *vanishing gradient*, especially for "sigmoid" functions). Furthermore, updates might overshoot (called *exploding gradients*) or activation functions will result in many zeros (especially for "relu", *dying relu*).

In general, the first layers of a network tend to learn (much) more slowly than subsequent ones.