- Many machine learning tasks are optimization problems, such as minimizing the error in a linear regression or maximizing the likelihood in a linear classification
- Those optimization problems revolve around some parameter
$\theta$ which has to be learned - Determining this parameter can rarely be done using closed form solutions
- Instead, different optimization methods are used
- If
$\mathcal{X}$ is the domain, i.e. possible values for$\theta$ and$f(\theta)$ is the function which needs to be optimized, then any optimization problem can be formulated like so:$$\theta^* = \arg \min_{\theta \in \mathcal{X}} f(\theta)$$ - The optimized parameter
$\theta^*$ is therefore the global minimum of$f(\theta)$ - Determining this global minimum is therefore the objective in an optizmization
- Finding this minimum is often not as simple as computing the gradient and solving for
$0$ , since the number of parameters can be very large
- A domain is convex if and only if it holds that a line drawn between any two points in the domain is contained in the domain as well
- More formally, for all
$x, y \in X$ it must hold that$\lambda x + (1 - \lambda) y \in X$ for$\lambda \in [0, 1]$
- A function is convex if the domain that consists of all points above and on a function is convex
- This can also be expressed like so:
$$\forall ; x, y \in X: \lambda f(x) + (1 - \lambda)f(y) \geq f(\lambda x + (1 - \lambda)y), ; \text{for} ; \lambda \in [0, 1]$$ - If a function is convex, it has no local minima and only global ones, which makes optimization a lot more simple
- The difference between two values
$f(x)$ and$f(y)$ is always bounded by function values between$x$ and$y$ - This fact can also be described like this:
$$f(y) - f(x) \geq (y - x)^T \nabla f(x)$$
- One of the following steps needs to be followed to prove, that a function is convex:
- It can be shown that the definition of convex functions holds
- A fact about convex functions, such as first order convexity, can be exploited
- It can be shown that a given function can be obtained by combining simple convex functions with operations that preserve convexity
- If
$f_1$ and$f_2$ are convex functions and$g$ is a concave function, then a convex function$h$ can be created like this:$$h(x) = f_1(x) + f_2(x)$$ $$h(x) = \max(f_1(x), f_2(x))$$ $$h(x) = c \cdot f_1(x), \text{if } c \geq 0$$ $$h(x) = c \cdot g(x), \text{if } c \leq 0$$ $$h(x) = f_1(Ax + b)$$ $$h(x) = m(f_1(x)), \text{if} ; m ; \text{is convex and non-decresing}$$
![[Pasted image 20241125135355.png]]
- Even if a function is convex, there may still be problems that arise when trying to compute the minimum
- For example, the gradient may not be computable, lay outside the domain or the function may not be differentiable on the whole domain
![[Pasted image 20241125135343.png]]
- Gradient descent is an approach which iteratively computes gradients to approximate the minimum of a function
- Given a starting point
$\theta$ the following three steps are repeated, until a stopping criterion is satisfied:- The gradient at
$\theta$ is computed:$\Delta \theta = - \nabla f(\theta)$ - The length
$t$ to descent the gradient is chosen in a line search:$t^* = \arg \min_{t > 0} f(\theta + t \cdot \Delta \theta)$ - The reached point is chosen as the starting point for the next iteration:
$\theta = \theta + t^* \cdot \Delta \theta$
- The gradient at
- Often, parallel computations can be used to accelerate the computation of the gradient but the line search remains expensive
- To speed up line search, one picks a learning rate
$\tau$ before executing the gradient descent - This
$\tau$ will then be used as the length$t$ in the iteration process above - Now the line search can be skipped, which speeds up the descent
- A few considerations need to be made when picking the learning rate
- If
$\tau$ is too small the convergence will be very slow and the descent may terminate in a local minimum - If
$\tau$ is too large the descent oscillates too much and the descent may never terminate at all
- The learning rate can be changed dynamically which supports different strategies:
-
$\tau$ may decrease as the descent continues in order to achieve fine-tuning in the later stages of the iteration - A momentum variable can be used so that the search accelerates as long as the gradients point in the same direction
-
$\tau$ may be different for different parameters and smaller if the gradient happens to be very large
-