diff --git a/404.html b/404.html index 6ed77593..afd7434b 100644 --- a/404.html +++ b/404.html @@ -1 +1 @@ - 404 - MIT Parallel Computing and Scientific Machine Learning (SciML)

404


The requested page was not found



Click here to go back to the homepage.
\ No newline at end of file + 404 - MIT Parallel Computing and Scientific Machine Learning (SciML)

404


The requested page was not found



Click here to go back to the homepage.
\ No newline at end of file diff --git a/Project.toml b/Project.toml index 46929531..4a359e48 100644 --- a/Project.toml +++ b/Project.toml @@ -55,6 +55,7 @@ Plots = "1" SIMD = "3" Sobol = "1" StaticArrays = "1" +Statistics = "1" StatsPlots = "0.15" Sundials = "4" Traceur = "0.3" diff --git a/_weave/homework01/hw1/index.html b/_weave/homework01/hw1/index.html index 29384c9a..4764ee50 100644 --- a/_weave/homework01/hw1/index.html +++ b/_weave/homework01/hw1/index.html @@ -1 +1 @@ -

Homework 1, Parallelized Dynamics

Chris Rackauckas
September 15th, 2020

Due October 1st, 2020 at midnight EST.

Homework 1 is a chance to get some experience implementing discrete dynamical systems techniques in a way that is parallelized, and a time to understand the fundamental behavior of the bottleneck algorithms in scientific computing.

Problem 1: A Ton of New Facts on Newton

In lecture 4 we looked at the properties of discrete dynamical systems to see that running many systems for infinitely many steps would go to a steady state. This process is used as a numerical method known as fixed point iteration to solve for the steady state of systems $x_{n+1} = f(x_{n})$. Under a transformation (which we will do in this homework), it can be used to solve rootfinding problems $f(x) = 0$ to solve for $x$.

In this problem we will look into Newton's method. Newton's method is the dynamical system defined by the update process:

\[ x_{n+1} = x_n - \left(\frac{dg}{dx}(x_n)\right)^{-1} g(x_n) \]

For these problems, assume that $\frac{dg}{dx}$ is non-singular. We will prove a few properties to show why, in practice, Newton methods are preferred for quickly calculating the steady state.

Part 1

Show that if $x^\ast$ is a steady state of the equation, then $g(x^\ast) = 0$.

Part 2

Take a look at the Quasi-Newton approximation:

\[ x_{n+1} = x_n - \left(\frac{dg}{dx}(x_0)\right)^{-1} g(x_n) \]

for some fixed $x_0$. Derive the stability of the Quasi-Newton approximation in the form of a matrix whose eigenvalues need to be constrained. Use this to argue that if $x_0$ is sufficiently close to $x^\ast$ then the steady state is a stable (attracting) steady state.

Part 3

Relaxed Quasi-Newton is the method:

\[ x_{n+1} = x_n - \alpha \left(\frac{dg}{dx}(x_0)\right)^{-1} g(x_n) \]

Argue that for some sufficiently small $\alpha$ that the Quasi-Newton iterations will be stable if the eigenvalues of $(\left(\frac{dg}{dx}(x_0)\right)^{-1} g(x_n))^\prime$ are all positive for every $x$.

(Technically, these assumptions can be greatly relaxed, but weird cases arise. When $x \in \mathbb{C}$, this holds except on some set of Lebesgue measure zero. Feel free to explore this.)

Part 4

Fixed point iteration is the dynamical system

\[ x_{n+1} = g(x_n) \]

which converges to $g(x)=x$.

  1. What is a small change to the dynamical system that could be done such that $g(x)=0$ is the steady state?

  2. How can you change the $\left(\frac{dg}{dx}(x_0)\right)^{-1}$ term from the Quasi-Newton iteration to get a method equivalent to fixed point iteration? What does this imply about the difference in stability between Quasi-Newton and fixed point iteration if $\frac{dg}{dx}$ has large eigenvalues?

Problem 2: The Root of all Problems

In this problem we will practice writing fast and type-generic Julia code by producing an algorithm that will compute the quantile of any probability distribution.

Part 1

Many problems can be interpreted as a rootfinding problem. For example, let's take a look at a problem in statistics. Let $X$ be a random variable with a cumulative distribution function (CDF) of $cdf(x)$. Recall that the CDF is a monotonically increasing function in $[0,1]$ which is the total probability of $X < x$. The $y$th quantile of $X$ is the value $x$ at with $X$ has a y% chance of being less than $x$. Interpret the problem of computing an arbitrary quantile $y$ as a rootfinding problem, and use Newton's method to write an algorithm for computing $x$.

(Hint: Recall that $cdf^{\prime}(x) = pdf(x)$, the probability distribution function.)

Part 2

Use the types from Distributions.jl to write a function my_quantile(y,d) which uses multiple dispatch to compute the $y$th quantile for any UnivariateDistribution d from Distributions.jl. Test your function on Gamma(5, 1), Normal(0, 1), and Beta(2, 4) against the Distributions.quantile function built into the library.

(Hint: Have a keyword argument for $x_0$, and let its default be the mean or median of the distribution.)

Problem 3: Bifurcating Data for Parallelism

In this problem we will write code for efficient generation of the bifurcation diagram of the logistic equation.

Part 1

The logistic equation is the dynamical system given by the update relation:

\[ x_{n+1} = rx_n (1-x_n) \]

where $r$ is some parameter. Write a function which iterates the equation from $x_0 = 0.25$ enough times to be sufficiently close to its long-term behavior (400 iterations) and samples 150 points from the steady state attractor (i.e. output iterations 401:550) as a function of $r$, and mutates some vector as a solution, i.e. calc_attractor!(out,f,p,num_attract=150;warmup=400).

Test your function with $r = 2.9$. Double check that your function computes the correct result by calculating the analytical steady state value.

Part 2

The bifurcation plot shows how a steady state changes as a parameter changes. Compute the long-term result of the logistic equation at the values of r = 2.9:0.001:4, and plot the steady state values for each $r$ as an r x steady_attractor scatter plot. You should get a very bizarrely awesome picture, the bifurcation graph of the logistic equation.

(Hint: Generate a single matrix for the attractor values, and use calc_attractor! on views of columns for calculating the output, or inline the calc_attractor! computation directly onto the matrix, or even give calc_attractor! an input for what column to modify.)

Part 3

Multithread your bifurcation graph generator by performing different steady state calculations on different threads. Does your timing improve? Why? Be careful and check to make sure you have more than 1 thread!

Part 4

Multiprocess your bifurcation graph generator first by using pmap, and then by using @distributed. Does your timing improve? Why? Be careful to add processes before doing the distributed call.

(Note: You may need to change your implementation around to be allocating differently in order for it to be compatible with multiprocessing!)

Part 5

Which method is the fastest? Why?

\ No newline at end of file +

Homework 1, Parallelized Dynamics

Chris Rackauckas
September 15th, 2020

Due October 1st, 2020 at midnight EST.

Homework 1 is a chance to get some experience implementing discrete dynamical systems techniques in a way that is parallelized, and a time to understand the fundamental behavior of the bottleneck algorithms in scientific computing.

Problem 1: A Ton of New Facts on Newton

In lecture 4 we looked at the properties of discrete dynamical systems to see that running many systems for infinitely many steps would go to a steady state. This process is used as a numerical method known as fixed point iteration to solve for the steady state of systems $x_{n+1} = f(x_{n})$. Under a transformation (which we will do in this homework), it can be used to solve rootfinding problems $f(x) = 0$ to solve for $x$.

In this problem we will look into Newton's method. Newton's method is the dynamical system defined by the update process:

\[ x_{n+1} = x_n - \left(\frac{dg}{dx}(x_n)\right)^{-1} g(x_n) \]

For these problems, assume that $\frac{dg}{dx}$ is non-singular. We will prove a few properties to show why, in practice, Newton methods are preferred for quickly calculating the steady state.

Part 1

Show that if $x^\ast$ is a steady state of the equation, then $g(x^\ast) = 0$.

Part 2

Take a look at the Quasi-Newton approximation:

\[ x_{n+1} = x_n - \left(\frac{dg}{dx}(x_0)\right)^{-1} g(x_n) \]

for some fixed $x_0$. Derive the stability of the Quasi-Newton approximation in the form of a matrix whose eigenvalues need to be constrained. Use this to argue that if $x_0$ is sufficiently close to $x^\ast$ then the steady state is a stable (attracting) steady state.

Part 3

Relaxed Quasi-Newton is the method:

\[ x_{n+1} = x_n - \alpha \left(\frac{dg}{dx}(x_0)\right)^{-1} g(x_n) \]

Argue that for some sufficiently small $\alpha$ that the Quasi-Newton iterations will be stable if the eigenvalues of $(\left(\frac{dg}{dx}(x_0)\right)^{-1} g(x_n))^\prime$ are all positive for every $x$.

(Technically, these assumptions can be greatly relaxed, but weird cases arise. When $x \in \mathbb{C}$, this holds except on some set of Lebesgue measure zero. Feel free to explore this.)

Part 4

Fixed point iteration is the dynamical system

\[ x_{n+1} = g(x_n) \]

which converges to $g(x)=x$.

  1. What is a small change to the dynamical system that could be done such that $g(x)=0$ is the steady state?

  2. How can you change the $\left(\frac{dg}{dx}(x_0)\right)^{-1}$ term from the Quasi-Newton iteration to get a method equivalent to fixed point iteration? What does this imply about the difference in stability between Quasi-Newton and fixed point iteration if $\frac{dg}{dx}$ has large eigenvalues?

Problem 2: The Root of all Problems

In this problem we will practice writing fast and type-generic Julia code by producing an algorithm that will compute the quantile of any probability distribution.

Part 1

Many problems can be interpreted as a rootfinding problem. For example, let's take a look at a problem in statistics. Let $X$ be a random variable with a cumulative distribution function (CDF) of $cdf(x)$. Recall that the CDF is a monotonically increasing function in $[0,1]$ which is the total probability of $X < x$. The $y$th quantile of $X$ is the value $x$ at with $X$ has a y% chance of being less than $x$. Interpret the problem of computing an arbitrary quantile $y$ as a rootfinding problem, and use Newton's method to write an algorithm for computing $x$.

(Hint: Recall that $cdf^{\prime}(x) = pdf(x)$, the probability distribution function.)

Part 2

Use the types from Distributions.jl to write a function my_quantile(y,d) which uses multiple dispatch to compute the $y$th quantile for any UnivariateDistribution d from Distributions.jl. Test your function on Gamma(5, 1), Normal(0, 1), and Beta(2, 4) against the Distributions.quantile function built into the library.

(Hint: Have a keyword argument for $x_0$, and let its default be the mean or median of the distribution.)

Problem 3: Bifurcating Data for Parallelism

In this problem we will write code for efficient generation of the bifurcation diagram of the logistic equation.

Part 1

The logistic equation is the dynamical system given by the update relation:

\[ x_{n+1} = rx_n (1-x_n) \]

where $r$ is some parameter. Write a function which iterates the equation from $x_0 = 0.25$ enough times to be sufficiently close to its long-term behavior (400 iterations) and samples 150 points from the steady state attractor (i.e. output iterations 401:550) as a function of $r$, and mutates some vector as a solution, i.e. calc_attractor!(out,f,p,num_attract=150;warmup=400).

Test your function with $r = 2.9$. Double check that your function computes the correct result by calculating the analytical steady state value.

Part 2

The bifurcation plot shows how a steady state changes as a parameter changes. Compute the long-term result of the logistic equation at the values of r = 2.9:0.001:4, and plot the steady state values for each $r$ as an r x steady_attractor scatter plot. You should get a very bizarrely awesome picture, the bifurcation graph of the logistic equation.

(Hint: Generate a single matrix for the attractor values, and use calc_attractor! on views of columns for calculating the output, or inline the calc_attractor! computation directly onto the matrix, or even give calc_attractor! an input for what column to modify.)

Part 3

Multithread your bifurcation graph generator by performing different steady state calculations on different threads. Does your timing improve? Why? Be careful and check to make sure you have more than 1 thread!

Part 4

Multiprocess your bifurcation graph generator first by using pmap, and then by using @distributed. Does your timing improve? Why? Be careful to add processes before doing the distributed call.

(Note: You may need to change your implementation around to be allocating differently in order for it to be compatible with multiprocessing!)

Part 5

Which method is the fastest? Why?

\ No newline at end of file diff --git a/_weave/homework02/hw2/index.html b/_weave/homework02/hw2/index.html index 5e421156..21512bc6 100644 --- a/_weave/homework02/hw2/index.html +++ b/_weave/homework02/hw2/index.html @@ -11,4 +11,4 @@

Homework 2

Chris Rackauckas
October 8th, 2020 module load julia-latest module load mpi/mpich-x86_64 -mpirun julia mycode.jl

to receive two cores on two nodes. Recreate the bandwidth vs message plots and the interpretation. Does the fact that the nodes are physically disconnected cause a substantial difference?

\ No newline at end of file +mpirun julia mycode.jl

to receive two cores on two nodes. Recreate the bandwidth vs message plots and the interpretation. Does the fact that the nodes are physically disconnected cause a substantial difference?

\ No newline at end of file diff --git a/_weave/homework03/hw3/index.html b/_weave/homework03/hw3/index.html index bf37b16e..ec4e60de 100644 --- a/_weave/homework03/hw3/index.html +++ b/_weave/homework03/hw3/index.html @@ -1 +1 @@ -

Neural Ordinary Differential Equation Adjoints

Chris Rackauckas
November 20th, 2020

In this homework, we will write an implementation of neural ordinary differential equations from scratch. You may use the DifferentialEquations.jl ODE solver, but not the adjoint sensitivities functionality. Optionally, a second problem is to add GPU support to your implementation.

Due December 9th, 2020 at midnight.

Please email the results to 18337.mit.psets@gmail.com.

Problem 1: Neural ODE from Scratch

In this problem we will work through the development of a neural ODE.

Part 1: Gradients as vjps

Use the definition of the pullback as a vector-Jacobian product (vjp) to show that $B_f^x(1) = \left( \nabla f(x) \right)^{T}$ for a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$.

(Hint: if you put 1 into the pullback, what kind of function is it? What does the Jacobian look like?)

Part 2: Backpropagation of a neural network

Implement a simple $NN: \mathbb{R}^2 \rightarrow \mathbb{R}^2$ neural network

\[ NN(u;W_i,b_i) = W_2 tanh.(W_1 u + b_1) + b_2 \]

where $W_1$ is $50 \times 2$, $b_1$ is length 50, $W_2$ is $2 \times 50$, and $b_2$ is length 2. Implement the pullback of the neural network: $B_{NN}^{u,W_i,b_i}(y)$ to calculate the derivative of the neural network with respect to each of these inputs. Check for correctness by using ForwardDiff.jl to calculate the gradient.

Part 3: Implementing an ODE adjoint

The adjoint of an ODE can be described as the set of vector equations:

\[ \begin{align} u' &= f(u,p,t)\\ \end{align} \]

forward, and then

\[ \begin{align} \lambda' &= -\lambda^\ast \frac{\partial f}{\partial u}\\ \mu' &= -\lambda^\ast \frac{\partial f}{\partial p}\\ \end{align} \]

solved in reverse time from $T$ to $0$ for some cost function $C(p)$. For this problem, we will use the L2 loss function.

Note that $\mu(T) = 0$ and $\lambda(T) = \frac{\partial C}{\partial u(T)}$. This is written in the form where the only data point is at time $T$. If that is not the case, the reverse solve needs to add the jump $\frac{\partial C}{\partial u(t_i)}$ to $\lambda$ at each data point $u(t_i)$. Use this example for how to add these jumps to the equation.

Using this formulation of the adjoint, it holds that $\mu(0) = \frac{\partial C}{\partial p}$, and thus solving these ODEs in reverse gives the solution for the gradient as a part of the system at time zero.

Notice that $B_f^u(\lambda) = \lambda^\ast \frac{\partial f}{\partial u}$ and similarly for $\mu$. Implement an adjoint calculation for a neural ordinary differential equation where

\[ u' = NN(u) \]

from above. Solve the ODE forwards using OrdinaryDiffEq.jl's Tsit5() integrator, then use the interpolation from the forward pass for the u values of the backpass and solve.

(Note: you will want to double check this gradient by using something like ForwardDiff! Start with only measuring the datapoint at the end, then try multiple data points.)

Part 4: Training the neural ODE

Generate data from the ODE $u' = Au$ where A = [-0.1 2.0; -2.0 -0.1] at t=0.0:0.1:1.0 (use saveat) with $u(0) = [2,0]$. Define the cost function C(θ) to be the Euclidean distance between the neural ODE's solution and the data. Optimize this cost function by using gradient descent where the gradient is your adjoint method's output.

(Note: calculate the cost and the gradient at the same time by using the forward pass to calculate the cost, and then use it in the adjoint for the interpolation. Note that you should not use saveat in the forward pass then, because otherwise the interpolation is linear. Instead, post-interpolate the data points.)

(Optional) Problem 2: Array-Based GPU Computing

If you have access to a GPU, you may wish to try the following.

Part 1: GPU Neural Network

Change your neural network to be GPU-accelerated by using CuArrays.jl for the underlying array types.

Part 2: GPU Neural ODE

Change the initial condition of the ODE solves to a CuArray to make your neural ODE GPU-accelerated.

\ No newline at end of file +

Neural Ordinary Differential Equation Adjoints

Chris Rackauckas
November 20th, 2020

In this homework, we will write an implementation of neural ordinary differential equations from scratch. You may use the DifferentialEquations.jl ODE solver, but not the adjoint sensitivities functionality. Optionally, a second problem is to add GPU support to your implementation.

Due December 9th, 2020 at midnight.

Please email the results to 18337.mit.psets@gmail.com.

Problem 1: Neural ODE from Scratch

In this problem we will work through the development of a neural ODE.

Part 1: Gradients as vjps

Use the definition of the pullback as a vector-Jacobian product (vjp) to show that $B_f^x(1) = \left( \nabla f(x) \right)^{T}$ for a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$.

(Hint: if you put 1 into the pullback, what kind of function is it? What does the Jacobian look like?)

Part 2: Backpropagation of a neural network

Implement a simple $NN: \mathbb{R}^2 \rightarrow \mathbb{R}^2$ neural network

\[ NN(u;W_i,b_i) = W_2 tanh.(W_1 u + b_1) + b_2 \]

where $W_1$ is $50 \times 2$, $b_1$ is length 50, $W_2$ is $2 \times 50$, and $b_2$ is length 2. Implement the pullback of the neural network: $B_{NN}^{u,W_i,b_i}(y)$ to calculate the derivative of the neural network with respect to each of these inputs. Check for correctness by using ForwardDiff.jl to calculate the gradient.

Part 3: Implementing an ODE adjoint

The adjoint of an ODE can be described as the set of vector equations:

\[ \begin{align} u' &= f(u,p,t)\\ \end{align} \]

forward, and then

\[ \begin{align} \lambda' &= -\lambda^\ast \frac{\partial f}{\partial u}\\ \mu' &= -\lambda^\ast \frac{\partial f}{\partial p}\\ \end{align} \]

solved in reverse time from $T$ to $0$ for some cost function $C(p)$. For this problem, we will use the L2 loss function.

Note that $\mu(T) = 0$ and $\lambda(T) = \frac{\partial C}{\partial u(T)}$. This is written in the form where the only data point is at time $T$. If that is not the case, the reverse solve needs to add the jump $\frac{\partial C}{\partial u(t_i)}$ to $\lambda$ at each data point $u(t_i)$. Use this example for how to add these jumps to the equation.

Using this formulation of the adjoint, it holds that $\mu(0) = \frac{\partial C}{\partial p}$, and thus solving these ODEs in reverse gives the solution for the gradient as a part of the system at time zero.

Notice that $B_f^u(\lambda) = \lambda^\ast \frac{\partial f}{\partial u}$ and similarly for $\mu$. Implement an adjoint calculation for a neural ordinary differential equation where

\[ u' = NN(u) \]

from above. Solve the ODE forwards using OrdinaryDiffEq.jl's Tsit5() integrator, then use the interpolation from the forward pass for the u values of the backpass and solve.

(Note: you will want to double check this gradient by using something like ForwardDiff! Start with only measuring the datapoint at the end, then try multiple data points.)

Part 4: Training the neural ODE

Generate data from the ODE $u' = Au$ where A = [-0.1 2.0; -2.0 -0.1] at t=0.0:0.1:1.0 (use saveat) with $u(0) = [2,0]$. Define the cost function C(θ) to be the Euclidean distance between the neural ODE's solution and the data. Optimize this cost function by using gradient descent where the gradient is your adjoint method's output.

(Note: calculate the cost and the gradient at the same time by using the forward pass to calculate the cost, and then use it in the adjoint for the interpolation. Note that you should not use saveat in the forward pass then, because otherwise the interpolation is linear. Instead, post-interpolate the data points.)

(Optional) Problem 2: Array-Based GPU Computing

If you have access to a GPU, you may wish to try the following.

Part 1: GPU Neural Network

Change your neural network to be GPU-accelerated by using CuArrays.jl for the underlying array types.

Part 2: GPU Neural ODE

Change the initial condition of the ODE solves to a CuArray to make your neural ODE GPU-accelerated.

\ No newline at end of file diff --git a/_weave/lecture02/jl_UdliHl/optimizing_16_1.png b/_weave/lecture02/jl_UdliHl/optimizing_16_1.png new file mode 100644 index 00000000..c214ca62 Binary files /dev/null and b/_weave/lecture02/jl_UdliHl/optimizing_16_1.png differ diff --git a/_weave/lecture02/jl_UdliHl/optimizing_17_1.png b/_weave/lecture02/jl_UdliHl/optimizing_17_1.png new file mode 100644 index 00000000..af3f7bcd Binary files /dev/null and b/_weave/lecture02/jl_UdliHl/optimizing_17_1.png differ diff --git a/_weave/lecture02/jl_h3cAOL/optimizing_16_1.png b/_weave/lecture02/jl_h3cAOL/optimizing_16_1.png deleted file mode 100644 index 6b4b8995..00000000 Binary files a/_weave/lecture02/jl_h3cAOL/optimizing_16_1.png and /dev/null differ diff --git a/_weave/lecture02/jl_h3cAOL/optimizing_17_1.png b/_weave/lecture02/jl_h3cAOL/optimizing_17_1.png deleted file mode 100644 index 08166f19..00000000 Binary files a/_weave/lecture02/jl_h3cAOL/optimizing_17_1.png and /dev/null differ diff --git a/_weave/lecture02/optimizing/index.html b/_weave/lecture02/optimizing/index.html index 7b178761..a95523e4 100644 --- a/_weave/lecture02/optimizing/index.html +++ b/_weave/lecture02/optimizing/index.html @@ -10,7 +10,7 @@

Optimizing Serial Code

Chris Rackauckas
Septe end @btime inner_rows!(C,A,B)
-16.500 μs (0 allocations: 0 bytes)
+21.600 μs (0 allocations: 0 bytes)
 
 function inner_cols!(C,A,B)
   for j in 1:100, i in 1:100
@@ -19,7 +19,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe end @btime inner_cols!(C,A,B)
-7.933 μs (0 allocations: 0 bytes)
+9.500 μs (0 allocations: 0 bytes)
 

Lower Level View: The Stack and the Heap

Locally, the stack is composed of a stack and a heap. The stack requires a static allocation: it is ordered. Because it's ordered, it is very clear where things are in the stack, and therefore accesses are very quick (think instantaneous). However, because this is static, it requires that the size of the variables is known at compile time (to determine all of the variable locations). Since that is not possible with all variables, there exists the heap. The heap is essentially a stack of pointers to objects in memory. When heap variables are needed, their values are pulled up the cache chain and accessed.

Heap Allocations and Speed

Heap allocations are costly because they involve this pointer indirection, so stack allocation should be done when sensible (it's not helpful for really large arrays, but for small values like scalars it's essential!)

 function inner_alloc!(C,A,B)
   for j in 1:100, i in 1:100
@@ -29,7 +29,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe end @btime inner_alloc!(C,A,B)
-314.598 μs (10000 allocations: 625.00 KiB)
+363.501 μs (10000 allocations: 625.00 KiB)
 
 function inner_noalloc!(C,A,B)
   for j in 1:100, i in 1:100
@@ -39,7 +39,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe end @btime inner_noalloc!(C,A,B)
-7.533 μs (0 allocations: 0 bytes)
+8.800 μs (0 allocations: 0 bytes)
 

Why does the array here get heap-allocated? It isn't able to prove/guarantee at compile-time that the array's size will always be a given value, and thus it allocates it to the heap. @btime tells us this allocation occurred and shows us the total heap memory that was taken. Meanwhile, the size of a Float64 number is known at compile-time (64-bits), and so this is stored onto the stack and given a specific location that the compiler will be able to directly address.

Note that one can use the StaticArrays.jl library to get statically-sized arrays and thus arrays which are stack-allocated:

 using StaticArrays
 function static_inner_alloc!(C,A,B)
@@ -50,7 +50,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe end @btime static_inner_alloc!(C,A,B)
-8.166 μs (0 allocations: 0 bytes)
+9.200 μs (0 allocations: 0 bytes)
 

Mutation to Avoid Heap Allocations

Many times you do need to write into an array, so how can you write into an array without performing a heap allocation? The answer is mutation. Mutation is changing the values of an already existing array. In that case, no free memory has to be found to put the array (and no memory has to be freed by the garbage collector).

In Julia, functions which mutate the first value are conventionally noted by a !. See the difference between these two equivalent functions:

 function inner_noalloc!(C,A,B)
   for j in 1:100, i in 1:100
@@ -60,7 +60,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe end @btime inner_noalloc!(C,A,B)
-7.733 μs (0 allocations: 0 bytes)
+9.600 μs (0 allocations: 0 bytes)
 
 function inner_alloc(A,B)
   C = similar(A)
@@ -71,7 +71,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe end @btime inner_alloc(A,B)
-15.100 μs (2 allocations: 78.17 KiB)
+16.200 μs (2 allocations: 78.17 KiB)
 

To use this algorithm effectively, the ! algorithm assumes that the caller already has allocated the output array to put as the output argument. If that is not true, then one would need to manually allocate. The goal of that interface is to give the caller control over the allocations to allow them to manually reduce the total number of heap allocations and thus increase the speed.

Julia's Broadcasting Mechanism

Wouldn't it be nice to not have to write the loop there? In many high level languages this is simply called vectorization. In Julia, we will call it array vectorization to distinguish it from the SIMD vectorization which is common in lower level languages like C, Fortran, and Julia.

In Julia, if you use . on an operator it will transform it to the broadcasted form. Broadcast is lazy: it will build up an entire .'d expression and then call broadcast! on composed expression. This is customizable and documented in detail. However, to a first approximation we can think of the broadcast mechanism as a mechanism for building fused expressions. For example, the Julia code:

 A .+ B .+ C;
 

under the hood lowers to something like:

@@ -86,29 +86,29 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe end @btime unfused(A,B,C);
-31.800 μs (4 allocations: 156.34 KiB)
+31.200 μs (4 allocations: 156.34 KiB)
 
 fused(A,B,C) = A .+ B .+ C
 @btime fused(A,B,C);
 
-18.400 μs (2 allocations: 78.17 KiB)
+18.600 μs (2 allocations: 78.17 KiB)
 

Note that we can also fuse the output by using .=. This is essentially the vectorized version of a ! function:

 D = similar(A)
 fused!(D,A,B,C) = (D .= A .+ B .+ C)
 @btime fused!(D,A,B,C);
 
-10.800 μs (0 allocations: 0 bytes)
+10.500 μs (0 allocations: 0 bytes)
 

Note on Broadcasting Function Calls

Julia allows for broadcasting the call () operator as well. .() will call the function element-wise on all arguments, so sin.(A) will be the elementwise sine function. This will fuse Julia like the other operators.

Note on Vectorization and Speed

In articles on MATLAB, Python, R, etc., this is where you will be told to vectorize your code. Notice from above that this isn't a performance difference between writing loops and using vectorized broadcasts. This is not abnormal! The reason why you are told to vectorize code in these other languages is because they have a high per-operation overhead (which will be discussed further down). This means that every call, like +, is costly in these languages. To get around this issue and make the language usable, someone wrote and compiled the loop for the C/Fortran function that does the broadcasted form (see numpy's Github repo). Thus A .+ B's MATLAB/Python/R equivalents are calling a single C function to generally avoid the cost of function calls and thus are faster.

But this is not an intrinsic property of vectorization. Vectorization isn't "fast" in these languages, it's just close to the correct speed. The reason vectorization is recommended is because looping is slow in these languages. Because looping isn't slow in Julia (or C, C++, Fortran, etc.), loops and vectorization generally have the same speed. So use the one that works best for your code without a care about performance.

(As a small side effect, these high level languages tend to allocate a lot of temporary variables since the individual C kernels are written for specific numbers of inputs and thus don't naturally fuse. Julia's broadcast mechanism is just generating and JIT compiling Julia functions on the fly, and thus it can accommodate the combinatorial explosion in the amount of choices just by only compiling the combinations that are necessary for a specific code)

Heap Allocations from Slicing

It's important to note that slices in Julia produce copies instead of views. Thus for example:

 A[50,50]
 
-0.6987883598884902
+0.1874936774122129
 

allocates a new output. This is for safety, since if it pointed to the same array then writing to it would change the original array. We can demonstrate this by asking for a view instead of a copy.

 @show A[1]
 E = @view A[1:5,1:5]
 E[1] = 2.0
 @show A[1]
 
-A[1] = 0.49711339210357286
+A[1] = 0.6237197114754515
 A[1] = 2.0
 2.0
 

However, this means that @view A[1:5,1:5] did not allocate an array (it does allocate a pointer if the escape analysis is unable to prove that it can be elided. This means that in small loops there will be no allocation, while if the view is returned from a function for example it will allocate the pointer, ~80 bytes, but not the memory of the array. This means that it is O(1) in cost but with a relatively small constant).

Asymptotic Cost of Heap Allocations

Heap allocations have to locate and prepare a space in RAM that is proportional to the amount of memory that is calculated, which means that the cost of a heap allocation for an array is O(n), with a large constant. As RAM begins to fill up, this cost dramatically increases. If you run out of RAM, your computer may begin to use swap, which is essentially RAM simulated on your hard drive. Generally when you hit swap your performance is so dead that you may think that your computation froze, but if you check your resource use you will notice that it's actually just filled the RAM and starting to use the swap.

But think of it as O(n) with a large constant factor. This means that for operations which only touch the data once, heap allocations can dominate the computational cost:

@@ -130,7 +130,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe plot(ns,alloc,label="=",xscale=:log10,yscale=:log10,legend=:bottomright, title="Micro-optimizations matter for BLAS1") plot!(ns,noalloc,label=".=") -

However, when the computation takes O(n^3), like in matrix multiplications, the high constant factor only comes into play when the matrices are sufficiently small:

+

However, when the computation takes O(n^3), like in matrix multiplications, the high constant factor only comes into play when the matrices are sufficiently small:

 using LinearAlgebra, BenchmarkTools
 function alloc_timer(n)
     A = rand(n,n)
@@ -149,7 +149,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe plot(ns,alloc,label="*",xscale=:log10,yscale=:log10,legend=:bottomright, title="Micro-optimizations only matter for small matmuls") plot!(ns,noalloc,label="mul!") -

Though using a mutating form is never bad and always is a little bit better.

Optimizing Memory Use Summary

Julia's Type Inference and the Compiler

Many people think Julia is fast because it is JIT compiled. That is simply not true (we've already shown examples where Julia code isn't fast, but it's always JIT compiled!). Instead, the reason why Julia is fast is because the combination of two ideas:

These two features naturally give rise to Julia's core design feature: multiple dispatch. Let's break down these pieces.

Type Inference

At the core level of the computer, everything has a type. Some languages are more explicit about said types, while others try to hide the types from the user. A type tells the compiler how to to store and interpret the memory of a value. For example, if the compiled code knows that the value in the register is supposed to be interpreted as a 64-bit floating point number, then it understands that slab of memory like:

Importantly, it will know what to do for function calls. If the code tells it to add two floating point numbers, it will send them as inputs to the Floating Point Unit (FPU) which will give the output.

If the types are not known, then... ? So one cannot actually compute until the types are known, since otherwise it's impossible to interpret the memory. In languages like C, the programmer has to declare the types of variables in the program:

void add(double *a, double *b, double *c, size_t n){
+

Though using a mutating form is never bad and always is a little bit better.

Optimizing Memory Use Summary

Julia's Type Inference and the Compiler

Many people think Julia is fast because it is JIT compiled. That is simply not true (we've already shown examples where Julia code isn't fast, but it's always JIT compiled!). Instead, the reason why Julia is fast is because the combination of two ideas:

These two features naturally give rise to Julia's core design feature: multiple dispatch. Let's break down these pieces.

Type Inference

At the core level of the computer, everything has a type. Some languages are more explicit about said types, while others try to hide the types from the user. A type tells the compiler how to to store and interpret the memory of a value. For example, if the compiled code knows that the value in the register is supposed to be interpreted as a 64-bit floating point number, then it understands that slab of memory like:

Importantly, it will know what to do for function calls. If the code tells it to add two floating point numbers, it will send them as inputs to the Floating Point Unit (FPU) which will give the output.

If the types are not known, then... ? So one cannot actually compute until the types are known, since otherwise it's impossible to interpret the memory. In languages like C, the programmer has to declare the types of variables in the program:

void add(double *a, double *b, double *c, size_t n){
   size_t i;
   for(i = 0; i < n; ++i) {
     c[i] = a[i] + b[i];
@@ -172,7 +172,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `f`
-define i64 @julia_f_2897(i64 signext %0, i64 signext %1) #0 {
+define i64 @julia_f_2893(i64 signext %0, i64 signext %1) #0 {
 top:
 ; ┌ @ int.jl:87 within `+`
    %2 = add i64 %1, %0
@@ -184,7 +184,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `f`
-define double @julia_f_2899(double %0, double %1) #0 {
+define double @julia_f_2895(double %0, double %1) #0 {
 top:
 ; ┌ @ float.jl:408 within `+`
    %2 = fadd double %0, %1
@@ -204,7 +204,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `g`
-define i64 @julia_g_2901(i64 signext %0, i64 signext %1) #0 {
+define i64 @julia_g_2897(i64 signext %0, i64 signext %1) #0 {
 top:
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 6 within `g`
@@ -249,7 +249,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `f`
-define double @julia_f_3427(double %0, i64 signext %1) #0 {
+define double @julia_f_3423(double %0, i64 signext %1) #0 {
 top:
 ; ┌ @ promotion.jl:410 within `+`
 ; │┌ @ promotion.jl:381 within `promote`
@@ -290,7 +290,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `g`
-define double @julia_g_3430(double %0, i64 signext %1) #0 {
+define double @julia_g_3426(double %0, i64 signext %1) #0 {
 top:
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 5 within `g`
@@ -360,7 +360,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe 0.4

The + function in Julia is just defined as +(a,b), and we can actually point to that code in the Julia distribution:

 @which +(2.0,5)
-
+(x::Number, y::Number) in Base at promotion.jl:410

To control at a higher level, Julia uses abstract types. For example, Float64 <: AbstractFloat, meaning Float64s are a subtype of AbstractFloat. We also have that Int <: Integer, while both AbstractFloat <: Number and Integer <: Number.

Julia allows the user to define dispatches at a higher level, and the version that is called is the most strict version that is correct. For example, right now with ff we will get a MethodError if we call it between a Int and a Float64 because no such method exists:

+
+(x::Number, y::Number) in Base at promotion.jl:410

To control at a higher level, Julia uses abstract types. For example, Float64 <: AbstractFloat, meaning Float64s are a subtype of AbstractFloat. We also have that Int <: Integer, while both AbstractFloat <: Number and Integer <: Number.

Julia allows the user to define dispatches at a higher level, and the version that is called is the most strict version that is correct. For example, right now with ff we will get a MethodError if we call it between a Int and a Float64 because no such method exists:

 ff(2.0,5)
 
 ERROR: MethodError: no method matching ff(::Float64, ::Int64)
@@ -381,7 +381,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `ff`
-define double @julia_ff_3551(double %0, i64 signext %1) #0 {
+define double @julia_ff_3547(double %0, i64 signext %1) #0 {
 top:
 ; ┌ @ promotion.jl:410 within `+`
 ; │┌ @ promotion.jl:381 within `promote`
@@ -537,7 +537,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `g`
-define void @julia_g_3690([2 x double]* noalias nocapture noundef nonnull s
+define void @julia_g_3686([2 x double]* noalias nocapture noundef nonnull s
 ret([2 x double]) align 8 dereferenceable(16) %0, [2 x double]* nocapture n
 oundef nonnull readonly align 8 dereferenceable(16) %1, [2 x double]* nocap
 ture noundef nonnull readonly align 8 dereferenceable(16) %2) #0 {
@@ -653,7 +653,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `g`
-define [2 x float] @julia_g_3702([2 x float]* nocapture noundef nonnull rea
+define [2 x float] @julia_g_3698([2 x float]* nocapture noundef nonnull rea
 donly align 4 dereferenceable(8) %0, [2 x float]* nocapture noundef nonnull
  readonly align 4 dereferenceable(8) %1) #0 {
 top:
@@ -754,7 +754,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `g`
-define void @julia_g_3721([2 x {}*]* noalias nocapture noundef nonnull sret
+define void @julia_g_3717([2 x {}*]* noalias nocapture noundef nonnull sret
 ([2 x {}*]) align 8 dereferenceable(16) %0, [2 x {}*]* nocapture noundef no
 nnull readonly align 8 dereferenceable(16) %1, [2 x {}*]* nocapture noundef
  nonnull readonly align 8 dereferenceable(16) %2) #0 {
@@ -788,22 +788,22 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe store {}** %13, {}*** %12, align 8 %14 = bitcast {}*** %pgcstack to {}*** store {}** %gcframe2.sub, {}*** %14, align 8 - call void @"j_+_3723"([2 x {}*]* noalias nocapture noundef nonnull sret( -[2 x {}*]) %9, [2 x {}*]* nocapture nonnull readonly %1, i64 signext 4) #0 + call void @"j_+_3719"([2 x {}*]* noalias nocapture noundef nonnull sret( +[2 x {}*]) %5, [2 x {}*]* nocapture nonnull readonly %1, i64 signext 4) #0 ; └ ; @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd: 6 within `g` ; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd :2 within `f` - call void @"j_+_3724"([2 x {}*]* noalias nocapture noundef nonnull sret( -[2 x {}*]) %5, i64 signext 2, [2 x {}*]* nocapture readonly %9) #0 + call void @"j_+_3720"([2 x {}*]* noalias nocapture noundef nonnull sret( +[2 x {}*]) %9, i64 signext 2, [2 x {}*]* nocapture readonly %5) #0 ; └ ; @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd: 7 within `g` ; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd :2 within `f` - call void @"j_+_3725"([2 x {}*]* noalias nocapture noundef nonnull sret( -[2 x {}*]) %7, [2 x {}*]* nocapture readonly %5, [2 x {}*]* nocapture nonnu + call void @"j_+_3721"([2 x {}*]* noalias nocapture noundef nonnull sret( +[2 x {}*]) %7, [2 x {}*]* nocapture readonly %9, [2 x {}*]* nocapture nonnu ll readonly %2) #0 ; └ %15 = bitcast [2 x {}*]* %0 to i8* @@ -836,28 +836,28 @@

Optimizing Serial Code

Chris Rackauckas
Septe b = MyComplex(2.0,1.0) @btime g(a,b)
-22.189 ns (1 allocation: 32 bytes)
+29.548 ns (1 allocation: 32 bytes)
 MyComplex(9.0, 2.0)
 
 a = MyParameterizedComplex(1.0,1.0)
 b = MyParameterizedComplex(2.0,1.0)
 @btime g(a,b)
 
-22.088 ns (1 allocation: 32 bytes)
+26.835 ns (1 allocation: 32 bytes)
 MyParameterizedComplex{Float64}(9.0, 2.0)
 
 a = MySlowComplex(1.0,1.0)
 b = MySlowComplex(2.0,1.0)
 @btime g(a,b)
 
-130.643 ns (5 allocations: 96 bytes)
+141.829 ns (5 allocations: 96 bytes)
 MySlowComplex(9.0, 2.0)
 
 a = MySlowComplex2(1.0,1.0)
 b = MySlowComplex2(2.0,1.0)
 @btime g(a,b)
 
-871.875 ns (14 allocations: 288 bytes)
+931.034 ns (14 allocations: 288 bytes)
 MySlowComplex2(9.0, 2.0)
 

Note on Julia

Note that, because of these type specialization, value types, etc. properties, the number types, even ones such as Int, Float64, and Complex, are all themselves implemented in pure Julia! Thus even basic pieces can be implemented in Julia with full performance, given one uses the features correctly.

Note on isbits

Note that a type which is mutable struct will not be isbits. This means that mutable structs will be a pointer to a heap allocated object, unless it's shortlived and the compiler can erase its construction. Also, note that isbits compiles down to bit operations from pure Julia, which means that these types can directly compile to GPU kernels through CUDAnative without modification.

Function Barriers

Since functions automatically specialize on their input types in Julia, we can use this to our advantage in order to make an inner loop fully inferred. For example, take the code from above but with a loop:

 function r(x)
@@ -872,7 +872,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe end @btime r(x)
-6.140 μs (300 allocations: 4.69 KiB)
+6.725 μs (300 allocations: 4.69 KiB)
 604.0
 

In here, the loop variables are not inferred and thus this is really slow. However, we can force a function call in the middle to end up with specialization and in the inner loop be stable:

 s(x) = _s(x[1],x[2])
@@ -888,7 +888,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe end @btime s(x)
-309.829 ns (1 allocation: 16 bytes)
+332.200 ns (1 allocation: 16 bytes)
 604.0
 

Notice that this algorithm still doesn't infer:

 @code_warntype s(x)
@@ -920,7 +920,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `fff`
-define i64 @julia_fff_3875(i64 signext %0) #0 {
+define i64 @julia_fff_3871(i64 signext %0) #0 {
 top:
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 8 within `fff`
@@ -934,7 +934,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `fff`
-define double @julia_fff_3877(double %0) #0 {
+define double @julia_fff_3873(double %0) #0 {
 top:
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 8 within `fff`
@@ -949,7 +949,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe C[i,j] = A[i,j] + B[i,j] end
-804.197 μs (30000 allocations: 468.75 KiB)
+1.011 ms (30000 allocations: 468.75 KiB)
 

This is very slow because the types of A, B, and C cannot be inferred. Why can't they be inferred? Well, at any time in the dynamic REPL scope I can do something like C = "haha now a string!", and thus it cannot specialize on the types currently existing in the REPL (since asynchronous changes could also occur), and therefore it defaults back to doing a type check at every single function which slows it down. Moral of the story, Julia functions are fast but its global scope is too dynamic to be optimized.

Summary

Overheads of Individual Operations

Now let's dig even a little deeper. Everything the processor does has a cost. A great chart to keep in mind is this classic one. A few things should immediately jump out to you:

Bounds Checking

Let's check the LLVM IR on one of our earlier loops:

 function inner_noalloc!(C,A,B)
   for j in 1:100, i in 1:100
@@ -961,7 +961,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `inner_noalloc!`
-define nonnull {}* @"japi1_inner_noalloc!_3886"({}* %0, {}** noalias nocapt
+define nonnull {}* @"japi1_inner_noalloc!_3882"({}* %0, {}** noalias nocapt
 ure noundef readonly %1, i32 %2) #0 {
 top:
   %3 = alloca {}**, align 8
@@ -1104,7 +1104,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe br i1 %.not18, label %L36, label %L2 L36: ; preds = %L25 - ret {}* inttoptr (i64 139639611547656 to {}*) + ret {}* inttoptr (i64 139979581812744 to {}*) oob: ; preds = %L5.us.us.postl oop, %L2.split.us.L2.split.us.split_crit_edge, %L2 @@ -1229,17 +1229,17 @@

Optimizing Serial Code

Chris Rackauckas
Septe end @btime inner_noalloc!(C,A,B)
-7.600 μs (0 allocations: 0 bytes)
+8.500 μs (0 allocations: 0 bytes)
 
 @btime inner_noalloc_ib!(C,A,B)
 
-5.016 μs (0 allocations: 0 bytes)
+5.450 μs (0 allocations: 0 bytes)
 

SIMD

Now let's inspect the LLVM IR again:

 @code_llvm inner_noalloc_ib!(C,A,B)
 
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `inner_noalloc_ib!`
-define nonnull {}* @"japi1_inner_noalloc_ib!_3922"({}* %0, {}** noalias noc
+define nonnull {}* @"japi1_inner_noalloc_ib!_3918"({}* %0, {}** noalias noc
 apture noundef readonly %1, i32 %2) #0 {
 top:
   %3 = alloca {}**, align 8
@@ -1374,7 +1374,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe br i1 %.not.not10, label %L36, label %L2 L36: ; preds = %L25 - ret {}* inttoptr (i64 139639611547656 to {}*) + ret {}* inttoptr (i64 139979581812744 to {}*) }

If you look closely, you will see things like:

%wide.load24 = load <4 x double>, <4 x double> addrspac(13)* %46, align 8
 ; └
@@ -1383,7 +1383,7 @@ 

Optimizing Serial Code

Chris Rackauckas
Septe @code_llvm fma(2.0,5.0,3.0)
 ;  @ floatfuncs.jl:426 within `fma`
-define double @julia_fma_3923(double %0, double %1, double %2) #0 {
+define double @julia_fma_3919(double %0, double %1, double %2) #0 {
 common.ret:
 ; ┌ @ floatfuncs.jl:421 within `fma_llvm`
    %3 = call double @llvm.fma.f64(double %0, double %1, double %2)
@@ -1430,7 +1430,7 @@ 

Inlining

 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 4 within `qinline`
-define double @julia_qinline_3926(double %0, double %1) #0 {
+define double @julia_qinline_3922(double %0, double %1) #0 {
 top:
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 7 within `qinline`
@@ -1467,17 +1467,17 @@ 

Inlining

 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 11 within `qnoinline`
-define double @julia_qnoinline_3928(double %0, double %1) #0 {
+define double @julia_qnoinline_3924(double %0, double %1) #0 {
 top:
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 14 within `qnoinline`
-  %2 = call double @j_fnoinline_3930(double %0, i64 signext 4) #0
+  %2 = call double @j_fnoinline_3926(double %0, i64 signext 4) #0
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 15 within `qnoinline`
-  %3 = call double @j_fnoinline_3931(i64 signext 2, double %2) #0
+  %3 = call double @j_fnoinline_3927(i64 signext 2, double %2) #0
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 16 within `qnoinline`
-  %4 = call double @j_fnoinline_3932(double %3, double %1) #0
+  %4 = call double @j_fnoinline_3928(double %3, double %1) #0
   ret double %4
 }
 
@@ -1496,7 +1496,7 @@

Inlining

-22.390 ns (1 allocation: 16 bytes)
+27.839 ns (1 allocation: 16 bytes)
 9.0
 
@@ -1508,7 +1508,7 @@

Inlining

-26.004 ns (1 allocation: 16 bytes)
+31.690 ns (1 allocation: 16 bytes)
 9.0
 
@@ -1536,7 +1536,7 @@

Note on Benchmarking

-1.699 ns (0 allocations: 0 bytes)
+1.900 ns (0 allocations: 0 bytes)
 9.0
 
@@ -1553,7 +1553,7 @@

Note on Benchmarking

 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `cheat`
-define double @julia_cheat_3960() #0 {
+define double @julia_cheat_3956() #0 {
 top:
   ret double 9.000000e+00
 }
@@ -1578,6 +1578,6 @@ 

Discussion Questions

\ No newline at end of file diff --git a/_weave/lecture03/jl_7PHY3B/sciml_35_1.png b/_weave/lecture03/jl_7PHY3B/sciml_35_1.png new file mode 100644 index 00000000..6f6be881 Binary files /dev/null and b/_weave/lecture03/jl_7PHY3B/sciml_35_1.png differ diff --git a/_weave/lecture03/jl_A7oobX/sciml_36_1.png b/_weave/lecture03/jl_7PHY3B/sciml_36_1.png similarity index 100% rename from _weave/lecture03/jl_A7oobX/sciml_36_1.png rename to _weave/lecture03/jl_7PHY3B/sciml_36_1.png diff --git a/_weave/lecture03/jl_A7oobX/sciml_37_1.png b/_weave/lecture03/jl_7PHY3B/sciml_37_1.png similarity index 100% rename from _weave/lecture03/jl_A7oobX/sciml_37_1.png rename to _weave/lecture03/jl_7PHY3B/sciml_37_1.png diff --git a/_weave/lecture03/jl_7PHY3B/sciml_41_1.png b/_weave/lecture03/jl_7PHY3B/sciml_41_1.png new file mode 100644 index 00000000..d6df7e6b Binary files /dev/null and b/_weave/lecture03/jl_7PHY3B/sciml_41_1.png differ diff --git a/_weave/lecture03/jl_A7oobX/sciml_42_1.png b/_weave/lecture03/jl_7PHY3B/sciml_42_1.png similarity index 100% rename from _weave/lecture03/jl_A7oobX/sciml_42_1.png rename to _weave/lecture03/jl_7PHY3B/sciml_42_1.png diff --git a/_weave/lecture03/jl_7PHY3B/sciml_45_1.png b/_weave/lecture03/jl_7PHY3B/sciml_45_1.png new file mode 100644 index 00000000..289c9827 Binary files /dev/null and b/_weave/lecture03/jl_7PHY3B/sciml_45_1.png differ diff --git a/_weave/lecture03/jl_A7oobX/sciml_35_1.png b/_weave/lecture03/jl_A7oobX/sciml_35_1.png deleted file mode 100644 index b73d3e2b..00000000 Binary files a/_weave/lecture03/jl_A7oobX/sciml_35_1.png and /dev/null differ diff --git a/_weave/lecture03/jl_A7oobX/sciml_41_1.png b/_weave/lecture03/jl_A7oobX/sciml_41_1.png deleted file mode 100644 index 7c82fa34..00000000 Binary files a/_weave/lecture03/jl_A7oobX/sciml_41_1.png and /dev/null differ diff --git a/_weave/lecture03/jl_A7oobX/sciml_45_1.png b/_weave/lecture03/jl_A7oobX/sciml_45_1.png deleted file mode 100644 index a6af4682..00000000 Binary files a/_weave/lecture03/jl_A7oobX/sciml_45_1.png and /dev/null differ diff --git a/_weave/lecture03/sciml/index.html b/_weave/lecture03/sciml/index.html index 97bef98a..af00466e 100644 --- a/_weave/lecture03/sciml/index.html +++ b/_weave/lecture03/sciml/index.html @@ -19,11 +19,11 @@

Introduction to Scientific Machine Learning through Physics-Inf simpleNN(rand(10))

 5-element Vector{Float64}:
- -2.4350849734604516
-  4.348579464751774
- -0.430223629075539
-  1.7699710631897965
- -8.170048057983601
+ -1.3383331584519713
+ -2.39034673226155
+  5.476769960460295
+ -4.383289115950564
+ -8.639151272106952
 

This is our direct definition of a neural network. Notice that we choose to use tanh as our activation function between the layers.

Defining Neural Networks with Flux.jl

One of the main deep learning libraries in Julia is Flux.jl. Flux is an interesting library for scientific machine learning because it is built on top of language-wide automatic differentiation libraries, giving rise to a programming paradigm known as differentiable programming, which means that one can write a program in a manner that it has easily accessible fast derivatives. However, due to being built on a differentiable programming base, the underlying functionality is simply standard Julia code,

To learn how to use the library, consult the documentation. A Google search will bring up the Flux.jl Github repository. From there, the blue link on the README brings you to the package documentation. This is common through Julia so it's a good habit to learn!

In the documentation you will find that the way a neural network is defined is through a Chain of layers. A Dense layer is the kind we defined above, which is given by an input size, an output size, and an activation function. For example, the following recreates the neural network that we had above:

 using Flux
 NN2 = Chain(Dense(10 => 32,tanh),
@@ -32,11 +32,11 @@ 

Introduction to Scientific Machine Learning through Physics-Inf NN2(rand(10))

 5-element Vector{Float64}:
- -0.29078475578534135
- -0.36515252241118434
-  0.18666332556964638
-  0.27894106921878215
- -0.07838097075616664
+ -0.367383275136521
+  0.09404428112599642
+  0.24593293916044112
+ -0.19206701177467855
+ -0.1616706778424914
 

Notice that Flux.jl as a library is written in pure Julia, which means that every piece of this syntax is just sugar over some Julia code that we can specialize ourselves (this is the advantage of having a language fast enough for the implementation of the library and the use of the library!)

For example, the activation function is just a scalar Julia function. If we wanted to replace it by something like the quadratic function, we can just use an anonymous function to define the scalar function we would like to use:

 NN3 = Chain(Dense(10 => 32,x->x^2),
             Dense(32 => 32,x->max(0,x)),
@@ -44,11 +44,11 @@ 

Introduction to Scientific Machine Learning through Physics-Inf NN3(rand(10))

 5-element Vector{Float64}:
- -0.09374245725516782
-  0.36412359730517746
- -0.04188060346619738
- -0.0063866127237920105
- -0.07180727833605427
+ -0.0635693658681912
+  0.10343361412920299
+  0.23396297818125808
+ -0.0655890950970476
+  0.05933084866332009
 

The second activation function there is what's known as a relu. A relu can be good to use because it's an exceptionally fast operation and satisfies a form of the universal approximation theorem (UAT). However, a downside is that its derivative is not continuous, which could impact the numerical properties of some algorithms, and thus it's widely used throughout standard machine learning but we'll see reasons why it may be disadvantageous in some cases in scientific machine learning.

Digging into the Construction of a Neural Network Library

Again, as mentioned before, this neural network NN2 is simply a function:

 simpleNN(x) = W[3]*tanh.(W[2]*tanh.(W[1]*x + b[1]) + b[2]) + b[3]
 
@@ -96,26 +96,26 @@ 

Introduction to Scientific Machine Learning through Physics-Inf denselayer_f(rand(32))

 32-element Vector{Float64}:
-  0.6502745532344867
- -0.2764078547672906
- -0.33756397356649603
-  0.2905378129827325
-  0.4829203037778629
-  0.49259637695301317
- -0.7244575524152954
-  0.6201698112871467
-  0.43852802909767524
- -0.29660901905762604
+  0.5742544502155338
+ -0.33380176214476615
+  0.542810141740359
+  0.14469442202480834
+ -0.29914058859543896
+  0.6716839940944329
+  0.2881809336509002
+  0.5692583753772852
+ -0.5698291775734202
+ -0.6771694953190442
   ⋮
- -0.2999804882299587
-  0.6165336309867477
-  0.29528760817159105
- -0.4957092213910054
-  0.7795838428322517
- -0.4247540893917004
- -0.42115911678801005
- -0.5614876585119257
- -0.48468637337209475
+  0.06555175730264437
+ -0.7822997657577385
+ -0.2509639764291457
+ -0.14439865793581114
+ -0.057541402830386
+  0.5126940597261566
+ -0.6583375064922741
+  0.19901747735535336
+ -0.11878186145036329
 

So okay, Dense objects are just functions that have weight and bias matrices inside of them. Now what does Chain do?

 @which Chain(1,2,3)
 
Chain(xs...) in Flux at /home/runner/.julia/packages/Flux/ZdbJr/src/layers/basic.jl:39

Again, for our explanations here we will look at the slightly simpler code From and earlier version of the Flux package:

@@ -169,57 +169,53 @@ 

Introduction to Scientific Machine Learning through Physics-Inf loss() = sum(abs2,sum(abs2,NN(rand(10)).-1) for i in 1:100) loss()

-4550.070342185175
+4762.879890603975
 

This loss function takes 100 random points in $[0,1]^{10}$ and then computes the output of the neural network minus 1 on each of the values, and sums up the squared values (abs2). Why the squared values? This means that every computed loss value is positive, and so we know that by decreasing the loss this means that, on average our neural network outputs are closer to 1. What are the weights? Since we're using the Flux callable struct style from above, the weights are those inside of the NN chain object, which we can inspect:

 NN[1].weight # The W matrix of the first layer
 
 32×10 Matrix{Float32}:
-  0.206907   -0.293996     0.246735   …  -0.316413    -0.344967    0.140159
- -0.234722    0.0792494   -0.255654       0.304856     0.301727   -0.15234
- -0.131264   -0.300535     0.34379        0.137062    -0.355142    0.292703
- -0.165884   -0.184028    -0.244061      -0.0888335   -0.0501927  -0.364835
-  0.273653   -0.27581     -0.165154       0.107254    -0.0985882  -0.131832
- -0.0837413   0.0814385    0.193289   …   0.114349    -0.0310933   0.343392
- -0.108227   -0.110772     0.155364       0.177525     0.160025   -0.017761
-1
-  0.131718    0.00724105  -0.223872      -0.00875761   0.112146    0.245469
- -0.173363    0.105232    -0.331788       0.224498     0.0817328   0.163695
- -0.129662    0.0193645    0.225084      -0.131568     0.124624    0.106667
-  ⋮                                   ⋱                           
-  0.366625    0.215874    -0.284587       0.257149     0.181714   -0.244675
-  0.134637   -0.280037     0.24618       -0.276576     0.0496992   0.026246
-6
-  0.0804886  -0.0138646    0.056448   …  -0.336282    -0.244829   -0.33495
- -0.0437822  -0.260398    -0.190927       0.287319    -0.192932    0.053829
- -0.0672412   0.283508     0.192685      -0.105615    -0.115523    0.037439
-8
-  0.0563317  -0.317537     0.356511       0.136938     0.349309   -0.187046
-  0.266178    0.125742     0.0387179     -0.322464     0.10805     0.268939
- -0.14143     0.315573     0.308718   …   0.357034    -0.30481     0.063483
-7
- -0.293272    0.181026    -0.101116      -0.135126    -0.249626    0.022162
-9
+ -0.042315   -0.152674    0.248828   -0.00480644  …  -0.165152   -0.0683321
+  0.0196488  -0.355363   -0.0511883  -0.275463       -0.345056    0.247659
+  0.0696147   0.198262   -0.0505652   0.208592        0.349459    0.160529
+  0.229841    0.181453   -0.206514   -0.165194       -0.262046   -0.123437
+ -0.298059   -0.320777    0.161661   -0.0647406       0.293637   -0.325253
+ -0.28259    -0.0159905  -0.227372   -0.26533     …  -0.159701    0.215338
+ -0.0417327   0.0246552   0.282349   -0.0145864      -0.277505    0.0595062
+  0.106574    0.0655952   0.11508    -0.0328105      -0.185263    0.12242
+  0.316739   -0.147705    0.088275   -0.0220919      -0.180979   -0.24828
+  0.223117   -0.0728504   0.0867307  -0.349231        0.223369   -0.147801
+  ⋮                                               ⋱              
+  0.156      -0.302055    0.14692    -0.189734        0.363814    0.366614
+  0.31773     0.32263    -0.375842   -0.306479        0.26718     0.257999
+  0.0447154   0.0959551  -0.0754782   0.32307     …   0.0750746   0.320953
+ -0.120086   -0.0858004  -0.29501    -0.0595151       0.278634    0.225435
+ -0.0773527   0.0215229   0.276879   -0.297569        0.179297    0.0024109
+5
+ -0.242057    0.116857   -0.0286312   0.12309         0.342328   -0.0437305
+ -0.0715863  -0.0341185   0.0922671  -0.0341884      -0.0671414  -0.0955032
+ -0.118132    0.0959598   0.148726    0.331655    …  -0.0611473   0.377659
+ -0.36258     0.312327   -0.368223    0.118171       -0.28799    -0.102674
 

Now let's grab all of the parameters together:

 p = Flux.params(NN)
 
-Params([Float32[0.20690748 -0.2939962 … -0.34496742 0.14015937; -0.23472181
- 0.07924941 … 0.30172688 -0.15234008; … ; -0.14143002 0.31557292 … -0.30481
-017 0.06348374; -0.2932724 0.18102595 … -0.24962649 0.022162892], Float32[0
-.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0
-, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[-0.03256267 -0.13882013 … 0.19279614 0.
-05722884; -0.0030363456 0.07537417 … 0.022959018 -0.2238231; … ; 0.29075414
- 0.27948645 … -0.20741504 0.052605283; 0.21510808 -0.21531041 … -0.28893065
- -0.087447755], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …
-  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.03591951 0.0
-95983386 … 0.3207747 -0.08298867; -0.10454989 0.12266854 … 0.25205672 -0.02
-537506; … ; 0.12059488 0.0905982 … 0.059288114 0.2102544; 0.07917695 -0.169
-73555 … -0.2598032 -0.35143456], Float32[0.0, 0.0, 0.0, 0.0, 0.0]])
+Params([Float32[-0.042314984 -0.15267445 … -0.16515188 -0.06833213; 0.01964
+8807 -0.35536322 … -0.34505555 0.2476594; … ; -0.11813201 0.09595979 … -0.0
+61147317 0.37765864; -0.36258012 0.31232706 … -0.28799024 -0.10267412], Flo
+at32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.
+0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.20425878 -0.076422386 … 0.08237
+911 -0.27405605; -0.28111216 0.16626641 … -0.29174885 -0.16367172; … ; 0.27
+852747 -0.23803714 … 0.2357961 0.14744177; 0.26311168 -0.2878293 … -0.03472
+6076 0.2109272], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  
+…  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[-0.104982175 
+-0.12251709 … -0.17302983 -0.039281193; 0.3006238 0.08044029 … 0.017479276 
+0.10239558; … ; 0.1948438 -0.26296428 … -0.2930175 0.0440955; 0.02304553 -0
+.34583634 … 0.19846451 0.38075408], Float32[0.0, 0.0, 0.0, 0.0, 0.0]])
 

That's a helper function on Chain which recursively gathers all of the defining parameters. Let's now find the optimal values p which cause the neural network to be the constant 1 function:

 Flux.train!(loss, p, Iterators.repeated((), 10000), ADAM(0.1))
 

Now let's check the loss:

 loss()
 
-6.637833380399612e-5
+5.824328915289363e-9
 

This means that NN(x) is now a very good function approximator to f(x) = ones(5)!

So Why Machine Learning? Why Neural Networks?

All we did was find parameters that made NN(x) act like a function f(x). How does that relate to machine learning? Well, in any case where one is acting on data (x,y), the idea is to assume that there exists some underlying mathematical model f(x) = y. If we had perfect knowledge of what f is, then from only the information of x we can then predict what y would be. The inference problem is to then figure out what function f should be. Therefore, machine learning on data is simply this problem of finding an approximator to some unknown function!

So why neural networks? Neural networks satisfy two properties. The first of which is known as the Universal Approximation Theorem (UAT), which in simple non-mathematical language means that, for any ϵ of accuracy, if your neural network is large enough (has enough layers, the weight matrices are large enough), then it can approximate any (nice) function f within that ϵ. Therefore, we can reduce the problem of finding missing functions, the problem of machine learning, to a problem of finding the weights of neural networks, which is a well-defined mathematical optimization problem.

Why neural networks specifically? That's a fairly good question, since there are many other functions with this property. For example, you will have learned from analysis that $a_0 + a_1 x + a_2 x^2 + \ldots$ arbitrary polynomials can be used to approximate any analytic function (this is the Taylor series). Similarly, a Fourier series

\[ f(x) = a_0 + \sum_k b_k \cos(kx) + c_k \sin(kx) \]

can approximate any continuous function f (and discontinuous functions also can have convergence, etc. these are the details of a harmonic analysis course).

That's all for one dimension. How about two dimensional functions? It turns out it's not difficult to prove that tensor products of universal approximators will give higher dimensional universal approximators. So for example, tensoring together two polynomials:

\[ a_0 + a_1 x + a_2 y + a_3 x y + a_4 x^2 y + a_5 x y^2 + a_6 x^2 y^2 + \ldots \]

will give a two-dimensional function approximator. But notice how we have to resolve every combination of terms. This means that if we used n coefficients in each dimension d, the total number of coefficients to build a d-dimensional universal approximator from one-dimensional objects would need $n^d$ coefficients. This exponential growth is known as the curse of dimensionality.

The second property of neural networks that makes them applicable to machine learning is that they overcome the curse of dimensionality. The proofs in this area can be a little difficult to parse, but what they boil down to is proving in many cases that the growth of neural networks to sufficiently approximate a d-dimensional function grows as a polynomial of d, rather than exponential. This means that there's some dimensional cutoff where for $d>cutoff$ it is more efficient to use a neural network. This can be problem-specific, but generally it tends to be the case at least by 8 or 10 dimensions.

Neural networks have a few other properties to consider as well:

  1. The assumptions of the neural network can be encoded into the neural architectures. A neural network where the last layer has an activation function x->x^2 is a neural network where all outputs are positive. This means that if you want to find a positive function, you can make the optimization easier by enforcing this constraint. A lot of other constraints can be enforced, like tanh activation functions can make the neural network be a smooth (all derivatives finite) function, or other activations can cause finite numbers of learnable discontinuities.

  2. Generating higher dimensional forms from one dimensional forms does not have good symmetry. For example, the two-dimensional tensor Fourier basis does not have a good way to represent $sin(xy)$. This property of the approximator is called (non)isotropy and more detail can be found in this wonderful talk about function approximation for multidimensional integration (cubature). Neural networks are naturally not aligned to a basis.

  3. Neural networks are "easy" to compute. There's good software for them, GPU-acceleration, and all other kinds of tooling that make them particularly simple to use.

  4. There are proofs that in many scenarios for neural networks the local minima are the global minima, meaning that local optimization is sufficient for training a neural network. Global optimization (which we will cover later in the course) is much more expensive than local methods like gradient descent, and thus this can be a good property to abuse for faster computation.

From Machine Learning to Scientific Machine Learning: Structure and Science

This understanding of a neural network and their libraries directly bridges to the understanding of scientific machine learning and the computation done in the field. In scientific machine learning, neural networks and machine learning are used as the basis to solve problems in scientific computing. Scientific computing, as a discipline also known as Computational Science, is a field of study which focuses on scientific simulation, using tools such as differential equations to investigate physical, biological, and other phenomena.

What we wish to do in scientific machine learning is use these properties of neural networks to improve the way that we investigate our scientific models.

Aside: Why Differential Equations?

Why do differential equations come up so often in as the model in the scientific context? This is a deep question with quite a simple answer. Essentially, all scientific experiments always have to test how things change. For example, you take a system now, you change it, and your measurement is how the changes you made caused changes in the system. This boils down to gather information about how, for some arbitrary system $y = f(x)$, how $\Delta x$ is related to $\Delta y$. Thus what you learn from scientific experiments, what is codified as scientific laws, is not "the answer", but the answer to how things change. This process of writing down equations by describing how they change precisely gives differential equations.

Solving ODEs with Neural Networks: The Physics-Informed Neural Network

Now let's get to our first true SciML application: solving ordinary differential equations with neural networks. The process of solving a differential equation with a neural network, or using a differential equation as a regularizer in the loss function, is known as a physics-informed neural network, since this allows for physical equations to guide the training of the neural network in circumstances where data might be lacking.

Background: A Method for Solving Ordinary Differential Equations with Neural Networks

This is a result first due to Lagaris et. al from 1998. The idea is to solve differential equations using neural networks by representing the solution by a neural network and training the resulting network to satisfy the conditions required by the differential equation.

Let's say we want to solve a system of ordinary differential equations

\[ u' = f(u,t) \]

with $t \in [0,1]$ and a known initial condition $u(0)=u_0$. To solve this, we approximate the solution by a neural network:

\[ NN(t) \approx u(t) \]

If $NN(t)$ was the true solution, then it would hold that $NN'(t) = f(NN(t),t)$ for all $t$. Thus we turn this condition into our loss function. This motivates the loss function:

\[ L(p) = \sum_i \left(\frac{dNN(t_i)}{dt} - f(NN(t_i),t_i) \right)^2 \]

The choice of $t_i$ could be done in many ways: it can be random, it can be a grid, etc. Anyways, when this loss function is minimized (gradients computed with standard reverse-mode automatic differentiation), then we have that $\frac{dNN(t_i)}{dt} \approx f(NN(t_i),t_i)$ and thus $NN(t)$ approximately solves the differential equation.

Note that we still have to handle the initial condition. One simple way to do this is to add an initial condition term to the cost function. This would look like:

\[ L(p) = (NN(0) - u_0)^2 + \sum_i \left(\frac{dNN(t_i)}{dt} - f(NN(t_i),t_i) \right)^2 \]

While that would work, it can be more efficient to encode the initial condition into the function itself so that it's trivially satisfied for any possible set of parameters. For example, instead of directly using a neural network, we can use:

\[ g(t) = u_0 + tNN(t) \]

as our solution. Notice that $g(t)$ is thus a universal approximator for all continuous functions such that $g(0)=u_0$ (this is a property one should prove!). Since $g(t)$ will always satisfy the initial condition, we can train $g(t)$ to satisfy the derivative function then it will automatically be a solution to the derivative function. In this sense, we can use the loss function:

\[ L(p) = \sum_i \left(\frac{dg(t_i)}{dt} - f(g(t_i),t_i) \right)^2 \]

where $p$ are the parameters that define $g$, which in turn are the parameters which define the neural network $NN$ that define $g$. Thus this reduces down, once again, to simply finding weights which minimize a loss function!

Coding Up the Method

Now let's implement this method with Flux. Let's define a neural network to be the NN(t) above. To make the problem easier, let's look at the ODE:

\[ u' = \cos 2\pi t \]

and approximate it with the neural network from a scalar to a scalar:

 using Flux
 NNODE = Chain(x -> [x], # Take in a scalar and transform it into an array
@@ -228,7 +224,7 @@ 

Introduction to Scientific Machine Learning through Physics-Inf first) # Take first value, i.e. return a scalar NNODE(1.0)

-0.12358128803653629
+-0.07905010348616885
 

Instead of directly approximating the neural network, we will use the transformed equation that is forced to satisfy the boundary conditions. Using u0=1.0, we have the function:

 g(t) = t*NNODE(t) + 1f0
 
@@ -252,23 +248,23 @@ 

Introduction to Scientific Machine Learning through Physics-Inf display(loss()) Flux.train!(loss, Flux.params(NNODE), data, opt; cb=cb)

-0.5292029324358418
-0.4932957105006129
-0.4516506740086385
-0.3061094653058115
-0.07083152746013988
-0.011665014535397188
-0.006347690360824062
-0.005509980046802497
-0.005214355730566553
-0.0049850620896091675
-0.004787041823734979
+0.5178025866113664
+0.5003380507674955
+0.4817250368610114
+0.4072179526120902
+0.1709448358222078
+0.020230873113184646
+0.005706788834646396
+0.004160672851654196
+0.0038591849808713866
+0.003736976081002845
+0.003611986041281208
 

How well did this do? Well if we take the integral of both sides of our differential equation, we see it's fairly trivial:

\[ \int g' = g = \int \cos 2\pi t = C + \frac{\sin 2\pi t}{2\pi} \]

where we defined $C = 1$. Let's take a bunch of (input,output) pairs from the neural network and plot it against the analytical solution to the differential equation:

 using Plots
 t = 0:0.001:1.0
 plot(t,g.(t),label="NN")
 plot!(t,1.0 .+ sin.(2π.*t)/2π, label = "True Solution")
-

We see that it matches very well, and we can keep improving this fit by increasing the size of the neural network, using more training points, and training for more iterations.

Example: Harmonic Oscillator Informed Training

Using this idea, differential equations encoding physical laws can be utilized inside of loss functions for terms which we have some basis to believe should approximately follow some physical system. Let's investigate this last step by looking at how to inform the training of a neural network using the harmonic oscillator.

Let's assume that we are taking measurements of (position,force) in some real one-dimensional spring pushing and pulling against a wall.

But instead of the simple spring, let's assume we had a more complex spring, for example, let's say $F(x) = -kx + 0.1sin(x)$ where this extra term is due to some deformities in the metal (assume mass=1). Then by Newton's law of motion we have a second order ordinary differential equation:

\[ x'' = -kx + 0.1 \sin(x) \]

We can use the DifferentialEquations.jl package to solve this differential equation and see what this system looks like:

+

We see that it matches very well, and we can keep improving this fit by increasing the size of the neural network, using more training points, and training for more iterations.

Example: Harmonic Oscillator Informed Training

Using this idea, differential equations encoding physical laws can be utilized inside of loss functions for terms which we have some basis to believe should approximately follow some physical system. Let's investigate this last step by looking at how to inform the training of a neural network using the harmonic oscillator.

Let's assume that we are taking measurements of (position,force) in some real one-dimensional spring pushing and pulling against a wall.

But instead of the simple spring, let's assume we had a more complex spring, for example, let's say $F(x) = -kx + 0.1sin(x)$ where this extra term is due to some deformities in the metal (assume mass=1). Then by Newton's law of motion we have a second order ordinary differential equation:

\[ x'' = -kx + 0.1 \sin(x) \]

We can use the DifferentialEquations.jl package to solve this differential equation and see what this system looks like:

 using DifferentialEquations
 k = 1.0
 force(dx,x,k,t) = -k*x + 0.1sin(x)
@@ -305,7 +301,7 @@ 

Introduction to Scientific Machine Learning through Physics-Inf loss() = sum(abs2,NNForce(position_data[i]) - force_data[i] for i in 1:length(position_data)) loss()

-0.004090519383696386
+0.0010967192704571457
 

Our random parameters do not do so well, so let's train!

 opt = Flux.Descent(0.01)
 data = Iterators.repeated((), 5000)
@@ -319,24 +315,24 @@ 

Introduction to Scientific Machine Learning through Physics-Inf display(loss()) Flux.train!(loss, Flux.params(NNForce), data, opt; cb=cb)

-0.004090519383696386
-0.003165046027128867
-0.002641747350681186
-0.0022070107178882607
-0.0018447603635215674
-0.0015422621040588331
-0.0012892905773686783
-0.001077546330758725
-0.0009002280237856551
-0.0007517186517138613
-0.000627355617761891
+0.0010967192704571457
+0.0008617590463802304
+0.0007290146954419487
+0.0006163647557874953
+0.0005208026932904165
+0.00043977568374691406
+0.00037111415967897563
+0.0003129703126082548
+0.0002637671607014535
+0.00022216044374859072
+0.00018700348642827335
 

The neural network almost exactly matched the dataset, but how well did it actually learn the real force function? Let's plot it to see:

 learned_force_plot = NNForce.(positions_plot)
 
 plot(plot_t,force_plot,xlabel="t",label="True Force")
 plot!(plot_t,learned_force_plot,label="Predicted Force")
 scatter!(t,force_data,label="Force Measurements")
-

Ouch. The problem is that a neural network can approximate any function, so it approximated a function that fits the data, but not the correct function. We somehow need to have more data... but where can we get more data?

Well, even a first year undergrad in physics will know Hooke's law, which is that the idealized spring should satisfy $F(x) = -kx$. This is a decent assumption for the evolution of the system:

+

Ouch. The problem is that a neural network can approximate any function, so it approximated a function that fits the data, but not the correct function. We somehow need to have more data... but where can we get more data?

Well, even a first year undergrad in physics will know Hooke's law, which is that the idealized spring should satisfy $F(x) = -kx$. This is a decent assumption for the evolution of the system:

 force2(dx,x,k,t) = -k*x
 prob_simplified = SecondOrderODEProblem(force2,1.0,0.0,(0.0,10.0),k)
 sol_simplified = solve(prob_simplified)
@@ -347,7 +343,7 @@ 

Introduction to Scientific Machine Learning through Physics-Inf loss_ode() = sum(abs2,NNForce(x) - (-k*x) for x in random_positions) loss_ode()

-14.500465286856848
+6.600899833173732
 

If this term is zero, then $F(x) = -kx$, which is approximately true. So now let's put these together:

 λ = 0.1
 composed_loss() = loss() + λ*loss_ode()
@@ -372,15 +368,15 @@ 

Introduction to Scientific Machine Learning through Physics-Inf plot!(plot_t,learned_force_plot,label="Predicted Force") scatter!(t,force_data,label="Force Measurements")

-1.4506738843034468
-0.0006670716572459478
-0.000629829423817155
-0.000596875962324285
-0.000567402212010099
-0.0005408107593652083
-0.0005166549010322651
-0.0004945863481240524
-0.00047432515093941667
-0.00045564658413633173
-0.0004383653530088632
-

And there we go: we have used knowledge of physics to help inform our neural network training process!

Conclusion

In this lecture we motivated machine learning not as a process of predicting from data but as a process for learning arbitrary nonlinear functions. Neural networks were just one choice of possible function. We then demonstrated how differential equations could be solved using this function approximation technique and then put together these two domains, solving differential equations and approximating data, into a single process to allow for physical knowledge to be embedded into the training process of a neural network, thus arriving at a physics-informed neural network. This is just one method in scientific machine learning which we will be exploring in more detail, demonstrating how we can utilize scientific knowledge to improve fits and allow for data-efficient machine learning.

\ No newline at end of file +0.6602769868038015 +0.0008962173483139916 +0.0008395519776902406 +0.0007893063297842377 +0.0007444563419638081 +0.0007041903913019822 +0.0006678489876113494 +0.000634894748184695 +0.0006048844663758801 +0.0005774476005671365 +0.0005522700329561084 +

And there we go: we have used knowledge of physics to help inform our neural network training process!

Conclusion

In this lecture we motivated machine learning not as a process of predicting from data but as a process for learning arbitrary nonlinear functions. Neural networks were just one choice of possible function. We then demonstrated how differential equations could be solved using this function approximation technique and then put together these two domains, solving differential equations and approximating data, into a single process to allow for physical knowledge to be embedded into the training process of a neural network, thus arriving at a physics-informed neural network. This is just one method in scientific machine learning which we will be exploring in more detail, demonstrating how we can utilize scientific knowledge to improve fits and allow for data-efficient machine learning.

\ No newline at end of file diff --git a/_weave/lecture04/dynamical_systems/index.html b/_weave/lecture04/dynamical_systems/index.html index eafa8a23..8a79ad02 100644 --- a/_weave/lecture04/dynamical_systems/index.html +++ b/_weave/lecture04/dynamical_systems/index.html @@ -113,7 +113,7 @@

How Iteration Works, An Introduction to Discrete Dynamics

end @time solve_system_save(lorenz,[1.0,0.0,0.0],p,1000)
-0.000072 seconds (1.00 k allocations: 86.062 KiB)
+0.000075 seconds (1.00 k allocations: 86.062 KiB)
 1000-element Vector{Vector{Float64}}:
  [1.0, 0.0, 0.0]
  [0.8, 0.56, 0.0]
@@ -138,7 +138,7 @@ 

How Iteration Works, An Introduction to Discrete Dynamics

 @time solve_system_save_push(lorenz,[1.0,0.0,0.0],p,1000)
 
-0.021052 seconds (4.63 k allocations: 344.371 KiB, 99.51% compilation tim
+0.019902 seconds (4.63 k allocations: 344.371 KiB, 99.49% compilation tim
 e)
 1000-element Vector{Vector{Float64}}:
  [1.0, 0.0, 0.0]
@@ -164,7 +164,7 @@ 

How Iteration Works, An Introduction to Discrete Dynamics

The first time Julia compiles the function, and the second is a straight call.

 @time solve_system_save_push(lorenz,[1.0,0.0,0.0],p,1000)
 
-0.000108 seconds (1.01 k allocations: 99.984 KiB)
+0.000122 seconds (1.01 k allocations: 99.984 KiB)
 1000-element Vector{Vector{Float64}}:
  [1.0, 0.0, 0.0]
  [0.8, 0.56, 0.0]
@@ -190,7 +190,7 @@ 

How Iteration Works, An Introduction to Discrete Dynamics

using BenchmarkTools @btime solve_system_save(lorenz,[1.0,0.0,0.0],p,1000)
-39.100 μs (1001 allocations: 86.06 KiB)
+43.100 μs (1001 allocations: 86.06 KiB)
 1000-element Vector{Vector{Float64}}:
  [1.0, 0.0, 0.0]
  [0.8, 0.56, 0.0]
@@ -215,7 +215,7 @@ 

How Iteration Works, An Introduction to Discrete Dynamics

 @btime solve_system_save_push(lorenz,[1.0,0.0,0.0],p,1000)
 
-51.200 μs (1006 allocations: 99.98 KiB)
+52.500 μs (1006 allocations: 99.98 KiB)
 1000-element Vector{Vector{Float64}}:
  [1.0, 0.0, 0.0]
  [0.8, 0.56, 0.0]
@@ -248,7 +248,7 @@ 

How Iteration Works, An Introduction to Discrete Dynamics

end @btime solve_system_save_matrix(lorenz,[1.0,0.0,0.0],p,1000)
-78.799 μs (2001 allocations: 179.66 KiB)
+98.700 μs (2001 allocations: 179.66 KiB)
 3×1000 Matrix{Float64}:
  1.0  0.8   0.752    0.80096   0.920338   …   1.98201    1.67886    1.4744
  0.0  0.56  0.9968   1.39785   1.81805        0.466287   0.656559   0.85300
@@ -265,7 +265,7 @@ 

How Iteration Works, An Introduction to Discrete Dynamics

end @btime solve_system_save_matrix_view(lorenz,[1.0,0.0,0.0],p,1000)
-49.800 μs (1002 allocations: 101.61 KiB)
+58.601 μs (1002 allocations: 101.61 KiB)
 3×1000 Matrix{Float64}:
  1.0  0.8   0.752    0.80096   0.920338   …   1.98201    1.67886    1.4744
  0.0  0.56  0.9968   1.39785   1.81805        0.466287   0.656559   0.85300
@@ -282,7 +282,7 @@ 

How Iteration Works, An Introduction to Discrete Dynamics

end @btime solve_system_save_matrix_resize(lorenz,[1.0,0.0,0.0],p,1000)
-2.913 ms (2318 allocations: 11.65 MiB)
+2.857 ms (2318 allocations: 11.65 MiB)
 3×1000 Matrix{Float64}:
  1.0  0.8   0.752    0.80096   0.920338   …   1.98201    1.67886    1.4744
  0.0  0.56  0.9968   1.39785   1.81805        0.466287   0.656559   0.85300
@@ -388,7 +388,7 @@ 

How Iteration Works, An Introduction to Discrete Dynamics

which would compute f and then take the values of du and update u with them, but that's 3 extra operations than required, whereas u,du = du,u will change u to be a pointer to the updated memory and now du is an "empty" cache array that we can refill (this decreases the computational cost by ~33%). Let's see what the cost is with this newest version:

 @btime solve_system(lorenz,[1.0,0.0,0.0],p,1000)
 
-37.600 μs (1000 allocations: 78.12 KiB)
+42.501 μs (1000 allocations: 78.12 KiB)
 3-element Vector{Float64}:
   1.4744010677851374
   0.8530017039412324
@@ -396,7 +396,7 @@ 

How Iteration Works, An Introduction to Discrete Dynamics

 @btime solve_system_mutate(lorenz,[1.0,0.0,0.0],p,1000)
 
-6.775 μs (3 allocations: 240 bytes)
+8.067 μs (3 allocations: 240 bytes)
 3-element Vector{Float64}:
   1.4744010677851374
   0.8530017039412324
@@ -445,7 +445,7 @@ 

How Iteration Works, An Introduction to Discrete Dynamics

 @btime solve_system_save(lorenz,@SVector[1.0,0.0,0.0],p,1000)
 
-6.825 μs (2 allocations: 23.48 KiB)
+8.200 μs (2 allocations: 23.48 KiB)
 1000-element Vector{SVector{3, Float64}}:
  [1.0, 0.0, 0.0]
  [0.8, 0.56, 0.0]
@@ -511,7 +511,7 @@ 

How Iteration Works, An Introduction to Discrete Dynamics

 @btime solve_system_save(lorenz,@SVector[1.0,0.0,0.0],p,1000)
 
-6.020 μs (2 allocations: 23.48 KiB)
+6.600 μs (2 allocations: 23.48 KiB)
 1000-element Vector{SVector{3, Float64}}:
  [1.0, 0.0, 0.0]
  [0.8, 0.56, 0.0]
@@ -536,7 +536,7 @@ 

How Iteration Works, An Introduction to Discrete Dynamics

And we can get down to non-allocating for the loop:

 @btime solve_system(lorenz,@SVector([1.0,0.0,0.0]),p,1000)
 
-5.167 μs (1 allocation: 32 bytes)
+5.700 μs (1 allocation: 32 bytes)
 3-element SVector{3, Float64} with indices SOneTo(3):
   1.4744010677851374
   0.8530017039412324
@@ -552,7 +552,7 @@ 

How Iteration Works, An Introduction to Discrete Dynamics

u = Vector{typeof(@SVector([1.0,0.0,0.0]))}(undef,1000) @btime solve_system_save!(u,lorenz,@SVector([1.0,0.0,0.0]),p,1000)
-5.350 μs (0 allocations: 0 bytes)
+6.360 μs (0 allocations: 0 bytes)
 1000-element Vector{SVector{3, Float64}}:
  [1.0, 0.0, 0.0]
  [0.8, 0.56, 0.0]
@@ -574,4 +574,4 @@ 

How Iteration Works, An Introduction to Discrete Dynamics

[1.9820054139405763, 0.46628657468365653, 22.964748583050085] [1.6788616460891923, 0.6565587545689172, 21.758445642263496] [1.4744010677851374, 0.8530017039412324, 20.62004063423844] -

It is important to note that this single allocation does not seem to effect the timing of the result in this case, when run serially. However, when parallelism or embedded applications get involved, this can be a significant effect.

Discussion Questions

  1. What are some ways to compute steady states? Periodic orbits?

  2. When using the mutating algorithms, what are the data dependencies between different solves if they were to happen simultaneously?

  3. We saw that there is a connection between delayed systems and multivariable systems. How deep does that go? Is every delayed system also a multivariable system and vice versa? Is this a useful idea to explore?

\ No newline at end of file +

It is important to note that this single allocation does not seem to effect the timing of the result in this case, when run serially. However, when parallelism or embedded applications get involved, this can be a significant effect.

Discussion Questions

  1. What are some ways to compute steady states? Periodic orbits?

  2. When using the mutating algorithms, what are the data dependencies between different solves if they were to happen simultaneously?

  3. We saw that there is a connection between delayed systems and multivariable systems. How deep does that go? Is every delayed system also a multivariable system and vice versa? Is this a useful idea to explore?

\ No newline at end of file diff --git a/_weave/lecture04/jl_RfkYO9/dynamical_systems_11_1.png b/_weave/lecture04/jl_o4gXmS/dynamical_systems_11_1.png similarity index 100% rename from _weave/lecture04/jl_RfkYO9/dynamical_systems_11_1.png rename to _weave/lecture04/jl_o4gXmS/dynamical_systems_11_1.png diff --git a/_weave/lecture05/parallelism_overview/index.html b/_weave/lecture05/parallelism_overview/index.html index 2165d4f8..73e0030e 100644 --- a/_weave/lecture05/parallelism_overview/index.html +++ b/_weave/lecture05/parallelism_overview/index.html @@ -27,7 +27,7 @@

The Basics of Single Node Parallel Computing

Chris Rac u = Vector{typeof(@SVector([1.0,0.0,0.0]))}(undef,1000) @btime solve_system_save!(u,lorenz,@SVector([1.0,0.0,0.0]),p,1000)
-4.750 μs (0 allocations: 0 bytes)
+6.580 μs (0 allocations: 0 bytes)
 1000-element Vector{SVector{3, Float64}}:
  [1.0, 0.0, 0.0]
  [0.8, 0.56, 0.0]
@@ -74,7 +74,7 @@ 

The Basics of Single Node Parallel Computing

Chris Rac u = [Vector{Float64}(undef,3) for i in 1:1000] @btime solve_system_save_iip!(u,lorenz!,[1.0,0.0,0.0],p,1000)
-6.460 μs (1 allocation: 80 bytes)
+8.400 μs (1 allocation: 80 bytes)
 1000-element Vector{Vector{Float64}}:
  [1.0, 0.0, 0.0]
  [0.8, 0.56, 0.0]
@@ -127,7 +127,7 @@ 

The Basics of Single Node Parallel Computing

Chris Rac u = [Vector{Float64}(undef,3) for i in 1:1000] @btime solve_system_save_iip!(u,lorenz_mt!,[1.0,0.0,0.0],p,1000);
-1.690 ms (6994 allocations: 671.28 KiB)
+2.001 ms (6994 allocations: 671.28 KiB)
 

Parallelism doesn't always make things faster. There are two costs associated with this code. For one, we had to go to the slower heap+mutation version, so its implementation starting point is slower. But secondly, and more importantly, the cost of spinning a new thread is non-negligible. In fact, here we can see that it even needs to make a small allocation for the new context. The total cost is on the order of It's on the order of 50ns: not huge, but something to take note of. So what we've done is taken almost free calculations and made them ~50ns by making each in a different thread, instead of just having it be one thread with one call stack.

The moral of the story is that you need to make sure that there's enough work per thread in order to effectively accelerate a program with parallelism.

Data-Parallel Problems

So not every setup is amenable to parallelism. Dynamical systems are notorious for being quite difficult to parallelize because the dependency of the future time step on the previous time step is clear, meaning that one cannot easily "parallelize through time" (though it is possible, which we will study later).

However, one common way that these systems are generally parallelized is in their inputs. The following questions allow for independent simulations:

  • What steady state does an input u0 go to for some list/region of initial conditions?

  • How does the solution very when I use a different p?

The problem has a few descriptions. For one, it's called an embarrassingly parallel problem since the problem can remain largely intact to solve the parallelism problem. To solve this, we can use the exact same solve_system_save_iip!, and just change how we are calling it. Secondly, this is called a data parallel problem, since it parallelized by splitting up the input data (here, the possible u0 or ps) and acting on them independently.

Multithreaded Parameter Searches

Now let's multithread our parameter search. Let's say we wanted to compute the mean of the values in the trajectory. For a single input pair, we can compute that like:

 using Statistics
 function compute_trajectory_mean(u0,p)
@@ -137,7 +137,7 @@ 

The Basics of Single Node Parallel Computing

Chris Rac end @btime compute_trajectory_mean(@SVector([1.0,0.0,0.0]),p)
-7.933 μs (3 allocations: 23.52 KiB)
+8.400 μs (3 allocations: 23.52 KiB)
 3-element SVector{3, Float64} with indices SOneTo(3):
  -0.3114996234648468
  -0.30974901748976497
@@ -151,7 +151,7 @@ 

The Basics of Single Node Parallel Computing

Chris Rac end @btime compute_trajectory_mean2(@SVector([1.0,0.0,0.0]),p)
-6.850 μs (3 allocations: 112 bytes)
+7.950 μs (3 allocations: 112 bytes)
 3-element SVector{3, Float64} with indices SOneTo(3):
  -0.3114996234648468
  -0.30974901748976497
@@ -165,7 +165,7 @@ 

The Basics of Single Node Parallel Computing

Chris Rac end @btime compute_trajectory_mean3(@SVector([1.0,0.0,0.0]),p)
-7.300 μs (1 allocation: 32 bytes)
+7.900 μs (1 allocation: 32 bytes)
 3-element SVector{3, Float64} with indices SOneTo(3):
  -0.3114996234648468
  -0.30974901748976497
@@ -178,7 +178,7 @@ 

The Basics of Single Node Parallel Computing

Chris Rac compute_trajectory_mean4(u0,p) = _compute_trajectory_mean4(_u_cache,u0,p) @btime compute_trajectory_mean4(@SVector([1.0,0.0,0.0]),p)
-6.775 μs (1 allocation: 32 bytes)
+7.900 μs (1 allocation: 32 bytes)
 3-element SVector{3, Float64} with indices SOneTo(3):
  -0.3114996234648468
  -0.30974901748976497
@@ -187,50 +187,50 @@ 

The Basics of Single Node Parallel Computing

Chris Rac ps = [(0.02,10.0,28.0,8/3) .* (1.0,rand(3)...) for i in 1:1000]
 1000-element Vector{NTuple{4, Float64}}:
- (0.02, 3.5506238069563842, 25.156388865912405, 1.2944671907763334)
- (0.02, 4.905102015647083, 20.798912517739257, 1.42455210067359)
- (0.02, 4.049777415039721, 15.634106342204387, 1.0689250599984783)
- (0.02, 0.617703562486901, 6.650265362134062, 0.9001192510844916)
- (0.02, 7.728479101055119, 4.283032277473355, 0.3814247162007132)
- (0.02, 8.69335001681721, 10.46660511452242, 0.009482281358365558)
- (0.02, 1.3817524075453613, 21.49817664634096, 2.228092866290097)
- (0.02, 3.6730466076349777, 13.370353460873327, 2.6306870107737916)
- (0.02, 4.757259184310949, 3.3848400213645222, 1.155402147677568)
- (0.02, 3.426685349820837, 0.40559752571178276, 2.1334711357331244)
+ (0.02, 3.037620156858795, 14.552719997603571, 1.5510247426254855)
+ (0.02, 4.936696155756138, 20.507022795818497, 2.183378859929042)
+ (0.02, 8.131977339125193, 7.7700882850491215, 0.48969639712480506)
+ (0.02, 1.755851672936397, 6.216686394026905, 1.7054311072661656)
+ (0.02, 8.13099723484067, 0.7588409559739153, 1.085714992546022)
+ (0.02, 1.7427316470389997, 12.0532270800079, 1.8453944444264936)
+ (0.02, 0.7583476249041621, 0.38895059916641284, 1.342153386465824)
+ (0.02, 7.723642173433803, 24.870938762602417, 0.6149070924838658)
+ (0.02, 4.354945546654296, 17.371545500497078, 0.44694951653095283)
+ (0.02, 1.5358900244982832, 9.084952881006156, 0.015530076994568986)
  ⋮
- (0.02, 8.332190592332246, 0.6269286088808745, 0.5604011320047804)
- (0.02, 5.476210596389, 10.370091055241664, 2.630030085241478)
- (0.02, 9.585550213838584, 18.44496697411735, 2.3643328335046734)
- (0.02, 4.343650227443543, 8.928297401416833, 2.1487711036406507)
- (0.02, 4.197305238186817, 10.260410200606863, 0.7413293451348505)
- (0.02, 2.0801528934988553, 9.917167676210104, 2.3442057783156334)
- (0.02, 3.7927870826229384, 24.445671325659124, 1.3108338858226898)
- (0.02, 7.342872394586884, 19.618883234188363, 1.5924503611012586)
- (0.02, 0.5667773787942509, 14.12384188972344, 0.506252200199163)
+ (0.02, 9.4927953946179, 19.43338395266777, 2.417035074069966)
+ (0.02, 1.2325741898934928, 3.235845837538059, 1.9310880212250672)
+ (0.02, 2.283079380936549, 15.943802804296585, 1.046984974419047)
+ (0.02, 3.9404703099197147, 1.1138370645453657, 0.5821002324991641)
+ (0.02, 0.2390881608605666, 16.14427395660405, 2.220052350252045)
+ (0.02, 1.3196303788806696, 24.830822721813846, 0.21431493909461025)
+ (0.02, 3.9863134178487747, 21.062045162769856, 1.8628923280400778)
+ (0.02, 4.425030472036934, 17.732299157775685, 1.0584427777088066)
+ (0.02, 5.923649789057791, 26.526499645471556, 1.5436207861637943)
 

And let's get the mean of the trajectory for each of the parameters.

 serial_out = map(p -> compute_trajectory_mean4(@SVector([1.0,0.0,0.0]),p),ps)
 
 1000-element Vector{SVector{3, Float64}}:
- [1.300771053925321, 1.2535219713772234, 21.921820081229736]
- [1.6308717261795633, 1.6037306854298525, 18.15225347468799]
- [1.7374617596652808, 1.7617277647285448, 13.601745113059051]
- [2.216634823089521, 2.3182370473551135, 5.409143685835168]
- [0.9306213595722924, 0.9315560984103936, 3.037663155931778]
- [0.16460696835639138, 0.15885544565187876, 17.921719196723085]
- [6.597795994389806, 6.806146848303711, 19.81586165018462]
- [5.0545716205734665, 5.123177741500901, 10.877586029568883]
- [1.6226045441626067, 1.6295488156533835, 2.2529107787132308]
- [0.024424343928562804, 0.009833040449289765, 0.0005146829566867803]
+ [-0.8995493442373456, -0.9545822583205469, 11.09763927433081]
+ [-0.2738640024613979, -0.2306122941828055, 16.92499769658805]
+ [0.21179044146551448, 0.20702430632781724, 5.759297153864314]
+ [2.911850465163551, 2.9683134505875706, 4.990515699186825]
+ [0.025022984412806943, 0.01888025302654529, 0.0009455939378620537]
+ [4.40004482736076, 4.500938337735848, 10.610355510012694]
+ [0.10352309459360956, 0.03764493928476319, 0.007957185983071241]
+ [0.25024346906854367, 0.25387664489025796, 24.14794855973064]
+ [0.38592086367787864, 0.3632040150421861, 16.004707391771415]
+ [0.17993985356714867, 0.14738543939576892, 11.347343162572958]
  ⋮
- [0.01597542267866004, 0.009975030135296149, 0.0009292432129033854]
- [4.550653603851785, 4.585199848662789, 8.568748261393232]
- [-1.554071305410045, -1.5875011180422711, 15.820131977465435]
- [3.9090248954460445, 3.9454434221744092, 7.437219335889599]
- [-0.9834730207162191, -0.9596995893927198, 8.307105651389753]
- [4.459468412015161, 4.545330459447452, 8.582678518797346]
- [0.8446566491898391, 0.8163840321561793, 21.572295069890302]
- [1.4417199630874142, 1.4184886481104566, 17.480873491994632]
- [2.5688108553476776, 2.707871623177261, 12.583059217222598]
+ [-0.0840117067670748, -0.0654512792833464, 16.857099406918405]
+ [1.999119397276924, 2.0428444387084164, 2.088984817566904]
+ [-1.6917818513472662, -1.677628394209012, 13.216976250124453]
+ [0.2624261419672769, 0.2530056461259637, 0.10257425482542186]
+ [5.295904745372588, 6.299305056930067, 14.345380093144529]
+ [-0.6455940789619545, -0.654050188127725, 22.961101218013503]
+ [-0.20563814866072042, -0.1853469928300127, 17.308722418188218]
+ [0.8349700059676074, 0.8703348079339471, 15.570853262114188]
+ [0.4191880800406462, 0.4178869606733214, 24.039132485781064]
 

Now let's do this with multithreading:

 function tmap(f,ps)
   out = Vector{typeof(@SVector([1.0,0.0,0.0]))}(undef,1000)
@@ -243,26 +243,26 @@ 

The Basics of Single Node Parallel Computing

Chris Rac threaded_out = tmap(p -> compute_trajectory_mean4(@SVector([1.0,0.0,0.0]),p),ps)
 1000-element Vector{SVector{3, Float64}}:
- [1.300771053925321, 1.2535219713772234, 21.921820081229736]
- [1.6308717261795633, 1.6037306854298525, 18.15225347468799]
- [1.7374617596652808, 1.7617277647285448, 13.601745113059051]
- [2.216634823089521, 2.3182370473551135, 5.409143685835168]
- [0.9306213595722924, 0.9315560984103936, 3.037663155931778]
- [0.16460696835639138, 0.15885544565187876, 17.921719196723085]
- [6.597795994389806, 6.806146848303711, 19.81586165018462]
- [5.0545716205734665, 5.123177741500901, 10.877586029568883]
- [1.6226045441626067, 1.6295488156533835, 2.2529107787132308]
- [0.024424343928562804, 0.009833040449289765, 0.0005146829566867803]
+ [-0.8995493442373456, -0.9545822583205469, 11.09763927433081]
+ [-0.2738640024613979, -0.2306122941828055, 16.92499769658805]
+ [0.21179044146551448, 0.20702430632781724, 5.759297153864314]
+ [2.911850465163551, 2.9683134505875706, 4.990515699186825]
+ [0.025022984412806943, 0.01888025302654529, 0.0009455939378620537]
+ [4.40004482736076, 4.500938337735848, 10.610355510012694]
+ [0.10352309459360956, 0.03764493928476319, 0.007957185983071241]
+ [0.25024346906854367, 0.25387664489025796, 24.14794855973064]
+ [0.38592086367787864, 0.3632040150421861, 16.004707391771415]
+ [0.17993985356714867, 0.14738543939576892, 11.347343162572958]
  ⋮
- [0.01597542267866004, 0.009975030135296149, 0.0009292432129033854]
- [4.550653603851785, 4.585199848662789, 8.568748261393232]
- [-1.554071305410045, -1.5875011180422711, 15.820131977465435]
- [3.9090248954460445, 3.9454434221744092, 7.437219335889599]
- [-0.9834730207162191, -0.9596995893927198, 8.307105651389753]
- [4.459468412015161, 4.545330459447452, 8.582678518797346]
- [0.8446566491898391, 0.8163840321561793, 21.572295069890302]
- [1.4417199630874142, 1.4184886481104566, 17.480873491994632]
- [2.5688108553476776, 2.707871623177261, 12.583059217222598]
+ [-0.0840117067670748, -0.0654512792833464, 16.857099406918405]
+ [1.999119397276924, 2.0428444387084164, 2.088984817566904]
+ [-1.6917818513472662, -1.677628394209012, 13.216976250124453]
+ [0.2624261419672769, 0.2530056461259637, 0.10257425482542186]
+ [5.295904745372588, 6.299305056930067, 14.345380093144529]
+ [-0.6455940789619545, -0.654050188127725, 22.961101218013503]
+ [-0.20563814866072042, -0.1853469928300127, 17.308722418188218]
+ [0.8349700059676074, 0.8703348079339471, 15.570853262114188]
+ [0.4191880800406462, 0.4178869606733214, 24.039132485781064]
 

Let's check the output:

 serial_out - threaded_out
 
@@ -296,7 +296,7 @@ 

The Basics of Single Node Parallel Computing

Chris Rac end @btime compute_trajectory_mean5(@SVector([1.0,0.0,0.0]),p)
-7.300 μs (1 allocation: 32 bytes)
+7.900 μs (1 allocation: 32 bytes)
 3-element SVector{3, Float64} with indices SOneTo(3):
  -0.3114996234648468
  -0.30974901748976497
@@ -330,53 +330,53 @@ 

The Basics of Single Node Parallel Computing

Chris Rac
 @btime serial_out = map(p -> compute_trajectory_mean5(@SVector([1.0,0.0,0.0]),p),ps)
 
-7.912 ms (3 allocations: 23.50 KiB)
+7.906 ms (3 allocations: 23.50 KiB)
 1000-element Vector{SVector{3, Float64}}:
- [1.300771053925321, 1.2535219713772234, 21.921820081229736]
- [1.6308717261795633, 1.6037306854298525, 18.15225347468799]
- [1.7374617596652808, 1.7617277647285448, 13.601745113059051]
- [2.216634823089521, 2.3182370473551135, 5.409143685835168]
- [0.9306213595722924, 0.9315560984103936, 3.037663155931778]
- [0.16460696835639138, 0.15885544565187876, 17.921719196723085]
- [6.597795994389806, 6.806146848303711, 19.81586165018462]
- [5.0545716205734665, 5.123177741500901, 10.877586029568883]
- [1.6226045441626067, 1.6295488156533835, 2.2529107787132308]
- [0.024424343928562804, 0.009833040449289765, 0.0005146829566867803]
+ [-0.8995493442373456, -0.9545822583205469, 11.09763927433081]
+ [-0.2738640024613979, -0.2306122941828055, 16.92499769658805]
+ [0.21179044146551448, 0.20702430632781724, 5.759297153864314]
+ [2.911850465163551, 2.9683134505875706, 4.990515699186825]
+ [0.025022984412806943, 0.01888025302654529, 0.0009455939378620537]
+ [4.40004482736076, 4.500938337735848, 10.610355510012694]
+ [0.10352309459360956, 0.03764493928476319, 0.007957185983071241]
+ [0.25024346906854367, 0.25387664489025796, 24.14794855973064]
+ [0.38592086367787864, 0.3632040150421861, 16.004707391771415]
+ [0.17993985356714867, 0.14738543939576892, 11.347343162572958]
  ⋮
- [0.01597542267866004, 0.009975030135296149, 0.0009292432129033854]
- [4.550653603851785, 4.585199848662789, 8.568748261393232]
- [-1.554071305410045, -1.5875011180422711, 15.820131977465435]
- [3.9090248954460445, 3.9454434221744092, 7.437219335889599]
- [-0.9834730207162191, -0.9596995893927198, 8.307105651389753]
- [4.459468412015161, 4.545330459447452, 8.582678518797346]
- [0.8446566491898391, 0.8163840321561793, 21.572295069890302]
- [1.4417199630874142, 1.4184886481104566, 17.480873491994632]
- [2.5688108553476776, 2.707871623177261, 12.583059217222598]
+ [-0.0840117067670748, -0.0654512792833464, 16.857099406918405]
+ [1.999119397276924, 2.0428444387084164, 2.088984817566904]
+ [-1.6917818513472662, -1.677628394209012, 13.216976250124453]
+ [0.2624261419672769, 0.2530056461259637, 0.10257425482542186]
+ [5.295904745372588, 6.299305056930067, 14.345380093144529]
+ [-0.6455940789619545, -0.654050188127725, 22.961101218013503]
+ [-0.20563814866072042, -0.1853469928300127, 17.308722418188218]
+ [0.8349700059676074, 0.8703348079339471, 15.570853262114188]
+ [0.4191880800406462, 0.4178869606733214, 24.039132485781064]
 
 @btime threaded_out = tmap(p -> compute_trajectory_mean5(@SVector([1.0,0.0,0.0]),p),ps)
 
-7.875 ms (9 allocations: 24.12 KiB)
+7.906 ms (9 allocations: 24.12 KiB)
 1000-element Vector{SVector{3, Float64}}:
- [1.300771053925321, 1.2535219713772234, 21.921820081229736]
- [1.6308717261795633, 1.6037306854298525, 18.15225347468799]
- [1.7374617596652808, 1.7617277647285448, 13.601745113059051]
- [2.216634823089521, 2.3182370473551135, 5.409143685835168]
- [0.9306213595722924, 0.9315560984103936, 3.037663155931778]
- [0.16460696835639138, 0.15885544565187876, 17.921719196723085]
- [6.597795994389806, 6.806146848303711, 19.81586165018462]
- [5.0545716205734665, 5.123177741500901, 10.877586029568883]
- [1.6226045441626067, 1.6295488156533835, 2.2529107787132308]
- [0.024424343928562804, 0.009833040449289765, 0.0005146829566867803]
+ [-0.8995493442373456, -0.9545822583205469, 11.09763927433081]
+ [-0.2738640024613979, -0.2306122941828055, 16.92499769658805]
+ [0.21179044146551448, 0.20702430632781724, 5.759297153864314]
+ [2.911850465163551, 2.9683134505875706, 4.990515699186825]
+ [0.025022984412806943, 0.01888025302654529, 0.0009455939378620537]
+ [4.40004482736076, 4.500938337735848, 10.610355510012694]
+ [0.10352309459360956, 0.03764493928476319, 0.007957185983071241]
+ [0.25024346906854367, 0.25387664489025796, 24.14794855973064]
+ [0.38592086367787864, 0.3632040150421861, 16.004707391771415]
+ [0.17993985356714867, 0.14738543939576892, 11.347343162572958]
  ⋮
- [0.01597542267866004, 0.009975030135296149, 0.0009292432129033854]
- [4.550653603851785, 4.585199848662789, 8.568748261393232]
- [-1.554071305410045, -1.5875011180422711, 15.820131977465435]
- [3.9090248954460445, 3.9454434221744092, 7.437219335889599]
- [-0.9834730207162191, -0.9596995893927198, 8.307105651389753]
- [4.459468412015161, 4.545330459447452, 8.582678518797346]
- [0.8446566491898391, 0.8163840321561793, 21.572295069890302]
- [1.4417199630874142, 1.4184886481104566, 17.480873491994632]
- [2.5688108553476776, 2.707871623177261, 12.583059217222598]
+ [-0.0840117067670748, -0.0654512792833464, 16.857099406918405]
+ [1.999119397276924, 2.0428444387084164, 2.088984817566904]
+ [-1.6917818513472662, -1.677628394209012, 13.216976250124453]
+ [0.2624261419672769, 0.2530056461259637, 0.10257425482542186]
+ [5.295904745372588, 6.299305056930067, 14.345380093144529]
+ [-0.6455940789619545, -0.654050188127725, 22.961101218013503]
+ [-0.20563814866072042, -0.1853469928300127, 17.308722418188218]
+ [0.8349700059676074, 0.8703348079339471, 15.570853262114188]
+ [0.4191880800406462, 0.4178869606733214, 24.039132485781064]
 

Hierarchical Task-Based Multithreading and Dynamic Scheduling

The major change in Julia v1.3 is that Julia's Tasks, which are traditionally its green threads interface, are now the basis of its multithreading infrastructure. This means that all independent threads are parallelized, and a new interface for multithreading will exist that works by spawning threads.

This implementation follows Go's goroutines and the classic multithreading interface of Cilk. There is a Julia-level scheduler that handles the multithreading to put different tasks on different vCPU threads. A benefit from this is hierarchical multithreading. Since Julia's tasks can spawn tasks, what can happen is a task can create tasks which create tasks which etc. In Julia (/Go/Cilk), this is then seen as a single pool of tasks which it can schedule, and thus it will still make sure only N are running at a time (as opposed to the naive implementation where the total number of running threads is equal then multiplied). This is essential for numerical performance because running multiple compute threads on a single CPU thread requires constant context switching between the threads, which will slow down the computations.

To directly use the task-based interface, simply use Threads.@spawn to spawn new tasks. For example:

 function tmap2(f,ps)
   tasks = [Threads.@spawn f(ps[i]) for i in 1:1000]
@@ -385,51 +385,51 @@ 

The Basics of Single Node Parallel Computing

Chris Rac threaded_out = tmap2(p -> compute_trajectory_mean5(@SVector([1.0,0.0,0.0]),p),ps)
 1000-element Vector{SVector{3, Float64}}:
- [1.300771053925321, 1.2535219713772234, 21.921820081229736]
- [1.6308717261795633, 1.6037306854298525, 18.15225347468799]
- [1.7374617596652808, 1.7617277647285448, 13.601745113059051]
- [2.216634823089521, 2.3182370473551135, 5.409143685835168]
- [0.9306213595722924, 0.9315560984103936, 3.037663155931778]
- [0.16460696835639138, 0.15885544565187876, 17.921719196723085]
- [6.597795994389806, 6.806146848303711, 19.81586165018462]
- [5.0545716205734665, 5.123177741500901, 10.877586029568883]
- [1.6226045441626067, 1.6295488156533835, 2.2529107787132308]
- [0.024424343928562804, 0.009833040449289765, 0.0005146829566867803]
+ [-0.8995493442373456, -0.9545822583205469, 11.09763927433081]
+ [-0.2738640024613979, -0.2306122941828055, 16.92499769658805]
+ [0.21179044146551448, 0.20702430632781724, 5.759297153864314]
+ [2.911850465163551, 2.9683134505875706, 4.990515699186825]
+ [0.025022984412806943, 0.01888025302654529, 0.0009455939378620537]
+ [4.40004482736076, 4.500938337735848, 10.610355510012694]
+ [0.10352309459360956, 0.03764493928476319, 0.007957185983071241]
+ [0.25024346906854367, 0.25387664489025796, 24.14794855973064]
+ [0.38592086367787864, 0.3632040150421861, 16.004707391771415]
+ [0.17993985356714867, 0.14738543939576892, 11.347343162572958]
  ⋮
- [0.01597542267866004, 0.009975030135296149, 0.0009292432129033854]
- [4.550653603851785, 4.585199848662789, 8.568748261393232]
- [-1.554071305410045, -1.5875011180422711, 15.820131977465435]
- [3.9090248954460445, 3.9454434221744092, 7.437219335889599]
- [-0.9834730207162191, -0.9596995893927198, 8.307105651389753]
- [4.459468412015161, 4.545330459447452, 8.582678518797346]
- [0.8446566491898391, 0.8163840321561793, 21.572295069890302]
- [1.4417199630874142, 1.4184886481104566, 17.480873491994632]
- [2.5688108553476776, 2.707871623177261, 12.583059217222598]
+ [-0.0840117067670748, -0.0654512792833464, 16.857099406918405]
+ [1.999119397276924, 2.0428444387084164, 2.088984817566904]
+ [-1.6917818513472662, -1.677628394209012, 13.216976250124453]
+ [0.2624261419672769, 0.2530056461259637, 0.10257425482542186]
+ [5.295904745372588, 6.299305056930067, 14.345380093144529]
+ [-0.6455940789619545, -0.654050188127725, 22.961101218013503]
+ [-0.20563814866072042, -0.1853469928300127, 17.308722418188218]
+ [0.8349700059676074, 0.8703348079339471, 15.570853262114188]
+ [0.4191880800406462, 0.4178869606733214, 24.039132485781064]
 

However, if we check the timing we see:

 @btime tmap2(p -> compute_trajectory_mean5(@SVector([1.0,0.0,0.0]),p),ps)
 
-8.648 ms (6005 allocations: 562.70 KiB)
+8.594 ms (6005 allocations: 562.70 KiB)
 1000-element Vector{SVector{3, Float64}}:
- [1.300771053925321, 1.2535219713772234, 21.921820081229736]
- [1.6308717261795633, 1.6037306854298525, 18.15225347468799]
- [1.7374617596652808, 1.7617277647285448, 13.601745113059051]
- [2.216634823089521, 2.3182370473551135, 5.409143685835168]
- [0.9306213595722924, 0.9315560984103936, 3.037663155931778]
- [0.16460696835639138, 0.15885544565187876, 17.921719196723085]
- [6.597795994389806, 6.806146848303711, 19.81586165018462]
- [5.0545716205734665, 5.123177741500901, 10.877586029568883]
- [1.6226045441626067, 1.6295488156533835, 2.2529107787132308]
- [0.024424343928562804, 0.009833040449289765, 0.0005146829566867803]
+ [-0.8995493442373456, -0.9545822583205469, 11.09763927433081]
+ [-0.2738640024613979, -0.2306122941828055, 16.92499769658805]
+ [0.21179044146551448, 0.20702430632781724, 5.759297153864314]
+ [2.911850465163551, 2.9683134505875706, 4.990515699186825]
+ [0.025022984412806943, 0.01888025302654529, 0.0009455939378620537]
+ [4.40004482736076, 4.500938337735848, 10.610355510012694]
+ [0.10352309459360956, 0.03764493928476319, 0.007957185983071241]
+ [0.25024346906854367, 0.25387664489025796, 24.14794855973064]
+ [0.38592086367787864, 0.3632040150421861, 16.004707391771415]
+ [0.17993985356714867, 0.14738543939576892, 11.347343162572958]
  ⋮
- [0.01597542267866004, 0.009975030135296149, 0.0009292432129033854]
- [4.550653603851785, 4.585199848662789, 8.568748261393232]
- [-1.554071305410045, -1.5875011180422711, 15.820131977465435]
- [3.9090248954460445, 3.9454434221744092, 7.437219335889599]
- [-0.9834730207162191, -0.9596995893927198, 8.307105651389753]
- [4.459468412015161, 4.545330459447452, 8.582678518797346]
- [0.8446566491898391, 0.8163840321561793, 21.572295069890302]
- [1.4417199630874142, 1.4184886481104566, 17.480873491994632]
- [2.5688108553476776, 2.707871623177261, 12.583059217222598]
+ [-0.0840117067670748, -0.0654512792833464, 16.857099406918405]
+ [1.999119397276924, 2.0428444387084164, 2.088984817566904]
+ [-1.6917818513472662, -1.677628394209012, 13.216976250124453]
+ [0.2624261419672769, 0.2530056461259637, 0.10257425482542186]
+ [5.295904745372588, 6.299305056930067, 14.345380093144529]
+ [-0.6455940789619545, -0.654050188127725, 22.961101218013503]
+ [-0.20563814866072042, -0.1853469928300127, 17.308722418188218]
+ [0.8349700059676074, 0.8703348079339471, 15.570853262114188]
+ [0.4191880800406462, 0.4178869606733214, 24.039132485781064]
 

Threads.@threads is built on the same multithreading infrastructure, so why is this so much slower? The reason is because Threads.@threads employs static scheduling while Threads.@spawn is using dynamic scheduling. Dynamic scheduling is the model of allowing the runtime to determine the ordering and scheduling of processes, i.e. what tasks will run run where and when. Julia's task-based multithreading system has a thread scheduler which will automatically do this for you in the background, but because this is done at runtime it will have overhead. Static scheduling is the model of pre-determining where and when tasks will run, instead of allowing this to be determined at runtime. Threads.@threads is "quasi-static" in the sense that it cuts the loop so that it spawns only as many tasks as there are threads, essentially assigning one thread for even chunks of the input data.

Does this lack of runtime overhead mean that static scheduling is "better"? No, it simply has trade-offs. Static scheduling assumes that the runtime of each block is the same. For this specific case where there are fixed number of loop iterations for the dynamical systems, we know that every compute_trajectory_mean5 costs exactly the same, and thus this will be more efficient. However, There are many cases where this might not be efficient. For example:

 function sleepmap_static()
   out = Vector{Int}(undef,24)
@@ -489,24 +489,24 @@ 

The Basics of Single Node Parallel Computing

Chris Rac A*B
 10000×10000 Matrix{Float64}:
- 2497.67  2491.1   2492.67  2489.33  …  2518.19  2454.25  2494.02  2505.91
- 2500.48  2487.5   2502.05  2504.62     2521.67  2485.92  2484.91  2500.74
- 2505.35  2492.7   2509.1   2517.35     2537.91  2482.31  2503.15  2530.22
- 2511.74  2497.84  2516.63  2507.37     2535.95  2493.89  2496.25  2518.27
- 2503.29  2496.67  2500.11  2503.93     2528.92  2469.13  2502.02  2521.17
- 2505.7   2494.62  2493.32  2504.15  …  2535.61  2482.82  2494.33  2511.56
- 2488.05  2475.19  2502.63  2499.82     2529.41  2471.18  2483.94  2508.2
- 2467.27  2471.25  2466.7   2474.89     2518.44  2446.91  2480.0   2485.43
- 2525.91  2513.54  2537.32  2523.61     2538.24  2503.12  2506.08  2528.71
- 2497.25  2500.94  2520.61  2517.3      2553.96  2483.94  2494.92  2547.87
+ 2495.47  2520.71  2530.22  2524.49  …  2503.61  2535.82  2511.56  2499.67
+ 2464.58  2466.29  2491.42  2449.62     2469.64  2499.86  2479.97  2463.95
+ 2452.32  2465.73  2482.12  2455.61     2460.24  2486.44  2472.71  2476.1
+ 2483.8   2485.67  2517.83  2485.45     2482.03  2513.04  2507.42  2497.89
+ 2485.62  2511.57  2516.88  2490.67     2482.53  2525.14  2514.93  2489.36
+ 2471.15  2479.17  2489.15  2474.51  …  2473.88  2505.56  2483.08  2466.69
+ 2469.96  2478.31  2491.2   2476.9      2480.98  2502.52  2488.43  2471.05
+ 2487.8   2501.18  2517.1   2493.89     2499.57  2517.53  2521.67  2506.1
+ 2512.49  2484.45  2511.12  2500.0      2498.04  2549.49  2534.7   2506.65
+ 2469.53  2475.42  2489.72  2443.06     2462.05  2506.42  2495.04  2473.58
     ⋮                                ⋱                             
- 2495.89  2485.45  2499.62  2504.25     2528.85  2476.86  2493.4   2492.0
- 2525.35  2499.89  2519.73  2515.7      2531.09  2495.28  2519.64  2534.39
- 2506.35  2505.91  2515.78  2527.56     2530.31  2481.43  2504.66  2515.94
- 2504.93  2490.24  2495.28  2521.8      2545.88  2468.61  2483.92  2509.99
- 2500.07  2478.29  2488.61  2500.22  …  2527.18  2488.51  2489.7   2500.94
- 2510.45  2498.04  2516.35  2512.16     2552.08  2484.41  2493.94  2530.27
- 2484.07  2482.82  2482.02  2505.73     2534.99  2455.2   2488.88  2498.46
- 2484.22  2466.34  2483.2   2490.26     2504.3   2463.77  2473.1   2493.58
- 2507.89  2483.82  2477.71  2499.6      2520.03  2463.15  2482.67  2500.59
-

If you are using a computer that has N cores, then this will use N cores. Try it and look at your resource usage!

Array-Based Parallelism

The simplest form of parallelism is array-based parallelism. The idea is that you use some construction of an array whose operations are already designed to be parallel under the hood. In Julia, some examples of this are:

  • DistributedArrays (Distributed Computing)

  • Elemental

  • MPIArrays

  • CuArrays (GPUs)

This is not a Julia specific idea either.

BLAS and Standard Libraries

The basic linear algebra calls are all handled by a set of libraries which follow the same interface known as BLAS (Basic Linear Algebra Subroutines). It's divided into 3 portions:

  • BLAS1: Element-wise operations (O(n))

  • BLAS2: Matrix-vector operations (O(n^2))

  • BLAS3: Matrix-matrix operations (O(n^3))

BLAS implementations are highly optimized, like OpenBLAS and Intel MKL, so every numerical language and library essentially uses similar underlying BLAS implementations. Extensions to these, known as LAPACK, include operations like factorizations, and are included in these standard libraries. These are all multithreaded. The reason why this is a location to target is because the operation count is high enough that parallelism can be made efficient even when only targeting this level: a matrix multiplication can take on the order of seconds, minutes, hours, or even days, and these are all highly parallel operations. This means you can get away with a bunch just by parallelizing at this level, which happens to be a bottleneck for a lot scientific computing codes.

This is also commonly the level at which GPU computing occurs in machine learning libraries for reasons which we will explain later.

MPI

Well, this is a big topic and we'll address this one later!

Conclusion

The easiest forms of parallelism are:

  • Embarrassingly parallel

  • Array-level parallelism (built into linear algebra)

Exploit these when possible.

\ No newline at end of file + 2517.45 2501.76 2535.21 2516.96 2501.37 2534.94 2526.12 2499.41 + 2501.8 2519.93 2519.24 2477.3 2481.52 2531.88 2522.3 2507.0 + 2490.03 2511.51 2519.28 2503.13 2480.05 2524.13 2531.29 2504.18 + 2469.77 2469.62 2506.22 2454.79 2458.06 2492.25 2478.63 2469.99 + 2486.49 2471.17 2492.96 2473.14 … 2465.24 2514.3 2502.17 2485.15 + 2452.35 2472.2 2484.65 2468.31 2463.58 2504.26 2496.91 2463.18 + 2504.85 2501.1 2517.51 2487.68 2487.3 2515.24 2513.68 2502.75 + 2470.61 2495.72 2498.27 2473.8 2469.82 2512.21 2500.38 2480.41 + 2477.44 2484.09 2505.52 2468.3 2479.17 2498.62 2492.54 2463.26 +

If you are using a computer that has N cores, then this will use N cores. Try it and look at your resource usage!

Array-Based Parallelism

The simplest form of parallelism is array-based parallelism. The idea is that you use some construction of an array whose operations are already designed to be parallel under the hood. In Julia, some examples of this are:

  • DistributedArrays (Distributed Computing)

  • Elemental

  • MPIArrays

  • CuArrays (GPUs)

This is not a Julia specific idea either.

BLAS and Standard Libraries

The basic linear algebra calls are all handled by a set of libraries which follow the same interface known as BLAS (Basic Linear Algebra Subroutines). It's divided into 3 portions:

  • BLAS1: Element-wise operations (O(n))

  • BLAS2: Matrix-vector operations (O(n^2))

  • BLAS3: Matrix-matrix operations (O(n^3))

BLAS implementations are highly optimized, like OpenBLAS and Intel MKL, so every numerical language and library essentially uses similar underlying BLAS implementations. Extensions to these, known as LAPACK, include operations like factorizations, and are included in these standard libraries. These are all multithreaded. The reason why this is a location to target is because the operation count is high enough that parallelism can be made efficient even when only targeting this level: a matrix multiplication can take on the order of seconds, minutes, hours, or even days, and these are all highly parallel operations. This means you can get away with a bunch just by parallelizing at this level, which happens to be a bottleneck for a lot scientific computing codes.

This is also commonly the level at which GPU computing occurs in machine learning libraries for reasons which we will explain later.

MPI

Well, this is a big topic and we'll address this one later!

Conclusion

The easiest forms of parallelism are:

  • Embarrassingly parallel

  • Array-level parallelism (built into linear algebra)

Exploit these when possible.

\ No newline at end of file diff --git a/_weave/lecture06/styles_of_parallelism/index.html b/_weave/lecture06/styles_of_parallelism/index.html index 89536c93..f824b38d 100644 --- a/_weave/lecture06/styles_of_parallelism/index.html +++ b/_weave/lecture06/styles_of_parallelism/index.html @@ -9,26 +9,26 @@

The Different Flavors of Parallelism

Chris Rackauckas< arr = [MyComplex(rand(),rand()) for i in 1:100]
 100-element Vector{MyComplex}:
- MyComplex(0.6878206973001487, 0.5944376681381183)
- MyComplex(0.876185978521746, 0.8834270064512657)
- MyComplex(0.7274511955405121, 0.9776333218870947)
- MyComplex(0.7860152352196623, 0.2246266273480778)
- MyComplex(0.42890688640506736, 0.15924501742288877)
- MyComplex(0.01060237947774434, 0.2262353380168034)
- MyComplex(0.7935771560827849, 0.1525403932328463)
- MyComplex(0.16117727903914125, 0.8243946095589878)
- MyComplex(0.5550542564876422, 0.3914851972778649)
- MyComplex(0.1343545153578216, 0.30599273978359387)
+ MyComplex(0.32516364337338777, 0.3550304843026272)
+ MyComplex(0.003144120188337096, 0.163179107928081)
+ MyComplex(0.07235316202378828, 0.5895002453826597)
+ MyComplex(0.2737074271530896, 0.3854369147402019)
+ MyComplex(0.14651058280465834, 0.25703921696007137)
+ MyComplex(0.9246961046559387, 0.3320769495992342)
+ MyComplex(0.7929175766889385, 0.8321792812407953)
+ MyComplex(0.8748941992144769, 0.3582692781165362)
+ MyComplex(0.94807059958605, 0.3693183290045081)
+ MyComplex(0.5393529457015298, 0.7326361499924592)
  ⋮
- MyComplex(0.42776724920957265, 0.7237887225861321)
- MyComplex(0.6342639668615949, 0.2563619040165326)
- MyComplex(0.07057459953946532, 0.3258356216180154)
- MyComplex(0.6917262339464759, 0.1363145249683415)
- MyComplex(0.23408343153685507, 0.8505351210651642)
- MyComplex(0.11196456763380669, 0.4193970173512319)
- MyComplex(0.6501836427783281, 0.058275727876870964)
- MyComplex(0.9806260355655791, 0.7003595452846337)
- MyComplex(0.01632409219317188, 0.9588968220373235)
+ MyComplex(0.5782671062296417, 0.4548938009032666)
+ MyComplex(0.8920956658422236, 0.028109929218517404)
+ MyComplex(0.27794806113432613, 0.9658640245583793)
+ MyComplex(0.6111486171406361, 0.7344804686656914)
+ MyComplex(0.09662923368940446, 0.5548190454939068)
+ MyComplex(0.1827962566614879, 0.41595303387734917)
+ MyComplex(0.5059171681027851, 0.7038191273302745)
+ MyComplex(0.7481225257229924, 0.6801002251820268)
+ MyComplex(0.1179925265666455, 0.30061080588876155)
 

is represented in memory as

[real1,imag1,real2,imag2,...]

while the struct of array formats are

@@ -43,18 +43,18 @@

The Different Flavors of Parallelism

Chris Rackauckas<
-MyComplexes([0.5273266334521056, 0.6758175680585644, 0.166526088766354, 0.5
-535370802900049, 0.3706105595339403, 0.41554002467170703, 0.592171034860571
-3, 0.687445572159449, 0.4587923428284365, 0.2896374897304236  …  0.34479599
-43539092, 0.4601670959875128, 0.6729984392350403, 0.11784099516786106, 0.51
-75904664901904, 0.17071237273923245, 0.27028579620359694, 0.129621218441456
-35, 0.8376248105911108, 0.7568834682904622], [0.11843099277515678, 0.964674
-7111296081, 0.7932654641659607, 0.04037040139768633, 0.9438159545817943, 0.
-6272736660119286, 0.37233000892415713, 0.9339530622496862, 0.07322910120678
-683, 0.17352841715253697  …  0.9228774206720473, 0.28315860186168174, 0.788
-710859023613, 0.4871373335900553, 0.3812150269155198, 0.577326908234619, 0.
-7607815057072469, 0.6564758612650347, 0.3478709665140167, 0.327831535892906
-03])
+MyComplexes([0.7164779312690199, 0.5865544146333738, 0.90320198698556, 0.64
+28203752547009, 0.72203667868656, 0.9034505920162977, 0.8682221101684356, 0
+.7993643426972368, 0.10606753677087344, 0.6961824507525881  …  0.0390623120
+0873134, 0.3981890396653336, 0.8166985144117405, 0.7501680127601921, 0.8700
+763078059573, 0.8917956139913009, 0.21038830324164248, 0.26642517407150745,
+ 0.2841261157899654, 0.852858764291031], [0.30436828324609155, 0.3114792567
+638629, 0.32397797517215754, 0.30423234353914086, 0.5030364597671834, 0.284
+27331086202323, 0.38011862192567225, 0.9695695499725508, 0.0536101205990224
+8, 0.7453650930055286  …  0.7174599146800597, 0.3096056720083835, 0.4395626
+8897495865, 0.49109120743236956, 0.6499699516275391, 0.3152078820742673, 0.
+37054301749019425, 0.2592646036335612, 0.5410006630195909, 0.00057106714998
+11193])
 
@@ -73,7 +73,7 @@

The Different Flavors of Parallelism

Chris Rackauckas<
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral
 lelism.jmd:5 within `average`
-define void @julia_average_13534([2 x double]* noalias nocapture noundef no
+define void @julia_average_13513([2 x double]* noalias nocapture noundef no
 nnull sret([2 x double]) align 8 dereferenceable(16) %0, {}* noundef nonnul
 l align 16 dereferenceable(40) %1) #0 {
 top:
@@ -107,21 +107,21 @@ 

The Different Flavors of Parallelism

Chris Rackauckas< L8: ; preds = %top ; ││││││││││ @ reduce.jl:427 within `_mapreduce` - store {}* inttoptr (i64 139639388780752 to {}*), {}** %.sub, al + store {}* inttoptr (i64 139979357674032 to {}*), {}** %.sub, al ign 8 %7 = getelementptr inbounds [4 x {}*], [4 x {}*]* %2, i64 0, i6 4 1 - store {}* inttoptr (i64 139639384143408 to {}*), {}** %7, align + store {}* inttoptr (i64 139979354801408 to {}*), {}** %7, align 8 %8 = getelementptr inbounds [4 x {}*], [4 x {}*]* %2, i64 0, i6 4 2 store {}* %1, {}** %8, align 8 %9 = getelementptr inbounds [4 x {}*], [4 x {}*]* %2, i64 0, i6 4 3 - store {}* inttoptr (i64 139639425355568 to {}*), {}** %9, align + store {}* inttoptr (i64 139979385976816 to {}*), {}** %9, align 8 - %10 = call nonnull {}* @ijl_invoke({}* inttoptr (i64 1396393902 -75360 to {}*), {}** nonnull %.sub, i32 4, {}* inttoptr (i64 139637665489760 + %10 = call nonnull {}* @ijl_invoke({}* inttoptr (i64 1399793752 +24496 to {}*), {}** nonnull %.sub, i32 4, {}* inttoptr (i64 139977605545056 to {}*)) call void @llvm.trap() unreachable @@ -197,7 +197,7 @@

The Different Flavors of Parallelism

Chris Rackauckas< L42: ; preds = %L14 ; ││││││││││ @ reduce.jl:442 within `_mapreduce` ; ││││││││││┌ @ reduce.jl:272 within `mapreduce_impl` - call void @j_mapreduce_impl_13536([2 x double]* noalias nocapt + call void @j_mapreduce_impl_13515([2 x double]* noalias nocapt ure noundef nonnull sret([2 x double]) %tmpcast, {}* nonnull %1, i64 signex t 1, i64 signext %6, i64 signext 1024) #0 ; └└└└└└└└└└└ @@ -364,7 +364,7 @@

Next Level Up: Multithreading

-178.399 μs (7 allocations: 640 bytes)
+246.699 μs (7 allocations: 640 bytes)
 
@@ -377,7 +377,7 @@

Next Level Up: Multithreading

-180.099 μs (7 allocations: 640 bytes)
+213.501 μs (7 allocations: 640 bytes)
 
@@ -390,7 +390,7 @@

Next Level Up: Multithreading

-57.799 μs (7 allocations: 640 bytes)
+73.001 μs (7 allocations: 640 bytes)
 
@@ -403,7 +403,7 @@

Next Level Up: Multithreading

-2.499 ns (0 allocations: 0 bytes)
+3.500 ns (0 allocations: 0 bytes)
 
@@ -418,17 +418,17 @@

Next Level Up: Multithreading

 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral
 lelism.jmd:26 within `h`
-define void @julia_h_13637() #0 {
+define void @julia_h_13616() #0 {
 top:
-  %.promoted = load i64, i64* inttoptr (i64 139638901468496 to i64*), align
- 16
+  %.promoted = load i64, i64* inttoptr (i64 139977816649728 to i64*), align
+ 4096
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral
 lelism.jmd:28 within `h`
   %0 = add i64 %.promoted, 10000
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral
 lelism.jmd:29 within `h`
 ; ┌ @ Base.jl within `setproperty!`
-   store i64 %0, i64* inttoptr (i64 139638901468496 to i64*), align 16
+   store i64 %0, i64* inttoptr (i64 139977816649728 to i64*), align 4096
 ; └
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral
 lelism.jmd:30 within `h`
@@ -454,7 +454,7 @@ 

Next Level Up: Multithreading

-2.799 ns (0 allocations: 0 bytes)
+3.200 ns (0 allocations: 0 bytes)
 
@@ -467,13 +467,13 @@

Next Level Up: Multithreading

 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral
 lelism.jmd:3 within `h2`
-define void @julia_h2_13644() #0 {
+define void @julia_h2_13623() #0 {
 top:
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral
 lelism.jmd:6 within `h2`
 ; ┌ @ refvalue.jl:56 within `getindex`
 ; │┌ @ Base.jl:37 within `getproperty`
-    %0 = load i64, i64* inttoptr (i64 139638902028752 to i64*), align 16
+    %0 = load i64, i64* inttoptr (i64 139977816937680 to i64*), align 16
 ; └└
 ; ┌ @ range.jl:5 within `Colon`
 ; │┌ @ range.jl:397 within `UnitRange`
@@ -483,15 +483,15 @@ 

Next Level Up: Multithreading

br i1 %.inv, label %L18.preheader, label %L34 L18.preheader: ; preds = %top - %.promoted = load i64, i64* inttoptr (i64 139638901468496 to i64*), align - 16 + %.promoted = load i64, i64* inttoptr (i64 139977816649728 to i64*), align + 4096 ; @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral lelism.jmd:8 within `h2` %1 = add i64 %.promoted, %0 ; @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral lelism.jmd:7 within `h2` ; ┌ @ Base.jl within `setproperty!` - store i64 %1, i64* inttoptr (i64 139638901468496 to i64*), align 16 + store i64 %1, i64* inttoptr (i64 139977816649728 to i64*), align 4096 ; └ ; @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral lelism.jmd:8 within `h2` @@ -522,7 +522,7 @@

Next Level Up: Multithreading

-114.188 ns (0 allocations: 0 bytes)
+160.279 ns (0 allocations: 0 bytes)
 
@@ -905,6 +905,6 @@

The Bait-and-switch: Parallelism is about Programming Models

\ No newline at end of file diff --git a/_weave/lecture07/discretizing_odes/index.html b/_weave/lecture07/discretizing_odes/index.html index b266e7f8..09401967 100644 --- a/_weave/lecture07/discretizing_odes/index.html +++ b/_weave/lecture07/discretizing_odes/index.html @@ -405,4 +405,4 @@

Ordinary Differential Equations, Applications and Discretizatio plot(sol)

 plot(sol, xscale=:log10, tspan=(1e-6, 60), layout=(3,1))
-

Geometric Properties

Linear Ordinary Differential Equations

The simplest ordinary differential equation is the scalar linear ODE, which is given in the form

\[ u' = \alpha u \]

We can solve this by noticing that $(e^{\alpha t})^\prime = \alpha e^{\alpha t}$ satisfies the differential equation and thus the general solution is:

\[ u(t) = u(0)e^{\alpha t} \]

From the analytical solution we have that:

  • If $Re(\alpha) > 0$ then $u(t) \rightarrow \infty$ as $t \rightarrow \infty$

  • If $Re(\alpha) < 0$ then $u(t) \rightarrow 0$ as $t \rightarrow \infty$

  • If $Re(\alpha) = 0$ then $u(t)$ has a constant or periodic solution.

This theory can then be extended to multivariable systems in the same way as the discrete dynamics case. Let $u$ be a vector and have

\[ u' = Au \]

be a linear ordinary differential equation. Assuming $A$ is diagonalizable, we diagonalize $A = P^{-1}DP$ to get

\[ Pu' = DPu \]

and change coordinates $z = Pu$ so that we have

\[ z' = Dz \]

which decouples the equation into a system of linear ordinary differential equations which we solve individually. Thus we see that, similarly to the discrete dynamical system, we have that:

  • If all of the eigenvalues negative, then $u(t) \rightarrow 0$ as $t \rightarrow \infty$

  • If any eigenvalue is positive, then $u(t) \rightarrow \infty$ as $t \rightarrow \infty$

Nonlinear Ordinary Differential Equations

As with discrete dynamical systems, the geometric properties extend locally to the linearization of the continuous dynamical system as defined by:

\[ u' = \frac{df}{du} u \]

where $\frac{df}{du}$ is the Jacobian of the system. This is a consequence of the Hartman-Grubman Theorem.

Numerically Solving Ordinary Differential Equations

Euler's Method

To numerically solve an ordinary differential equation, one turns the continuous equation into a discrete equation by discretizing it. The simplest discretization is the Euler method. The Euler method can be thought of as a simple approximation replacing $dt$ with a small non-infinitesimal $\Delta t$. Thus we can approximate

\[ f(u,p,t) = u' = \frac{du}{dt} \approx \frac{\Delta u}{\Delta t} \]

and now since $\Delta u = u_{n+1} - u_n$ we have that

\[ \Delta t f(u,p,t) = u_{n+1} - u_n \]

We need to make a choice as to where we evaluate $f$ at. The simplest approximation is to evaluate it at $t_n$ with $u_n$ where we already have the data, and thus we re-arrange to get

\[ u_{n+1} = u_n + \Delta t f(u,p,t) \]

This is the Euler method.

We can interpret it more rigorously by looking at the Taylor series expansion. First write out the Taylor series for the ODE's solution in the near future:

\[ u(t+\Delta t) = u(t) + \Delta t u'(t) + \frac{\Delta t^2}{2} u''(t) + \ldots \]

Recall that $u' = f(u,p,t)$ by the definition of the ODE system, and thus we have that

\[ u(t+\Delta t) = u(t) + \Delta t f(u,p,t) + \mathcal{O}(\Delta t^2) \]

This is a first order approximation because the error in our step can be expressed as an error in the derivative, i.e.

\[ \frac{u(t + \Delta t) - u(t)}{\Delta t} = f(u,p,t) + \mathcal{O}(\Delta t) \]

Higher Order Methods

We can use this analysis to extend our methods to higher order approximation by simply matching the Taylor series to a higher order. Intuitively, when we developed the Euler method we had to make a choice:

\[ u_{n+1} = u_n + \Delta t f(u,p,t) \]

where do we evaluate $f$? One may think that the best derivative approximation my come from the middle of the interval, in which case we might want to evaluate it at $t + \frac{\Delta t}{2}$. To do so, we can use the Euler method to approximate the value at $t + \frac{\Delta t}{2}$ and then use that value to approximate the derivative at $t + \frac{\Delta t}{2}$. This looks like:

\[ k_1 = f(u_n,p,t)\\ k_2 = f(u_n + \frac{\Delta t}{2} k_1,p,t + \frac{\Delta t}{2})\\ u_{n+1} = u_n + \Delta t k_2 \]

which we can also write as:

\[ u_{n+1} = u_n + \Delta t f(u_n + \frac{\Delta t}{2} f_n,p,t + \frac{\Delta t}{2}) \]

where $f_n = f(u_n,p,t)$. If we do the two-dimensional Taylor expansion we get:

\[ u_{n+1} = u_n + \Delta t f_n + \frac{\Delta t^2}{2}(f_t + f_u f)(u_n,p,t)\\ + \frac{\Delta t^3}{6} (f_{tt} + 2f_{tu}f + f_{uu}f^2)(u_n,p,t) \]

which when we compare against the true Taylor series:

\[ u(t+\Delta t) = u_n + \Delta t f(u_n,p,t) + \frac{\Delta t^2}{2}(f_t + f_u f)(u_n,p,t) + \frac{\Delta t^3}{6}(f_{tt} + 2f_{tu} + f_{uu}f^2 + f_t f_u + f_u^2 f)(u_n,p,t) \]

and thus we see that

\[ u(t + \Delta t) - u_n = \mathcal{O}(\Delta t^3) \]

Runge-Kutta Methods

More generally, Runge-Kutta methods are of the form:

\[ k_1 = f(u_n,p,t)\\ k_2 = f(u_n + \Delta t (a_{21} k_1),p,t + \Delta t c_2)\\ k_3 = f(u_n + \Delta t (a_{31} k_1 + a_{32} k_2),p,t + \Delta t c_3)\\ \vdots \\ u_{n+1} = u_n + \Delta t (b_1 k_1 + \ldots + b_s k_s) \]

where $s$ is the number of stages. These can be expressed as a tableau:

The order of the Runge-Kutta method is simply the number of terms in the Taylor series that ends up being matched by the resulting expansion. For example, for the 4th order you can expand out and see that the following equations need to be satisfied:

The classic Runge-Kutta method is also known as RK4 and is the following 4th order method:

\[ k_1 = f(u_n,p,t)\\ k_2 = f(u_n + \frac{\Delta t}{2} k_1,p,t + \frac{\Delta t}{2})\\ k_3 = f(u_n + \frac{\Delta t}{2} k_2,p,t + \frac{\Delta t}{2})\\ k_4 = f(u_n + \Delta t k_3,p,t + \Delta t)\\ u_{n+1} = u_n + \frac{\Delta t}{6}(k_1 + 2 k_2 + 2 k_3 + k_4)\\ \]

While it's widely known and simple to remember, it's not necessarily good. The way to judge a Runge-Kutta method is by looking at the size of the coefficient of the next term in the Taylor series: if it's large then the true error can be larger, even if it matches another one asymptotically.

What Makes a Good Method?

Leading Truncation Coefficients

For given orders of explicit Runge-Kutta methods, lower bounds for the number of f evaluations (stages) required to receive a given order are known:

While unintuitive, using the method is not necessarily the one that reduces the coefficient the most. The reason is because what is attempted in ODE solving is precisely the opposite of the analysis. In the ODE analysis, we're looking at behavior as $\Delta t \rightarrow 0$. However, when efficiently solving ODEs, we want to use the largest $\Delta t$ which satisfies error tolerances.

The most widely used method is the Dormand-Prince 5th order Runge-Kutta method, whose tableau is represented as:

Notice that this method takes 7 calls to f for 5th order. The key to this method is that it has optimized leading truncation error coefficients, under some extra assumptions which allow for the analysis to be simplified.

Looking at the Effects of RK Method Choices and Code Optimizations

Pulling from the SciML Benchmarks, we can see the general effect of these different properties on a given set of Runge-Kutta methods:

Here, the order of the method is given in the name. We can see one immediate factor is that, as the requested error in the calculation decreases, the higher order methods become more efficient. This is because to decrease error, you decrease $\Delta t$, and thus the exponent difference with respect to $\Delta t$ has more of a chance to pay off for the extra calls to f. Additionally, we can see that order is not the only determining factor for efficiency: the Vern8 method seems to have a clear approximate 2.5x performance advantage over the whole span of the benchmark compared to the DP8 method, even though both are 8th order methods. This is because of the leading truncation terms: with a small enough $\Delta t$, the more optimized method (Vern8) will generally have low error in a step for the same $\Delta t$ because the coefficients in the expansion are generally smaller.

This is a factor which is generally ignored in high level discussions of numerical differential equations, but can lead to orders of magnitude differences! This is highlighted in the following plot:

Here we see ODEInterface.jl's ODEInterfaceDiffEq.jl wrapper into the SciML common interface for the standard dopri method from Fortran, and ODE.jl, the original ODE solvers in Julia, have a performance disadvantage compared to the DifferentialEquations.jl methods due in part to some of the coding performance pieces that we discussed in the first few lectures.

Specifically, a large part of this can be attributed to inlining of the higher order functions, i.e. ODEs are defined by a user function and then have to be called from the solver. If the solver code is compiled as a shared library ahead of time, like is commonly done in C++ or Fortran, then there can be a function call overhead that is eliminated by JIT compilation optimizing across the function call barriers (known as interprocedural optimization). This is one way which a JIT system can outperform an AOT (ahead of time) compiled system in real-world code (for completeness, two other ways are by doing full function specialization, which is something that is not generally possible in AOT languages given that you cannot know all types ahead of time for a fully generic function, and calling C itself, i.e. c-ffi (foreign function interface), can be optimized using the runtime information of the JIT compiler to outperform C!).

The other performance difference being shown here is due to optimization of the method. While a slightly different order, we can see a clear difference in the performance of RK4 vs the coefficient optimized methods. It's about the same order of magnitude as "highly optimized code differences", showing that both the Runge-Kutta coefficients and the code implementation can have a significant impact on performance.

Taking a look at what happens when interpreted languages get involved highlights some of the code challenges in this domain. Let's take a look at for example the results when simulating 3 ODE systems with the various RK methods:

We see that using interpreted languages introduces around a 50x-100x performance penalty. If you recall in your previous lecture, the discrete dynamical system that was being simulated was the 3-dimensional Lorenz equation discretized by Euler's method, meaning that the performance of that implementation is a good proxy for understanding the performance differences in this graph. Recall that in previous lectures we saw an approximately 5x performance advantage when specializing on the system function and size and around 10x by reducing allocations: these features account for the performance differences noticed between library implementations, which are then compounded by the use of different RK methods (note that R uses "call by copy" which even further increases the memory usages and makes standard usage of the language incompatible with mutating function calls!).

Stability of a Method

Simply having an order on the truncation error does not imply convergence of the method. The disconnect is that the errors at a given time point may not dissipate. What also needs to be checked is the asymptotic behavior of a disturbance. To see this, one can utilize the linear test problem:

\[ u' = \alpha u \]

and ask the question, does the discrete dynamical system defined by the discretized ODE end up going to zero? You would hope that the discretized dynamical system and the continuous dynamical system have the same properties in this simple case, and this is known as linear stability analysis of the method.

As an example, take a look at the Euler method. Recall that the Euler method was given by:

\[ u_{n+1} = u_n + \Delta t f(u_n,p,t) \]

When we plug in the linear test equation, we get that

\[ u_{n+1} = u_n + \Delta t \alpha u_n \]

If we let $z = \Delta t \alpha$, then we get the following:

\[ u_{n+1} = u_n + z u_n = (1+z)u_n \]

which is stable when $z$ is in the shifted unit circle. This means that, as a necessary condition, the step size $\Delta t$ needs to be small enough that $z$ satisfies this condition, placing a stepsize limit on the method.

If $\Delta t$ is ever too large, it will cause the equation to overshoot zero, which then causes oscillations that spiral out to infinity.

Thus the stability condition places a hard constraint on the allowed $\Delta t$ which will result in a realistic simulation.

For reference, the stability regions of the 2nd and 4th order Runge-Kutta methods that we discussed are as follows:

Interpretation of the Linear Stability Condition

To interpret the linear stability condition, recall that the linearization of a system interprets the dynamics as locally being due to the Jacobian of the system. Thus

\[ u' = f(u,p,t) \]

is locally equivalent to

\[ u' = \frac{df}{du}u \]

You can understand the local behavior through diagonalizing this matrix. Therefore, the scalar for the linear stability analysis is performing an analysis on the eigenvalues of the Jacobian. The method will be stable if the largest eigenvalues of df/du are all within the stability limit. This means that stability effects are different throughout the solution of a nonlinear equation and are generally understood locally (though different more comprehensive stability conditions exist!).

Implicit Methods

If instead of the Euler method we defined $f$ to be evaluated at the future point, we would receive a method like:

\[ u_{n+1} = u_n + \Delta t f(u_{n+1},p,t+\Delta t) \]

in which case, for the stability calculation we would have that

\[ u_{n+1} = u_n + \Delta t \alpha u_n \]

or

\[ (1-z) u_{n+1} = u_n \]

which means that

\[ u_{n+1} = \frac{1}{1-z} u_n \]

which is stable for all $Re(z) < 0$ a property which is known as A-stability. It is also stable as $z \rightarrow \infty$, a property known as L-stability. This means that for equations with very ill-conditioned Jacobians, this method is still able to be use reasonably large stepsizes and can thus be efficient.

Stiffness and Timescale Separation

From this we see that there is a maximal stepsize whenever the eigenvalues of the Jacobian are sufficiently large. It turns out that's not an issue if the phenomena we see are fast, since then the total integration time tends to be small. However, if we have some equations with both fast modes and slow modes, like the Robertson equation, then it is very difficult because in order to resolve the slow dynamics over a long timespan, one needs to ensure that the fast dynamics do not diverge. This is a property known as stiffness. Stiffness can thus be approximated in some sense by the condition number of the Jacobian. The condition number of a matrix is its maximal eigenvalue divided by its minimal eigenvalue and gives a rough measure of the local timescale separations. If this value is large and one wants to resolve the slow dynamics, then explicit integrators, like the explicit Runge-Kutta methods described before, have issues with stability. In this case implicit integrators (or other forms of stabilized stepping) are required in order to efficiently reach the end time step.

Exploiting Continuity

So far, we have looked at ordinary differential equations as a $\Delta t \rightarrow 0$ formulation of a discrete dynamical system. However, continuous dynamics and discrete dynamics have very different characteristics which can be utilized in order to arrive at simpler models and faster computations.

Geometric Properties: No Jumping and the Poincaré–Bendixson theorem

In terms of geometric properties, continuity places a large constraint on the possible dynamics. This is because of the physical constraint on "jumping", i.e. flows of differential equations cannot jump over each other. If you are ever at some point in phase space and $f$ is not explicitly time-dependent, then the direction of $u'$ is uniquely determined (given reasonable assumptions on $f$), meaning that flow lines (solutions to the differential equation) can never cross.

A result from this is the Poincaré–Bendixson theorem, which states that, with any arbitrary (but nice) two dimensional continuous system, you can only have 3 behaviors:

  • Steady state behavior

  • Divergence

  • Periodic orbits

A simple proof by picture shows this.

\ No newline at end of file +

Geometric Properties

Linear Ordinary Differential Equations

The simplest ordinary differential equation is the scalar linear ODE, which is given in the form

\[ u' = \alpha u \]

We can solve this by noticing that $(e^{\alpha t})^\prime = \alpha e^{\alpha t}$ satisfies the differential equation and thus the general solution is:

\[ u(t) = u(0)e^{\alpha t} \]

From the analytical solution we have that:

  • If $Re(\alpha) > 0$ then $u(t) \rightarrow \infty$ as $t \rightarrow \infty$

  • If $Re(\alpha) < 0$ then $u(t) \rightarrow 0$ as $t \rightarrow \infty$

  • If $Re(\alpha) = 0$ then $u(t)$ has a constant or periodic solution.

This theory can then be extended to multivariable systems in the same way as the discrete dynamics case. Let $u$ be a vector and have

\[ u' = Au \]

be a linear ordinary differential equation. Assuming $A$ is diagonalizable, we diagonalize $A = P^{-1}DP$ to get

\[ Pu' = DPu \]

and change coordinates $z = Pu$ so that we have

\[ z' = Dz \]

which decouples the equation into a system of linear ordinary differential equations which we solve individually. Thus we see that, similarly to the discrete dynamical system, we have that:

  • If all of the eigenvalues negative, then $u(t) \rightarrow 0$ as $t \rightarrow \infty$

  • If any eigenvalue is positive, then $u(t) \rightarrow \infty$ as $t \rightarrow \infty$

Nonlinear Ordinary Differential Equations

As with discrete dynamical systems, the geometric properties extend locally to the linearization of the continuous dynamical system as defined by:

\[ u' = \frac{df}{du} u \]

where $\frac{df}{du}$ is the Jacobian of the system. This is a consequence of the Hartman-Grubman Theorem.

Numerically Solving Ordinary Differential Equations

Euler's Method

To numerically solve an ordinary differential equation, one turns the continuous equation into a discrete equation by discretizing it. The simplest discretization is the Euler method. The Euler method can be thought of as a simple approximation replacing $dt$ with a small non-infinitesimal $\Delta t$. Thus we can approximate

\[ f(u,p,t) = u' = \frac{du}{dt} \approx \frac{\Delta u}{\Delta t} \]

and now since $\Delta u = u_{n+1} - u_n$ we have that

\[ \Delta t f(u,p,t) = u_{n+1} - u_n \]

We need to make a choice as to where we evaluate $f$ at. The simplest approximation is to evaluate it at $t_n$ with $u_n$ where we already have the data, and thus we re-arrange to get

\[ u_{n+1} = u_n + \Delta t f(u,p,t) \]

This is the Euler method.

We can interpret it more rigorously by looking at the Taylor series expansion. First write out the Taylor series for the ODE's solution in the near future:

\[ u(t+\Delta t) = u(t) + \Delta t u'(t) + \frac{\Delta t^2}{2} u''(t) + \ldots \]

Recall that $u' = f(u,p,t)$ by the definition of the ODE system, and thus we have that

\[ u(t+\Delta t) = u(t) + \Delta t f(u,p,t) + \mathcal{O}(\Delta t^2) \]

This is a first order approximation because the error in our step can be expressed as an error in the derivative, i.e.

\[ \frac{u(t + \Delta t) - u(t)}{\Delta t} = f(u,p,t) + \mathcal{O}(\Delta t) \]

Higher Order Methods

We can use this analysis to extend our methods to higher order approximation by simply matching the Taylor series to a higher order. Intuitively, when we developed the Euler method we had to make a choice:

\[ u_{n+1} = u_n + \Delta t f(u,p,t) \]

where do we evaluate $f$? One may think that the best derivative approximation my come from the middle of the interval, in which case we might want to evaluate it at $t + \frac{\Delta t}{2}$. To do so, we can use the Euler method to approximate the value at $t + \frac{\Delta t}{2}$ and then use that value to approximate the derivative at $t + \frac{\Delta t}{2}$. This looks like:

\[ k_1 = f(u_n,p,t)\\ k_2 = f(u_n + \frac{\Delta t}{2} k_1,p,t + \frac{\Delta t}{2})\\ u_{n+1} = u_n + \Delta t k_2 \]

which we can also write as:

\[ u_{n+1} = u_n + \Delta t f(u_n + \frac{\Delta t}{2} f_n,p,t + \frac{\Delta t}{2}) \]

where $f_n = f(u_n,p,t)$. If we do the two-dimensional Taylor expansion we get:

\[ u_{n+1} = u_n + \Delta t f_n + \frac{\Delta t^2}{2}(f_t + f_u f)(u_n,p,t)\\ + \frac{\Delta t^3}{6} (f_{tt} + 2f_{tu}f + f_{uu}f^2)(u_n,p,t) \]

which when we compare against the true Taylor series:

\[ u(t+\Delta t) = u_n + \Delta t f(u_n,p,t) + \frac{\Delta t^2}{2}(f_t + f_u f)(u_n,p,t) + \frac{\Delta t^3}{6}(f_{tt} + 2f_{tu} + f_{uu}f^2 + f_t f_u + f_u^2 f)(u_n,p,t) \]

and thus we see that

\[ u(t + \Delta t) - u_n = \mathcal{O}(\Delta t^3) \]

Runge-Kutta Methods

More generally, Runge-Kutta methods are of the form:

\[ k_1 = f(u_n,p,t)\\ k_2 = f(u_n + \Delta t (a_{21} k_1),p,t + \Delta t c_2)\\ k_3 = f(u_n + \Delta t (a_{31} k_1 + a_{32} k_2),p,t + \Delta t c_3)\\ \vdots \\ u_{n+1} = u_n + \Delta t (b_1 k_1 + \ldots + b_s k_s) \]

where $s$ is the number of stages. These can be expressed as a tableau:

The order of the Runge-Kutta method is simply the number of terms in the Taylor series that ends up being matched by the resulting expansion. For example, for the 4th order you can expand out and see that the following equations need to be satisfied:

The classic Runge-Kutta method is also known as RK4 and is the following 4th order method:

\[ k_1 = f(u_n,p,t)\\ k_2 = f(u_n + \frac{\Delta t}{2} k_1,p,t + \frac{\Delta t}{2})\\ k_3 = f(u_n + \frac{\Delta t}{2} k_2,p,t + \frac{\Delta t}{2})\\ k_4 = f(u_n + \Delta t k_3,p,t + \Delta t)\\ u_{n+1} = u_n + \frac{\Delta t}{6}(k_1 + 2 k_2 + 2 k_3 + k_4)\\ \]

While it's widely known and simple to remember, it's not necessarily good. The way to judge a Runge-Kutta method is by looking at the size of the coefficient of the next term in the Taylor series: if it's large then the true error can be larger, even if it matches another one asymptotically.

What Makes a Good Method?

Leading Truncation Coefficients

For given orders of explicit Runge-Kutta methods, lower bounds for the number of f evaluations (stages) required to receive a given order are known:

While unintuitive, using the method is not necessarily the one that reduces the coefficient the most. The reason is because what is attempted in ODE solving is precisely the opposite of the analysis. In the ODE analysis, we're looking at behavior as $\Delta t \rightarrow 0$. However, when efficiently solving ODEs, we want to use the largest $\Delta t$ which satisfies error tolerances.

The most widely used method is the Dormand-Prince 5th order Runge-Kutta method, whose tableau is represented as:

Notice that this method takes 7 calls to f for 5th order. The key to this method is that it has optimized leading truncation error coefficients, under some extra assumptions which allow for the analysis to be simplified.

Looking at the Effects of RK Method Choices and Code Optimizations

Pulling from the SciML Benchmarks, we can see the general effect of these different properties on a given set of Runge-Kutta methods:

Here, the order of the method is given in the name. We can see one immediate factor is that, as the requested error in the calculation decreases, the higher order methods become more efficient. This is because to decrease error, you decrease $\Delta t$, and thus the exponent difference with respect to $\Delta t$ has more of a chance to pay off for the extra calls to f. Additionally, we can see that order is not the only determining factor for efficiency: the Vern8 method seems to have a clear approximate 2.5x performance advantage over the whole span of the benchmark compared to the DP8 method, even though both are 8th order methods. This is because of the leading truncation terms: with a small enough $\Delta t$, the more optimized method (Vern8) will generally have low error in a step for the same $\Delta t$ because the coefficients in the expansion are generally smaller.

This is a factor which is generally ignored in high level discussions of numerical differential equations, but can lead to orders of magnitude differences! This is highlighted in the following plot:

Here we see ODEInterface.jl's ODEInterfaceDiffEq.jl wrapper into the SciML common interface for the standard dopri method from Fortran, and ODE.jl, the original ODE solvers in Julia, have a performance disadvantage compared to the DifferentialEquations.jl methods due in part to some of the coding performance pieces that we discussed in the first few lectures.

Specifically, a large part of this can be attributed to inlining of the higher order functions, i.e. ODEs are defined by a user function and then have to be called from the solver. If the solver code is compiled as a shared library ahead of time, like is commonly done in C++ or Fortran, then there can be a function call overhead that is eliminated by JIT compilation optimizing across the function call barriers (known as interprocedural optimization). This is one way which a JIT system can outperform an AOT (ahead of time) compiled system in real-world code (for completeness, two other ways are by doing full function specialization, which is something that is not generally possible in AOT languages given that you cannot know all types ahead of time for a fully generic function, and calling C itself, i.e. c-ffi (foreign function interface), can be optimized using the runtime information of the JIT compiler to outperform C!).

The other performance difference being shown here is due to optimization of the method. While a slightly different order, we can see a clear difference in the performance of RK4 vs the coefficient optimized methods. It's about the same order of magnitude as "highly optimized code differences", showing that both the Runge-Kutta coefficients and the code implementation can have a significant impact on performance.

Taking a look at what happens when interpreted languages get involved highlights some of the code challenges in this domain. Let's take a look at for example the results when simulating 3 ODE systems with the various RK methods:

We see that using interpreted languages introduces around a 50x-100x performance penalty. If you recall in your previous lecture, the discrete dynamical system that was being simulated was the 3-dimensional Lorenz equation discretized by Euler's method, meaning that the performance of that implementation is a good proxy for understanding the performance differences in this graph. Recall that in previous lectures we saw an approximately 5x performance advantage when specializing on the system function and size and around 10x by reducing allocations: these features account for the performance differences noticed between library implementations, which are then compounded by the use of different RK methods (note that R uses "call by copy" which even further increases the memory usages and makes standard usage of the language incompatible with mutating function calls!).

Stability of a Method

Simply having an order on the truncation error does not imply convergence of the method. The disconnect is that the errors at a given time point may not dissipate. What also needs to be checked is the asymptotic behavior of a disturbance. To see this, one can utilize the linear test problem:

\[ u' = \alpha u \]

and ask the question, does the discrete dynamical system defined by the discretized ODE end up going to zero? You would hope that the discretized dynamical system and the continuous dynamical system have the same properties in this simple case, and this is known as linear stability analysis of the method.

As an example, take a look at the Euler method. Recall that the Euler method was given by:

\[ u_{n+1} = u_n + \Delta t f(u_n,p,t) \]

When we plug in the linear test equation, we get that

\[ u_{n+1} = u_n + \Delta t \alpha u_n \]

If we let $z = \Delta t \alpha$, then we get the following:

\[ u_{n+1} = u_n + z u_n = (1+z)u_n \]

which is stable when $z$ is in the shifted unit circle. This means that, as a necessary condition, the step size $\Delta t$ needs to be small enough that $z$ satisfies this condition, placing a stepsize limit on the method.

If $\Delta t$ is ever too large, it will cause the equation to overshoot zero, which then causes oscillations that spiral out to infinity.

Thus the stability condition places a hard constraint on the allowed $\Delta t$ which will result in a realistic simulation.

For reference, the stability regions of the 2nd and 4th order Runge-Kutta methods that we discussed are as follows:

Interpretation of the Linear Stability Condition

To interpret the linear stability condition, recall that the linearization of a system interprets the dynamics as locally being due to the Jacobian of the system. Thus

\[ u' = f(u,p,t) \]

is locally equivalent to

\[ u' = \frac{df}{du}u \]

You can understand the local behavior through diagonalizing this matrix. Therefore, the scalar for the linear stability analysis is performing an analysis on the eigenvalues of the Jacobian. The method will be stable if the largest eigenvalues of df/du are all within the stability limit. This means that stability effects are different throughout the solution of a nonlinear equation and are generally understood locally (though different more comprehensive stability conditions exist!).

Implicit Methods

If instead of the Euler method we defined $f$ to be evaluated at the future point, we would receive a method like:

\[ u_{n+1} = u_n + \Delta t f(u_{n+1},p,t+\Delta t) \]

in which case, for the stability calculation we would have that

\[ u_{n+1} = u_n + \Delta t \alpha u_n \]

or

\[ (1-z) u_{n+1} = u_n \]

which means that

\[ u_{n+1} = \frac{1}{1-z} u_n \]

which is stable for all $Re(z) < 0$ a property which is known as A-stability. It is also stable as $z \rightarrow \infty$, a property known as L-stability. This means that for equations with very ill-conditioned Jacobians, this method is still able to be use reasonably large stepsizes and can thus be efficient.

Stiffness and Timescale Separation

From this we see that there is a maximal stepsize whenever the eigenvalues of the Jacobian are sufficiently large. It turns out that's not an issue if the phenomena we see are fast, since then the total integration time tends to be small. However, if we have some equations with both fast modes and slow modes, like the Robertson equation, then it is very difficult because in order to resolve the slow dynamics over a long timespan, one needs to ensure that the fast dynamics do not diverge. This is a property known as stiffness. Stiffness can thus be approximated in some sense by the condition number of the Jacobian. The condition number of a matrix is its maximal eigenvalue divided by its minimal eigenvalue and gives a rough measure of the local timescale separations. If this value is large and one wants to resolve the slow dynamics, then explicit integrators, like the explicit Runge-Kutta methods described before, have issues with stability. In this case implicit integrators (or other forms of stabilized stepping) are required in order to efficiently reach the end time step.

Exploiting Continuity

So far, we have looked at ordinary differential equations as a $\Delta t \rightarrow 0$ formulation of a discrete dynamical system. However, continuous dynamics and discrete dynamics have very different characteristics which can be utilized in order to arrive at simpler models and faster computations.

Geometric Properties: No Jumping and the Poincaré–Bendixson theorem

In terms of geometric properties, continuity places a large constraint on the possible dynamics. This is because of the physical constraint on "jumping", i.e. flows of differential equations cannot jump over each other. If you are ever at some point in phase space and $f$ is not explicitly time-dependent, then the direction of $u'$ is uniquely determined (given reasonable assumptions on $f$), meaning that flow lines (solutions to the differential equation) can never cross.

A result from this is the Poincaré–Bendixson theorem, which states that, with any arbitrary (but nice) two dimensional continuous system, you can only have 3 behaviors:

  • Steady state behavior

  • Divergence

  • Periodic orbits

A simple proof by picture shows this.

\ No newline at end of file diff --git a/_weave/lecture07/jl_twkCVE/discretizing_odes_12_1.png b/_weave/lecture07/jl_jJrxnO/discretizing_odes_12_1.png similarity index 100% rename from _weave/lecture07/jl_twkCVE/discretizing_odes_12_1.png rename to _weave/lecture07/jl_jJrxnO/discretizing_odes_12_1.png diff --git a/_weave/lecture07/jl_twkCVE/discretizing_odes_13_1.png b/_weave/lecture07/jl_jJrxnO/discretizing_odes_13_1.png similarity index 100% rename from _weave/lecture07/jl_twkCVE/discretizing_odes_13_1.png rename to _weave/lecture07/jl_jJrxnO/discretizing_odes_13_1.png diff --git a/_weave/lecture07/jl_twkCVE/discretizing_odes_14_1.png b/_weave/lecture07/jl_jJrxnO/discretizing_odes_14_1.png similarity index 100% rename from _weave/lecture07/jl_twkCVE/discretizing_odes_14_1.png rename to _weave/lecture07/jl_jJrxnO/discretizing_odes_14_1.png diff --git a/_weave/lecture07/jl_twkCVE/discretizing_odes_15_1.png b/_weave/lecture07/jl_jJrxnO/discretizing_odes_15_1.png similarity index 100% rename from _weave/lecture07/jl_twkCVE/discretizing_odes_15_1.png rename to _weave/lecture07/jl_jJrxnO/discretizing_odes_15_1.png diff --git a/_weave/lecture07/jl_twkCVE/discretizing_odes_16_1.png b/_weave/lecture07/jl_jJrxnO/discretizing_odes_16_1.png similarity index 100% rename from _weave/lecture07/jl_twkCVE/discretizing_odes_16_1.png rename to _weave/lecture07/jl_jJrxnO/discretizing_odes_16_1.png diff --git a/_weave/lecture07/jl_twkCVE/discretizing_odes_19_1.png b/_weave/lecture07/jl_jJrxnO/discretizing_odes_19_1.png similarity index 100% rename from _weave/lecture07/jl_twkCVE/discretizing_odes_19_1.png rename to _weave/lecture07/jl_jJrxnO/discretizing_odes_19_1.png diff --git a/_weave/lecture07/jl_twkCVE/discretizing_odes_20_1.png b/_weave/lecture07/jl_jJrxnO/discretizing_odes_20_1.png similarity index 100% rename from _weave/lecture07/jl_twkCVE/discretizing_odes_20_1.png rename to _weave/lecture07/jl_jJrxnO/discretizing_odes_20_1.png diff --git a/_weave/lecture07/jl_twkCVE/discretizing_odes_7_1.png b/_weave/lecture07/jl_jJrxnO/discretizing_odes_7_1.png similarity index 100% rename from _weave/lecture07/jl_twkCVE/discretizing_odes_7_1.png rename to _weave/lecture07/jl_jJrxnO/discretizing_odes_7_1.png diff --git a/_weave/lecture07/jl_twkCVE/discretizing_odes_8_1.png b/_weave/lecture07/jl_jJrxnO/discretizing_odes_8_1.png similarity index 100% rename from _weave/lecture07/jl_twkCVE/discretizing_odes_8_1.png rename to _weave/lecture07/jl_jJrxnO/discretizing_odes_8_1.png diff --git a/_weave/lecture07/jl_twkCVE/discretizing_odes_9_1.png b/_weave/lecture07/jl_jJrxnO/discretizing_odes_9_1.png similarity index 100% rename from _weave/lecture07/jl_twkCVE/discretizing_odes_9_1.png rename to _weave/lecture07/jl_jJrxnO/discretizing_odes_9_1.png diff --git a/_weave/lecture08/automatic_differentiation/index.html b/_weave/lecture08/automatic_differentiation/index.html index d85fb47b..31198afe 100644 --- a/_weave/lecture08/automatic_differentiation/index.html +++ b/_weave/lecture08/automatic_differentiation/index.html @@ -18,9 +18,9 @@

Forward-Mode Automatic Differentiation (AD) via High Dimensiona ϵ2 = (1+ϵ) - 1 (ϵ - ϵ2)

-ϵ = 9.831067687145973e-11
-1 + ϵ = 1.0000000000983107
--1.6048307731825555e-17
+ϵ = 7.470048420814885e-11
+1 + ϵ = 1.0000000000747005
+1.6174664895368224e-17
 

See how $\epsilon$ is only rebuilt at accuracy around $10^{-16}$ and thus we only keep around 6 digits of accuracy when it's generated at the size of around $10^{-10}$!

Finite Differencing and Numerical Stability

To start understanding how to compute derivatives on a computer, we start with finite differencing. For finite differencing, recall that the definition of the derivative is:

\[ f'(x) = \lim_{\epsilon \rightarrow 0} \frac{f(x+\epsilon)-f(x)}{\epsilon} \]

Finite differencing directly follows from this definition by choosing a small $\epsilon$. However, choosing a good $\epsilon$ is very difficult. If $\epsilon$ is too large than there is error since this definition is asymptotic. However, if $\epsilon$ is too small, you receive roundoff error. To understand why you would get roundoff error, recall that floating point error is relative, and can essentially store 16 digits of accuracy. So let's say we choose $\epsilon = 10^{-6}$. Then $f(x+\epsilon) - f(x)$ is roughly the same in the first 6 digits, meaning that after the subtraction there is only 10 digits of accuracy, and then dividing by $10^{-6}$ simply brings those 10 digits back up to the correct relative size.

This means that we want to choose $\epsilon$ small enough that the $\mathcal{O}(\epsilon^2)$ error of the truncation is balanced by the $O(1/\epsilon)$ roundoff error. Under some minor assumptions, one can argue that the average best point is $\sqrt(E)$, where E is machine epsilon

 @show eps(Float64)
 @show sqrt(eps(Float64))
@@ -85,7 +85,7 @@ 

Forward-Mode Automatic Differentiation (AD) via High Dimensiona a, b, c, d = 1, 2, 3, 4 @btime add($(Ref(a))[], $(Ref(b))[], $(Ref(c))[], $(Ref(d))[])

-3.699 ns (0 allocations: 0 bytes)
+4.200 ns (0 allocations: 0 bytes)
 (4, 6)
 
 a = Dual(1, 2)
@@ -95,17 +95,17 @@ 

Forward-Mode Automatic Differentiation (AD) via High Dimensiona add(a, b) @btime add($(Ref(a))[], $(Ref(b))[])

-3.499 ns (0 allocations: 0 bytes)
+3.900 ns (0 allocations: 0 bytes)
 Dual{Int64}(4, 6)
 

It seems like we have lost no performance.

 @code_native add(1, 2, 3, 4)
 
 .text
 	.file	"add"
-	.globl	julia_add_16015                 # -- Begin function julia_add_16015
+	.globl	julia_add_15994                 # -- Begin function julia_add_15994
 	.p2align	4, 0x90
-	.type	julia_add_16015,@function
-julia_add_16015:                        # @julia_add_16015
+	.type	julia_add_15994,@function
+julia_add_15994:                        # @julia_add_15994
 ; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture08/automatic_diff
 erentiation.jmd:2 within `add`
 	.cfi_startproc
@@ -126,7 +126,7 @@ 

Forward-Mode Automatic Differentiation (AD) via High Dimensiona .cfi_def_cfa %rsp, 8 retq .Lfunc_end0: - .size julia_add_16015, .Lfunc_end0-julia_add_16015 + .size julia_add_15994, .Lfunc_end0-julia_add_15994 .cfi_endproc ; └ # -- End function @@ -136,10 +136,10 @@

Forward-Mode Automatic Differentiation (AD) via High Dimensiona

 .text
 	.file	"add"
-	.globl	julia_add_16017                 # -- Begin function julia_add_16017
+	.globl	julia_add_15996                 # -- Begin function julia_add_15996
 	.p2align	4, 0x90
-	.type	julia_add_16017,@function
-julia_add_16017:                        # @julia_add_16017
+	.type	julia_add_15996,@function
+julia_add_15996:                        # @julia_add_15996
 ; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture08/automatic_diff
 erentiation.jmd:5 within `add`
 	.cfi_startproc
@@ -160,7 +160,7 @@ 

Forward-Mode Automatic Differentiation (AD) via High Dimensiona .cfi_def_cfa %rsp, 8 retq .Lfunc_end0: - .size julia_add_16017, .Lfunc_end0-julia_add_16017 + .size julia_add_15996, .Lfunc_end0-julia_add_15996 .cfi_endproc ; └ # -- End function @@ -323,4 +323,4 @@

Forward-Mode Automatic Differentiation (AD) via High Dimensiona 2-element SVector{2, Float64} with indices SOneTo(2): 0.7071067811865476 0.7071067811865476 -

Conclusion

To make derivative calculations efficient and correct, we can move to higher dimensional numbers. In multiple dimensions, these then allow for multiple directional derivatives to be computed simultaneously, giving a method for computing the Jacobian of a function $f$ on a single input. This is a direct application of using the compiler as part of a mathematical framework.

References

  • John L. Bell, An Invitation to Smooth Infinitesimal Analysis, http://publish.uwo.ca/~jbell/invitation%20to%20SIA.pdf

  • Bell, John L. A Primer of Infinitesimal Analysis

  • Nocedal & Wright, Numerical Optimization, Chapter 8

  • Griewank & Walther, Evaluating Derivatives

Many thanks to David Sanders for helping make these lecture notes.

\ No newline at end of file +

Conclusion

To make derivative calculations efficient and correct, we can move to higher dimensional numbers. In multiple dimensions, these then allow for multiple directional derivatives to be computed simultaneously, giving a method for computing the Jacobian of a function $f$ on a single input. This is a direct application of using the compiler as part of a mathematical framework.

References

  • John L. Bell, An Invitation to Smooth Infinitesimal Analysis, http://publish.uwo.ca/~jbell/invitation%20to%20SIA.pdf

  • Bell, John L. A Primer of Infinitesimal Analysis

  • Nocedal & Wright, Numerical Optimization, Chapter 8

  • Griewank & Walther, Evaluating Derivatives

Many thanks to David Sanders for helping make these lecture notes.

\ No newline at end of file diff --git a/_weave/lecture09/stiff_odes/index.html b/_weave/lecture09/stiff_odes/index.html index c5b48a28..0808f7a9 100644 --- a/_weave/lecture09/stiff_odes/index.html +++ b/_weave/lecture09/stiff_odes/index.html @@ -1 +1 @@ -

Solving Stiff Ordinary Differential Equations

Chris Rackauckas
October 14th, 2020

Youtube Video Link

We have previously shown how to solve non-stiff ODEs via optimized Runge-Kutta methods, but we ended by showing that there is a fundamental limitation of these methods when attempting to solve stiff ordinary differential equations. However, we can get around these limitations by using different types of methods, like implicit Euler. Let's now go down the path of understanding how to efficiently implement stiff ordinary differential equation solvers, and its interaction with other domains like automatic differentiation.

When one is solving a large-scale scientific computing problem with MPI, this is almost always the piece of code where all of the time is spent, so let's understand how what it's doing.

Newton's Method and Jacobians

Recall that the implicit Euler method is the following:

\[ u_{n+1} = u_n + \Delta t f(u_{n+1},p,t + \Delta t) \]

If we wanted to use this method, we would need to find out how to get the value $u_{n+1}$ when only knowing the value $u_n$. To do so, we can move everything to one side:

\[ u_{n+1} - \Delta t f(u_{n+1},p,t + \Delta t) - u_n = 0 \]

and now we have a problem

\[ g(u_{n+1}) = 0 \]

This is the classic rootfinding problem $g(x)=0$, find $x$. The way that we solve the rootfinding problem is, once again, by replacing this problem about a continuous function $g$ with a discrete dynamical system whose steady state is the solution to the $g(x)=0$. There are many methods for this, but some choices of the rootfinding method effect the stability of the ODE solver itself since we need to make sure that the steady state solution is a stable steady state of the iteration process, otherwise the rootfinding method will diverge (will be explored in the homework).

Thus for example, fixed point iteration is not appropriate for stiff differential equations. Methods which are used in the stiff case are either Anderson Acceleration or Newton's method. Newton's is by far the most common (and generally performs the best), so we can go down this route.

Let's use the syntax $g(x)=0$. Here we need some starting value $x_0$ as our first guess for $u_{n+1}$. The easiest guess is $u_{n}$, though additional information about the equation can be used to compute a better starting value (known as a step predictor). Once we have a starting value, we run the iteration:

\[ x_{k+1} = x_k - J(x_k)^{-1}g(x_k) \]

where $J(x_k)$ is the Jacobian of $g$ at the point $x_k$. However, the mathematical formulation is never the syntax that you should use for the actual application! Instead, numerically this is two stages:

  • Solve $Ja=g(x_k)$ for $a$

  • Update $x_{k+1} = x_k - a$

By doing this, we can turn the matrix inversion into a problem of a linear solve and then an update. The reason this is done is manyfold, but one major reason is because the inverse of a sparse matrix can be dense, and this Jacobian is in many cases (PDEs) a large and dense matrix.

Now let's break this down step by step.

Some Quick Notes

The Jacobian of $g$ can also be written as $J = I - \gamma \frac{df}{du}$ for the ODE $u' = f(u,p,t)$, where $\gamma = \Delta t$ for the implicit Euler method. This general form holds for all other (SDIRK) implicit methods, changing the value of $\gamma$. Additionally, the class of Rosenbrock methods solves a linear system with exactly the same $J$, meaning that essentially all implicit and semi-implicit ODE solvers have to do the same Newton iteration process on the same structure. This is the portion of the code that is generally the bottleneck.

Additionally, if one is solving a mass matrix ODE: $Mu' = f(u,p,t)$, exactly the same treatment can be had with $J = M - \gamma \frac{df}{du}$. This works even if $M$ is singular, a case known as a differential-algebraic equation or a DAE. A DAE for example can be an ODE with constraint equations, and these structures can be represented as an ODE where these constraints lead to a singularity in the mass matrix (a row of all zeros is a term that is only the right hand side equals zero!).

Generation of the Jacobian

Dense Finite Differences and Forward-Mode AD

Recall that the Jacobian is the matrix of $\frac{df_i}{dx_j}$ for $f$ a vector-valued function. The simplest way to generate the Jacobian is through finite differences. For each $h_j = h e_j$ for $e_j$ the basis vector of the $j$th axis and some sufficiently small $h$, then we can compute column $j$ of the Jacobian by:

\[ \frac{f(x+h_j)-f(x)}{h} \]

Thus $m+1$ applications of $f$ are required to compute the full Jacobian.

This can be improved by using forward-mode automatic differentiation. Recall that we can formulate a multidimensional duel number of the form

\[ d = x + v_1 \epsilon_1 + \ldots + v_m \epsilon_m \]

We can then seed the vectors $v_j = h_j$ so that the differentiation directions are along the basis vectors, and then the output dual is the result:

\[ f(d) = f(x) + J_1 \epsilon_1 + \ldots + J_m \epsilon_m \]

where $J_j$ is the $j$th column of the Jacobian. And thus with one calculation of the primal (f(x)) we have calculated the entire Jacobian.

Sparse Differentiation and Matrix Coloring

However, when the Jacobian is sparse we can compute it much faster. We can understand this by looking at the following system:

\[ f(x)=\left[\begin{array}{c} x_{1}+x_{3}\\ x_{2}x_{3}\\ x_{1} \end{array}\right] \]

Notice that in 3 differencing steps we can calculate:

\[ f(x+\epsilon e_{1})=\left[\begin{array}{c} x_{1}+x_{3}+\epsilon\\ x_{2}x_{3}\\ x_{1}+\epsilon \end{array}\right] \]

\[ f(x+\epsilon e_{2})=\left[\begin{array}{c} x_{1}+x_{3}\\ x_{2}x_{3}+\epsilon x_{3}\\ x_{1} \end{array}\right] \]

\[ f(x+\epsilon e_{3})=\left[\begin{array}{c} x_{1}+x_{3}+\epsilon\\ x_{2}x_{3}+\epsilon x_{2}\\ x_{1} \end{array}\right] \]

and thus:

\[ \frac{f(x+\epsilon e_{1})-f(x)}{\epsilon}=\left[\begin{array}{c} 1\\ 0\\ 1 \end{array}\right] \]

\[ \frac{f(x+\epsilon e_{2})-f(x)}{\epsilon}=\left[\begin{array}{c} 0\\ x_{3}\\ 0 \end{array}\right] \]

\[ \frac{f(x+\epsilon e_{3})-f(x)}{\epsilon}=\left[\begin{array}{c} 1\\ x_{2}\\ 0 \end{array}\right] \]

But notice that the calculation of $e_1$ and $e_2$ do not interact. If we had done:

\[ \frac{f(x+\epsilon e_{1}+\epsilon e_{2})-f(x)}{\epsilon}=\left[\begin{array}{c} 1\\ x_{3}\\ 1 \end{array}\right] \]

we would still get the correct value for every row because the $\epsilon$ terms do not collide (a situation known as perturbation confusion). If we knew the sparsity pattern of the Jacobian included a 0 at (2,1), (1,2), and (3,2), then we would know that the vectors would have to be $[1 0 1]$ and $[0 x_3 0]$, meaning that columns 1 and 2 can be computed simultaneously and decompressed. This is the key to sparse differentiation.

With forward-mode automatic differentiation, recall that we calculate multiple dimensions simultaneously by using a multidimensional dual number seeded by the vectors of the differentiation directions, that is:

\[ d = x + v_1 \epsilon_1 + \ldots + v_m \epsilon_m \]

Instead of using the primitive differentiation directions $e_j$, we can instead replace this with the mixed values. For example, the Jacobian of the example function can be computed in one function call to $f$ with the dual number input:

\[ d = x + (e_1 + e_2) \epsilon_1 + e_3 \epsilon_2 \]

and performing the decompression via the sparsity pattern. Thus the sparsity pattern gives a direct way to optimize the construction of the Jacobian.

This idea of independent directions can be formalized as a matrix coloring. Take $S_{ij}$ the sparsity pattern of some Jacobian matrix $J_{ij}$. Define a graph on the nodes 1 through m where there is an edge between $i$ and $j$ if there is a row where $i$ and $j$ are non-zero. This graph is the column connectivity graph of the Jacobian. What we wish to do is find the smallest set of differentiation directions such that differentiating in the direction of $e_i$ does not collide with differentiation in the direction of $e_j$. The connectivity graph is setup so that way this cannot be done if the two nodes are adjacent. If we let the subset of nodes differentiated together be a color, the question is, what is the smallest number of colors s.t. no adjacent nodes are the same color. This is the classic distance-1 coloring problem from graph theory. It is well-known that the problem of finding the chromatic number, the minimal number of colors for a graph, is generally NP-complete. However, there are heuristic methods for performing a distance-1 coloring quite quickly. For example, a greedy algorithm is as follows:

  • Pick a node at random to be color 1.

  • Make all nodes adjacent to that be the lowest color that they can be (in this step that will be 2).

  • Now look at all nodes adjacent to that. Make all nodes be the lowest color that they can be (either 1 or 3).

  • Repeat by looking at the next set of adjacent nodes and color as conservatively as possible.

This can be visualized as follows:

The result will color the entire connected component. While not giving an optimal result, it will still give a result that is a sufficient reduction in the number of differentiation directions (without solving an NP-complete problem) and thus can lead to a large computational saving.

At the end, let $c_i$ be the vector of 1's and 0's, where it's 1 for every node that is color $i$ and 0 otherwise. Sparse automatic differentiation of the Jacobian is then computed with:

\[ d = x + c_1 \epsilon_1 + \ldots + c_k \epsilon_k \]

that is, the full Jacobian is computed with one dual number which consists of the primal calculation along with $k$ dual dimensions, where $k$ is the computed chromatic number of the connectivity graph on the Jacobian. Once this calculation is complete, the colored columns can be decompressed into the full Jacobian using the sparsity information, generating the original quantity that we wanted to compute.

For more information on the graph coloring aspects, find the paper titled "What Color Is Your Jacobian? Graph Coloring for Computing Derivatives" by Gebremedhin.

Note on Sparse Reverse-Mode AD

Reverse-mode automatic differentiation can be though of as a method for computing one row of a Jacobian per seed, as opposed to one column per seed given by forward-mode AD. Thus sparse reverse-mode automatic differentiation can be done by looking at the connectivity graph of the column and using the resulting color vectors to seed the reverse accumulation process.

Linear Solving

After the Jacobian has been computed, we need to solve a linear equation $Ja=b$. While mathematically you can solve this by computing the inverse $J^{-1}$, this is not a good way to perform the calculation because even if $J$ is sparse, then $J^{-1}$ is in general dense and thus may not fit into memory (remember, this is $N^2$ as many terms, where $N$ is the size of the ordinary differential equation that is being solved, so if it's a large equation it is very feasible and common that the ODE is representable but its full Jacobian is not able to fit into RAM). Note that some may say that this is done for numerical stability reasons: that is incorrect. In fact, under reasonable assumptions for how the inverse is computed, it will be as numerically stable as other techniques we will mention.

Thus instead of generating the inverse, we can instead perform a matrix factorization. A matrix factorization is a transformation of the matrix into a form that is more amenable to certain analyses. For our purposes, a general Jacobian within a Newton iteration can be transformed via the LU-factorization or (LU-decomposition), i.e.

\[ J = LU \]

where $L$ is lower triangular and $U$ is upper triangular. If we write the linear equation in this form:

\[ LUa = b \]

then we see that we can solve it by first solving $L(Ua) = b$. Since $L$ is lower triangular, this is done by the backsubstitution algorithm. That is, in a lower triangular form, we can solve for the first value since we have:

\[ L_{11} a_1 = b_1 \]

and thus by dividing we solve. For the next term, we have that

\[ L_{21} a_1 + L_{22} a_2 = b_2 \]

and thus we plug in the solution to $a_1$ and solve to get $a_2$. The lower triangular form allows this to continue. This occurs in 1+2+3+...+n operations, and is thus O(n^2). Next, we solve $Ua = b$, which once again is done by a backsubstitution algorithm but in the reverse direction. Together those two operations are O(n^2) and complete the inversion of $LU$.

So is this an O(n^2) algorithm for computing the solution of a linear system? No, because the computation of $LU$ itself is an O(n^3) calculation, and thus the true complexity of solving a linear system is still O(n^3). However, if we have already factorized $J$, then we can repeatedly use the same $LU$ factors to solve additional linear problems $Jv = u$ with different vectors. We can exploit this to accelerate the Newton method. Instead of doing the calculation:

\[ x_{k+1} = x_k - J(x_k)^{-1}g(x_k) \]

we can instead do:

\[ x_{k+1} = x_k - J(x_0)^{-1}g(x_k) \]

so that all of the Jacobians are the same. This means that a single O(n^3) factorization can be done, with multiple O(n^2) calculations using the same factorization. This is known as a Quasi-Newton method. While this makes the Newton method no longer quadratically convergent, it minimizes the large constant factor on the computational cost while retaining the same dynamical properties, i.e. the same steady state and thus the same overall solution. This makes sense for sufficiently large $n$, but requires sufficiently large $n$ because the loss of quadratic convergence means that it will take more steps to converge than before, and thus more $O(n^2)$ backsolves are required, meaning that the difference between factorizations and backsolves needs to be large enough in order to offset the cost of extra steps.

Note on Sparse Factorization

Note that LU-factorization, and other factorizations, have generalizations to sparse matrices where a symbolic factorization is utilized to compute a sparse storage of the values which then allow for a fast backsubstitution. More details are outside the scope of this course, but note that Julia and MATLAB will both use the library SuiteSparse in the background when lu is called on a sparse matrix.

Jacobian-Free Newton Krylov (JFNK)

An alternative method for solving the linear system is the Jacobian-Free Newton Krylov technique. This technique is broken into two pieces: the jvp calculation and the Krylov subspace iterative linear solver.

Jacobian-Vector Products as Directional Derivatives

We don't actually need to compute $J$ itself, since all that we actually need is the v = J*w. Is it possible to compute the Jacobian-Vector Product, or the jvp, without producing the Jacobian?

To see how this is done let's take a look at what is actually calculated. Written out in the standard basis, we have that:

\[ w_i = \sum_{j}^{m} J_{ij} v_{j} \]

Now write out what $J$ means and we see that:

\[ w_i = \sum_j^{m} \frac{df_i}{dx_j} v_j = \nabla f_i(x) \cdot v \]

that is, the $i$th component of $Jv$ is the directional derivative of $f_i$ in the direction $v$. This means that in general, the jvp $Jv$ is actually just the directional derivative in the direction of $v$, that is:

\[ Jv = \nabla f \cdot v \]

and therefore it has another mathematical representation, that is:

\[ Jv = \lim_{\epsilon \rightarrow 0} \frac{f(x+v \epsilon) - f(x)}{\epsilon} \]

From this alternative form it is clear that we can always compute a jvp with a single computation. Using finite differences, a simple approximation is the following:

\[ Jv \approx \frac{f(x+v \epsilon) - f(x)}{\epsilon} \]

for non-zero $\epsilon$. Similarly, recall that in forward-mode automatic differentiation we can choose directions by seeding the dual part. Therefore, using the dual number with one partial component:

\[ d = x + v \epsilon \]

we get that

\[ f(d) = f(x) + Jv \epsilon \]

and thus a single application with a single partial gives the jvp.

Note on Reverse-Mode Automatic Differentiation

As noted earlier, reverse-mode automatic differentiation has its primitives compute rows of the Jacobian in the seeded direction. This means that the seeded reverse-mode call with the vector $v$ computes $v^T J$, that is the vector (transpose) Jacobian transpose, or vjp for short. When discussing parameter estimation and adjoints, this shorthand will be introduced as a way for using a traditionally machine learning tool to accelerate traditionally scientific computing tasks.

Krylov Subspace Methods For Solving Linear Systems

Basic Iterative Solver Methods

Now that we have direct access to quick calculations of $Jv$, how would we use this to solve the linear system $Jw = v$ quickly? This is done through iterative linear solvers. These methods replace the process of solving for a factorization with, you may have guessed it, a discrete dynamical system whose solution is $w$. To do this, what we want is some iterative process so that

\[ Jw - b = 0 \]

So now let's split $J = A - B$, then if we are iterating the vectors $w_k$ such that $w_k \rightarrow w$, then if we plug this into the previous (residual) equation we get

\[ A w_{k+1} = Bw_k + b \]

since when we plug in $w$ we get zero (the sequence must be Cauchy so the difference $w_{k+1} - w_k \rightarrow 0$). Thus if we can split our matrix $J$ into a component $A$ which is easy to invert and a part $B$ that is just everything else, then we would have a bunch of easy linear systems to solve. There are many different choices that we can do. If we let $J = L + D + U$, where $L$ is the lower portion of $J$, $D$ is the diagonal, and $U$ is the upper portion, then the following are well-known methods:

  • Richardson: $A = \omega I$ for some $\omega$

  • Jacobi: $A = D$

  • Damped Jacobi: $A = \omega D$

  • Gauss-Seidel: $A = D-L$

  • Successive Over Relaxation: $A = \omega D - L$

  • Symmetric Successive Over Relaxation: $A = \frac{1}{\omega (2 - \omega)}(D-\omega L)D^{-1}(D-\omega U)$

These decompositions are chosen since a diagonal matrix is easy to invert (it's just the inversion of the scalars of the diagonal) and it's easy to solve an upper or lower triangular linear system (once again, it's backsubstitution).

Since these methods give a a linear dynamical system, we know that there is a unique steady state solution, which happens to be $Aw - Bw = Jw = b$. Thus we will converge to it as long as the steady state is stable. To see if it's stable, take the update equation

\[ w_{k+1} = A^{-1}(Bw_k + b) \]

and check the eigenvalues of the system: if they are within the unit circle then you have stability. Notice that this can always occur by bringing the eigenvalues of $A^{-1}$ closer to zero, which can be done by multiplying $A$ by a significantly large value, hence the $\omega$ quantities. While that always works, this essentially amounts to decreasing the stepsize of the iterative process and thus requiring more steps, thus making it take more computations. Thus the game is to pick the largest stepsize ($\omega$) for which the steady state is stable. We will leave that as outside the topic of this course.

Krylov Subspace Methods

While the classical iterative solver methods give the background for understanding an alternative to direct inversion or factorization of a matrix, the problem with that approach is that it requires the ability to split the matrix $J$, which we would like to avoid computing. Instead, we would like to develop an iterative solver technique which instead just uses the solution to $Jv$. Indeed there are such methods, and these are the Krylov subspace methods. A Krylov subspace is the space spanned by:

\[ \mathcal{K}_k = \text{span} \{v,Jv,J^2 v, \ldots, J^k v\} \]

There are a few nice properties about Krylov subspaces that can be exploited. For one, it is known that there is a finite maximum dimension of the Krylov subspace, that is there is a value $r$ such that $J^{r+1} v \in \mathcal{K}_r$, which means that the complete Krylov subspace can be computed in finitely many jvp, since $J^2 v$ is just the jvp where the vector is the jvp. Indeed, one can show that $J^i v$ is linearly independent for each $i$, and thus that maximal value is $m$, the dimension of the Jacobian. Therefore in $m$ jvps the solution is guaranteed to live in the Krylov subspace, giving a maximal computational cost and a proof of convergence if the vector in there is the "optimal in the space".

The most common method in the Krylov subspace family of methods is the GMRES method. Essentially, in step $i$ one computes $\mathcal{K}_i$, and finds the $x$ that is the closest to the Krylov subspace, i.e. finds the $x \in \mathcal{K}_i$ such that $\Vert Jx-v \Vert$ is minimized. At each step, it adds the new vector to the Krylov subspace after orthogonalizing it against the other vectors via Arnoldi iterations, leading to an orthogonal basis of $\mathcal{K}_i$ which makes it easy to express $x$.

While one has a guaranteed bound on the number of possible jvps in GMRES which is simply the number of ODEs (since that is what determines the size of the Jacobian and thus the total dimension of the problem), that bound is not necessarily a good one. For a large sparse matrix, it may be computationally impractical to ever compute 100,000 jvps. Thus one does not typically run the algorithm to conclusion, and instead stops when $\Vert Jx-v \Vert$ is sufficiently below some user-defined error tolerance.

Intermediate Conclusion

Let's take a step back and see what our intermediate conclusion is. In order to solve for the implicit step, it just boils down to doing Newton's method on some $g(x)=0$. If the Jacobian is small enough, one factorizes the Jacobian and uses Quasi-Newton iterations in order to utilize the stored LU-decomposition in multiple steps to reduce the computation cost. If the Jacobian is sparse, sparse automatic differentiation through matrix coloring is employed to directly fill the sparse matrix with less applications of $g$, and then this sparse matrix is factorized using a sparse LU factorization.

When the matrix is too large, then one resorts to using a Krylov subspace method, since this only requires being able to do $Jv$ calculations. In general, $Jv$ can be done matrix-free because it is simply the directional derivative in the direction of the vector $v$, which can be computed through either numerical or forward-mode automatic differentiation. This is then used in the GMRES iterative process to find the solution in the Krylov subspace which is closest to the solution, exiting early when the residual error is small enough. If this is converging too slow, then preconditioning is used.

That's the basic algorithm, but what are the other important details for getting this right?

The Need for Speed

Preconditioning

However, the speed at GMRES convergences is dependent on the correlations between the vectors, which can be shown to be related to the condition number of the Jacobian matrix. A high condition number makes convergence slower (this is the case for the traditional iterative methods as well), which in turn is an issue because it is the high condition number on the Jacobian which leads to stiffness and causes one to have to use an implicit integrator in the first place!

To help speed up the convergence, a common technique is known as preconditioning. Preconditioning is the process of using a semi-inverse to the matrix in order to split the matrix so that the iterative problem that is being solved is one that has a smaller condition number. Mathematically, it involves decomposing $J = P_l A P_r$ where $P_l$ and $P_r$ are the left and right preconditioners which have simple inverses, and thus instead of solving $Jx=v$, we would solve:

\[ P_l A P_r x = v \]

or

\[ A P_r x = P_l^{-1}v \]

which then means that the Krylov subpace that needs to be solved for is that defined by $A$: $\mathcal{K} = \text{span}\{v,Av,A^2 v, \ldots\}$. There are many possible choices for these preconditioners, but they are usually problem dependent. For example, for ODEs which come from parabolic and elliptic PDE discretizations, the multigrid method, such as a geometric multigrid or an algebraic multigrid, is a preconditioner that can accelerate the iterative solving process. One generic preconditioner that can generally be used is to divide by the norm of the vector $v$, which is a scaling employed by both SUNDIALS CVODE and by DifferentialEquations.jl and can be shown to be almost always advantageous.

Jacobian Re-use

If the problem is small enough such that the factorization is used and a Quasi-Newton technique is employed, it then holds that for most steps $J$ is only approximate since it can be using an old LU-factorization. To push it even further, high performance codes allow for jacobian reuse, which is allowing the same Jacobian to be reused between different timesteps. If the Jacobian is too incorrect, it can cause the Newton iterations to diverge, which is then when one would calculate a new Jacobian and compute a new LU-factorization.

Adaptive Timestepping

In simple cases, like partial differential equation discretizations of physical problems, the resulting ODEs are not too stiff and thus Newton's iteration generally works. However, in cases like stiff biological models, Newton's iteration can itself not always be stable enough to allow convergence. In fact, with many of the stiff biological models commonly used in benchmarks, no method is stable enough to pass without using adaptive timestepping! Thus one may need to adapt the timestep in order to improve the ability for the Newton method to converge (smaller timesteps increase the stability of the Newton stepping, see the homework).

This needs to be mixed with the Jacobian re-use strategy, since $J = I - \gamma \frac{df}{du}$ where $\gamma$ is dependent on $\Delta t$ (and $\gamma = \Delta t$ for implicit Euler) means that the Jacobian of the Newton method changes as $\Delta t$ changes. Thus one usually has a tiered algorithm for determining when to update the factorizations of $J$ vs when to compute a new $\frac{df}{du}$ and then refactorize. This is generally dependent on estimates of convergence rates to heuristically guess how far off $\frac{df}{du}$ is from the current true value.

So how does one perform adaptivity? This is generally done through a rejection sampling technique. First one needs some estimate of the error in a step. This is calculated through an embedded method, which is a method that is able to be calculated without any extra $f$ evaluations that is (usually) one order different from the true method. The difference between the true and the embedded method is then an error estimate. If this is greater than a user chosen tolerance, the step is rejected and re-ran with a smaller $\Delta t$ (possibly refactorizing, etc.). If this is less than the user tolerance, the step is accepted and $\Delta t$ is changed.

There are many schemes for how one can change $\Delta t$. One of the most common is known as the P-control, which stands for the proportional controller which is used throughout control theory. In this case, the control is to change $\Delta t$ in proportion to the current error ratio from the desired tolerance. If we let

\[ q = \frac{\text{E}}{\max(u_k,u_{k+1}) \tau_r + \tau_a} \]

where $\tau_r$ is the relative tolerance and $\tau_a$ is the absolute tolerance, then $q$ is the ratio of the current error to the current tolerance. If $q<1$, then the error is less than the tolerance and the step is accepted, and vice versa for $q>1$. In either case, we let $\Delta t_{new} = q \Delta t$ be the proportional update.

However, proportional error control has many known features that are undesirable. For example, it happens to work in a "bang bang" manner, meaning that it can drastically change its behavior from step to step. One step may multiply the step size by 10x, then the next by 2x. This is an issue because it effects the stability of the ODE solver method (since the stability is not a property of a single step, but rather it's a property of the global behavior over time)! Thus to smooth it out, one can use a PI-control, which modifies the control factor by a history value, i.e. the error in one step in the past. This of course also means that one can utilize a PID-controller for time stepping. And there are many other techniques that can be used, but many of the most optimized codes tend to use a PI-control mechanism.

Methodological Summary

Here's a quick summary of the methodologies in a hierarchical sense:

  • At the lowest level is the linear solve, either done by JFNK or (sparse) factorization. For large enough systems, this is the brunt of the work. This is thus the piece to computationally optimize as much as possible, and parallelize. For sparse factorizations, this can be done with a distributed sparse library implementation. For JFNK, the efficiency is simply due to the efficiency of your ODE function f.

  • An optional level for JFNK is the preconditioning level, where preconditioners can be used to decrease the total number of iterations required for Krylov subspace methods like GMRES to converge, and thus reduce the total number of f calls.

  • At the nonlinear solver level, different Newton-like techniques are utilized to minimize the number of factorizations/linear solves required, and maximize the stability of the Newton method.

  • At the ODE solver level, more efficient integrators and adaptive methods for stiff ODEs are used to reduce the cost by affecting the linear solves. Most of these calculations are dominated by the linear solve portion when it's in the regime of large stiff systems. Jacobian reuse techniques, partial factorizations, and IMEX methods come into play as ways to reduce the cost per factorization and reduce the total number of factorizations.

\ No newline at end of file +

Solving Stiff Ordinary Differential Equations

Chris Rackauckas
October 14th, 2020

Youtube Video Link

We have previously shown how to solve non-stiff ODEs via optimized Runge-Kutta methods, but we ended by showing that there is a fundamental limitation of these methods when attempting to solve stiff ordinary differential equations. However, we can get around these limitations by using different types of methods, like implicit Euler. Let's now go down the path of understanding how to efficiently implement stiff ordinary differential equation solvers, and its interaction with other domains like automatic differentiation.

When one is solving a large-scale scientific computing problem with MPI, this is almost always the piece of code where all of the time is spent, so let's understand how what it's doing.

Newton's Method and Jacobians

Recall that the implicit Euler method is the following:

\[ u_{n+1} = u_n + \Delta t f(u_{n+1},p,t + \Delta t) \]

If we wanted to use this method, we would need to find out how to get the value $u_{n+1}$ when only knowing the value $u_n$. To do so, we can move everything to one side:

\[ u_{n+1} - \Delta t f(u_{n+1},p,t + \Delta t) - u_n = 0 \]

and now we have a problem

\[ g(u_{n+1}) = 0 \]

This is the classic rootfinding problem $g(x)=0$, find $x$. The way that we solve the rootfinding problem is, once again, by replacing this problem about a continuous function $g$ with a discrete dynamical system whose steady state is the solution to the $g(x)=0$. There are many methods for this, but some choices of the rootfinding method effect the stability of the ODE solver itself since we need to make sure that the steady state solution is a stable steady state of the iteration process, otherwise the rootfinding method will diverge (will be explored in the homework).

Thus for example, fixed point iteration is not appropriate for stiff differential equations. Methods which are used in the stiff case are either Anderson Acceleration or Newton's method. Newton's is by far the most common (and generally performs the best), so we can go down this route.

Let's use the syntax $g(x)=0$. Here we need some starting value $x_0$ as our first guess for $u_{n+1}$. The easiest guess is $u_{n}$, though additional information about the equation can be used to compute a better starting value (known as a step predictor). Once we have a starting value, we run the iteration:

\[ x_{k+1} = x_k - J(x_k)^{-1}g(x_k) \]

where $J(x_k)$ is the Jacobian of $g$ at the point $x_k$. However, the mathematical formulation is never the syntax that you should use for the actual application! Instead, numerically this is two stages:

  • Solve $Ja=g(x_k)$ for $a$

  • Update $x_{k+1} = x_k - a$

By doing this, we can turn the matrix inversion into a problem of a linear solve and then an update. The reason this is done is manyfold, but one major reason is because the inverse of a sparse matrix can be dense, and this Jacobian is in many cases (PDEs) a large and dense matrix.

Now let's break this down step by step.

Some Quick Notes

The Jacobian of $g$ can also be written as $J = I - \gamma \frac{df}{du}$ for the ODE $u' = f(u,p,t)$, where $\gamma = \Delta t$ for the implicit Euler method. This general form holds for all other (SDIRK) implicit methods, changing the value of $\gamma$. Additionally, the class of Rosenbrock methods solves a linear system with exactly the same $J$, meaning that essentially all implicit and semi-implicit ODE solvers have to do the same Newton iteration process on the same structure. This is the portion of the code that is generally the bottleneck.

Additionally, if one is solving a mass matrix ODE: $Mu' = f(u,p,t)$, exactly the same treatment can be had with $J = M - \gamma \frac{df}{du}$. This works even if $M$ is singular, a case known as a differential-algebraic equation or a DAE. A DAE for example can be an ODE with constraint equations, and these structures can be represented as an ODE where these constraints lead to a singularity in the mass matrix (a row of all zeros is a term that is only the right hand side equals zero!).

Generation of the Jacobian

Dense Finite Differences and Forward-Mode AD

Recall that the Jacobian is the matrix of $\frac{df_i}{dx_j}$ for $f$ a vector-valued function. The simplest way to generate the Jacobian is through finite differences. For each $h_j = h e_j$ for $e_j$ the basis vector of the $j$th axis and some sufficiently small $h$, then we can compute column $j$ of the Jacobian by:

\[ \frac{f(x+h_j)-f(x)}{h} \]

Thus $m+1$ applications of $f$ are required to compute the full Jacobian.

This can be improved by using forward-mode automatic differentiation. Recall that we can formulate a multidimensional duel number of the form

\[ d = x + v_1 \epsilon_1 + \ldots + v_m \epsilon_m \]

We can then seed the vectors $v_j = h_j$ so that the differentiation directions are along the basis vectors, and then the output dual is the result:

\[ f(d) = f(x) + J_1 \epsilon_1 + \ldots + J_m \epsilon_m \]

where $J_j$ is the $j$th column of the Jacobian. And thus with one calculation of the primal (f(x)) we have calculated the entire Jacobian.

Sparse Differentiation and Matrix Coloring

However, when the Jacobian is sparse we can compute it much faster. We can understand this by looking at the following system:

\[ f(x)=\left[\begin{array}{c} x_{1}+x_{3}\\ x_{2}x_{3}\\ x_{1} \end{array}\right] \]

Notice that in 3 differencing steps we can calculate:

\[ f(x+\epsilon e_{1})=\left[\begin{array}{c} x_{1}+x_{3}+\epsilon\\ x_{2}x_{3}\\ x_{1}+\epsilon \end{array}\right] \]

\[ f(x+\epsilon e_{2})=\left[\begin{array}{c} x_{1}+x_{3}\\ x_{2}x_{3}+\epsilon x_{3}\\ x_{1} \end{array}\right] \]

\[ f(x+\epsilon e_{3})=\left[\begin{array}{c} x_{1}+x_{3}+\epsilon\\ x_{2}x_{3}+\epsilon x_{2}\\ x_{1} \end{array}\right] \]

and thus:

\[ \frac{f(x+\epsilon e_{1})-f(x)}{\epsilon}=\left[\begin{array}{c} 1\\ 0\\ 1 \end{array}\right] \]

\[ \frac{f(x+\epsilon e_{2})-f(x)}{\epsilon}=\left[\begin{array}{c} 0\\ x_{3}\\ 0 \end{array}\right] \]

\[ \frac{f(x+\epsilon e_{3})-f(x)}{\epsilon}=\left[\begin{array}{c} 1\\ x_{2}\\ 0 \end{array}\right] \]

But notice that the calculation of $e_1$ and $e_2$ do not interact. If we had done:

\[ \frac{f(x+\epsilon e_{1}+\epsilon e_{2})-f(x)}{\epsilon}=\left[\begin{array}{c} 1\\ x_{3}\\ 1 \end{array}\right] \]

we would still get the correct value for every row because the $\epsilon$ terms do not collide (a situation known as perturbation confusion). If we knew the sparsity pattern of the Jacobian included a 0 at (2,1), (1,2), and (3,2), then we would know that the vectors would have to be $[1 0 1]$ and $[0 x_3 0]$, meaning that columns 1 and 2 can be computed simultaneously and decompressed. This is the key to sparse differentiation.

With forward-mode automatic differentiation, recall that we calculate multiple dimensions simultaneously by using a multidimensional dual number seeded by the vectors of the differentiation directions, that is:

\[ d = x + v_1 \epsilon_1 + \ldots + v_m \epsilon_m \]

Instead of using the primitive differentiation directions $e_j$, we can instead replace this with the mixed values. For example, the Jacobian of the example function can be computed in one function call to $f$ with the dual number input:

\[ d = x + (e_1 + e_2) \epsilon_1 + e_3 \epsilon_2 \]

and performing the decompression via the sparsity pattern. Thus the sparsity pattern gives a direct way to optimize the construction of the Jacobian.

This idea of independent directions can be formalized as a matrix coloring. Take $S_{ij}$ the sparsity pattern of some Jacobian matrix $J_{ij}$. Define a graph on the nodes 1 through m where there is an edge between $i$ and $j$ if there is a row where $i$ and $j$ are non-zero. This graph is the column connectivity graph of the Jacobian. What we wish to do is find the smallest set of differentiation directions such that differentiating in the direction of $e_i$ does not collide with differentiation in the direction of $e_j$. The connectivity graph is setup so that way this cannot be done if the two nodes are adjacent. If we let the subset of nodes differentiated together be a color, the question is, what is the smallest number of colors s.t. no adjacent nodes are the same color. This is the classic distance-1 coloring problem from graph theory. It is well-known that the problem of finding the chromatic number, the minimal number of colors for a graph, is generally NP-complete. However, there are heuristic methods for performing a distance-1 coloring quite quickly. For example, a greedy algorithm is as follows:

  • Pick a node at random to be color 1.

  • Make all nodes adjacent to that be the lowest color that they can be (in this step that will be 2).

  • Now look at all nodes adjacent to that. Make all nodes be the lowest color that they can be (either 1 or 3).

  • Repeat by looking at the next set of adjacent nodes and color as conservatively as possible.

This can be visualized as follows:

The result will color the entire connected component. While not giving an optimal result, it will still give a result that is a sufficient reduction in the number of differentiation directions (without solving an NP-complete problem) and thus can lead to a large computational saving.

At the end, let $c_i$ be the vector of 1's and 0's, where it's 1 for every node that is color $i$ and 0 otherwise. Sparse automatic differentiation of the Jacobian is then computed with:

\[ d = x + c_1 \epsilon_1 + \ldots + c_k \epsilon_k \]

that is, the full Jacobian is computed with one dual number which consists of the primal calculation along with $k$ dual dimensions, where $k$ is the computed chromatic number of the connectivity graph on the Jacobian. Once this calculation is complete, the colored columns can be decompressed into the full Jacobian using the sparsity information, generating the original quantity that we wanted to compute.

For more information on the graph coloring aspects, find the paper titled "What Color Is Your Jacobian? Graph Coloring for Computing Derivatives" by Gebremedhin.

Note on Sparse Reverse-Mode AD

Reverse-mode automatic differentiation can be though of as a method for computing one row of a Jacobian per seed, as opposed to one column per seed given by forward-mode AD. Thus sparse reverse-mode automatic differentiation can be done by looking at the connectivity graph of the column and using the resulting color vectors to seed the reverse accumulation process.

Linear Solving

After the Jacobian has been computed, we need to solve a linear equation $Ja=b$. While mathematically you can solve this by computing the inverse $J^{-1}$, this is not a good way to perform the calculation because even if $J$ is sparse, then $J^{-1}$ is in general dense and thus may not fit into memory (remember, this is $N^2$ as many terms, where $N$ is the size of the ordinary differential equation that is being solved, so if it's a large equation it is very feasible and common that the ODE is representable but its full Jacobian is not able to fit into RAM). Note that some may say that this is done for numerical stability reasons: that is incorrect. In fact, under reasonable assumptions for how the inverse is computed, it will be as numerically stable as other techniques we will mention.

Thus instead of generating the inverse, we can instead perform a matrix factorization. A matrix factorization is a transformation of the matrix into a form that is more amenable to certain analyses. For our purposes, a general Jacobian within a Newton iteration can be transformed via the LU-factorization or (LU-decomposition), i.e.

\[ J = LU \]

where $L$ is lower triangular and $U$ is upper triangular. If we write the linear equation in this form:

\[ LUa = b \]

then we see that we can solve it by first solving $L(Ua) = b$. Since $L$ is lower triangular, this is done by the backsubstitution algorithm. That is, in a lower triangular form, we can solve for the first value since we have:

\[ L_{11} a_1 = b_1 \]

and thus by dividing we solve. For the next term, we have that

\[ L_{21} a_1 + L_{22} a_2 = b_2 \]

and thus we plug in the solution to $a_1$ and solve to get $a_2$. The lower triangular form allows this to continue. This occurs in 1+2+3+...+n operations, and is thus O(n^2). Next, we solve $Ua = b$, which once again is done by a backsubstitution algorithm but in the reverse direction. Together those two operations are O(n^2) and complete the inversion of $LU$.

So is this an O(n^2) algorithm for computing the solution of a linear system? No, because the computation of $LU$ itself is an O(n^3) calculation, and thus the true complexity of solving a linear system is still O(n^3). However, if we have already factorized $J$, then we can repeatedly use the same $LU$ factors to solve additional linear problems $Jv = u$ with different vectors. We can exploit this to accelerate the Newton method. Instead of doing the calculation:

\[ x_{k+1} = x_k - J(x_k)^{-1}g(x_k) \]

we can instead do:

\[ x_{k+1} = x_k - J(x_0)^{-1}g(x_k) \]

so that all of the Jacobians are the same. This means that a single O(n^3) factorization can be done, with multiple O(n^2) calculations using the same factorization. This is known as a Quasi-Newton method. While this makes the Newton method no longer quadratically convergent, it minimizes the large constant factor on the computational cost while retaining the same dynamical properties, i.e. the same steady state and thus the same overall solution. This makes sense for sufficiently large $n$, but requires sufficiently large $n$ because the loss of quadratic convergence means that it will take more steps to converge than before, and thus more $O(n^2)$ backsolves are required, meaning that the difference between factorizations and backsolves needs to be large enough in order to offset the cost of extra steps.

Note on Sparse Factorization

Note that LU-factorization, and other factorizations, have generalizations to sparse matrices where a symbolic factorization is utilized to compute a sparse storage of the values which then allow for a fast backsubstitution. More details are outside the scope of this course, but note that Julia and MATLAB will both use the library SuiteSparse in the background when lu is called on a sparse matrix.

Jacobian-Free Newton Krylov (JFNK)

An alternative method for solving the linear system is the Jacobian-Free Newton Krylov technique. This technique is broken into two pieces: the jvp calculation and the Krylov subspace iterative linear solver.

Jacobian-Vector Products as Directional Derivatives

We don't actually need to compute $J$ itself, since all that we actually need is the v = J*w. Is it possible to compute the Jacobian-Vector Product, or the jvp, without producing the Jacobian?

To see how this is done let's take a look at what is actually calculated. Written out in the standard basis, we have that:

\[ w_i = \sum_{j}^{m} J_{ij} v_{j} \]

Now write out what $J$ means and we see that:

\[ w_i = \sum_j^{m} \frac{df_i}{dx_j} v_j = \nabla f_i(x) \cdot v \]

that is, the $i$th component of $Jv$ is the directional derivative of $f_i$ in the direction $v$. This means that in general, the jvp $Jv$ is actually just the directional derivative in the direction of $v$, that is:

\[ Jv = \nabla f \cdot v \]

and therefore it has another mathematical representation, that is:

\[ Jv = \lim_{\epsilon \rightarrow 0} \frac{f(x+v \epsilon) - f(x)}{\epsilon} \]

From this alternative form it is clear that we can always compute a jvp with a single computation. Using finite differences, a simple approximation is the following:

\[ Jv \approx \frac{f(x+v \epsilon) - f(x)}{\epsilon} \]

for non-zero $\epsilon$. Similarly, recall that in forward-mode automatic differentiation we can choose directions by seeding the dual part. Therefore, using the dual number with one partial component:

\[ d = x + v \epsilon \]

we get that

\[ f(d) = f(x) + Jv \epsilon \]

and thus a single application with a single partial gives the jvp.

Note on Reverse-Mode Automatic Differentiation

As noted earlier, reverse-mode automatic differentiation has its primitives compute rows of the Jacobian in the seeded direction. This means that the seeded reverse-mode call with the vector $v$ computes $v^T J$, that is the vector (transpose) Jacobian transpose, or vjp for short. When discussing parameter estimation and adjoints, this shorthand will be introduced as a way for using a traditionally machine learning tool to accelerate traditionally scientific computing tasks.

Krylov Subspace Methods For Solving Linear Systems

Basic Iterative Solver Methods

Now that we have direct access to quick calculations of $Jv$, how would we use this to solve the linear system $Jw = v$ quickly? This is done through iterative linear solvers. These methods replace the process of solving for a factorization with, you may have guessed it, a discrete dynamical system whose solution is $w$. To do this, what we want is some iterative process so that

\[ Jw - b = 0 \]

So now let's split $J = A - B$, then if we are iterating the vectors $w_k$ such that $w_k \rightarrow w$, then if we plug this into the previous (residual) equation we get

\[ A w_{k+1} = Bw_k + b \]

since when we plug in $w$ we get zero (the sequence must be Cauchy so the difference $w_{k+1} - w_k \rightarrow 0$). Thus if we can split our matrix $J$ into a component $A$ which is easy to invert and a part $B$ that is just everything else, then we would have a bunch of easy linear systems to solve. There are many different choices that we can do. If we let $J = L + D + U$, where $L$ is the lower portion of $J$, $D$ is the diagonal, and $U$ is the upper portion, then the following are well-known methods:

  • Richardson: $A = \omega I$ for some $\omega$

  • Jacobi: $A = D$

  • Damped Jacobi: $A = \omega D$

  • Gauss-Seidel: $A = D-L$

  • Successive Over Relaxation: $A = \omega D - L$

  • Symmetric Successive Over Relaxation: $A = \frac{1}{\omega (2 - \omega)}(D-\omega L)D^{-1}(D-\omega U)$

These decompositions are chosen since a diagonal matrix is easy to invert (it's just the inversion of the scalars of the diagonal) and it's easy to solve an upper or lower triangular linear system (once again, it's backsubstitution).

Since these methods give a a linear dynamical system, we know that there is a unique steady state solution, which happens to be $Aw - Bw = Jw = b$. Thus we will converge to it as long as the steady state is stable. To see if it's stable, take the update equation

\[ w_{k+1} = A^{-1}(Bw_k + b) \]

and check the eigenvalues of the system: if they are within the unit circle then you have stability. Notice that this can always occur by bringing the eigenvalues of $A^{-1}$ closer to zero, which can be done by multiplying $A$ by a significantly large value, hence the $\omega$ quantities. While that always works, this essentially amounts to decreasing the stepsize of the iterative process and thus requiring more steps, thus making it take more computations. Thus the game is to pick the largest stepsize ($\omega$) for which the steady state is stable. We will leave that as outside the topic of this course.

Krylov Subspace Methods

While the classical iterative solver methods give the background for understanding an alternative to direct inversion or factorization of a matrix, the problem with that approach is that it requires the ability to split the matrix $J$, which we would like to avoid computing. Instead, we would like to develop an iterative solver technique which instead just uses the solution to $Jv$. Indeed there are such methods, and these are the Krylov subspace methods. A Krylov subspace is the space spanned by:

\[ \mathcal{K}_k = \text{span} \{v,Jv,J^2 v, \ldots, J^k v\} \]

There are a few nice properties about Krylov subspaces that can be exploited. For one, it is known that there is a finite maximum dimension of the Krylov subspace, that is there is a value $r$ such that $J^{r+1} v \in \mathcal{K}_r$, which means that the complete Krylov subspace can be computed in finitely many jvp, since $J^2 v$ is just the jvp where the vector is the jvp. Indeed, one can show that $J^i v$ is linearly independent for each $i$, and thus that maximal value is $m$, the dimension of the Jacobian. Therefore in $m$ jvps the solution is guaranteed to live in the Krylov subspace, giving a maximal computational cost and a proof of convergence if the vector in there is the "optimal in the space".

The most common method in the Krylov subspace family of methods is the GMRES method. Essentially, in step $i$ one computes $\mathcal{K}_i$, and finds the $x$ that is the closest to the Krylov subspace, i.e. finds the $x \in \mathcal{K}_i$ such that $\Vert Jx-v \Vert$ is minimized. At each step, it adds the new vector to the Krylov subspace after orthogonalizing it against the other vectors via Arnoldi iterations, leading to an orthogonal basis of $\mathcal{K}_i$ which makes it easy to express $x$.

While one has a guaranteed bound on the number of possible jvps in GMRES which is simply the number of ODEs (since that is what determines the size of the Jacobian and thus the total dimension of the problem), that bound is not necessarily a good one. For a large sparse matrix, it may be computationally impractical to ever compute 100,000 jvps. Thus one does not typically run the algorithm to conclusion, and instead stops when $\Vert Jx-v \Vert$ is sufficiently below some user-defined error tolerance.

Intermediate Conclusion

Let's take a step back and see what our intermediate conclusion is. In order to solve for the implicit step, it just boils down to doing Newton's method on some $g(x)=0$. If the Jacobian is small enough, one factorizes the Jacobian and uses Quasi-Newton iterations in order to utilize the stored LU-decomposition in multiple steps to reduce the computation cost. If the Jacobian is sparse, sparse automatic differentiation through matrix coloring is employed to directly fill the sparse matrix with less applications of $g$, and then this sparse matrix is factorized using a sparse LU factorization.

When the matrix is too large, then one resorts to using a Krylov subspace method, since this only requires being able to do $Jv$ calculations. In general, $Jv$ can be done matrix-free because it is simply the directional derivative in the direction of the vector $v$, which can be computed through either numerical or forward-mode automatic differentiation. This is then used in the GMRES iterative process to find the solution in the Krylov subspace which is closest to the solution, exiting early when the residual error is small enough. If this is converging too slow, then preconditioning is used.

That's the basic algorithm, but what are the other important details for getting this right?

The Need for Speed

Preconditioning

However, the speed at GMRES convergences is dependent on the correlations between the vectors, which can be shown to be related to the condition number of the Jacobian matrix. A high condition number makes convergence slower (this is the case for the traditional iterative methods as well), which in turn is an issue because it is the high condition number on the Jacobian which leads to stiffness and causes one to have to use an implicit integrator in the first place!

To help speed up the convergence, a common technique is known as preconditioning. Preconditioning is the process of using a semi-inverse to the matrix in order to split the matrix so that the iterative problem that is being solved is one that has a smaller condition number. Mathematically, it involves decomposing $J = P_l A P_r$ where $P_l$ and $P_r$ are the left and right preconditioners which have simple inverses, and thus instead of solving $Jx=v$, we would solve:

\[ P_l A P_r x = v \]

or

\[ A P_r x = P_l^{-1}v \]

which then means that the Krylov subpace that needs to be solved for is that defined by $A$: $\mathcal{K} = \text{span}\{v,Av,A^2 v, \ldots\}$. There are many possible choices for these preconditioners, but they are usually problem dependent. For example, for ODEs which come from parabolic and elliptic PDE discretizations, the multigrid method, such as a geometric multigrid or an algebraic multigrid, is a preconditioner that can accelerate the iterative solving process. One generic preconditioner that can generally be used is to divide by the norm of the vector $v$, which is a scaling employed by both SUNDIALS CVODE and by DifferentialEquations.jl and can be shown to be almost always advantageous.

Jacobian Re-use

If the problem is small enough such that the factorization is used and a Quasi-Newton technique is employed, it then holds that for most steps $J$ is only approximate since it can be using an old LU-factorization. To push it even further, high performance codes allow for jacobian reuse, which is allowing the same Jacobian to be reused between different timesteps. If the Jacobian is too incorrect, it can cause the Newton iterations to diverge, which is then when one would calculate a new Jacobian and compute a new LU-factorization.

Adaptive Timestepping

In simple cases, like partial differential equation discretizations of physical problems, the resulting ODEs are not too stiff and thus Newton's iteration generally works. However, in cases like stiff biological models, Newton's iteration can itself not always be stable enough to allow convergence. In fact, with many of the stiff biological models commonly used in benchmarks, no method is stable enough to pass without using adaptive timestepping! Thus one may need to adapt the timestep in order to improve the ability for the Newton method to converge (smaller timesteps increase the stability of the Newton stepping, see the homework).

This needs to be mixed with the Jacobian re-use strategy, since $J = I - \gamma \frac{df}{du}$ where $\gamma$ is dependent on $\Delta t$ (and $\gamma = \Delta t$ for implicit Euler) means that the Jacobian of the Newton method changes as $\Delta t$ changes. Thus one usually has a tiered algorithm for determining when to update the factorizations of $J$ vs when to compute a new $\frac{df}{du}$ and then refactorize. This is generally dependent on estimates of convergence rates to heuristically guess how far off $\frac{df}{du}$ is from the current true value.

So how does one perform adaptivity? This is generally done through a rejection sampling technique. First one needs some estimate of the error in a step. This is calculated through an embedded method, which is a method that is able to be calculated without any extra $f$ evaluations that is (usually) one order different from the true method. The difference between the true and the embedded method is then an error estimate. If this is greater than a user chosen tolerance, the step is rejected and re-ran with a smaller $\Delta t$ (possibly refactorizing, etc.). If this is less than the user tolerance, the step is accepted and $\Delta t$ is changed.

There are many schemes for how one can change $\Delta t$. One of the most common is known as the P-control, which stands for the proportional controller which is used throughout control theory. In this case, the control is to change $\Delta t$ in proportion to the current error ratio from the desired tolerance. If we let

\[ q = \frac{\text{E}}{\max(u_k,u_{k+1}) \tau_r + \tau_a} \]

where $\tau_r$ is the relative tolerance and $\tau_a$ is the absolute tolerance, then $q$ is the ratio of the current error to the current tolerance. If $q<1$, then the error is less than the tolerance and the step is accepted, and vice versa for $q>1$. In either case, we let $\Delta t_{new} = q \Delta t$ be the proportional update.

However, proportional error control has many known features that are undesirable. For example, it happens to work in a "bang bang" manner, meaning that it can drastically change its behavior from step to step. One step may multiply the step size by 10x, then the next by 2x. This is an issue because it effects the stability of the ODE solver method (since the stability is not a property of a single step, but rather it's a property of the global behavior over time)! Thus to smooth it out, one can use a PI-control, which modifies the control factor by a history value, i.e. the error in one step in the past. This of course also means that one can utilize a PID-controller for time stepping. And there are many other techniques that can be used, but many of the most optimized codes tend to use a PI-control mechanism.

Methodological Summary

Here's a quick summary of the methodologies in a hierarchical sense:

  • At the lowest level is the linear solve, either done by JFNK or (sparse) factorization. For large enough systems, this is the brunt of the work. This is thus the piece to computationally optimize as much as possible, and parallelize. For sparse factorizations, this can be done with a distributed sparse library implementation. For JFNK, the efficiency is simply due to the efficiency of your ODE function f.

  • An optional level for JFNK is the preconditioning level, where preconditioners can be used to decrease the total number of iterations required for Krylov subspace methods like GMRES to converge, and thus reduce the total number of f calls.

  • At the nonlinear solver level, different Newton-like techniques are utilized to minimize the number of factorizations/linear solves required, and maximize the stability of the Newton method.

  • At the ODE solver level, more efficient integrators and adaptive methods for stiff ODEs are used to reduce the cost by affecting the linear solves. Most of these calculations are dominated by the linear solve portion when it's in the regime of large stiff systems. Jacobian reuse techniques, partial factorizations, and IMEX methods come into play as ways to reduce the cost per factorization and reduce the total number of factorizations.

\ No newline at end of file diff --git a/_weave/lecture10/estimation_identification/index.html b/_weave/lecture10/estimation_identification/index.html index 0dcb617d..4e62fb1c 100644 --- a/_weave/lecture10/estimation_identification/index.html +++ b/_weave/lecture10/estimation_identification/index.html @@ -1 +1 @@ -

Basic Parameter Estimation, Reverse-Mode AD, and Inverse Problems

Chris Rackauckas
October 22th, 2020

Youtube Video Link

Have a model. Have data. Fit model to data.

This is a problem that goes under many different names: parameter estimation, inverse problems, training, etc. In this lecture we will go through the methods for how that's done, starting with the basics and bringing in the recent techniques from machine learning that can be used to improve the basic implementations.

The Shooting Method for Parameter Fitting

Assume that we have some model $u = f(p)$ where $p$ is our parameters, where we put in some parameters and receive our simulated data $u$. How should you choose $p$ such that $u$ best fits that data? The shooting method directly uses this high level definition of the model by putting a cost function on the output $C(p)$. This cost function is dependent on a user-choice and it's model-dependent. However, a common one is the L2-loss. If $y$ is our expected data, then the L2-loss function against the data is simply:

\[ C(p) = \Vert f(p) - y \Vert \]

where $C(p): \mathbb{R}^n \rightarrow \mathbb{R}$ is a function that returns a scalar. The shooting method then directly optimizes this cost function by having the optimizer generate a data given new choices of $p$.

Methods for Optimization

There are many different nonlinear optimization methods which can be used for this purpose, and for a full survey one should look at packages like JuMP, Optim.jl, and NLopt.jl.

There are generally two sets of methods: global and local optimization methods. Local optimization methods attempt to find the best nearby extrema by finding a point where the gradient $\frac{dC}{dp} = 0$. Global optimization methods attempt to explore the whole space and find the best of the extrema. Global methods tend to employ a lot more heuristics and are extremely computationally difficult, and thus many studies focus on local optimization. We will focus strictly on local optimization, but one may want to look into global optimization for many applications of parameter estimation.

Most local optimizers make use of derivative information in order to accelerate the solver. The simplest of which is the method of gradient descent. In this method, given a set of parameters $p_i$, the next step of parameters one will try is:

\[ p_{i+1} = p_i - \alpha \frac{dC}{dP} \]

that is, update $p_i$ by walking in the downward direction of the gradient. Instead of using just first order information, one may want to directly solve the rootfinding problem $\frac{dC}{dp} = 0$ using Newton's method. Newton's method in this case looks like:

\[ p_{i+1} = p_i - (\frac{d}{dp}\frac{dC}{dp})^{-1} \frac{dC}{dp} \]

But notice that the Jacobian of the gradient is the Hessian, and thus we can rewrite this as:

\[ p_{i+1} = p_i - H(p_i)^{-1} \frac{dC(p_i)}{dp} \]

where $H(p)$ is the Hessian matrix $H_{ij} = \frac{dC}{dx_i dx_j}$. However, solving a system of equations which involves the Hessian can be difficult (just like the Jacobian, but now with another layer of differentiation!), and thus many optimization techniques attempt to avoid the Hessian. A commonly used technique that is somewhat in the middle is the BFGS technique, which is a gradient-based optimization method that attempts to approximate the Hessian along the way to modify its stepping behavior. It uses the history of previously calculated points in order to build this quick Hessian approximate. If one keeps only a constant length history, say 5 time points, then one arrives at the l-BFGS technique, which is one of the most common large-scale optimization techniques.

Connection Between Optimization and Differential Equations

There is actually a strong connection between optimization and differential equations. Let's say we wanted to follow the gradient of the solution towards a local minimum. That would mean that the flow that we would wish to follow is given by an ODE, specifically the ODE:

\[ p' = -\frac{dC}{dp} \]

If we apply the Euler method to this ODE, then we receive

\[ p_{n+1} = p_n - \alpha \frac{dC(p_n)}{dp} \]

and we thus recover the gradient descent method. Now assume that you want to use implicit Euler. Then we would have the system

\[ p_{n+1} = p_n - \alpha \frac{dC(p_{n+1})}{dp} \]

which we would then move to one side:

\[ p_{n+1} - p_n + \alpha \frac{dC(p_{n+1})}{dp} = 0 \]

and solve each step via a Newton method. For this Newton method, we need to take the Jacobian of this gradient function, and once again the Hessian arrives as the fundamental quantity.

Neural Network Training as a Shooting Method for Functions

A one layer dense neuron is traditionally written as the function:

\[ layer(x) = \sigma.(Wx + b) \]

where $x \in \mathbb{R}^n$, $W \in \mathbb{R}^{m \times n}$, $b \in \mathbb{R}^{m}$ and $\sigma$ is some choice of $\mathbb{R}\rightarrow\mathbb{R}$ nonlinear function, where the . is the Julia dot to signify element-wise operation.

A traditional neural network, feed-forward network, or multi-layer perceptron is a 3 layer function, i.e.

\[ NN(x) = W_3 \sigma_2.(W_2\sigma_1.(W_1x + b_1) + b_2) + b_3 \]

where the first layer is called the input layer, the second is called the hidden layer, and the final is called the output layer. This specific function was seen as desirable because of the Universal Approximation Theorem, which is formally stated as follows:

Let $\sigma$ be a nonconstant, bounded, and continuous function. Let $I_m = [0,1]^m$. The space of real-valued continuous functions on $I_m$ is denoted by $C(I_m)$. For any $\epsilon >0$ and any $f\in C(I_m)$, there exists an integer $N$, real constants $W_i$ and $b_i$ s.t.

\[ \Vert NN(x) - f(x) \Vert < \epsilon \]

for all $x \in I_m$. Equivalently, $NN$ given parameters is dense in $C(I_m)$.

However, it turns out that using only one hidden layer can require exponential growth in the size of said hidden layer, where the size is given by the number of columns in $W_1$. To counteract this, deep neural networks were developed to be in the form of the recurrence relation:

\[ v_{i+1} = \sigma_i.(W_i v_{i} + b_i) \]

\[ v_1 = x \]

\[ DNN(x) = v_{n} \]

for some $n$ where $n$ is the number of layers. Given a sufficient size of the hidden layers, this kind of function is a universal approximator (2017). Although it's not quite known yet, some results have shown that this kind of function is able to fit high dimensional functions without the curse of dimensionality, i.e. the number of parameters does not grow exponentially with the input size. More mathematical results in this direction are still being investigated.

However, this theory gives a direct way to transform the fitting of an arbitrary function into a parameter shooting problem. Given an unknown function $f$ one wishes to fit, one can place the cost function

\[ C(p) = \Vert DNN(x;p) - f(x) \Vert \]

where $DNN(x;p)$ signifies the deep neural network given by the parameters $p$, where the full set of parameters is the $W_i$ and $b_i$. To make the evaluation of that function be practical, we can instead say we wish to evaluate the difference at finitely many points:

\[ C(p) = \sum_k^N \Vert DNN(x_k;p) - f(x_k) \Vert \]

Training a neural network is machine learning speak for finding the $p$ which minimizes this cost function. Notice that this is then a shooting method problem, where a cost function is defined by direct evaluations of the model with some choice of parameters.

Recurrent Neural Networks

Recurrent neural networks are networks which are given by the recurrence relation:

\[ x_{k+1} = x_k + DNN(x_k,k;p) \]

Given our machinery, we can see this is equivalent to the Euler discretization with $\Delta t = 1$ on the neural ordinary differential equation defined by:

\[ x' = DNN(x,t;p) \]

Thus a recurrent neural network is a sequence of applications of a neural network (or possibly a neural network indexed by integer time).

Computing Gradients

This shows that many different problems, from training neural networks to fitting differential equations, all have the same underlying mathematical structure which requires the ability to compute the gradient of a cost function given model evaluations. However, this simply reduces to computing the gradient of the model's output given the parameters. To see this, let's take for example the L2 loss function, i.e.

\[ C(p) = \sum_i^N \Vert f(x_i;p) - y_i \Vert \]

for some finite data points $y_i$. In the ODE model, $y_i$ are time series points. In the general neural network, $y_i = d(x_i)$ for the function we wish to fit $d$. In data science applications of machine learning, $y_i = d_i$ the discrete data points we wish to fit. In any of these cases, we see that by the chain rule we have

\[ \frac{dC}{dp} = \sum_i^N 2 \left(f(x_i;p) - y_i \right) \frac{df(x_i)}{dp} \]

and therefore, knowing how to efficiently compute $\frac{df(x_i)}{dp}$ is the essential question for shooting-based parameter fitting.

Forward-Mode Automatic Differentiation for Gradients

Let's recall the forward-mode method for computing gradients. For an arbitrary nonlinear function $f$ with scalar output, we can compute derivatives by putting a dual number in. For example, with

\[ d = d_0 + v_1 \epsilon_1 + \ldots + v_m \epsilon_m \]

we have that

\[ f(d) = f(d_0) + f'(d_0)v_1 \epsilon_1 + \ldots + f'(d_0)v_m \epsilon_m \]

where $f'(d_0)v_i$ is the direction derivative in the direction of $v_i$. To compute the gradient with respond to the input, we thus need to make $v_i = e_i$.

However, in this case we now do not want to compute the derivative with respect to the input! Instead, now we have $f(x;p)$ and want to compute the derivatives with respect to $p$. This simply means that we want to take derivatives in the directions of the parameters. To do this, let:

\[ x = x_0 + 0 \epsilon_1 + \ldots + 0 \epsilon_k \]

\[ P = p + e_1 \epsilon_1 + \ldots + e_k \epsilon_k \]

where there are $k$ parameters. We then have that

\[ f(x;P) = f(x;p) + \frac{df}{dp_1} \epsilon_1 + \ldots + \frac{df}{dp_k} \epsilon_k \]

as the output, and thus a $k+1$-dimensional number computes the gradient of the function with respect to $k$ parameters.

Can we do better?

The Adjoint Technique and Reverse Accumulation

The fast method for computing gradients goes under many times. The adjoint technique, backpropagation, and reverse-mode automatic differentiation are in some sense all equivalent phrases given to this method from different disciplines. To understand the adjoint technique, we will look at the multivariate chain rule on a computation graph. Recall that for $f(x(t),y(t))$ that we have:

\[ \frac{df}{dt} = \frac{df}{dx}\frac{dx}{dt} + \frac{df}{dy}\frac{dy}{dt} \]

We can visualize our direct dependences as the computation graph:

i.e. $t$ directly determines $x$ and $y$ which then determines $f$. To calculate Assume you've already evaluated $f(t)$. If this has been done, then you've already had to calculate $x$ and $y$. Thus given the function $f$, we can now calculate $\frac{df}{dx}$ and $\frac{df}{dy}$, and then calculate $\frac{dx}{dt}$ and $\frac{dy}{dt}$.

Now let's put another layer in the computation. Let's make $f(x(v(t),w(t)),y(v(t),w(t))$. We can write out the full expression for the derivative. Notice that even with this additional layer, the statement we wrote above still holds:

\[ \frac{df}{dt} = \frac{df}{dx}\frac{dx}{dt} + \frac{df}{dy}\frac{dy}{dt} \]

So given an evaluation of $f$, we can (still) directly calculate $\frac{df}{dx}$ and $\frac{df}{dy}$. But now, to calculate $\frac{dx}{dt}$ and $\frac{dy}{dt}$, we do the next step of the chain rule:

\[ \frac{dx}{dt} = \frac{dx}{dv}\frac{dv}{dt} + \frac{dx}{dw}\frac{dw}{dt} \]

and similar for $y$. So plug it all in, and you see that our equations will grow wild if we actually try to plug it in! But it's clear that, to calculate $\frac{df}{dt}$, we can first calculate $\frac{df}{dx}$, and then multiply that to $\frac{dx}{dt}$. If we had more layers, we could calculate the sensitivity (the derivative) of the output to the last layer, then and then the sensitivity to the second layer back is the sensitivity of the last layer multiplied to that, and the third layer back has the sensitivity of the second layer multiplied to it!

Logistic Regression Example

To better see this structure, let's write out a simple example. Let our forward pass through our function be:

\[ \begin{align} z &= wx + b\\ y &= \sigma(z)\\ \mathcal{L} &= \frac{1}{2}(y-t)^2\\ \mathcal{R} &= \frac{1}{2}w^2\\ \mathcal{L}_{reg} &= \mathcal{L} + \lambda \mathcal{R}\end{align} \]

The formulation of the program here is called a Wengert list, tape, or graph. In this, $x$ and $t$ are inputs, $b$ and $W$ are parameters, $z$, $y$, $\mathcal{L}$, and $\mathcal{R}$ are intermediates, and $\mathcal{L}_{reg}$ is our output.

This is a simple univariate logistic regression model. To do logistic regression, we wish to find the parameters $w$ and $b$ which minimize the distance of $\mathcal{L}_{reg}$ from a desired output, which is done by computing derivatives.

Let's calculate the derivatives with respect to each quantity in reverse order. If our program is $f(x) = \mathcal{L}_{reg}$, then we have that

\[ \frac{df}{d\mathcal{L}_{reg}} = 1 \]

as the derivatives of the last layer. To computerize our notation, let's write

\[ \overline{\mathcal{L}_{reg}} = \frac{df}{d\mathcal{L}_{reg}} \]

for our computed values. For the derivatives of the second to last layer, we have that:

\[ \begin{align} \overline{\mathcal{R}} &= \frac{df}{d\mathcal{L}_{reg}} \frac{d\mathcal{L}_{reg}}{d\mathcal{R}}\\ &= \overline{\mathcal{L}_{reg}} \lambda \end{align} \]

\[ \begin{align} \overline{\mathcal{L}} &= \frac{df}{d\mathcal{L}_{reg}} \frac{d\mathcal{L}_{reg}}{d\mathcal{L}}\\ &= \overline{\mathcal{L}_{reg}} \end{align} \]

This was our observation from before that the derivative of the second layer is the partial derivative of the current values times the sensitivity of the final layer. And then we keep multiplying, so now for our next layer we have that:

\[ \begin{align} \overline{y} &= \overline{\mathcal{L}} \frac{d\mathcal{L}}{dy}\\ &= \overline{\mathcal{L}} (y-t) \end{align} \]

And notice that the chain rule holds since $\overline{\mathcal{L}}$ implicitly already has the multiplication by $\overline{\mathcal{L}_{reg}}$ inside of it. Then the next layer is:

\[ \begin{align} \frac{df}{z} &= \overline{y} \frac{dy}{dz}\\ &= \overline{y} \sigma^\prime(z) \end{align} \]

Then the next layer. Notice that here, by the chain rule on $w$ we have that:

\[ \begin{align} \overline{w} &= \overline{z} \frac{\partial z}{\partial w} + \overline{\mathcal{R}} \frac{d \mathcal{R}}{dw}\\ &= \overline{z} x + \overline{\mathcal{R}} w\end{align} \]

\[ \begin{align} \overline{b} &= \overline{z} \frac{\partial z}{\partial b}\\ &= \overline{z} \end{align} \]

This completely calculates all derivatives. In conclusion, the rule is:

  • You sum terms from each outward arrow

  • Each arrow has the derivative term of the end times the partial of the current term.

  • Recurse backwards to build simple linear combination expressions.

You can thus think of the relations as a message passing relation in reverse to the forward pass:

Note that the reverse-pass has the values of the forward pass, like $x$ and $t$, embedded within it.

Backpropagation of a Neural Network

Now let's look at backpropagation of a deep neural network. Before getting to it in the linear algebraic sense, let's write everything in terms of scalars. This means we can write a simple neural network as:

\[ \begin{align} z_i &= \sum_j W_{ij}^1 x_j + b_i^1\\ h_i &= \sigma(z_i)\\ y_i &= \sum_j W_{ij}^2 h_j + b_i^2\\ \mathcal{L} &= \frac{1}{2} \sum_k \left(y_k - t_k \right)^2 \end{align} \]

where I have chosen the L2 loss function. This is visualized by the computational graph:

Then we can do the same process as before to get:

\[ \begin{align} \overline{\mathcal{L}} &= 1\\ \overline{y_i} &= \overline{\mathcal{L}} (y_i - t_i)\\ \overline{w_{ij}^2} &= \overline{y_i} h_j\\ \overline{b_i^2} &= \overline{y_i}\\ \overline{h_i} &= \sum_k (\overline{y_k}w_{ki}^2)\\ \overline{z_i} &= \overline{h_i}\sigma^\prime(z_i)\\ \overline{w_{ij}^1} &= \overline{z_i} x_j\\ \overline{b_i^1} &= \overline{z_i}\end{align} \]

just by examining the computation graph. Now let's write this in linear algebraic form.

The forward pass for this simple neural network was:

\[ \begin{align} z &= W_1 x + b_1\\ h &= \sigma(z)\\ y &= W_2 h + b_2\\ \mathcal{L} = \frac{1}{2} \Vert y-t \Vert^2 \end{align} \]

If we carefully decode our scalar expression, we see that we get the following:

\[ \begin{align} \overline{\mathcal{L}} &= 1\\ \overline{y} &= \overline{\mathcal{L}}(y-t)\\ \overline{W_2} &= \overline{y}h^{T}\\ \overline{b_2} &= \overline{y}\\ \overline{h} &= W_2^T \overline{y}\\ \overline{z} &= \overline{h} .* \sigma^\prime(z)\\ \overline{W_1} &= \overline{z} x^T\\ \overline{b_1} &= \overline{z} \end{align} \]

We can thus decode the rules as:

  • Multiplying by the matrix going forwards means multiplying by the transpose going backwards. A term on the left stays on the left, and a term on the right stays on the right.

  • Element-wise operations give element-wise multiplication

Notice that the summation is then easily encoded into this rule by the transpose operation.

We can write it in the general DNN form of:

\[ r_i = W_i v_{i} + b_i \]

\[ v_{i+1} = \sigma_i.(r_i) \]

\[ v_1 = x \]

\[ \mathcal{L} = \frac{1}{2} \Vert v_{n} - t \Vert \]

\[ \begin{align} \overline{\mathcal{L}} &= 1\\ \overline{v_n} &= \overline{\mathcal{L}}(y-t)\\ \overline{r_i} &= \overline{v_i} .* \sigma_i^\prime (r_i)\\ \overline{W_i} &= \overline{v_i}r_{i-1}^{T}\\ \overline{b_i} &= \overline{v_i}\\ \overline{v_{i-1}} &= W_{i}^{T} \overline{v_i} \end{align} \]

Reverse-Mode Automatic Differentiation and vjps

Backpropagation of a neural network is thus a different way of accumulating derivatives. If $f$ is a composition of $L$ functions:

\[ f = f^L \circ f^{L-1} \circ \ldots \circ f^1 \]

Then the Jacobian matrix satisfies:

\[ J = J_L J_{L-1} \ldots J_1 \]

A program is essentially a nice way of writing a function in composition form. Forward-mode automatic differentiation worked by propagating forward the actions of the Jacobians at every step of the program:

\[ Jv = J_L (J_{L-1} (\ldots (J_1 v) \ldots )) \]

effectively calculating the Jacobian of the program by multiplying by the Jacobians from left to right at each step of the way. This means doing primitive $Jv$ calculations on each underlying problem, and pushing that calculation through.

But what about reverse accumulation? This can be isolated to the simple expression graph:

In backpropagation, we just showed that when doing reverse accumulation, the rule is that multiplication forwards is multiplication by the transpose backwards. So if the forward way to compute the Jacobian in reverse is to replace the matrix by its transpose:

We can either look at it as $J^T v$, or by transposing the equation $v^T J$. It's right there that we have a vector-transpose Jacobian product, or a vjp.

We can thus think of this as a different direction for the Jacobian accumulation. Reverse-mode automatic differentiation moves backwards through our composed Jacobian. For a value $v$ at the end, we can push it backwards:

\[ v^T J = (\ldots ((v^T J_L) J_{L-1}) \ldots ) J_1 \]

doing a vjp at every step of the way, which is simply doing reverse-mode AD of that function (and if it's linear, then simply doing the matrix multiplication). Thus reverse-mode AD is just a grouping of vjps into a single larger expression, instead of linearizing every single step.

Primitives of Reverse Mode

For forward-mode AD, we saw that we could define primitives in order to accelerate the calculation. For example, knowing that

\[ exp(x+\epsilon) = exp(x) + exp(x)\epsilon \]

allows the program to skip autodifferentiating through the code for exp. This was simple with forward-mode since we could represent the operation on a Dual number. What's the equivalent for reverse-mode AD? The answer is the pullback function. If $y = [y_1,y_2,\ldots] = f(x_1,x_2, \ldots)$, then $[\overline{x_1},\overline{x_2},\ldots]=\mathcal{B}_f^x(\overline{y})$ is the pullback of $f$ at the point $x$, defined for a scalar loss function $L(y)$ as:

\[ \overline{x_i} = \frac{\partial L}{\partial x_i} = \sum_j \frac{\partial L}{\partial y_j} \frac{\partial y_j}{\partial x_i} \]

Using the notation from earlier, $\overline{y} = \frac{\partial L}{\partial y}$ is the derivative of the some intermediate w.r.t. the cost function, and thus

\[ \overline{x_i} = \sum_j \overline{y_j} \frac{\partial y_j}{\partial x_i} = \mathcal{B}_f^x(\overline{y}) \]

Note that $\mathcal{B}_f^x(\overline{y})$ is a function of $x$ because the reverse pass that is use embeds values from the forward pass, and the values from the forward pass to use are those calculated during the evaluation of $f(x)$.

By the chain rule, if we don't have a primitive defined for $y_i(x)$, we can compute that by $\mathcal{B}_{y_i}(\overline{y})$, and recursively apply this process until we hit rules that we know. The rules to start with are the scalar derivative rules with follow quite simply, and the multivariate rules which we derived above. For example, if $y=f(x)=Ax$, then

\[ \mathcal{B}_{f}^x(\overline{y}) = \overline{y}^T A \]

which is simply saying that the Jacobian of $f$ at $x$ is $A$, and so the vjp is to multiply the vector transpose by $A$.

Likewise, for element-wise operations, the Jacobian is diagonal, and thus the vjp is multiplying once again by a diagonal matrix against the derivative, deriving the same pullback as we had for backpropagation in a neural network. This then is a quicker encoding and derivation of backpropagation.

Multivariate Derivatives from Reverse Mode

Since the primitive of reverse mode is the vjp, we can understand its behavior by looking at a large primitive. In our simplest case, the function $f(x)=Ax$ outputs a vector value, which we apply our loss function $L(y) = \Vert y-t \Vert$ to get a scalar. Thus we seed the scalar output $v=1$, and in the first step backwards we have a vector to scalar function, so the first pullback transforms from $1$ to the vector $v_2 = 2|y-t|$. Then we take that vector and multiply it like $v_2^T A$ to get the derivatives w.r.t. $x$.

Now let $L(y)$ be a vector function, i.e. we output a vector instead of a scalar from our loss function. Then $v$ is the seed to this process. Let's assume that $v = e_i$, one of the basis vectors. Then

\[ v_i^T J = e_i^T J \]

pulls computes a row of the Jacobian. There, if we had a vector function $y=f(x)$, the pullback $\mathcal{B}_f^x(e_i)$ is the row of the Jacobian $f'(x)$. Concatenating these is thus a way to build a full Jacobian. The gradient is thus a special case where $y$ is scalar, and thus the resulting Jacobian is just a single row, and therefore we set the seed equal to $1$ to compute the unscaled gradient.

Multi-Seeding

Similarly to forward-mode having a dual number with multiple simultaneous derivatives through partials $d = x + v_1 \epsilon_1 + \ldots + v_m \epsilon_m$, one can see that multi-seeding is an option in reverse-mode AD by, instead of pulling back a matrix instead of a row vector, where each row is a direction. Thus the matrix $A = [v_1 v_2 \ldots v_n]^T$ evaluated as $\mathcal{B}_f^x(A)$ is the equivalent operation to the forward-mode $f(d)$ for generalized multivariate multiseeded reverse-mode automatic differentiation. One should take care to recognize the Jacobian as a generalized linear operator in this case and ensure that the shapes in the program correctly handle this storage of the reverse seed. When linear, this will automatically make use of BLAS3 operations, making it an efficient form for neural networks.

Sparse Reverse Mode AD

Since the Jacobian is built row-by-row with reverse mode AD, the sparse differentiation discussion from forward-mode AD applies similarly but to the transpose. Therefore, in order to perform sparse reverse mode automatic differentiation, one would build up a connectivity graph of the columns, and perform a coloring algorithm on this graph. The seeds of the reverse call, $v_i$, would then be the color vectors, which would compute compressed rows, that are then decompressed similarly to the forward-mode case.

Forward Mode vs Reverse Mode

Notice that a pullback of a single scalar gives the gradient of a function, while the pushforward using forward-mode of a dual gives a directional derivative. Forward mode computes columns of a Jacobian, while reverse mode computes gradients (rows of a Jacobian). Therefore, the relative efficiency of the two approaches is based on the size of the Jacobian. If $f:\mathbb{R}^n \rightarrow \mathbb{R}^m$, then the Jacobian is of size $m \times n$. If $m$ is much smaller than $n$, then computing by each row will be faster, and thus use reverse mode. In the case of a gradient, $m=1$ while $n$ can be large, leading to this phenomena. Likewise, if $n$ is much smaller than $m$, then computing by each column will be faster. We will see shortly the reverse mode AD has a high overhead with respect to forward mode, and thus if the values are relatively equal (or $n$ and $m$ are small), forward mode is more efficient.

However, since optimization needs gradients, reverse-mode definitely has a place in the standard toolchain which is why backpropagation is so central to machine learning.

Side Note on Mixed Mode

Interestingly, one can find cases where mixing the forward and reverse mode results would give an asymptotically better result. For example, if a Jacobian was non-zero in only the first 3 rows and first 3 columns, then sparse forward mode would still require N partials and reverse mode would require M seeds. However, one forward mode call of 3 partials and one reverse mode call of 3 seeds would calculate all three rows and columns with $\mathcal{O}(1)$ work, as opposed to $\mathcal{O}(N)$ or $\mathcal{O}(M)$. Exactly how to make use of this insight in an automated manner is an open research question.

Forward-Over-Reverse and Hessian-Free Products

Using this knowledge, we can also develop quick ways for computing the Hessian. Recall from earlier in the discussion that Hessians are the Jacobian of the gradient. So let's say for a scalar function $f$ we want to compute the Hessian. To compute the gradient, we use the reverse-mode AD pullback $\nabla f(x) = \mathcal{B}_f^x(1)$. Recall that the pullback is a function of $x$ since that is the value at which the values from the forward pass are taken. Then since the Jacobian of the gradient vector is $n \times n$ (as many terms in the gradient as there are inputs!), it holds that we want to use forward-mode AD for this Jacobian. Therefore, using the dual number $x = x_0 + e_1 \epsilon_1 + \ldots + e_n \epsilon_n$ the reverse mode gradient function computes the full Hessian in one forward pass. What this amounts to is pushing forward the dual number forward sensitivities when building the pullback, and then when doing the pullback the dual portions, will be holding vectors for the columns of the Hessian.

Similarly, Hessian-vector products without computing the Hessian can be computed using the Jacobian-vector product trick on the function defined by the gradient. Here, $Hv$ is equivalent to the dual part of

\[ \nabla f(x+v\epsilon) = \mathcal{B}_f^{x+v\epsilon}(1) \]

This means that our Newton method for optimization:

\[ p_{i+1} = p_i - H(p_i)^{-1} \frac{dC(p_i)}{dp} \]

can be treated similarly to that for the nonlinear solving problem, where the linear system can be solved using Hessian-free vector products to build a Krylov subspace, giving rise to the Hessian-free Newton Krylov method for optimization.

References

We thank Roger Grosse's lecture notes for the amazing tikz graphs.

\ No newline at end of file +

Basic Parameter Estimation, Reverse-Mode AD, and Inverse Problems

Chris Rackauckas
October 22th, 2020

Youtube Video Link

Have a model. Have data. Fit model to data.

This is a problem that goes under many different names: parameter estimation, inverse problems, training, etc. In this lecture we will go through the methods for how that's done, starting with the basics and bringing in the recent techniques from machine learning that can be used to improve the basic implementations.

The Shooting Method for Parameter Fitting

Assume that we have some model $u = f(p)$ where $p$ is our parameters, where we put in some parameters and receive our simulated data $u$. How should you choose $p$ such that $u$ best fits that data? The shooting method directly uses this high level definition of the model by putting a cost function on the output $C(p)$. This cost function is dependent on a user-choice and it's model-dependent. However, a common one is the L2-loss. If $y$ is our expected data, then the L2-loss function against the data is simply:

\[ C(p) = \Vert f(p) - y \Vert \]

where $C(p): \mathbb{R}^n \rightarrow \mathbb{R}$ is a function that returns a scalar. The shooting method then directly optimizes this cost function by having the optimizer generate a data given new choices of $p$.

Methods for Optimization

There are many different nonlinear optimization methods which can be used for this purpose, and for a full survey one should look at packages like JuMP, Optim.jl, and NLopt.jl.

There are generally two sets of methods: global and local optimization methods. Local optimization methods attempt to find the best nearby extrema by finding a point where the gradient $\frac{dC}{dp} = 0$. Global optimization methods attempt to explore the whole space and find the best of the extrema. Global methods tend to employ a lot more heuristics and are extremely computationally difficult, and thus many studies focus on local optimization. We will focus strictly on local optimization, but one may want to look into global optimization for many applications of parameter estimation.

Most local optimizers make use of derivative information in order to accelerate the solver. The simplest of which is the method of gradient descent. In this method, given a set of parameters $p_i$, the next step of parameters one will try is:

\[ p_{i+1} = p_i - \alpha \frac{dC}{dP} \]

that is, update $p_i$ by walking in the downward direction of the gradient. Instead of using just first order information, one may want to directly solve the rootfinding problem $\frac{dC}{dp} = 0$ using Newton's method. Newton's method in this case looks like:

\[ p_{i+1} = p_i - (\frac{d}{dp}\frac{dC}{dp})^{-1} \frac{dC}{dp} \]

But notice that the Jacobian of the gradient is the Hessian, and thus we can rewrite this as:

\[ p_{i+1} = p_i - H(p_i)^{-1} \frac{dC(p_i)}{dp} \]

where $H(p)$ is the Hessian matrix $H_{ij} = \frac{dC}{dx_i dx_j}$. However, solving a system of equations which involves the Hessian can be difficult (just like the Jacobian, but now with another layer of differentiation!), and thus many optimization techniques attempt to avoid the Hessian. A commonly used technique that is somewhat in the middle is the BFGS technique, which is a gradient-based optimization method that attempts to approximate the Hessian along the way to modify its stepping behavior. It uses the history of previously calculated points in order to build this quick Hessian approximate. If one keeps only a constant length history, say 5 time points, then one arrives at the l-BFGS technique, which is one of the most common large-scale optimization techniques.

Connection Between Optimization and Differential Equations

There is actually a strong connection between optimization and differential equations. Let's say we wanted to follow the gradient of the solution towards a local minimum. That would mean that the flow that we would wish to follow is given by an ODE, specifically the ODE:

\[ p' = -\frac{dC}{dp} \]

If we apply the Euler method to this ODE, then we receive

\[ p_{n+1} = p_n - \alpha \frac{dC(p_n)}{dp} \]

and we thus recover the gradient descent method. Now assume that you want to use implicit Euler. Then we would have the system

\[ p_{n+1} = p_n - \alpha \frac{dC(p_{n+1})}{dp} \]

which we would then move to one side:

\[ p_{n+1} - p_n + \alpha \frac{dC(p_{n+1})}{dp} = 0 \]

and solve each step via a Newton method. For this Newton method, we need to take the Jacobian of this gradient function, and once again the Hessian arrives as the fundamental quantity.

Neural Network Training as a Shooting Method for Functions

A one layer dense neuron is traditionally written as the function:

\[ layer(x) = \sigma.(Wx + b) \]

where $x \in \mathbb{R}^n$, $W \in \mathbb{R}^{m \times n}$, $b \in \mathbb{R}^{m}$ and $\sigma$ is some choice of $\mathbb{R}\rightarrow\mathbb{R}$ nonlinear function, where the . is the Julia dot to signify element-wise operation.

A traditional neural network, feed-forward network, or multi-layer perceptron is a 3 layer function, i.e.

\[ NN(x) = W_3 \sigma_2.(W_2\sigma_1.(W_1x + b_1) + b_2) + b_3 \]

where the first layer is called the input layer, the second is called the hidden layer, and the final is called the output layer. This specific function was seen as desirable because of the Universal Approximation Theorem, which is formally stated as follows:

Let $\sigma$ be a nonconstant, bounded, and continuous function. Let $I_m = [0,1]^m$. The space of real-valued continuous functions on $I_m$ is denoted by $C(I_m)$. For any $\epsilon >0$ and any $f\in C(I_m)$, there exists an integer $N$, real constants $W_i$ and $b_i$ s.t.

\[ \Vert NN(x) - f(x) \Vert < \epsilon \]

for all $x \in I_m$. Equivalently, $NN$ given parameters is dense in $C(I_m)$.

However, it turns out that using only one hidden layer can require exponential growth in the size of said hidden layer, where the size is given by the number of columns in $W_1$. To counteract this, deep neural networks were developed to be in the form of the recurrence relation:

\[ v_{i+1} = \sigma_i.(W_i v_{i} + b_i) \]

\[ v_1 = x \]

\[ DNN(x) = v_{n} \]

for some $n$ where $n$ is the number of layers. Given a sufficient size of the hidden layers, this kind of function is a universal approximator (2017). Although it's not quite known yet, some results have shown that this kind of function is able to fit high dimensional functions without the curse of dimensionality, i.e. the number of parameters does not grow exponentially with the input size. More mathematical results in this direction are still being investigated.

However, this theory gives a direct way to transform the fitting of an arbitrary function into a parameter shooting problem. Given an unknown function $f$ one wishes to fit, one can place the cost function

\[ C(p) = \Vert DNN(x;p) - f(x) \Vert \]

where $DNN(x;p)$ signifies the deep neural network given by the parameters $p$, where the full set of parameters is the $W_i$ and $b_i$. To make the evaluation of that function be practical, we can instead say we wish to evaluate the difference at finitely many points:

\[ C(p) = \sum_k^N \Vert DNN(x_k;p) - f(x_k) \Vert \]

Training a neural network is machine learning speak for finding the $p$ which minimizes this cost function. Notice that this is then a shooting method problem, where a cost function is defined by direct evaluations of the model with some choice of parameters.

Recurrent Neural Networks

Recurrent neural networks are networks which are given by the recurrence relation:

\[ x_{k+1} = x_k + DNN(x_k,k;p) \]

Given our machinery, we can see this is equivalent to the Euler discretization with $\Delta t = 1$ on the neural ordinary differential equation defined by:

\[ x' = DNN(x,t;p) \]

Thus a recurrent neural network is a sequence of applications of a neural network (or possibly a neural network indexed by integer time).

Computing Gradients

This shows that many different problems, from training neural networks to fitting differential equations, all have the same underlying mathematical structure which requires the ability to compute the gradient of a cost function given model evaluations. However, this simply reduces to computing the gradient of the model's output given the parameters. To see this, let's take for example the L2 loss function, i.e.

\[ C(p) = \sum_i^N \Vert f(x_i;p) - y_i \Vert \]

for some finite data points $y_i$. In the ODE model, $y_i$ are time series points. In the general neural network, $y_i = d(x_i)$ for the function we wish to fit $d$. In data science applications of machine learning, $y_i = d_i$ the discrete data points we wish to fit. In any of these cases, we see that by the chain rule we have

\[ \frac{dC}{dp} = \sum_i^N 2 \left(f(x_i;p) - y_i \right) \frac{df(x_i)}{dp} \]

and therefore, knowing how to efficiently compute $\frac{df(x_i)}{dp}$ is the essential question for shooting-based parameter fitting.

Forward-Mode Automatic Differentiation for Gradients

Let's recall the forward-mode method for computing gradients. For an arbitrary nonlinear function $f$ with scalar output, we can compute derivatives by putting a dual number in. For example, with

\[ d = d_0 + v_1 \epsilon_1 + \ldots + v_m \epsilon_m \]

we have that

\[ f(d) = f(d_0) + f'(d_0)v_1 \epsilon_1 + \ldots + f'(d_0)v_m \epsilon_m \]

where $f'(d_0)v_i$ is the direction derivative in the direction of $v_i$. To compute the gradient with respond to the input, we thus need to make $v_i = e_i$.

However, in this case we now do not want to compute the derivative with respect to the input! Instead, now we have $f(x;p)$ and want to compute the derivatives with respect to $p$. This simply means that we want to take derivatives in the directions of the parameters. To do this, let:

\[ x = x_0 + 0 \epsilon_1 + \ldots + 0 \epsilon_k \]

\[ P = p + e_1 \epsilon_1 + \ldots + e_k \epsilon_k \]

where there are $k$ parameters. We then have that

\[ f(x;P) = f(x;p) + \frac{df}{dp_1} \epsilon_1 + \ldots + \frac{df}{dp_k} \epsilon_k \]

as the output, and thus a $k+1$-dimensional number computes the gradient of the function with respect to $k$ parameters.

Can we do better?

The Adjoint Technique and Reverse Accumulation

The fast method for computing gradients goes under many times. The adjoint technique, backpropagation, and reverse-mode automatic differentiation are in some sense all equivalent phrases given to this method from different disciplines. To understand the adjoint technique, we will look at the multivariate chain rule on a computation graph. Recall that for $f(x(t),y(t))$ that we have:

\[ \frac{df}{dt} = \frac{df}{dx}\frac{dx}{dt} + \frac{df}{dy}\frac{dy}{dt} \]

We can visualize our direct dependences as the computation graph:

i.e. $t$ directly determines $x$ and $y$ which then determines $f$. To calculate Assume you've already evaluated $f(t)$. If this has been done, then you've already had to calculate $x$ and $y$. Thus given the function $f$, we can now calculate $\frac{df}{dx}$ and $\frac{df}{dy}$, and then calculate $\frac{dx}{dt}$ and $\frac{dy}{dt}$.

Now let's put another layer in the computation. Let's make $f(x(v(t),w(t)),y(v(t),w(t))$. We can write out the full expression for the derivative. Notice that even with this additional layer, the statement we wrote above still holds:

\[ \frac{df}{dt} = \frac{df}{dx}\frac{dx}{dt} + \frac{df}{dy}\frac{dy}{dt} \]

So given an evaluation of $f$, we can (still) directly calculate $\frac{df}{dx}$ and $\frac{df}{dy}$. But now, to calculate $\frac{dx}{dt}$ and $\frac{dy}{dt}$, we do the next step of the chain rule:

\[ \frac{dx}{dt} = \frac{dx}{dv}\frac{dv}{dt} + \frac{dx}{dw}\frac{dw}{dt} \]

and similar for $y$. So plug it all in, and you see that our equations will grow wild if we actually try to plug it in! But it's clear that, to calculate $\frac{df}{dt}$, we can first calculate $\frac{df}{dx}$, and then multiply that to $\frac{dx}{dt}$. If we had more layers, we could calculate the sensitivity (the derivative) of the output to the last layer, then and then the sensitivity to the second layer back is the sensitivity of the last layer multiplied to that, and the third layer back has the sensitivity of the second layer multiplied to it!

Logistic Regression Example

To better see this structure, let's write out a simple example. Let our forward pass through our function be:

\[ \begin{align} z &= wx + b\\ y &= \sigma(z)\\ \mathcal{L} &= \frac{1}{2}(y-t)^2\\ \mathcal{R} &= \frac{1}{2}w^2\\ \mathcal{L}_{reg} &= \mathcal{L} + \lambda \mathcal{R}\end{align} \]

The formulation of the program here is called a Wengert list, tape, or graph. In this, $x$ and $t$ are inputs, $b$ and $W$ are parameters, $z$, $y$, $\mathcal{L}$, and $\mathcal{R}$ are intermediates, and $\mathcal{L}_{reg}$ is our output.

This is a simple univariate logistic regression model. To do logistic regression, we wish to find the parameters $w$ and $b$ which minimize the distance of $\mathcal{L}_{reg}$ from a desired output, which is done by computing derivatives.

Let's calculate the derivatives with respect to each quantity in reverse order. If our program is $f(x) = \mathcal{L}_{reg}$, then we have that

\[ \frac{df}{d\mathcal{L}_{reg}} = 1 \]

as the derivatives of the last layer. To computerize our notation, let's write

\[ \overline{\mathcal{L}_{reg}} = \frac{df}{d\mathcal{L}_{reg}} \]

for our computed values. For the derivatives of the second to last layer, we have that:

\[ \begin{align} \overline{\mathcal{R}} &= \frac{df}{d\mathcal{L}_{reg}} \frac{d\mathcal{L}_{reg}}{d\mathcal{R}}\\ &= \overline{\mathcal{L}_{reg}} \lambda \end{align} \]

\[ \begin{align} \overline{\mathcal{L}} &= \frac{df}{d\mathcal{L}_{reg}} \frac{d\mathcal{L}_{reg}}{d\mathcal{L}}\\ &= \overline{\mathcal{L}_{reg}} \end{align} \]

This was our observation from before that the derivative of the second layer is the partial derivative of the current values times the sensitivity of the final layer. And then we keep multiplying, so now for our next layer we have that:

\[ \begin{align} \overline{y} &= \overline{\mathcal{L}} \frac{d\mathcal{L}}{dy}\\ &= \overline{\mathcal{L}} (y-t) \end{align} \]

And notice that the chain rule holds since $\overline{\mathcal{L}}$ implicitly already has the multiplication by $\overline{\mathcal{L}_{reg}}$ inside of it. Then the next layer is:

\[ \begin{align} \frac{df}{z} &= \overline{y} \frac{dy}{dz}\\ &= \overline{y} \sigma^\prime(z) \end{align} \]

Then the next layer. Notice that here, by the chain rule on $w$ we have that:

\[ \begin{align} \overline{w} &= \overline{z} \frac{\partial z}{\partial w} + \overline{\mathcal{R}} \frac{d \mathcal{R}}{dw}\\ &= \overline{z} x + \overline{\mathcal{R}} w\end{align} \]

\[ \begin{align} \overline{b} &= \overline{z} \frac{\partial z}{\partial b}\\ &= \overline{z} \end{align} \]

This completely calculates all derivatives. In conclusion, the rule is:

  • You sum terms from each outward arrow

  • Each arrow has the derivative term of the end times the partial of the current term.

  • Recurse backwards to build simple linear combination expressions.

You can thus think of the relations as a message passing relation in reverse to the forward pass:

Note that the reverse-pass has the values of the forward pass, like $x$ and $t$, embedded within it.

Backpropagation of a Neural Network

Now let's look at backpropagation of a deep neural network. Before getting to it in the linear algebraic sense, let's write everything in terms of scalars. This means we can write a simple neural network as:

\[ \begin{align} z_i &= \sum_j W_{ij}^1 x_j + b_i^1\\ h_i &= \sigma(z_i)\\ y_i &= \sum_j W_{ij}^2 h_j + b_i^2\\ \mathcal{L} &= \frac{1}{2} \sum_k \left(y_k - t_k \right)^2 \end{align} \]

where I have chosen the L2 loss function. This is visualized by the computational graph:

Then we can do the same process as before to get:

\[ \begin{align} \overline{\mathcal{L}} &= 1\\ \overline{y_i} &= \overline{\mathcal{L}} (y_i - t_i)\\ \overline{w_{ij}^2} &= \overline{y_i} h_j\\ \overline{b_i^2} &= \overline{y_i}\\ \overline{h_i} &= \sum_k (\overline{y_k}w_{ki}^2)\\ \overline{z_i} &= \overline{h_i}\sigma^\prime(z_i)\\ \overline{w_{ij}^1} &= \overline{z_i} x_j\\ \overline{b_i^1} &= \overline{z_i}\end{align} \]

just by examining the computation graph. Now let's write this in linear algebraic form.

The forward pass for this simple neural network was:

\[ \begin{align} z &= W_1 x + b_1\\ h &= \sigma(z)\\ y &= W_2 h + b_2\\ \mathcal{L} = \frac{1}{2} \Vert y-t \Vert^2 \end{align} \]

If we carefully decode our scalar expression, we see that we get the following:

\[ \begin{align} \overline{\mathcal{L}} &= 1\\ \overline{y} &= \overline{\mathcal{L}}(y-t)\\ \overline{W_2} &= \overline{y}h^{T}\\ \overline{b_2} &= \overline{y}\\ \overline{h} &= W_2^T \overline{y}\\ \overline{z} &= \overline{h} .* \sigma^\prime(z)\\ \overline{W_1} &= \overline{z} x^T\\ \overline{b_1} &= \overline{z} \end{align} \]

We can thus decode the rules as:

  • Multiplying by the matrix going forwards means multiplying by the transpose going backwards. A term on the left stays on the left, and a term on the right stays on the right.

  • Element-wise operations give element-wise multiplication

Notice that the summation is then easily encoded into this rule by the transpose operation.

We can write it in the general DNN form of:

\[ r_i = W_i v_{i} + b_i \]

\[ v_{i+1} = \sigma_i.(r_i) \]

\[ v_1 = x \]

\[ \mathcal{L} = \frac{1}{2} \Vert v_{n} - t \Vert \]

\[ \begin{align} \overline{\mathcal{L}} &= 1\\ \overline{v_n} &= \overline{\mathcal{L}}(y-t)\\ \overline{r_i} &= \overline{v_i} .* \sigma_i^\prime (r_i)\\ \overline{W_i} &= \overline{v_i}r_{i-1}^{T}\\ \overline{b_i} &= \overline{v_i}\\ \overline{v_{i-1}} &= W_{i}^{T} \overline{v_i} \end{align} \]

Reverse-Mode Automatic Differentiation and vjps

Backpropagation of a neural network is thus a different way of accumulating derivatives. If $f$ is a composition of $L$ functions:

\[ f = f^L \circ f^{L-1} \circ \ldots \circ f^1 \]

Then the Jacobian matrix satisfies:

\[ J = J_L J_{L-1} \ldots J_1 \]

A program is essentially a nice way of writing a function in composition form. Forward-mode automatic differentiation worked by propagating forward the actions of the Jacobians at every step of the program:

\[ Jv = J_L (J_{L-1} (\ldots (J_1 v) \ldots )) \]

effectively calculating the Jacobian of the program by multiplying by the Jacobians from left to right at each step of the way. This means doing primitive $Jv$ calculations on each underlying problem, and pushing that calculation through.

But what about reverse accumulation? This can be isolated to the simple expression graph:

In backpropagation, we just showed that when doing reverse accumulation, the rule is that multiplication forwards is multiplication by the transpose backwards. So if the forward way to compute the Jacobian in reverse is to replace the matrix by its transpose:

We can either look at it as $J^T v$, or by transposing the equation $v^T J$. It's right there that we have a vector-transpose Jacobian product, or a vjp.

We can thus think of this as a different direction for the Jacobian accumulation. Reverse-mode automatic differentiation moves backwards through our composed Jacobian. For a value $v$ at the end, we can push it backwards:

\[ v^T J = (\ldots ((v^T J_L) J_{L-1}) \ldots ) J_1 \]

doing a vjp at every step of the way, which is simply doing reverse-mode AD of that function (and if it's linear, then simply doing the matrix multiplication). Thus reverse-mode AD is just a grouping of vjps into a single larger expression, instead of linearizing every single step.

Primitives of Reverse Mode

For forward-mode AD, we saw that we could define primitives in order to accelerate the calculation. For example, knowing that

\[ exp(x+\epsilon) = exp(x) + exp(x)\epsilon \]

allows the program to skip autodifferentiating through the code for exp. This was simple with forward-mode since we could represent the operation on a Dual number. What's the equivalent for reverse-mode AD? The answer is the pullback function. If $y = [y_1,y_2,\ldots] = f(x_1,x_2, \ldots)$, then $[\overline{x_1},\overline{x_2},\ldots]=\mathcal{B}_f^x(\overline{y})$ is the pullback of $f$ at the point $x$, defined for a scalar loss function $L(y)$ as:

\[ \overline{x_i} = \frac{\partial L}{\partial x_i} = \sum_j \frac{\partial L}{\partial y_j} \frac{\partial y_j}{\partial x_i} \]

Using the notation from earlier, $\overline{y} = \frac{\partial L}{\partial y}$ is the derivative of the some intermediate w.r.t. the cost function, and thus

\[ \overline{x_i} = \sum_j \overline{y_j} \frac{\partial y_j}{\partial x_i} = \mathcal{B}_f^x(\overline{y}) \]

Note that $\mathcal{B}_f^x(\overline{y})$ is a function of $x$ because the reverse pass that is use embeds values from the forward pass, and the values from the forward pass to use are those calculated during the evaluation of $f(x)$.

By the chain rule, if we don't have a primitive defined for $y_i(x)$, we can compute that by $\mathcal{B}_{y_i}(\overline{y})$, and recursively apply this process until we hit rules that we know. The rules to start with are the scalar derivative rules with follow quite simply, and the multivariate rules which we derived above. For example, if $y=f(x)=Ax$, then

\[ \mathcal{B}_{f}^x(\overline{y}) = \overline{y}^T A \]

which is simply saying that the Jacobian of $f$ at $x$ is $A$, and so the vjp is to multiply the vector transpose by $A$.

Likewise, for element-wise operations, the Jacobian is diagonal, and thus the vjp is multiplying once again by a diagonal matrix against the derivative, deriving the same pullback as we had for backpropagation in a neural network. This then is a quicker encoding and derivation of backpropagation.

Multivariate Derivatives from Reverse Mode

Since the primitive of reverse mode is the vjp, we can understand its behavior by looking at a large primitive. In our simplest case, the function $f(x)=Ax$ outputs a vector value, which we apply our loss function $L(y) = \Vert y-t \Vert$ to get a scalar. Thus we seed the scalar output $v=1$, and in the first step backwards we have a vector to scalar function, so the first pullback transforms from $1$ to the vector $v_2 = 2|y-t|$. Then we take that vector and multiply it like $v_2^T A$ to get the derivatives w.r.t. $x$.

Now let $L(y)$ be a vector function, i.e. we output a vector instead of a scalar from our loss function. Then $v$ is the seed to this process. Let's assume that $v = e_i$, one of the basis vectors. Then

\[ v_i^T J = e_i^T J \]

pulls computes a row of the Jacobian. There, if we had a vector function $y=f(x)$, the pullback $\mathcal{B}_f^x(e_i)$ is the row of the Jacobian $f'(x)$. Concatenating these is thus a way to build a full Jacobian. The gradient is thus a special case where $y$ is scalar, and thus the resulting Jacobian is just a single row, and therefore we set the seed equal to $1$ to compute the unscaled gradient.

Multi-Seeding

Similarly to forward-mode having a dual number with multiple simultaneous derivatives through partials $d = x + v_1 \epsilon_1 + \ldots + v_m \epsilon_m$, one can see that multi-seeding is an option in reverse-mode AD by, instead of pulling back a matrix instead of a row vector, where each row is a direction. Thus the matrix $A = [v_1 v_2 \ldots v_n]^T$ evaluated as $\mathcal{B}_f^x(A)$ is the equivalent operation to the forward-mode $f(d)$ for generalized multivariate multiseeded reverse-mode automatic differentiation. One should take care to recognize the Jacobian as a generalized linear operator in this case and ensure that the shapes in the program correctly handle this storage of the reverse seed. When linear, this will automatically make use of BLAS3 operations, making it an efficient form for neural networks.

Sparse Reverse Mode AD

Since the Jacobian is built row-by-row with reverse mode AD, the sparse differentiation discussion from forward-mode AD applies similarly but to the transpose. Therefore, in order to perform sparse reverse mode automatic differentiation, one would build up a connectivity graph of the columns, and perform a coloring algorithm on this graph. The seeds of the reverse call, $v_i$, would then be the color vectors, which would compute compressed rows, that are then decompressed similarly to the forward-mode case.

Forward Mode vs Reverse Mode

Notice that a pullback of a single scalar gives the gradient of a function, while the pushforward using forward-mode of a dual gives a directional derivative. Forward mode computes columns of a Jacobian, while reverse mode computes gradients (rows of a Jacobian). Therefore, the relative efficiency of the two approaches is based on the size of the Jacobian. If $f:\mathbb{R}^n \rightarrow \mathbb{R}^m$, then the Jacobian is of size $m \times n$. If $m$ is much smaller than $n$, then computing by each row will be faster, and thus use reverse mode. In the case of a gradient, $m=1$ while $n$ can be large, leading to this phenomena. Likewise, if $n$ is much smaller than $m$, then computing by each column will be faster. We will see shortly the reverse mode AD has a high overhead with respect to forward mode, and thus if the values are relatively equal (or $n$ and $m$ are small), forward mode is more efficient.

However, since optimization needs gradients, reverse-mode definitely has a place in the standard toolchain which is why backpropagation is so central to machine learning.

Side Note on Mixed Mode

Interestingly, one can find cases where mixing the forward and reverse mode results would give an asymptotically better result. For example, if a Jacobian was non-zero in only the first 3 rows and first 3 columns, then sparse forward mode would still require N partials and reverse mode would require M seeds. However, one forward mode call of 3 partials and one reverse mode call of 3 seeds would calculate all three rows and columns with $\mathcal{O}(1)$ work, as opposed to $\mathcal{O}(N)$ or $\mathcal{O}(M)$. Exactly how to make use of this insight in an automated manner is an open research question.

Forward-Over-Reverse and Hessian-Free Products

Using this knowledge, we can also develop quick ways for computing the Hessian. Recall from earlier in the discussion that Hessians are the Jacobian of the gradient. So let's say for a scalar function $f$ we want to compute the Hessian. To compute the gradient, we use the reverse-mode AD pullback $\nabla f(x) = \mathcal{B}_f^x(1)$. Recall that the pullback is a function of $x$ since that is the value at which the values from the forward pass are taken. Then since the Jacobian of the gradient vector is $n \times n$ (as many terms in the gradient as there are inputs!), it holds that we want to use forward-mode AD for this Jacobian. Therefore, using the dual number $x = x_0 + e_1 \epsilon_1 + \ldots + e_n \epsilon_n$ the reverse mode gradient function computes the full Hessian in one forward pass. What this amounts to is pushing forward the dual number forward sensitivities when building the pullback, and then when doing the pullback the dual portions, will be holding vectors for the columns of the Hessian.

Similarly, Hessian-vector products without computing the Hessian can be computed using the Jacobian-vector product trick on the function defined by the gradient. Here, $Hv$ is equivalent to the dual part of

\[ \nabla f(x+v\epsilon) = \mathcal{B}_f^{x+v\epsilon}(1) \]

This means that our Newton method for optimization:

\[ p_{i+1} = p_i - H(p_i)^{-1} \frac{dC(p_i)}{dp} \]

can be treated similarly to that for the nonlinear solving problem, where the linear system can be solved using Hessian-free vector products to build a Krylov subspace, giving rise to the Hessian-free Newton Krylov method for optimization.

References

We thank Roger Grosse's lecture notes for the amazing tikz graphs.

\ No newline at end of file diff --git a/_weave/lecture11/adjoints/index.html b/_weave/lecture11/adjoints/index.html index 98f1b622..2977edbe 100644 --- a/_weave/lecture11/adjoints/index.html +++ b/_weave/lecture11/adjoints/index.html @@ -33,4 +33,4 @@

Differentiable Programming and Neural Differential Equationsy = meanpool(data(x), pdims; kw...) y, Δ -> (nobacksies(:meanpool, NNlib.∇meanpool(data.((Δ, y, x))..., pdims; kw...)), nothing) end -

where the derivative makes use of not only x, but also y so that the meanpool does not need to be re-calculated.

Using this style, Tracker.jl moves forward, building up the value and closures for the backpass and then recursively pulls back the input Δ to receive the derivative.

Source-to-Source AD

Given our previous discussions on performance, you should be horrified with how this approach handles scalar values. Each TrackedReal holds as Tracked{T} which holds a Call, not a Call{F,As<:Tuple}, and thus it's not strictly typed. Because it's not strictly typed, this implies that every single operation is going to cause heap allocations. If you measure this in PyTorch, TensorFlow Eager, Tracker, etc. you get around 500ns-2ms of overhead. This means that a 2ns + operation becomes... >500ns! Oh my!

This is not the only issue with tracing. Another issue is that the trace is value-dependent, meaning that every new value can build a new trace. Thus one cannot easily JIT compile a trace because it'll be different for every gradient calculation (you can compile it, but you better make sure the compile times are short!). Lastly, the Wengert list can be much larger than the code itself. For example, if you trace through a loop that is for i in 1:100000, then the trace will be huge, even if the function is relatively simple. This is directly demonstrated in the JAX "how it works" slide:

To avoid these issues, another version of reverse-mode automatic differentiation is source-to-source transformations. In order to do source code transformations, you need to know how to transform all language constructs via the reverse pass. This can be quite difficult (what is the "adjoint" of lock?), but when worked out this has a few benefits. First of all, you do not have to track values, meaning stack-allocated values can stay on the stack. Additionally, you can JIT compile one backpass because you have a single function used for all backpasses. Lastly, you don't need to unroll your loops! Instead, which each branch you'd need to insert some data structure to recall the values used from the forward pass (in order to invert in the right directions). However, that can be much more lightweight than a tracking pass.

This can be a difficult problem to do on a general programming language. In general it needs a strong programmatic representation to use as a compute graph. Google's engineers did an analysis when choosing Swift for TensorFlow and narrowed it down to either Swift or Julia due to their internal graph structures. Thus, it should be no surprise that the modern source-to-source AD systems are Zygote.jl for Julia, and Swift for TensorFlow in Swift. Additionally, older AD systems, like Tampenade, ADIFOR, and TAF, all for Fortran, were source-to-source AD systems.

Derivation of Reverse Mode Rules: Adjoints and Implicit Function Theorem

In order to require the least amount of work from our AD system, we need to be able to derive the adjoint rules at the highest level possible. Here are a few well-known cases to start understanding. These next examples are from Steven Johnson's resource.

Adjoint of Linear Solve

Let's say we have the function $A(p)x=b(p)$, i.e. this is the function that is given by the linear solving process, and we want to calculate the gradients of a cost function $g(x,p)$. To evaluate the gradient directly, we'd calculate:

\[ \frac{dg}{dp} = g_p + g_x x_p \]

where $x_p$ is the derivative of each value of $x$ with respect to each parameter $p$, and thus it's an $M \times P$ matrix (a Jacobian). Since $g$ is a small cost function, $g_p$ and $g_x$ are easy to compute, but $x_p$ is given by:

\[ x_{p_i} = A^{-1}(b_{p_i}-A_{p_i}x) \]

and so this is $P$ $M \times M$ linear solves, which is expensive! However, if we multiply by

\[ \lambda^{T} = g_x A^{-1} \]

then we obtain

\[ \frac{dg}{dp}\vert_{f=0} = g_p - \lambda^T f_p = g_p - \lambda^T (A_p x - b_p) \]

which is an alternative formulation of the derivative at the solution value. However, in this case there is no computational benefit to this reformulation.

Adjoint of Nonlinear Solve

Now let's look at some $f(x,p)=0$ nonlinear solving. Differentiating by $p$ gives us:

\[ f_x x_p + f_p = 0 \]

and thus $x_p = -f_x^{-1}f_p$. Therefore, using our cost function we write:

\[ \frac{dg}{dp} = g_p + g_x x_p = g_p - g_x \left(f_x^{-1} f_p \right) \]

or

\[ \frac{dg}{dp} = g_p - \left(g_x f_x^{-1} \right) f_p \]

Since $g_x$ is $1 \times M$, $f_x^{-1}$ is $M \times M$, and $f_p$ is $M \times P$, this grouping changes the problem gets rid of the size $MP$ term.

As is normal with backpasses, we solve for $x$ through the forward pass however we like, and then for the backpass solve for

\[ f_x^T \lambda = g_x^T \]

to obtain

\[ \frac{dg}{dp}\vert_{f=0} = g_p - \lambda^T f_p \]

which does the calculation without ever building the size $M \times MP$ term.

Adjoint of Ordinary Differential Equations

We with to solve for some cost function $G(u,p)$ evaluated throughout the differential equation, i.e.:

\[ G(u,p) = G(u(p)) = \int_{t_0}^T g(u(t,p))dt \]

To derive this adjoint, introduce the Lagrange multiplier $\lambda$ to form:

\[ I(p) = G(p) - \int_{t_0}^T \lambda^\ast (u^\prime - f(u,p,t))dt \]

Since $u^\prime = f(u,p,t)$, this is the mathematician's trick of adding zero, so then we have that

\[ \frac{dG}{dp} = \frac{dI}{dp} = \int_{t_0}^T (g_p + g_u s)dt - \int_{t_0}^T \lambda^\ast (s^\prime - f_u s - f_p)dt \]

for $s$ being the sensitivity, $s = \frac{du}{dp}$. After applying integration by parts to $\lambda^\ast s^\prime$, we get that:

\[ \int_{t_{0}}^{T}\lambda^{\ast}\left(s^{\prime}-f_{u}s-f_{p}\right)dt =\int_{t_{0}}^{T}\lambda^{\ast}s^{\prime}dt-\int_{t_{0}}^{T}\lambda^{\ast}\left(f_{u}s+f_{p}\right)dt \]

\[ =|\lambda^{\ast}(t)s(t)|_{t_{0}}^{T}-\int_{t_{0}}^{T}\lambda^{\ast\prime}sdt-\int_{t_{0}}^{T}\lambda^{\ast}\left(f_{u}s+f_{p}\right)dt \]

To see where we ended up, let's re-arrange the full expression now:

\[ \frac{dG}{dp} =\int_{t_{0}}^{T}(g_{p}+g_{u}s)dt+|\lambda^{\ast}(t)s(t)|_{t_{0}}^{T}-\int_{t_{0}}^{T}\lambda^{\ast\prime}sdt-\int_{t_{0}}^{T}\lambda^{\ast}\left(f_{u}s+f_{p}\right)dt \]

\[ =\int_{t_{0}}^{T}(g_{p}+\lambda^{\ast}f_{p})dt+|\lambda^{\ast}(t)s(t)|_{t_{0}}^{T}-\int_{t_{0}}^{T}\left(\lambda^{\ast\prime}+\lambda^\ast f_{u}-g_{u}\right)sdt \]

That was just a re-arrangement. Now, let's require that

\[ \lambda^\prime = -\frac{df}{du}^\ast \lambda + \left(\frac{dg}{du} \right)^\ast \]

\[ \lambda(T) = 0 \]

This means that the boundary term of the integration by parts is zero, and also one of those integral terms are perfectly zero. Thus, if $\lambda$ satisfies that equation, then we get:

\[ \frac{dG}{dp} = \lambda^\ast(t_0)\frac{dG}{du}(t_0) + \int_{t_0}^T \left(g_p + \lambda^\ast f_p \right)dt \]

which gives us our adjoint derivative relation.

If $G$ is discrete, then it can be represented via the Dirac delta:

\[ G(u,p) = \int_{t_0}^T \sum_{i=1}^N \Vert d_i - u(t_i,p)\Vert^2 \delta(t_i - t)dt \]

in which case

\[ g_u(t_i) = 2(d_i - u(t_i,p)) \]

at the data points $(t_i,d_i)$. Therefore, the derivative of an ODE solution with respect to a cost function is given by solving for $\lambda^\ast$ using an ODE for $\lambda^T$ in reverse time, and then using that to calculate $\frac{dG}{dp}$. Note that $\frac{dG}{dp}$ can be calculated simultaneously by appending a single value to the reverse ODE, since we can simply define the new ODE term as $g_p + \lambda^\ast f_p$, which would then calculate the integral on the fly (ODE integration is just... integration!).

Complexities of Implementing ODE Adjoints

The image below explains the dilemma:

Essentially, the whole problem is that we need to solve the ODE

\[ \lambda^\prime = -\frac{df}{du}^\ast \lambda - \left(\frac{dg}{du} \right)^\ast \]

\[ \lambda(T) = 0 \]

in reverse, but $\frac{df}{du}$ is defined by $u(t)$ which is a value only computed in the forward pass (the forward pass is embedded within the backpass!). Thus we need to be able to retrieve the value of $u(t)$ to get the Jacobian on-demand. There are three ways which this can be done:

  1. If you solve the reverse ODE $u^\prime = f(u,p,t)$ backwards in time, mathematically it'll give equivalent values. Computation-wise, this means that you can append $u(t)$ to $\lambda(t)$ (to $\frac{dG}{dp}$) to calculate all terms at the same time with a single reverse pass ODE. However, numerically this is unstable and thus not always recommended (ODEs are reversible, but ODE solver methods are not necessarily going to generate the same exact values or trajectories in reverse!)

  2. If you solve the forward ODE and receive a continuous solution $u(t)$, you can interpolate it to retrieve the values at any given the time reverse pass needs the $\frac{df}{du}$ Jacobian. This is fast but memory-intensive.

  3. Every time you need a value $u(t)$ during the backpass, you re-solve the forward ODE to $u(t)$. This is expensive! Thus one can instead use checkpoints, i.e. save at finitely many time points during the forward pass, and use those as starting points for the $u(t)$ calculation.

Alternative strategies can be investigated, such as an interpolation which stores values in a compressed form.

The vjp and Neural Ordinary Differential Equations

It is here that we can note that, if $f$ is a function defined by a neural network, we arrive at the neural ordinary differential equation. This adjoint method is thus the backpropagation method for the neural ODE. However, the backpass

\[ \lambda^\prime = -\frac{df}{du}^\ast \lambda - \left(\frac{dg}{du} \right)^\ast \]

\[ \lambda(T) = 0 \]

can be improved by noticing $\frac{df}{du}^\ast \lambda$ is a vjp, and thus it can be calculated using $\mathcal{B}_f^{u(t)}(\lambda^\ast)$, i.e. reverse-mode AD on the function $f$. If $f$ is a neural network, this means that the reverse ODE is defined through successive backpropagation passes of that neural network. The result is a derivative with respect to the cost function of the parameters defining $f$ (either a model or a neural network), which can then be used to fit the data ("train").

Alternative "Training" Strategies

Those are the "brute force" training methods which simply use $u(t,p)$ evaluations to calculate the cost. However, it is worth noting that there are a few better strategies that one can employ in the case of dynamical models.

Multiple Shooting Techniques

Instead of shooting just from the beginning, one can instead shoot from multiple points in time:

Of course, one won't know what the "initial condition in the future" is, but one can instead make that a parameter. By doing so, each interval can be solved independently, and one can then add to the cost function that the end of one interval must match up with the beginning of the other. This can make the integration more robust, since shooting with incorrect parameters over long time spans can give massive gradients which makes it hard to hone in on the correct values.

Collocation Methods

If the data is dense enough, one can fit a curve through the points, such as a spline:

If that's the case, one can use the fit spline in order to estimate the derivative at each point. Since the ODE is defined as $u^\prime = f(u,p,t)$, one then then use the cost function

\[ C(p) = \sum_{i=1}^N \Vert\tilde{u}^{\prime}(t_i) - f(u(t_i),p,t)\Vert \]

where $\tilde{u}^{\prime}(t_i)$ is the estimated derivative at the time point $t_i$. Then one can fit the parameters to ensure this holds. This method can be extremely fast since the ODE doesn't ever have to be solved! However, note that this is not able to compensate for error accumulation, and thus early errors are not accounted for in the later parts of the data. This means that the integration won't necessarily match the data even if this fit is "good" if the data points are too far apart, a property that is not true with fitting. Thus, this is usually done as part of a two-stage method, where the starting stage uses collocation to get initial parameters which is then completed with a shooting method.

\ No newline at end of file +

where the derivative makes use of not only x, but also y so that the meanpool does not need to be re-calculated.

Using this style, Tracker.jl moves forward, building up the value and closures for the backpass and then recursively pulls back the input Δ to receive the derivative.

Source-to-Source AD

Given our previous discussions on performance, you should be horrified with how this approach handles scalar values. Each TrackedReal holds as Tracked{T} which holds a Call, not a Call{F,As<:Tuple}, and thus it's not strictly typed. Because it's not strictly typed, this implies that every single operation is going to cause heap allocations. If you measure this in PyTorch, TensorFlow Eager, Tracker, etc. you get around 500ns-2ms of overhead. This means that a 2ns + operation becomes... >500ns! Oh my!

This is not the only issue with tracing. Another issue is that the trace is value-dependent, meaning that every new value can build a new trace. Thus one cannot easily JIT compile a trace because it'll be different for every gradient calculation (you can compile it, but you better make sure the compile times are short!). Lastly, the Wengert list can be much larger than the code itself. For example, if you trace through a loop that is for i in 1:100000, then the trace will be huge, even if the function is relatively simple. This is directly demonstrated in the JAX "how it works" slide:

To avoid these issues, another version of reverse-mode automatic differentiation is source-to-source transformations. In order to do source code transformations, you need to know how to transform all language constructs via the reverse pass. This can be quite difficult (what is the "adjoint" of lock?), but when worked out this has a few benefits. First of all, you do not have to track values, meaning stack-allocated values can stay on the stack. Additionally, you can JIT compile one backpass because you have a single function used for all backpasses. Lastly, you don't need to unroll your loops! Instead, which each branch you'd need to insert some data structure to recall the values used from the forward pass (in order to invert in the right directions). However, that can be much more lightweight than a tracking pass.

This can be a difficult problem to do on a general programming language. In general it needs a strong programmatic representation to use as a compute graph. Google's engineers did an analysis when choosing Swift for TensorFlow and narrowed it down to either Swift or Julia due to their internal graph structures. Thus, it should be no surprise that the modern source-to-source AD systems are Zygote.jl for Julia, and Swift for TensorFlow in Swift. Additionally, older AD systems, like Tampenade, ADIFOR, and TAF, all for Fortran, were source-to-source AD systems.

Derivation of Reverse Mode Rules: Adjoints and Implicit Function Theorem

In order to require the least amount of work from our AD system, we need to be able to derive the adjoint rules at the highest level possible. Here are a few well-known cases to start understanding. These next examples are from Steven Johnson's resource.

Adjoint of Linear Solve

Let's say we have the function $A(p)x=b(p)$, i.e. this is the function that is given by the linear solving process, and we want to calculate the gradients of a cost function $g(x,p)$. To evaluate the gradient directly, we'd calculate:

\[ \frac{dg}{dp} = g_p + g_x x_p \]

where $x_p$ is the derivative of each value of $x$ with respect to each parameter $p$, and thus it's an $M \times P$ matrix (a Jacobian). Since $g$ is a small cost function, $g_p$ and $g_x$ are easy to compute, but $x_p$ is given by:

\[ x_{p_i} = A^{-1}(b_{p_i}-A_{p_i}x) \]

and so this is $P$ $M \times M$ linear solves, which is expensive! However, if we multiply by

\[ \lambda^{T} = g_x A^{-1} \]

then we obtain

\[ \frac{dg}{dp}\vert_{f=0} = g_p - \lambda^T f_p = g_p - \lambda^T (A_p x - b_p) \]

which is an alternative formulation of the derivative at the solution value. However, in this case there is no computational benefit to this reformulation.

Adjoint of Nonlinear Solve

Now let's look at some $f(x,p)=0$ nonlinear solving. Differentiating by $p$ gives us:

\[ f_x x_p + f_p = 0 \]

and thus $x_p = -f_x^{-1}f_p$. Therefore, using our cost function we write:

\[ \frac{dg}{dp} = g_p + g_x x_p = g_p - g_x \left(f_x^{-1} f_p \right) \]

or

\[ \frac{dg}{dp} = g_p - \left(g_x f_x^{-1} \right) f_p \]

Since $g_x$ is $1 \times M$, $f_x^{-1}$ is $M \times M$, and $f_p$ is $M \times P$, this grouping changes the problem gets rid of the size $MP$ term.

As is normal with backpasses, we solve for $x$ through the forward pass however we like, and then for the backpass solve for

\[ f_x^T \lambda = g_x^T \]

to obtain

\[ \frac{dg}{dp}\vert_{f=0} = g_p - \lambda^T f_p \]

which does the calculation without ever building the size $M \times MP$ term.

Adjoint of Ordinary Differential Equations

We with to solve for some cost function $G(u,p)$ evaluated throughout the differential equation, i.e.:

\[ G(u,p) = G(u(p)) = \int_{t_0}^T g(u(t,p))dt \]

To derive this adjoint, introduce the Lagrange multiplier $\lambda$ to form:

\[ I(p) = G(p) - \int_{t_0}^T \lambda^\ast (u^\prime - f(u,p,t))dt \]

Since $u^\prime = f(u,p,t)$, this is the mathematician's trick of adding zero, so then we have that

\[ \frac{dG}{dp} = \frac{dI}{dp} = \int_{t_0}^T (g_p + g_u s)dt - \int_{t_0}^T \lambda^\ast (s^\prime - f_u s - f_p)dt \]

for $s$ being the sensitivity, $s = \frac{du}{dp}$. After applying integration by parts to $\lambda^\ast s^\prime$, we get that:

\[ \int_{t_{0}}^{T}\lambda^{\ast}\left(s^{\prime}-f_{u}s-f_{p}\right)dt =\int_{t_{0}}^{T}\lambda^{\ast}s^{\prime}dt-\int_{t_{0}}^{T}\lambda^{\ast}\left(f_{u}s+f_{p}\right)dt \]

\[ =|\lambda^{\ast}(t)s(t)|_{t_{0}}^{T}-\int_{t_{0}}^{T}\lambda^{\ast\prime}sdt-\int_{t_{0}}^{T}\lambda^{\ast}\left(f_{u}s+f_{p}\right)dt \]

To see where we ended up, let's re-arrange the full expression now:

\[ \frac{dG}{dp} =\int_{t_{0}}^{T}(g_{p}+g_{u}s)dt+|\lambda^{\ast}(t)s(t)|_{t_{0}}^{T}-\int_{t_{0}}^{T}\lambda^{\ast\prime}sdt-\int_{t_{0}}^{T}\lambda^{\ast}\left(f_{u}s+f_{p}\right)dt \]

\[ =\int_{t_{0}}^{T}(g_{p}+\lambda^{\ast}f_{p})dt+|\lambda^{\ast}(t)s(t)|_{t_{0}}^{T}-\int_{t_{0}}^{T}\left(\lambda^{\ast\prime}+\lambda^\ast f_{u}-g_{u}\right)sdt \]

That was just a re-arrangement. Now, let's require that

\[ \lambda^\prime = -\frac{df}{du}^\ast \lambda + \left(\frac{dg}{du} \right)^\ast \]

\[ \lambda(T) = 0 \]

This means that the boundary term of the integration by parts is zero, and also one of those integral terms are perfectly zero. Thus, if $\lambda$ satisfies that equation, then we get:

\[ \frac{dG}{dp} = \lambda^\ast(t_0)\frac{dG}{du}(t_0) + \int_{t_0}^T \left(g_p + \lambda^\ast f_p \right)dt \]

which gives us our adjoint derivative relation.

If $G$ is discrete, then it can be represented via the Dirac delta:

\[ G(u,p) = \int_{t_0}^T \sum_{i=1}^N \Vert d_i - u(t_i,p)\Vert^2 \delta(t_i - t)dt \]

in which case

\[ g_u(t_i) = 2(d_i - u(t_i,p)) \]

at the data points $(t_i,d_i)$. Therefore, the derivative of an ODE solution with respect to a cost function is given by solving for $\lambda^\ast$ using an ODE for $\lambda^T$ in reverse time, and then using that to calculate $\frac{dG}{dp}$. Note that $\frac{dG}{dp}$ can be calculated simultaneously by appending a single value to the reverse ODE, since we can simply define the new ODE term as $g_p + \lambda^\ast f_p$, which would then calculate the integral on the fly (ODE integration is just... integration!).

Complexities of Implementing ODE Adjoints

The image below explains the dilemma:

Essentially, the whole problem is that we need to solve the ODE

\[ \lambda^\prime = -\frac{df}{du}^\ast \lambda - \left(\frac{dg}{du} \right)^\ast \]

\[ \lambda(T) = 0 \]

in reverse, but $\frac{df}{du}$ is defined by $u(t)$ which is a value only computed in the forward pass (the forward pass is embedded within the backpass!). Thus we need to be able to retrieve the value of $u(t)$ to get the Jacobian on-demand. There are three ways which this can be done:

  1. If you solve the reverse ODE $u^\prime = f(u,p,t)$ backwards in time, mathematically it'll give equivalent values. Computation-wise, this means that you can append $u(t)$ to $\lambda(t)$ (to $\frac{dG}{dp}$) to calculate all terms at the same time with a single reverse pass ODE. However, numerically this is unstable and thus not always recommended (ODEs are reversible, but ODE solver methods are not necessarily going to generate the same exact values or trajectories in reverse!)

  2. If you solve the forward ODE and receive a continuous solution $u(t)$, you can interpolate it to retrieve the values at any given the time reverse pass needs the $\frac{df}{du}$ Jacobian. This is fast but memory-intensive.

  3. Every time you need a value $u(t)$ during the backpass, you re-solve the forward ODE to $u(t)$. This is expensive! Thus one can instead use checkpoints, i.e. save at finitely many time points during the forward pass, and use those as starting points for the $u(t)$ calculation.

Alternative strategies can be investigated, such as an interpolation which stores values in a compressed form.

The vjp and Neural Ordinary Differential Equations

It is here that we can note that, if $f$ is a function defined by a neural network, we arrive at the neural ordinary differential equation. This adjoint method is thus the backpropagation method for the neural ODE. However, the backpass

\[ \lambda^\prime = -\frac{df}{du}^\ast \lambda - \left(\frac{dg}{du} \right)^\ast \]

\[ \lambda(T) = 0 \]

can be improved by noticing $\frac{df}{du}^\ast \lambda$ is a vjp, and thus it can be calculated using $\mathcal{B}_f^{u(t)}(\lambda^\ast)$, i.e. reverse-mode AD on the function $f$. If $f$ is a neural network, this means that the reverse ODE is defined through successive backpropagation passes of that neural network. The result is a derivative with respect to the cost function of the parameters defining $f$ (either a model or a neural network), which can then be used to fit the data ("train").

Alternative "Training" Strategies

Those are the "brute force" training methods which simply use $u(t,p)$ evaluations to calculate the cost. However, it is worth noting that there are a few better strategies that one can employ in the case of dynamical models.

Multiple Shooting Techniques

Instead of shooting just from the beginning, one can instead shoot from multiple points in time:

Of course, one won't know what the "initial condition in the future" is, but one can instead make that a parameter. By doing so, each interval can be solved independently, and one can then add to the cost function that the end of one interval must match up with the beginning of the other. This can make the integration more robust, since shooting with incorrect parameters over long time spans can give massive gradients which makes it hard to hone in on the correct values.

Collocation Methods

If the data is dense enough, one can fit a curve through the points, such as a spline:

If that's the case, one can use the fit spline in order to estimate the derivative at each point. Since the ODE is defined as $u^\prime = f(u,p,t)$, one then then use the cost function

\[ C(p) = \sum_{i=1}^N \Vert\tilde{u}^{\prime}(t_i) - f(u(t_i),p,t)\Vert \]

where $\tilde{u}^{\prime}(t_i)$ is the estimated derivative at the time point $t_i$. Then one can fit the parameters to ensure this holds. This method can be extremely fast since the ODE doesn't ever have to be solved! However, note that this is not able to compensate for error accumulation, and thus early errors are not accounted for in the later parts of the data. This means that the integration won't necessarily match the data even if this fit is "good" if the data points are too far apart, a property that is not true with fitting. Thus, this is usually done as part of a two-stage method, where the starting stage uses collocation to get initial parameters which is then completed with a shooting method.

\ No newline at end of file diff --git a/_weave/lecture13/gpus/index.html b/_weave/lecture13/gpus/index.html index d7047909..7a60d107 100644 --- a/_weave/lecture13/gpus/index.html +++ b/_weave/lecture13/gpus/index.html @@ -127,4 +127,4 @@

GPU programming

Valentin Churavy
October 30th c = vifelse(mask, b, a) # merge results vstore(c, A, i) end -

GPU (implicit vectorized)

Instead of using explicit vectorization, GPUs change the programming model so that the programmer writes a kernel which operates over each element of the data. In effect the programmer is writing a program that is executed for each vector lane. It is important to remember that the hardware itself still operates on vectors (CUDA calls this warp-size and it is 32 elements).

At this point please refer to the lecture slides

\ No newline at end of file +

GPU (implicit vectorized)

Instead of using explicit vectorization, GPUs change the programming model so that the programmer writes a kernel which operates over each element of the data. In effect the programmer is writing a program that is executed for each vector lane. It is important to remember that the hardware itself still operates on vectors (CUDA calls this warp-size and it is 32 elements).

At this point please refer to the lecture slides

\ No newline at end of file diff --git a/_weave/lecture14/pdes_and_convolutions/index.html b/_weave/lecture14/pdes_and_convolutions/index.html index 94d27c8a..b65b2243 100644 --- a/_weave/lecture14/pdes_and_convolutions/index.html +++ b/_weave/lecture14/pdes_and_convolutions/index.html @@ -125,6 +125,6 @@

What SciComp can learn from ML: Moderate Generalizations to Partial Differen \ No newline at end of file diff --git a/_weave/lecture15/diffeq_machine_learning/index.html b/_weave/lecture15/diffeq_machine_learning/index.html index ffae8fd9..931438e9 100644 --- a/_weave/lecture15/diffeq_machine_learning/index.html +++ b/_weave/lecture15/diffeq_machine_learning/index.html @@ -44,4 +44,4 @@

Mixing Differential Equations and Neural Networks for Physics-I cb() Flux.train!(loss_adjoint, ps, data, opt, cb = cb) -

DiffEqFlux.jl supports the wide gambit of possible universal differential equations with combinations of stiffness, delays, stochasticity, etc. It does so by using Julia's language-wide AD tooling, such as ReverseDiff.jl, Tracker.jl, ForwardDiff.jl, and Zygote.jl, along with specializations available whenever adjoint methods are known (and the choice between the two is given to the user).

Many of the methods below can be encapsulated as a choice of a universal differential equation and trained with higher order, adaptive, and more efficient methods with DiffEqFlux.jl.

Deep BSDE Methods for High Dimensional Partial Differential Equations

The key paper on deep BSDE methods is this article from PNAS by Jiequn Han, Arnulf Jentzen, and Weinan E. Follow up papers like this one have identified a larger context in the sense of forward-backwards SDEs for a large class of partial differential equations.

Understanding the Setup for Terminal PDEs

While this setup may seem a bit contrived given the "very specific" partial differential equation form (you know the end value? You have some parabolic form?), it turns out that there is a large class of problems in economics and finance that satisfy this form. The reason is because in these problems you may know the value of something at the end, when you're going to sell it, and you want to evaluate it right now. The classic example is in options pricing. An option is a contract to be able to solve a stock at a given value. The simplest case is a contract that can only be executed at a pre-determined time in the future. Let's say we have an option to sell a stock at 100 no matter what. This means that, if the stock at the strike time (the time the option can be sold) is 70, we will make 30 from this option, and thus the option itself is worth 30. The question is, if I have this option today, the strike time is 3 months in the future, and the stock price is currently 70, how much should I value the option today?

To solve this, we need to put a model on how we think the stock price will evolve. One simple version is a linear stochastic differential equation, i.e. the stock price will evolve with a constant interest rate $r$ with some volatility (randomness) $\sigma$, in which case:

\[ dX_t = r X_t dt + \sigma X_t dW_t. \]

From this model, we can evaluate the probability that the stock is going to be at given values, which then gives us the probability that the option is worth a given value, which then gives us the expected (or average) value of the option. This is the Black-Scholes problem. However, a more direct way of calculating this result is writing down a partial differential equation for the evolution of the value of the option $V$ as a function of time $t$ and the current stock price $x$. At the final time point, if we know the stock price then we know the value of the option, and thus we have a terminal condition $V(T,x) = g(x)$ for some known value function $g(x)$. The question is, given this value at time $T$, what is the value of the option at time $t=0$ given that the stock currently has a value $x = \zeta$. Why is this interesting? This will tell you what you think the option is currently valued at, and thus if it's cheaper than that, you can gain money by buying the option right now! This means that the "solution" to the PDE is the value $V(0,\zeta)$, where we know the final points $V(T,x) = g(x)$. This is precisely the type of problem that is solved by the deep BSDE method.

The Deep BSDE Method

Consider the class of semilinear parabolic PDEs, in finite time $t\in[0, T]$ and $d$-dimensional space $x\in\mathbb R^d$, that have the form

\[ \begin{align} \frac{\partial u}{\partial t}(t,x) &+\frac{1}{2}\text{trace}\left(\sigma\sigma^{T}(t,x)\left(\text{Hess}_{x}u\right)(t,x)\right)\\ &+\nabla u(t,x)\cdot\mu(t,x) \\ &+f\left(t,x,u(t,x),\sigma^{T}(t,x)\nabla u(t,x)\right)=0,\end{align} \]

with a terminal condition $u(T,x)=g(x)$. In this equation, $\text{trace}$ is the trace of a matrix, $\sigma^T$ is the transpose of $\sigma$, $\nabla u$ is the gradient of $u$, and $\text{Hess}_x u$ is the Hessian of $u$ with respect to $x$. Furthermore, $\mu$ is a vector-valued function, $\sigma$ is a $d \times d$ matrix-valued function and $f$ is a nonlinear function. We assume that $\mu$, $\sigma$, and $f$ are known. We wish to find the solution at initial time, $t=0$, at some starting point, $x = \zeta$.

Let $W_{t}$ be a Brownian motion and take $X_t$ to be the solution to the stochastic differential equation

\[ dX_t = \mu(t,X_t) dt + \sigma (t,X_t) dW_t \]

with initial condition $X(0)=\zeta$. Previous work has shown that the solution satisfies the following BSDE:

\[ \begin{align} u(t, &X_t) - u(0,\zeta) = \\ & -\int_0^t f(s,X_s,u(s,X_s),\sigma^T(s,X_s)\nabla u(s,X_s)) ds \\ & + \int_0^t \left[\nabla u(s,X_s) \right]^T \sigma (s,X_s) dW_s,\end{align} \]

with terminating condition $g(X_T) = u(X_T,W_T)$.

At this point, the authors approximate $\left[\nabla u(s,X_s) \right]^T \sigma (s,X_s)$ and $u(0,\zeta)$ as neural networks. Using the Euler-Maruyama discretization of the stochastic differential equation system, one arrives at a recurrent neural network:

Deep BSDE

Julia Implementation

A Julia implementation for the deep BSDE method can be found at NeuralPDE.jl. The examples considered below are part of the standard test suite.

Financial Applications of Deep BSDEs: Nonlinear Black-Scholes

Now let's look at a few applications which have PDEs that are solved by this method. One set of problems that are solved, given our setup, are Black-Scholes types of equations. Unlike a lot of previous literature, this works for a wide class of nonlinear extensions to Black-Scholes with large portfolios. Here, the dimension of the PDE for $V(t,x)$ is the dimension of $x$, where the dimension is the number of stocks in the portfolio that we want to consider. If we want to track 1000 stocks, this means our PDE is 1000 dimensional! Traditional PDE solvers would need around $N^{1000}$ points evolving over time in order to arrive at the solution, which is completely impractical.

One example of a nonlinear Black-Scholes equation in this form is the Black-Scholes equation with default risk. Here we are adding to the standard model the idea that the companies that we are buying stocks for can default, and thus our valuation has to take into account this default probability as the option will thus become value-less. The PDE that is arrived at is:

\[ \frac{\partial u}{\partial t}(t,x) + \bar{\mu}\cdot \nabla u(t, x) + \frac{\bar{\sigma}^{2}}{2} \sum_{i=1}^{d} \left |x_{i} \right |^{2} \frac{\partial^2 u}{\partial {x_{i}}^2}(t,x) \\ - (1 -\delta )Q(u(t,x))u(t,x) - Ru(t,x) = 0 \]

with terminating condition $g(x) = \min_{i} x_i$ for $x = (x_{1}, . . . , x_{100}) \in R^{100}$, where $\delta \in [0, 1)$, $R$ is the interest rate of the risk-free asset, and Q is a piecewise linear function of the current value with three regions $(v^{h} < v ^{l}, \gamma^{h} > \gamma^{l})$,

\[ \begin{align} Q(y) &= \mathbb{1}_{(-\infty,\upsilon^{h})}(y)\gamma ^{h} + \mathbb{1}_{[\upsilon^{l},\infty)}(y)\gamma ^{l} \\ &+ \mathbb{1}_{[\upsilon^{h},\upsilon^{l}]}(y) \left[ \frac{(\gamma ^{h} - \gamma ^{l})}{(\upsilon ^{h}- \upsilon ^{l})} (y - \upsilon ^{h}) + \gamma ^{h} \right ]. \end{align} \]

This PDE can be cast into the form of the deep BSDE method by setting:

\[ \begin{align} \mu &= \overline{\mu} X_{t} \\ \sigma &= \overline{\sigma} \text{diag}(X_{t}) \\ f &= -(1 -\delta )Q(u(t,x))u(t,x) - R u(t,x) \end{align} \]

The Julia code for this exact problem in 100 dimensions can be found here

Stochastic Optimal Control as a Deep BSDE Application

Another type of problem that fits into this terminal PDE form is the stochastic optimal control problem. The problem is a generalized context to what motivated us before. In this case, there are a set of agents which undergo some known stochastic model. What we want to do is apply some control (push them in some direction) at every single timepoint towards some goal. For example, we have the physics for the dynamics of drone flight, but there's randomness in the wind condition, and so we want to control the engine speeds to move in a certain direction. However, there is a cost associated with controlling, and thus the question is how to best balance the use of controls with the natural stochastic evolution.

It turns out this is in the same form as the Black-Scholes problem. There is a model evolving forwards, and when we get to the end we know how much everything "cost" because we know if the drone got to the right location and how much energy it took. So in the same sense as Black-Scholes, we can know the value at the end and try and propagate it backwards given the current state of the system $x$, to find out $u(0,\zeta)$, i.e. how should we control right now given the current system is in the state $x = \zeta$. It turns out that the solution of $u(t,x)$ where $u(T,x)=g(x)$ and we want to find $u(0,\zeta)$ is given by a partial differential equation which is known as the Hamilton-Jacobi-Bellman equation, which is one of these terminal PDEs that is representable by the deep BSDE method.

Take the classical linear-quadratic Gaussian (LQG) control problem in 100 dimensions

\[ dX_t = 2\sqrt{\lambda} c_t dt + \sqrt{2} dW_t \]

with $t\in [0,T]$, $X_0 = x$, and with a cost function

\[ C(c_t) = \mathbb{E}\left[\int_0^T \Vert c_t \Vert^2 dt + g(X_t) \right] \]

where $X_t$ is the state we wish to control, $\lambda$ is the strength of the control, and $c_t$ is the control process. To minimize the control, the Hamilton–Jacobi–Bellman equation:

\[ \frac{\partial u}{\partial t}(t,x) + \Delta u(t,x) - \lambda \Vert \nabla u(t,x) \Vert^2 = 0 \]

has a solution $u(t,x)$ which at $t=0$ represents the optimal cost of starting from $x$.

This PDE can be rewritten into the canonical form of the deep BSDE method by setting:

\[ \begin{align} \mu &= 0, \\ \sigma &= \overline{\sigma} I, \\ f &= -\alpha \left \| \sigma^T(s,X_s)\nabla u(s,X_s)) \right \|^{2}, \end{align} \]

where $\overline{\sigma} = \sqrt{2}$, T = 1 and $X_0 = (0,. . . , 0) \in R^{100}$.

The Julia code for solving this exact problem in 100 dimensions can be found here

Connections of Reservoir Computing to Scientific Machine Learning

Reservoir computing techniques are an alternative to the "full" neural network techniques we have previously discussed. However, the process of training neural networks has a few caveats which can cause difficulties in real systems:

  1. The tangent space diverges exponentially fast when the system is chaotic, meaning that results of both forward and reverse automatic differentiation techniques (and the related adjoints) are divergent on these kinds of systems.

  2. It is hard for neural networks to represent stiff systems. There are many reasons for this, one being that neural networks tend to drop high frequency behavior.

There are ways being investigated to alleviate these issues. For example, shadow adjoints can give a non-divergent average sense of a derivative on ergodic chaotic systems, but is significantly more expensive than the traditional adjoint.

To get around these caveats, some research teams have investigated alternatives which do not require gradient-based optimization. The clear frontrunner in this field is a type of architecture called echo state networks. A simplified formulation of an echo state network essentially fixes a neural network that defines a reservoir, i.e.

\[ x_{n+1} = \sigma(W x_n + W_{fb} y_n) \]

\[ y_n = g(W_{out} x_n) \]

where $W$ and $W_{fb}$ are fixed random matrices that are chosen before the training process, $x_n$ is called the reservoir state, and $y_n$ is the output state for the observables. The idea is to find a projection $W_{out}$ from the high dimensional random reservoir $x$ to model the timeseries by $y$. If the reservoir is a big enough and nonlinear enough random system, there should in theory exist a projection from that random system that matches any potential timeseries. Indeed, one can prove that echo state networks are universal adaptive filters under certain conditions.

If $g$ is invertible (and in many cases $g$ is taken to be the identity), then one can directly apply the inversion of $g$ to the data. This turns the training of $W_{out}$, the only non-fixed portion, into a standard least squares regression between the reservoir and the observation series. This is then solved by classical means like SVD factorizations which can be stable in ill-conditioned cases.

Echo state networks have been shown to accurately reproduce chaotic attractors which are shown to be hard to train RNNs against. A demonstration via ReservoirComputing.jl clearly highlights this prediction ability:

However, this methodology still is not tailored to the continuous nature of dynamical systems found in scientific computing. Recent work has extended this methodolgy to allow for a continuous reservoir, i.e. a continuous-time echo state network. It is shown that using the adaptive points of a stiff ODE integrator gives a non-uniform sampling in time that makes it easier to learn stiff equations from less training points, and demonstrates the ability to learn equations where standard physics-informed neural network (PINN) training techniques fail.

This area of research is still far less developed than PINNs and neural differential equations but shows promise to more easily learn highly stiff and chaotic systems which are seemingly out of reach for these other methods.

Automated Equation Discovery: Outputting LaTeX for Dynamical Systems from Data

The SINDy algorithm enables data-driven discovery of governing equations from data. It leverages the fact that most physical systems have only a few relevant terms that define the dynamics, making the governing equations sparse in a high-dimensional nonlinear function space. Given a set of observations

\[ \begin{array}{c} \mathbf{X}=\left[\begin{array}{c} \mathbf{x}^{T}\left(t_{1}\right) \\ \mathbf{x}^{T}\left(t_{2}\right) \\ \vdots \\ \mathbf{x}^{T}\left(t_{m}\right) \end{array}\right]=\left[\begin{array}{cccc} x_{1}\left(t_{1}\right) & x_{2}\left(t_{1}\right) & \cdots & x_{n}\left(t_{1}\right) \\ x_{1}\left(t_{2}\right) & x_{2}\left(t_{2}\right) & \cdots & x_{n}\left(t_{2}\right) \\ \vdots & \vdots & \ddots & \vdots \\ x_{1}\left(t_{m}\right) & x_{2}\left(t_{m}\right) & \cdots & x_{n}\left(t_{m}\right) \end{array}\right] \\ \end{array} \]

and a set of derivative observations

\[ \begin{array}{c} \dot{\mathbf{X}}=\left[\begin{array}{c} \mathbf{x}^{T}\left(t_{1}\right) \\ \dot{\mathbf{x}}^{T}\left(t_{2}\right) \\ \vdots \\ \mathbf{x}^{T}\left(t_{m}\right) \end{array}\right]=\left[\begin{array}{cccc} \dot{x}_{1}\left(t_{1}\right) & \dot{x}_{2}\left(t_{1}\right) & \cdots & \dot{x}_{n}\left(t_{1}\right) \\ \dot{x}_{1}\left(t_{2}\right) & \dot{x}_{2}\left(t_{2}\right) & \cdots & \dot{x}_{n}\left(t_{2}\right) \\ \vdots & \vdots & \ddots & \vdots \\ \dot{x}_{1}\left(t_{m}\right) & \dot{x}_{2}\left(t_{m}\right) & \cdots & \dot{x}_{n}\left(t_{m}\right) \end{array}\right] \end{array} \]

we can evaluate the observations in a basis $\Theta(X)$:

\[ \Theta(\mathbf{X})=\left[\begin{array}{llllllll} 1 & \mathbf{X} & \mathbf{X}^{P_{2}} & \mathbf{X}^{P_{3}} & \cdots & \sin (\mathbf{X}) & \cos (\mathbf{X}) & \cdots \end{array}\right] \]

where $X^{P_i}$ stands for all $P_i$th order polynomial terms. For example,

\[ \mathbf{X}^{P_{2}}=\left[\begin{array}{cccccc} x_{1}^{2}\left(t_{1}\right) & x_{1}\left(t_{1}\right) x_{2}\left(t_{1}\right) & \cdots & x_{2}^{2}\left(t_{1}\right) & \cdots & x_{n}^{2}\left(t_{1}\right) \\ x_{1}^{2}\left(t_{2}\right) & x_{1}\left(t_{2}\right) x_{2}\left(t_{2}\right) & \cdots & x_{2}^{2}\left(t_{2}\right) & \cdots & x_{n}^{2}\left(t_{2}\right) \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ x_{1}^{2}\left(t_{m}\right) & x_{1}\left(t_{m}\right) x_{2}\left(t_{m}\right) & \cdots & x_{2}^{2}\left(t_{m}\right) & \cdots & x_{n}^{2}\left(t_{m}\right) \end{array}\right] \]

Using these matrices, SINDy finds this sparse basis $\mathbf{\Xi}$ over a given candidate library $\mathbf{\Theta}$ by solving the sparse regression problem $\dot{X} =\mathbf{\Theta}\mathbf{\Xi}$ with $L_1$ regularization, i.e. minimizing the objective function $\left\Vert \mathbf{\dot{X}} - \mathbf{\Theta}\mathbf{\Xi} \right\Vert_2 + \lambda \left\Vert \mathbf{\Xi}\right\Vert_1$. This method and other variants of SInDy, along with specialized optimizers for the LASSO $L_1$ optimization problem, have been implemented in packages like DataDrivenDiffEq.jl and pysindy. The result of these methods is LaTeX for the missing dynamical system.

Notice that to use this method, derivative data $\dot{X}$ is required. While in most publications on the subject this information is assumed. To find this, $\dot{X}$ is calculated directly from the time series $X$ by fitting a cubic spline and taking the approximated derivatives at the observation points. However, for this estimation to be stable one needs a fairly dense timeseries for the interpolation. To alleviate this issue, the universal differential equations work estimates terms of partially described models and then uses the neural network as an oracle for the derivative values to learn from subsets of the dynamical system. This allows for the neural network's training to smooth out the derivative estimate between points while incorporating extra scientific information.

Other ways are being investigated for incorporating deep learning into the model discovery process. For example, extensions have been investigated where elements are defined by neural networks representing a basis of the Koopman operator. Additionally, much work is going on in improving the efficiency of the symbolic regression methods themselves, and making the methods implicit and parallel.

Surrogate Acceleration Methods

Another approach for mixing neural networks with differential equations is as a surrogate method. These methods are more mathematically trivial than the previous ideas, but can still achieve interesting results. A full example is explained in this video.

Say we have some function $g(p)$ which depends on a solution to a differential equation $u(t;p)$ and choices of parameters $p$. Computationally how we evaluate this function is we do the following:

  • Solve the differential equation with parameters $p$

  • Evaluate $g$ on the numerical solution for $u$

However, this process is computationally expensive since it requires the numerical solution of $u$ for every evaluation. Thus, one can look at this setup and see $g(p)$ itself is a nonlinear function. The idea is to train a neural network to be the function $g(p)$, i.e. directly put in $p$ and return the appropriate value without ever solving the differential equation.

The video highlights an important fact about this method: it can be computationally expensive to train this kind of surrogate since many data points $(p,g(p))$ are required. In fact, many more data points than you might use. However, after training, the surrogate network for $g(p)$ can be a lot faster than the original simulation-based approach. This means that this is a method for accelerating real-time solutions by doing upfront computations. The total compute time will always be more, but in some sense the cost is amortized or shifted to be done before hand, so that the model does not need to be simulated on the fly. This can allow for things like computationally expensive models of drone flight to be used in a real-time controller.

This technique goes a long way back, but some recent examples of this have been shown. For example, there's this paper which "accelerated" the solution of the 3-body problem using a neural network surrogate trained over a few days to get a 1 million times acceleration (after generating many points beforehand of course! In the paper, notice that it took 10 days to generate the training dataset). Additionally, there is this deep learning trebuchet example which showcased that inverse problems, i.e. control or finding parameters, can be completely encapsulated as a $g(p)$ and learned with sufficient data.

\ No newline at end of file +

DiffEqFlux.jl supports the wide gambit of possible universal differential equations with combinations of stiffness, delays, stochasticity, etc. It does so by using Julia's language-wide AD tooling, such as ReverseDiff.jl, Tracker.jl, ForwardDiff.jl, and Zygote.jl, along with specializations available whenever adjoint methods are known (and the choice between the two is given to the user).

Many of the methods below can be encapsulated as a choice of a universal differential equation and trained with higher order, adaptive, and more efficient methods with DiffEqFlux.jl.

Deep BSDE Methods for High Dimensional Partial Differential Equations

The key paper on deep BSDE methods is this article from PNAS by Jiequn Han, Arnulf Jentzen, and Weinan E. Follow up papers like this one have identified a larger context in the sense of forward-backwards SDEs for a large class of partial differential equations.

Understanding the Setup for Terminal PDEs

While this setup may seem a bit contrived given the "very specific" partial differential equation form (you know the end value? You have some parabolic form?), it turns out that there is a large class of problems in economics and finance that satisfy this form. The reason is because in these problems you may know the value of something at the end, when you're going to sell it, and you want to evaluate it right now. The classic example is in options pricing. An option is a contract to be able to solve a stock at a given value. The simplest case is a contract that can only be executed at a pre-determined time in the future. Let's say we have an option to sell a stock at 100 no matter what. This means that, if the stock at the strike time (the time the option can be sold) is 70, we will make 30 from this option, and thus the option itself is worth 30. The question is, if I have this option today, the strike time is 3 months in the future, and the stock price is currently 70, how much should I value the option today?

To solve this, we need to put a model on how we think the stock price will evolve. One simple version is a linear stochastic differential equation, i.e. the stock price will evolve with a constant interest rate $r$ with some volatility (randomness) $\sigma$, in which case:

\[ dX_t = r X_t dt + \sigma X_t dW_t. \]

From this model, we can evaluate the probability that the stock is going to be at given values, which then gives us the probability that the option is worth a given value, which then gives us the expected (or average) value of the option. This is the Black-Scholes problem. However, a more direct way of calculating this result is writing down a partial differential equation for the evolution of the value of the option $V$ as a function of time $t$ and the current stock price $x$. At the final time point, if we know the stock price then we know the value of the option, and thus we have a terminal condition $V(T,x) = g(x)$ for some known value function $g(x)$. The question is, given this value at time $T$, what is the value of the option at time $t=0$ given that the stock currently has a value $x = \zeta$. Why is this interesting? This will tell you what you think the option is currently valued at, and thus if it's cheaper than that, you can gain money by buying the option right now! This means that the "solution" to the PDE is the value $V(0,\zeta)$, where we know the final points $V(T,x) = g(x)$. This is precisely the type of problem that is solved by the deep BSDE method.

The Deep BSDE Method

Consider the class of semilinear parabolic PDEs, in finite time $t\in[0, T]$ and $d$-dimensional space $x\in\mathbb R^d$, that have the form

\[ \begin{align} \frac{\partial u}{\partial t}(t,x) &+\frac{1}{2}\text{trace}\left(\sigma\sigma^{T}(t,x)\left(\text{Hess}_{x}u\right)(t,x)\right)\\ &+\nabla u(t,x)\cdot\mu(t,x) \\ &+f\left(t,x,u(t,x),\sigma^{T}(t,x)\nabla u(t,x)\right)=0,\end{align} \]

with a terminal condition $u(T,x)=g(x)$. In this equation, $\text{trace}$ is the trace of a matrix, $\sigma^T$ is the transpose of $\sigma$, $\nabla u$ is the gradient of $u$, and $\text{Hess}_x u$ is the Hessian of $u$ with respect to $x$. Furthermore, $\mu$ is a vector-valued function, $\sigma$ is a $d \times d$ matrix-valued function and $f$ is a nonlinear function. We assume that $\mu$, $\sigma$, and $f$ are known. We wish to find the solution at initial time, $t=0$, at some starting point, $x = \zeta$.

Let $W_{t}$ be a Brownian motion and take $X_t$ to be the solution to the stochastic differential equation

\[ dX_t = \mu(t,X_t) dt + \sigma (t,X_t) dW_t \]

with initial condition $X(0)=\zeta$. Previous work has shown that the solution satisfies the following BSDE:

\[ \begin{align} u(t, &X_t) - u(0,\zeta) = \\ & -\int_0^t f(s,X_s,u(s,X_s),\sigma^T(s,X_s)\nabla u(s,X_s)) ds \\ & + \int_0^t \left[\nabla u(s,X_s) \right]^T \sigma (s,X_s) dW_s,\end{align} \]

with terminating condition $g(X_T) = u(X_T,W_T)$.

At this point, the authors approximate $\left[\nabla u(s,X_s) \right]^T \sigma (s,X_s)$ and $u(0,\zeta)$ as neural networks. Using the Euler-Maruyama discretization of the stochastic differential equation system, one arrives at a recurrent neural network:

Deep BSDE

Julia Implementation

A Julia implementation for the deep BSDE method can be found at NeuralPDE.jl. The examples considered below are part of the standard test suite.

Financial Applications of Deep BSDEs: Nonlinear Black-Scholes

Now let's look at a few applications which have PDEs that are solved by this method. One set of problems that are solved, given our setup, are Black-Scholes types of equations. Unlike a lot of previous literature, this works for a wide class of nonlinear extensions to Black-Scholes with large portfolios. Here, the dimension of the PDE for $V(t,x)$ is the dimension of $x$, where the dimension is the number of stocks in the portfolio that we want to consider. If we want to track 1000 stocks, this means our PDE is 1000 dimensional! Traditional PDE solvers would need around $N^{1000}$ points evolving over time in order to arrive at the solution, which is completely impractical.

One example of a nonlinear Black-Scholes equation in this form is the Black-Scholes equation with default risk. Here we are adding to the standard model the idea that the companies that we are buying stocks for can default, and thus our valuation has to take into account this default probability as the option will thus become value-less. The PDE that is arrived at is:

\[ \frac{\partial u}{\partial t}(t,x) + \bar{\mu}\cdot \nabla u(t, x) + \frac{\bar{\sigma}^{2}}{2} \sum_{i=1}^{d} \left |x_{i} \right |^{2} \frac{\partial^2 u}{\partial {x_{i}}^2}(t,x) \\ - (1 -\delta )Q(u(t,x))u(t,x) - Ru(t,x) = 0 \]

with terminating condition $g(x) = \min_{i} x_i$ for $x = (x_{1}, . . . , x_{100}) \in R^{100}$, where $\delta \in [0, 1)$, $R$ is the interest rate of the risk-free asset, and Q is a piecewise linear function of the current value with three regions $(v^{h} < v ^{l}, \gamma^{h} > \gamma^{l})$,

\[ \begin{align} Q(y) &= \mathbb{1}_{(-\infty,\upsilon^{h})}(y)\gamma ^{h} + \mathbb{1}_{[\upsilon^{l},\infty)}(y)\gamma ^{l} \\ &+ \mathbb{1}_{[\upsilon^{h},\upsilon^{l}]}(y) \left[ \frac{(\gamma ^{h} - \gamma ^{l})}{(\upsilon ^{h}- \upsilon ^{l})} (y - \upsilon ^{h}) + \gamma ^{h} \right ]. \end{align} \]

This PDE can be cast into the form of the deep BSDE method by setting:

\[ \begin{align} \mu &= \overline{\mu} X_{t} \\ \sigma &= \overline{\sigma} \text{diag}(X_{t}) \\ f &= -(1 -\delta )Q(u(t,x))u(t,x) - R u(t,x) \end{align} \]

The Julia code for this exact problem in 100 dimensions can be found here

Stochastic Optimal Control as a Deep BSDE Application

Another type of problem that fits into this terminal PDE form is the stochastic optimal control problem. The problem is a generalized context to what motivated us before. In this case, there are a set of agents which undergo some known stochastic model. What we want to do is apply some control (push them in some direction) at every single timepoint towards some goal. For example, we have the physics for the dynamics of drone flight, but there's randomness in the wind condition, and so we want to control the engine speeds to move in a certain direction. However, there is a cost associated with controlling, and thus the question is how to best balance the use of controls with the natural stochastic evolution.

It turns out this is in the same form as the Black-Scholes problem. There is a model evolving forwards, and when we get to the end we know how much everything "cost" because we know if the drone got to the right location and how much energy it took. So in the same sense as Black-Scholes, we can know the value at the end and try and propagate it backwards given the current state of the system $x$, to find out $u(0,\zeta)$, i.e. how should we control right now given the current system is in the state $x = \zeta$. It turns out that the solution of $u(t,x)$ where $u(T,x)=g(x)$ and we want to find $u(0,\zeta)$ is given by a partial differential equation which is known as the Hamilton-Jacobi-Bellman equation, which is one of these terminal PDEs that is representable by the deep BSDE method.

Take the classical linear-quadratic Gaussian (LQG) control problem in 100 dimensions

\[ dX_t = 2\sqrt{\lambda} c_t dt + \sqrt{2} dW_t \]

with $t\in [0,T]$, $X_0 = x$, and with a cost function

\[ C(c_t) = \mathbb{E}\left[\int_0^T \Vert c_t \Vert^2 dt + g(X_t) \right] \]

where $X_t$ is the state we wish to control, $\lambda$ is the strength of the control, and $c_t$ is the control process. To minimize the control, the Hamilton–Jacobi–Bellman equation:

\[ \frac{\partial u}{\partial t}(t,x) + \Delta u(t,x) - \lambda \Vert \nabla u(t,x) \Vert^2 = 0 \]

has a solution $u(t,x)$ which at $t=0$ represents the optimal cost of starting from $x$.

This PDE can be rewritten into the canonical form of the deep BSDE method by setting:

\[ \begin{align} \mu &= 0, \\ \sigma &= \overline{\sigma} I, \\ f &= -\alpha \left \| \sigma^T(s,X_s)\nabla u(s,X_s)) \right \|^{2}, \end{align} \]

where $\overline{\sigma} = \sqrt{2}$, T = 1 and $X_0 = (0,. . . , 0) \in R^{100}$.

The Julia code for solving this exact problem in 100 dimensions can be found here

Connections of Reservoir Computing to Scientific Machine Learning

Reservoir computing techniques are an alternative to the "full" neural network techniques we have previously discussed. However, the process of training neural networks has a few caveats which can cause difficulties in real systems:

  1. The tangent space diverges exponentially fast when the system is chaotic, meaning that results of both forward and reverse automatic differentiation techniques (and the related adjoints) are divergent on these kinds of systems.

  2. It is hard for neural networks to represent stiff systems. There are many reasons for this, one being that neural networks tend to drop high frequency behavior.

There are ways being investigated to alleviate these issues. For example, shadow adjoints can give a non-divergent average sense of a derivative on ergodic chaotic systems, but is significantly more expensive than the traditional adjoint.

To get around these caveats, some research teams have investigated alternatives which do not require gradient-based optimization. The clear frontrunner in this field is a type of architecture called echo state networks. A simplified formulation of an echo state network essentially fixes a neural network that defines a reservoir, i.e.

\[ x_{n+1} = \sigma(W x_n + W_{fb} y_n) \]

\[ y_n = g(W_{out} x_n) \]

where $W$ and $W_{fb}$ are fixed random matrices that are chosen before the training process, $x_n$ is called the reservoir state, and $y_n$ is the output state for the observables. The idea is to find a projection $W_{out}$ from the high dimensional random reservoir $x$ to model the timeseries by $y$. If the reservoir is a big enough and nonlinear enough random system, there should in theory exist a projection from that random system that matches any potential timeseries. Indeed, one can prove that echo state networks are universal adaptive filters under certain conditions.

If $g$ is invertible (and in many cases $g$ is taken to be the identity), then one can directly apply the inversion of $g$ to the data. This turns the training of $W_{out}$, the only non-fixed portion, into a standard least squares regression between the reservoir and the observation series. This is then solved by classical means like SVD factorizations which can be stable in ill-conditioned cases.

Echo state networks have been shown to accurately reproduce chaotic attractors which are shown to be hard to train RNNs against. A demonstration via ReservoirComputing.jl clearly highlights this prediction ability:

However, this methodology still is not tailored to the continuous nature of dynamical systems found in scientific computing. Recent work has extended this methodolgy to allow for a continuous reservoir, i.e. a continuous-time echo state network. It is shown that using the adaptive points of a stiff ODE integrator gives a non-uniform sampling in time that makes it easier to learn stiff equations from less training points, and demonstrates the ability to learn equations where standard physics-informed neural network (PINN) training techniques fail.

This area of research is still far less developed than PINNs and neural differential equations but shows promise to more easily learn highly stiff and chaotic systems which are seemingly out of reach for these other methods.

Automated Equation Discovery: Outputting LaTeX for Dynamical Systems from Data

The SINDy algorithm enables data-driven discovery of governing equations from data. It leverages the fact that most physical systems have only a few relevant terms that define the dynamics, making the governing equations sparse in a high-dimensional nonlinear function space. Given a set of observations

\[ \begin{array}{c} \mathbf{X}=\left[\begin{array}{c} \mathbf{x}^{T}\left(t_{1}\right) \\ \mathbf{x}^{T}\left(t_{2}\right) \\ \vdots \\ \mathbf{x}^{T}\left(t_{m}\right) \end{array}\right]=\left[\begin{array}{cccc} x_{1}\left(t_{1}\right) & x_{2}\left(t_{1}\right) & \cdots & x_{n}\left(t_{1}\right) \\ x_{1}\left(t_{2}\right) & x_{2}\left(t_{2}\right) & \cdots & x_{n}\left(t_{2}\right) \\ \vdots & \vdots & \ddots & \vdots \\ x_{1}\left(t_{m}\right) & x_{2}\left(t_{m}\right) & \cdots & x_{n}\left(t_{m}\right) \end{array}\right] \\ \end{array} \]

and a set of derivative observations

\[ \begin{array}{c} \dot{\mathbf{X}}=\left[\begin{array}{c} \mathbf{x}^{T}\left(t_{1}\right) \\ \dot{\mathbf{x}}^{T}\left(t_{2}\right) \\ \vdots \\ \mathbf{x}^{T}\left(t_{m}\right) \end{array}\right]=\left[\begin{array}{cccc} \dot{x}_{1}\left(t_{1}\right) & \dot{x}_{2}\left(t_{1}\right) & \cdots & \dot{x}_{n}\left(t_{1}\right) \\ \dot{x}_{1}\left(t_{2}\right) & \dot{x}_{2}\left(t_{2}\right) & \cdots & \dot{x}_{n}\left(t_{2}\right) \\ \vdots & \vdots & \ddots & \vdots \\ \dot{x}_{1}\left(t_{m}\right) & \dot{x}_{2}\left(t_{m}\right) & \cdots & \dot{x}_{n}\left(t_{m}\right) \end{array}\right] \end{array} \]

we can evaluate the observations in a basis $\Theta(X)$:

\[ \Theta(\mathbf{X})=\left[\begin{array}{llllllll} 1 & \mathbf{X} & \mathbf{X}^{P_{2}} & \mathbf{X}^{P_{3}} & \cdots & \sin (\mathbf{X}) & \cos (\mathbf{X}) & \cdots \end{array}\right] \]

where $X^{P_i}$ stands for all $P_i$th order polynomial terms. For example,

\[ \mathbf{X}^{P_{2}}=\left[\begin{array}{cccccc} x_{1}^{2}\left(t_{1}\right) & x_{1}\left(t_{1}\right) x_{2}\left(t_{1}\right) & \cdots & x_{2}^{2}\left(t_{1}\right) & \cdots & x_{n}^{2}\left(t_{1}\right) \\ x_{1}^{2}\left(t_{2}\right) & x_{1}\left(t_{2}\right) x_{2}\left(t_{2}\right) & \cdots & x_{2}^{2}\left(t_{2}\right) & \cdots & x_{n}^{2}\left(t_{2}\right) \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ x_{1}^{2}\left(t_{m}\right) & x_{1}\left(t_{m}\right) x_{2}\left(t_{m}\right) & \cdots & x_{2}^{2}\left(t_{m}\right) & \cdots & x_{n}^{2}\left(t_{m}\right) \end{array}\right] \]

Using these matrices, SINDy finds this sparse basis $\mathbf{\Xi}$ over a given candidate library $\mathbf{\Theta}$ by solving the sparse regression problem $\dot{X} =\mathbf{\Theta}\mathbf{\Xi}$ with $L_1$ regularization, i.e. minimizing the objective function $\left\Vert \mathbf{\dot{X}} - \mathbf{\Theta}\mathbf{\Xi} \right\Vert_2 + \lambda \left\Vert \mathbf{\Xi}\right\Vert_1$. This method and other variants of SInDy, along with specialized optimizers for the LASSO $L_1$ optimization problem, have been implemented in packages like DataDrivenDiffEq.jl and pysindy. The result of these methods is LaTeX for the missing dynamical system.

Notice that to use this method, derivative data $\dot{X}$ is required. While in most publications on the subject this information is assumed. To find this, $\dot{X}$ is calculated directly from the time series $X$ by fitting a cubic spline and taking the approximated derivatives at the observation points. However, for this estimation to be stable one needs a fairly dense timeseries for the interpolation. To alleviate this issue, the universal differential equations work estimates terms of partially described models and then uses the neural network as an oracle for the derivative values to learn from subsets of the dynamical system. This allows for the neural network's training to smooth out the derivative estimate between points while incorporating extra scientific information.

Other ways are being investigated for incorporating deep learning into the model discovery process. For example, extensions have been investigated where elements are defined by neural networks representing a basis of the Koopman operator. Additionally, much work is going on in improving the efficiency of the symbolic regression methods themselves, and making the methods implicit and parallel.

Surrogate Acceleration Methods

Another approach for mixing neural networks with differential equations is as a surrogate method. These methods are more mathematically trivial than the previous ideas, but can still achieve interesting results. A full example is explained in this video.

Say we have some function $g(p)$ which depends on a solution to a differential equation $u(t;p)$ and choices of parameters $p$. Computationally how we evaluate this function is we do the following:

  • Solve the differential equation with parameters $p$

  • Evaluate $g$ on the numerical solution for $u$

However, this process is computationally expensive since it requires the numerical solution of $u$ for every evaluation. Thus, one can look at this setup and see $g(p)$ itself is a nonlinear function. The idea is to train a neural network to be the function $g(p)$, i.e. directly put in $p$ and return the appropriate value without ever solving the differential equation.

The video highlights an important fact about this method: it can be computationally expensive to train this kind of surrogate since many data points $(p,g(p))$ are required. In fact, many more data points than you might use. However, after training, the surrogate network for $g(p)$ can be a lot faster than the original simulation-based approach. This means that this is a method for accelerating real-time solutions by doing upfront computations. The total compute time will always be more, but in some sense the cost is amortized or shifted to be done before hand, so that the model does not need to be simulated on the fly. This can allow for things like computationally expensive models of drone flight to be used in a real-time controller.

This technique goes a long way back, but some recent examples of this have been shown. For example, there's this paper which "accelerated" the solution of the 3-body problem using a neural network surrogate trained over a few days to get a 1 million times acceleration (after generating many points beforehand of course! In the paper, notice that it took 10 days to generate the training dataset). Additionally, there is this deep learning trebuchet example which showcased that inverse problems, i.e. control or finding parameters, can be completely encapsulated as a $g(p)$ and learned with sufficient data.

\ No newline at end of file diff --git a/_weave/lecture16/jl_xqsM7i/probabilistic_programming_10_1.png b/_weave/lecture16/jl_IeV9Er/probabilistic_programming_10_1.png similarity index 100% rename from _weave/lecture16/jl_xqsM7i/probabilistic_programming_10_1.png rename to _weave/lecture16/jl_IeV9Er/probabilistic_programming_10_1.png diff --git a/_weave/lecture16/jl_xqsM7i/probabilistic_programming_1_1.png b/_weave/lecture16/jl_IeV9Er/probabilistic_programming_1_1.png similarity index 100% rename from _weave/lecture16/jl_xqsM7i/probabilistic_programming_1_1.png rename to _weave/lecture16/jl_IeV9Er/probabilistic_programming_1_1.png diff --git a/_weave/lecture16/jl_IeV9Er/probabilistic_programming_3_1.png b/_weave/lecture16/jl_IeV9Er/probabilistic_programming_3_1.png new file mode 100644 index 00000000..25eb1282 Binary files /dev/null and b/_weave/lecture16/jl_IeV9Er/probabilistic_programming_3_1.png differ diff --git a/_weave/lecture16/jl_IeV9Er/probabilistic_programming_4_1.png b/_weave/lecture16/jl_IeV9Er/probabilistic_programming_4_1.png new file mode 100644 index 00000000..94f67f9e Binary files /dev/null and b/_weave/lecture16/jl_IeV9Er/probabilistic_programming_4_1.png differ diff --git a/_weave/lecture16/jl_IeV9Er/probabilistic_programming_5_1.png b/_weave/lecture16/jl_IeV9Er/probabilistic_programming_5_1.png new file mode 100644 index 00000000..d7fb2768 Binary files /dev/null and b/_weave/lecture16/jl_IeV9Er/probabilistic_programming_5_1.png differ diff --git a/_weave/lecture16/jl_IeV9Er/probabilistic_programming_6_1.png b/_weave/lecture16/jl_IeV9Er/probabilistic_programming_6_1.png new file mode 100644 index 00000000..8306b349 Binary files /dev/null and b/_weave/lecture16/jl_IeV9Er/probabilistic_programming_6_1.png differ diff --git a/_weave/lecture16/jl_IeV9Er/probabilistic_programming_7_1.png b/_weave/lecture16/jl_IeV9Er/probabilistic_programming_7_1.png new file mode 100644 index 00000000..eea74574 Binary files /dev/null and b/_weave/lecture16/jl_IeV9Er/probabilistic_programming_7_1.png differ diff --git a/_weave/lecture16/jl_IeV9Er/probabilistic_programming_8_1.png b/_weave/lecture16/jl_IeV9Er/probabilistic_programming_8_1.png new file mode 100644 index 00000000..d3e64808 Binary files /dev/null and b/_weave/lecture16/jl_IeV9Er/probabilistic_programming_8_1.png differ diff --git a/_weave/lecture16/jl_xqsM7i/probabilistic_programming_9_1.png b/_weave/lecture16/jl_IeV9Er/probabilistic_programming_9_1.png similarity index 100% rename from _weave/lecture16/jl_xqsM7i/probabilistic_programming_9_1.png rename to _weave/lecture16/jl_IeV9Er/probabilistic_programming_9_1.png diff --git a/_weave/lecture16/jl_xqsM7i/probabilistic_programming_3_1.png b/_weave/lecture16/jl_xqsM7i/probabilistic_programming_3_1.png deleted file mode 100644 index 5e0b0a71..00000000 Binary files a/_weave/lecture16/jl_xqsM7i/probabilistic_programming_3_1.png and /dev/null differ diff --git a/_weave/lecture16/jl_xqsM7i/probabilistic_programming_4_1.png b/_weave/lecture16/jl_xqsM7i/probabilistic_programming_4_1.png deleted file mode 100644 index 3b9b4f7d..00000000 Binary files a/_weave/lecture16/jl_xqsM7i/probabilistic_programming_4_1.png and /dev/null differ diff --git a/_weave/lecture16/jl_xqsM7i/probabilistic_programming_5_1.png b/_weave/lecture16/jl_xqsM7i/probabilistic_programming_5_1.png deleted file mode 100644 index 575d0642..00000000 Binary files a/_weave/lecture16/jl_xqsM7i/probabilistic_programming_5_1.png and /dev/null differ diff --git a/_weave/lecture16/jl_xqsM7i/probabilistic_programming_6_1.png b/_weave/lecture16/jl_xqsM7i/probabilistic_programming_6_1.png deleted file mode 100644 index 480a9e9b..00000000 Binary files a/_weave/lecture16/jl_xqsM7i/probabilistic_programming_6_1.png and /dev/null differ diff --git a/_weave/lecture16/jl_xqsM7i/probabilistic_programming_7_1.png b/_weave/lecture16/jl_xqsM7i/probabilistic_programming_7_1.png deleted file mode 100644 index 502c0320..00000000 Binary files a/_weave/lecture16/jl_xqsM7i/probabilistic_programming_7_1.png and /dev/null differ diff --git a/_weave/lecture16/jl_xqsM7i/probabilistic_programming_8_1.png b/_weave/lecture16/jl_xqsM7i/probabilistic_programming_8_1.png deleted file mode 100644 index 877e7633..00000000 Binary files a/_weave/lecture16/jl_xqsM7i/probabilistic_programming_8_1.png and /dev/null differ diff --git a/_weave/lecture16/probabilistic_programming/index.html b/_weave/lecture16/probabilistic_programming/index.html index 8e879594..a95ec2ba 100644 --- a/_weave/lecture16/probabilistic_programming/index.html +++ b/_weave/lecture16/probabilistic_programming/index.html @@ -24,7 +24,7 @@

From Optimization to Probabilistic Programming

Chris R prob1 = ODEProblem(lotka_volterra,u0,tspan,) sol = solve(prob1,Tsit5()) plot(sol) -

and from which we can get an ensemble of solutions:

+

and from which we can get an ensemble of solutions:

 prob_func = function (prob,i,repeat)
   remake(prob,p=rand.(θ))
 end
@@ -34,21 +34,21 @@ 

From Optimization to Probabilistic Programming

Chris R using DiffEqBase.EnsembleAnalysis plot(EnsembleSummary(sol)) -

From just a few variables having probabilities, every variable has an induced probability: there is a probability distribution on the integrator states, the output at time t_i, etc.

Bayesian Estimation with Point Estimates: Bayes' Rule, Maximum Likelihood, and MAP

Recall from our previous studies that the difficult part of modeling is not necessarily the forward modeling approach, rather it's the incorporation of data or the estimation problem that is difficult. When your variables are now random distributions, how do you "fit" them?

The answer comes from Bayes' rule, which is the following. Assume you had a prior distribution $p(\theta)$ for the probability that $X$ is a given value $\theta$. Then the posterior probability distribution, $p(\theta|D)$, or the distribution which is updated to include data, is given by:

\[ p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int_\Omega p(D|\theta)p(\theta)d\theta} \]

The scaling factor on the denominator is simply a constant to make the distribution integrate 1 (so that the resulting function is a probability distribution!). The numerator is simply the prior distribution multiplied by the likelihood of seeing the data given the value of the random variable. The prior distribution must be given but notice that the likelihood has another name: the likelihood is the model.

The reason why it's the same thing is because the model is what tells you the expected outcomes given a value of the random variable, and your data is on an expected outcome! However, the likelihood encodes a little bit more information in that it again is a distribution and not a point estimate. We need to make a choice for our measurement distribution on our model's results.

Quick Question: Why is this referred to as measurement noise? Why is it not process noise?

A common choice for the measurement distribution is the Normal distribution. This comes from the Central Limit Theorem (CLT) which essentially states that, given enough interacting mechanisms, the average values of things "tend to become normally distributed". The true statement of the CLT is much more complex, but that is a decent working definition for practical use. The normal distribution is defined by two parameters, $\mu$ and $\sigma$, and is given by the following function:

\[ f(x;\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}\exp(\frac{-(x-\mu)^2}{2\sigma^2}) \]

This is a bell curve centered at $\mu$ with a variance of $\sigma$. Our best guess for the output, i.e. the model's prediction, should be the average measurement, meaning that $\mu$ is the result from the simulator. $\sigma$ is a parameter for how much measurement error we expect (some intuition on $\sigma$ will come soon).

Let's return to thinking about the ODE example. In this case we have $\theta$ as a vector of random variables. This means that $u(t;\theta)$ is a random variable for the ODE $u'= ...$'s solution at a given point in time $t$. If we have a measurement at a time $t_i$ and assume our measurement noise is normally distributed with some constant measurement noise $\sigma$, then the likelihood of our data would be $f(x_i;u(t_i;\theta),\sigma)$ at each data point $(t_i,x_i)$. From probability we know that seeing the composition of events is given by the multiplication of probabilities, so the probability of seeing the full dataset given observations $D = (t_i,x_i)$ along the timeseries is:

\[ p(D|\theta) = \prod_i f(x_i;u(t_i;\theta),\sigma) \]

This can be read as: solve the model with the given parameters, and the probability of having seen the measurement is thus given by a product of normal distribution calculations. Note that in many cases the product is not numerically stable (and grows exponentially), and so the likelihood is transformed to the log-likelihood. To get this expression, we take the log of both sides and notice that the product becomes a summation, and thus:

\[ \begin{align} \log p(D|\theta) &= \sum_i \log f(x_i;u(t_i;\theta),\sigma)\\ &= \frac{N}{\sqrt{2\pi}\sigma} + \frac{1}{2\sigma^2} \sum_i -(x-\mu)^2 \end{align} \]

Notice that maximizing this log-likelihood is equivalent to minimizing the L2 norm of the solution against the data!. Thus we can see a few things:

  1. Previous parameter estimation by minimizing a norm against data can be seen as maximum likelihood with some measurement distribution. L2 norm corresponds to assuming measurement noise is normally distributed and all of the measurements have the same error variance.

  2. By the same derivation, having different error variances with normally distributed errors is equivalent to doing weighted L2 estimation.

This reformulation (generalization?) to likelihoods of probability distributions is known as maximum likelihood estimation (MLE), but is equivalent to our previous forms of parameter estimation using point estimates against data. However, this calculation is ignoring Bayes' rule, and is thus not finding the parameters which have the highest probability. To do that, we need to go back to Bayes' rule which states that:

\[ \log p(\theta|D) = \log p(D|\theta) + \log p(\theta) - C \]

Thus, maximizing the log-likelihood is "almost" the same as finding the most probable parameters, except that we need to add weights given $\log p(\theta)$ from our prior distribution! If we assume our prior distribution is flat, like a uniform distribution, then we have a non-informative prior and the maximum posterior point matches that of the maximum likelihood estimation. However, this formulation allows us to get point estimates in a way that takes into account prior knowledge, and is call maximum a posteriori estimation (MAP).

Bayesian Estimation of Posterior Distributions with Monte Carlo

The previous discussion still solely focused on getting point estimates for the most probable parameters. However, what if we wanted to find the distributions of the parameters, i.e. the full $p(D|\theta)$? Outside of very few small models, this cannot be done analytically and is thus the basic problem of probabilistic programming. There are two general approaches:

  1. Sampling-based approaches. Sample parameters $\theta_i$ in such a manner that the array $[\theta_i]$ converges to an array sampled from the true distribution, and thus with enough samples one can capture the distribution numerically.

  2. Variational inference. Find some way to represent the probability distribution and push forward the distributions at every step of the program.

Recovering Distributions from Sampled Points

It's clear from above that if you have a distribution, like Normal(5,1), that you can sample from the distribution to get an array of values which follow the distribution. However, in order for the following sampling approaches to make sense, we need to see how to recover a distribution from discrete samples. So let's say you had a bunch of normally distributed points:

+

From just a few variables having probabilities, every variable has an induced probability: there is a probability distribution on the integrator states, the output at time t_i, etc.

Bayesian Estimation with Point Estimates: Bayes' Rule, Maximum Likelihood, and MAP

Recall from our previous studies that the difficult part of modeling is not necessarily the forward modeling approach, rather it's the incorporation of data or the estimation problem that is difficult. When your variables are now random distributions, how do you "fit" them?

The answer comes from Bayes' rule, which is the following. Assume you had a prior distribution $p(\theta)$ for the probability that $X$ is a given value $\theta$. Then the posterior probability distribution, $p(\theta|D)$, or the distribution which is updated to include data, is given by:

\[ p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int_\Omega p(D|\theta)p(\theta)d\theta} \]

The scaling factor on the denominator is simply a constant to make the distribution integrate 1 (so that the resulting function is a probability distribution!). The numerator is simply the prior distribution multiplied by the likelihood of seeing the data given the value of the random variable. The prior distribution must be given but notice that the likelihood has another name: the likelihood is the model.

The reason why it's the same thing is because the model is what tells you the expected outcomes given a value of the random variable, and your data is on an expected outcome! However, the likelihood encodes a little bit more information in that it again is a distribution and not a point estimate. We need to make a choice for our measurement distribution on our model's results.

Quick Question: Why is this referred to as measurement noise? Why is it not process noise?

A common choice for the measurement distribution is the Normal distribution. This comes from the Central Limit Theorem (CLT) which essentially states that, given enough interacting mechanisms, the average values of things "tend to become normally distributed". The true statement of the CLT is much more complex, but that is a decent working definition for practical use. The normal distribution is defined by two parameters, $\mu$ and $\sigma$, and is given by the following function:

\[ f(x;\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}\exp(\frac{-(x-\mu)^2}{2\sigma^2}) \]

This is a bell curve centered at $\mu$ with a variance of $\sigma$. Our best guess for the output, i.e. the model's prediction, should be the average measurement, meaning that $\mu$ is the result from the simulator. $\sigma$ is a parameter for how much measurement error we expect (some intuition on $\sigma$ will come soon).

Let's return to thinking about the ODE example. In this case we have $\theta$ as a vector of random variables. This means that $u(t;\theta)$ is a random variable for the ODE $u'= ...$'s solution at a given point in time $t$. If we have a measurement at a time $t_i$ and assume our measurement noise is normally distributed with some constant measurement noise $\sigma$, then the likelihood of our data would be $f(x_i;u(t_i;\theta),\sigma)$ at each data point $(t_i,x_i)$. From probability we know that seeing the composition of events is given by the multiplication of probabilities, so the probability of seeing the full dataset given observations $D = (t_i,x_i)$ along the timeseries is:

\[ p(D|\theta) = \prod_i f(x_i;u(t_i;\theta),\sigma) \]

This can be read as: solve the model with the given parameters, and the probability of having seen the measurement is thus given by a product of normal distribution calculations. Note that in many cases the product is not numerically stable (and grows exponentially), and so the likelihood is transformed to the log-likelihood. To get this expression, we take the log of both sides and notice that the product becomes a summation, and thus:

\[ \begin{align} \log p(D|\theta) &= \sum_i \log f(x_i;u(t_i;\theta),\sigma)\\ &= \frac{N}{\sqrt{2\pi}\sigma} + \frac{1}{2\sigma^2} \sum_i -(x-\mu)^2 \end{align} \]

Notice that maximizing this log-likelihood is equivalent to minimizing the L2 norm of the solution against the data!. Thus we can see a few things:

  1. Previous parameter estimation by minimizing a norm against data can be seen as maximum likelihood with some measurement distribution. L2 norm corresponds to assuming measurement noise is normally distributed and all of the measurements have the same error variance.

  2. By the same derivation, having different error variances with normally distributed errors is equivalent to doing weighted L2 estimation.

This reformulation (generalization?) to likelihoods of probability distributions is known as maximum likelihood estimation (MLE), but is equivalent to our previous forms of parameter estimation using point estimates against data. However, this calculation is ignoring Bayes' rule, and is thus not finding the parameters which have the highest probability. To do that, we need to go back to Bayes' rule which states that:

\[ \log p(\theta|D) = \log p(D|\theta) + \log p(\theta) - C \]

Thus, maximizing the log-likelihood is "almost" the same as finding the most probable parameters, except that we need to add weights given $\log p(\theta)$ from our prior distribution! If we assume our prior distribution is flat, like a uniform distribution, then we have a non-informative prior and the maximum posterior point matches that of the maximum likelihood estimation. However, this formulation allows us to get point estimates in a way that takes into account prior knowledge, and is call maximum a posteriori estimation (MAP).

Bayesian Estimation of Posterior Distributions with Monte Carlo

The previous discussion still solely focused on getting point estimates for the most probable parameters. However, what if we wanted to find the distributions of the parameters, i.e. the full $p(D|\theta)$? Outside of very few small models, this cannot be done analytically and is thus the basic problem of probabilistic programming. There are two general approaches:

  1. Sampling-based approaches. Sample parameters $\theta_i$ in such a manner that the array $[\theta_i]$ converges to an array sampled from the true distribution, and thus with enough samples one can capture the distribution numerically.

  2. Variational inference. Find some way to represent the probability distribution and push forward the distributions at every step of the program.

Recovering Distributions from Sampled Points

It's clear from above that if you have a distribution, like Normal(5,1), that you can sample from the distribution to get an array of values which follow the distribution. However, in order for the following sampling approaches to make sense, we need to see how to recover a distribution from discrete samples. So let's say you had a bunch of normally distributed points:

 X = Normal(5,1)
 x = [rand(X) for i in 1:100]
 scatter(x,[1 for i in 1:100])
-

Notice that there are more points in the areas of higher probability. Thus the density of sampled points gives us an estimate for the probability of having points in a given area. We can then count the number of points in a bin and divide by the total number of points in order to get the probability of being in a specific region. This is depicted by a histogram:

+

Notice that there are more points in the areas of higher probability. Thus the density of sampled points gives us an estimate for the probability of having points in a given area. We can then count the number of points in a bin and divide by the total number of points in order to get the probability of being in a specific region. This is depicted by a histogram:

 histogram(x)
-

and we see this converges when we get more points:

+

and we see this converges when we get more points:

 histogram([rand(X) for i in 1:10000],normed=true)
 using StatsPlots
 plot!(X,lw=5)
-

A continuous form of this is the kernel density estimate, which is essentially a smoothed binning approach.

+

A continuous form of this is the kernel density estimate, which is essentially a smoothed binning approach.

 using KernelDensity
 plot(kde([rand(X) for i in 1:10000]),lw=5)
 plot!(X,lw=5)
-

Thus, for the sampling-based approaches, we simply need to arrive at an array which is sampled according to the distribution that we want to estimate, and from that array we can recover the distribution.

Sampling Distributions with the Metropolis Hastings Algorithm

The Metropolis-Hastings algorithm is the simplest form of Markov Chain Monte Carlo (MCMC) which gives a way of sampling the $\theta$ distribution. To see how this algorithm works, let's understand the ratio between two points in the posterior probability. If we have $x_i$ and $x_j$, the ratio of the two probabilities would be given by:

\[ \frac{p(x_i|D)}{p(x_j|D)} = \frac{p(D|x_i)p(x_i)}{p(D|x_j)p(x_j)} \]

(notice that the integration constant cancels). This motivates the idea that all we have to do is ensure we only go to a point $x_j$ from $x_i$ with probability difference that matches that ratio, and over time if we do this between "all points" we will have the right number of "each point" in the distribution (quotes because it's continuous). With a bit more rigour we arrive at the following algorithm:

  1. Starting at $x_i$, take $x_{i+1}$ from a sampling algorithm $g(x_{i+1}|x_i)$.

  2. Calculate $A = \min\left(1,\frac{p(D|x_{i+1})p(x_{i+1})g(x_i|x_{i+1})}{p(D|x_i)p(x_i)g(x_{i+1}|x_i)}\right)$. Notice that if we require $g$ to be symmetric, then this simplifies to the probability ratio $A = \min\left(1,\frac{p(D|x_{i+1})p(x_{i+1})}{p(D|x_i)p(x_i)}\right)$

  3. Use a random number to accept the step with a probability $A$. Go back to step 1, incrementing $i$ if accepted, otherwise just repeat.

I.e, we just walk around the space biasing the acceptance of a step by the factor $\frac{p(x_i|D)}{p(x_j|D)}$ and sooner or later we will have spent the right amount of time in each area, giving the correct distribution.

(This can be rigorously proven, and those details are left out.)

The Cost of Bayesian Estimation

Let's take a quick moment to understand the high cost of Bayesian posterior estimations. While before we were getting point estimates, now we are trying to recover a full probability distribution, and each accept/reject probability calculation requires evaluating the likelihood at some point. Remember, the likelihood is generated by our simulator, and thus every evaluation here is an ODE solver call or a neural network forward pass! This means that to get good distributions, we are solving the ODE hundreds of thousands of times, i.e. even more than when doing parameter estimation! This is something to keep in mind.

However, notice that this process is trivially parallelizable. We can just have parallel chains on going, i.e. start 16 processes all doing Metropolis-Hastings, and in the end they are all sampling from the same distribution, so the final array can simply be the pooled results of each chain.

Hamiltonian Monte Carlo

Metropolis-Hastings is easy to motivate and implement. However, it does not do well in high dimensional spaces because it searches in all directions. For example, it's common for the sampling distribution $g$ to be a multivariable distribution (i.e. normal in all directions). However, high dimensional objects commonly sit on low dimensional manifolds (known as the manifold hypothesis). If that's the case, the most probable set of parameters is something that is low dimensional. For example, parameters may compensate for one another, and so $\theta_1^2 + \theta_2^2 + \theta_3^2 = 1$ might be the manifold on which all of the most probable choices for $\theta$ lie, in which case we need sample on the sphere instead of all of $\mathbb{R}^3$.

However, it's quick to see that this will give Metropolis-Hastings some trouble, since it will use a normal distribution around the current point, and thus even if we start on the sphere, it will have a high chance of trying a point not on the sphere in the next round! This can be depicted as:

Recall that every single rejection is still evaluating the likelihood (since it's calculating an acceptance probability, finding it near zero, rejecting and starting again), and every likelihood call is calling our simulator, and so this is sllllllooooooooooooooowwwwwwwww in high dimensions!

What we need to do instead is ensure that we walk along the path of high probability. What we want to do is thus build a vector field that matches our high probability regions

and follow said vector field (following a vector field is solving what kind of equation?). The first idea one might have is to use the gradient. However, while this idea has the right intentions, the issue is that the gradient of the probability will average out all of the possible probabilities, and will thus flow towards the mode of the distribution:

To overcome this issue, we look to physical systems and see that a satellite orbiting a planet always nicely stays on some manifold instead of following the gradient:

The reason why it does is because it has momentum. Recall from basic physics that one way to describe a physical system is through Hamiltonian mechanics, where $H(x,p)$ is the energy associated with the state $(x,p)$ (normally $x$ is location and $p$ is momentum). Due to conservation of energy, the solution of the dynamical equations leads to $H(x,p)$ being constant, and thus the dynamics follow the level sets of $H$. From the Hamiltonian the dynamics of the system are:

\[ \begin{align} \frac{dx}{dt} &= \frac{dH}{dp}\\ &= -\frac{dH}{dx} \end{align} \]

Here we want our Hamiltonian to be our posterior probability, so that way we stay on the manifold of high probability. This means:

\[ H(x,p) = - \log \pi(x,p) \]

where $\pi(x,p) = \pi(p|x)\pi(x)$ (where I am now using $pi$ for probability since $p$ is momentum!). So to lift from a probability over parameters to one that includes momentum, we simply need to choose a conditional distribution $\pi(p|x)$. This would mean that

\[ \begin{align} H(x,p) &= -log \pi(p|x) - \log \pi(x)\\ &= K(p,x) + V(x) \end{align} \]

where $K$ is the kinetic energy and $V$ is the potential. Thus the potential energy is directly given by the posterior calculation, and the kinetic energy is thus a choice that is used to build the correct Hamiltonian. Hamiltonian Monte Carlo methods then dig into good ways to choose the kinetic energy function. This is done at the start (along with the choice of ODE solver time step) in such a way that it maximizes acceptance probabilities.

Connections to Differentiable Programming

\[ -\frac{dH}{dx} \]

requires calculating the gradient of the likelihood function with respect to the parameters, so we are once again using the gradient of our simulator! This means that all of our previous discussion on automatic differentiation and differentiable programming applies to the Hamiltonian Monte Carlo context.

There's another thread to follow that transformations of probability distributions are pushforwards of the Jacobian transformations (given the transformation of an integral formula), and this is used when doing variational inference.

Symplectic and Geometric Integration

One way to integrate the system of ODEs which result from the Hamiltonian system is to convert it to a system of first order ODEs and solve it directly. However, this loses information and can result in drift. This is demonstrated by looking at the long time solution of the pendulum:

+

Thus, for the sampling-based approaches, we simply need to arrive at an array which is sampled according to the distribution that we want to estimate, and from that array we can recover the distribution.

Sampling Distributions with the Metropolis Hastings Algorithm

The Metropolis-Hastings algorithm is the simplest form of Markov Chain Monte Carlo (MCMC) which gives a way of sampling the $\theta$ distribution. To see how this algorithm works, let's understand the ratio between two points in the posterior probability. If we have $x_i$ and $x_j$, the ratio of the two probabilities would be given by:

\[ \frac{p(x_i|D)}{p(x_j|D)} = \frac{p(D|x_i)p(x_i)}{p(D|x_j)p(x_j)} \]

(notice that the integration constant cancels). This motivates the idea that all we have to do is ensure we only go to a point $x_j$ from $x_i$ with probability difference that matches that ratio, and over time if we do this between "all points" we will have the right number of "each point" in the distribution (quotes because it's continuous). With a bit more rigour we arrive at the following algorithm:

  1. Starting at $x_i$, take $x_{i+1}$ from a sampling algorithm $g(x_{i+1}|x_i)$.

  2. Calculate $A = \min\left(1,\frac{p(D|x_{i+1})p(x_{i+1})g(x_i|x_{i+1})}{p(D|x_i)p(x_i)g(x_{i+1}|x_i)}\right)$. Notice that if we require $g$ to be symmetric, then this simplifies to the probability ratio $A = \min\left(1,\frac{p(D|x_{i+1})p(x_{i+1})}{p(D|x_i)p(x_i)}\right)$

  3. Use a random number to accept the step with a probability $A$. Go back to step 1, incrementing $i$ if accepted, otherwise just repeat.

I.e, we just walk around the space biasing the acceptance of a step by the factor $\frac{p(x_i|D)}{p(x_j|D)}$ and sooner or later we will have spent the right amount of time in each area, giving the correct distribution.

(This can be rigorously proven, and those details are left out.)

The Cost of Bayesian Estimation

Let's take a quick moment to understand the high cost of Bayesian posterior estimations. While before we were getting point estimates, now we are trying to recover a full probability distribution, and each accept/reject probability calculation requires evaluating the likelihood at some point. Remember, the likelihood is generated by our simulator, and thus every evaluation here is an ODE solver call or a neural network forward pass! This means that to get good distributions, we are solving the ODE hundreds of thousands of times, i.e. even more than when doing parameter estimation! This is something to keep in mind.

However, notice that this process is trivially parallelizable. We can just have parallel chains on going, i.e. start 16 processes all doing Metropolis-Hastings, and in the end they are all sampling from the same distribution, so the final array can simply be the pooled results of each chain.

Hamiltonian Monte Carlo

Metropolis-Hastings is easy to motivate and implement. However, it does not do well in high dimensional spaces because it searches in all directions. For example, it's common for the sampling distribution $g$ to be a multivariable distribution (i.e. normal in all directions). However, high dimensional objects commonly sit on low dimensional manifolds (known as the manifold hypothesis). If that's the case, the most probable set of parameters is something that is low dimensional. For example, parameters may compensate for one another, and so $\theta_1^2 + \theta_2^2 + \theta_3^2 = 1$ might be the manifold on which all of the most probable choices for $\theta$ lie, in which case we need sample on the sphere instead of all of $\mathbb{R}^3$.

However, it's quick to see that this will give Metropolis-Hastings some trouble, since it will use a normal distribution around the current point, and thus even if we start on the sphere, it will have a high chance of trying a point not on the sphere in the next round! This can be depicted as:

Recall that every single rejection is still evaluating the likelihood (since it's calculating an acceptance probability, finding it near zero, rejecting and starting again), and every likelihood call is calling our simulator, and so this is sllllllooooooooooooooowwwwwwwww in high dimensions!

What we need to do instead is ensure that we walk along the path of high probability. What we want to do is thus build a vector field that matches our high probability regions

and follow said vector field (following a vector field is solving what kind of equation?). The first idea one might have is to use the gradient. However, while this idea has the right intentions, the issue is that the gradient of the probability will average out all of the possible probabilities, and will thus flow towards the mode of the distribution:

To overcome this issue, we look to physical systems and see that a satellite orbiting a planet always nicely stays on some manifold instead of following the gradient:

The reason why it does is because it has momentum. Recall from basic physics that one way to describe a physical system is through Hamiltonian mechanics, where $H(x,p)$ is the energy associated with the state $(x,p)$ (normally $x$ is location and $p$ is momentum). Due to conservation of energy, the solution of the dynamical equations leads to $H(x,p)$ being constant, and thus the dynamics follow the level sets of $H$. From the Hamiltonian the dynamics of the system are:

\[ \begin{align} \frac{dx}{dt} &= \frac{dH}{dp}\\ &= -\frac{dH}{dx} \end{align} \]

Here we want our Hamiltonian to be our posterior probability, so that way we stay on the manifold of high probability. This means:

\[ H(x,p) = - \log \pi(x,p) \]

where $\pi(x,p) = \pi(p|x)\pi(x)$ (where I am now using $pi$ for probability since $p$ is momentum!). So to lift from a probability over parameters to one that includes momentum, we simply need to choose a conditional distribution $\pi(p|x)$. This would mean that

\[ \begin{align} H(x,p) &= -log \pi(p|x) - \log \pi(x)\\ &= K(p,x) + V(x) \end{align} \]

where $K$ is the kinetic energy and $V$ is the potential. Thus the potential energy is directly given by the posterior calculation, and the kinetic energy is thus a choice that is used to build the correct Hamiltonian. Hamiltonian Monte Carlo methods then dig into good ways to choose the kinetic energy function. This is done at the start (along with the choice of ODE solver time step) in such a way that it maximizes acceptance probabilities.

Connections to Differentiable Programming

\[ -\frac{dH}{dx} \]

requires calculating the gradient of the likelihood function with respect to the parameters, so we are once again using the gradient of our simulator! This means that all of our previous discussion on automatic differentiation and differentiable programming applies to the Hamiltonian Monte Carlo context.

There's another thread to follow that transformations of probability distributions are pushforwards of the Jacobian transformations (given the transformation of an integral formula), and this is used when doing variational inference.

Symplectic and Geometric Integration

One way to integrate the system of ODEs which result from the Hamiltonian system is to convert it to a system of first order ODEs and solve it directly. However, this loses information and can result in drift. This is demonstrated by looking at the long time solution of the pendulum:

 using ParameterizedFunctions
 u0 = [1.,0.]
 harmonic! = @ode_def HarmonicOscillator begin
@@ -63,4 +63,4 @@ 

From Optimization to Probabilistic Programming

Chris R plot(sol,vars=(1,2))
 plot(sol)
-

What is an oscillatory system slowly loses energy and falls inward towards the center. To avoid this issue, we can do a few things:

  1. Project back to the manifold after steps. That can be costly (but almost might only need to happen every once in awhile!)

  2. Use a symplectic integrator.

A symplectic integrator is an integrator who's solution lives on a symplectic manifold, i.e. it preserves area in in the $(x,p)$ ellipses as it numerically approximates the flow. This means that:

  • Long-time integrations are truly cyclic with only floating point drift.

  • Steps preserve area. In the sense of Hamiltonian Monte Carlo, this means preserve probability and thus increase the acceptance rate.

These properties are demonstrated in the Kepler problem demo. However, note that while the solution lives on a symplectic manifold, it isn't necessarily the correct symplectic manifold. The shift in the manifold is $\mathcal{O}(\Delta t^k)$ where $k$ is the order of the method. For more information on symplectic integration, consult this StackOverflow response which goes into depth.

Application: Bayesian Estimation of Differential Equation Parameters

For a full demo of probabilistic programming on a differential equation system, see this tutorial on Bayesian inference of pendulum parameteres utilizing DifferentialEquations.jl and DiffEqBayes.jl.

Bayesian Estimation of Posterior Distributions with Variational Inference

Instead of using sampling, one can use variational inference to push through probability distributions. There are many ways to do variational inference, but a lot of the methods can be very model-specific. However, a recent change to probabilistic programming has been the development of Automatic Differentiation Variational Inference (ADVI): a general variational inference method which is not model-specific and instead uses AD. This has allowed for large expensive models to get effective distributional estimation, something that wasn't previously possible with HMC. In this section we will build up this methodology and understand its performance characteristics.

ADVI as Optimization

In this form of variational inference, we wish to directly estimate the posterior distribution. To do so, we pick a functional form to represent the solution $q(\theta; \phi)$ where $\phi$ are latent variables. We want our resulting distribution to fit the posterior, and tus we enforce that:

\[ \phi^\ast = \text{argmin}_{\phi} \text{KL} \left(q(\theta; \phi) \Vert p(\theta | D)\right) \]

where KL is the KL-divergence. KL-divergence is a distance function over probability distributions, and so this is simply a cost function over the distance between a chosen distribution and a desired distribution, where when $\phi$ are good we will have $q$ as a good approximation to the posterior.

However, the KL divergence lacks an analytical form because it requires knowing the posterior, the quantity we are trying to numerically estimate. However, it turns out that we can instead maximize the Evidence Lower Bound (ELBO):

\[ \mathcal{L}(\phi) = \mathbb{E}_{q}[\log p(x,\theta)] - \mathbb{E}_q [\log q(\theta; \phi)] \]

The ELBO is equivalent to the negative KL divergence up to a constant $\log p(x)$, which means that maximizing this is equivalent to minimizing the KL divergence.

One last detail is necessary in order for this problem to be tractable. To know the set of possible values to optimize over, we assume that the support of $q$ is a subset of the support of the prior. This means that our prior has to cover the probability distribution, which makes sense and matches Cromwell's rule for MCMC.

At this point we now assume that $q$ is Gaussian. When we rewrite the ELBO in terms of the standard Gaussian, we receive an expectation that is automatically differentiable. Calculating gradients is thus done with AD. Using only one or a few solves gives a noisy gradient to sample and optimize the latent variables to hone in on latent variables.

A Note on Implementation of Optimization for Probabilistic Programming

Variable domains can be constrained. For example, you may require a positive value. This can be handled by a transformation. For example, if $y$ must be positive, then one can optimize implicitly using $\exp(y)$ at every point, this allowing $y$ to be any real value with then $\exp(y)$ is positive. This turns the problem into an unconstrained optimization over the real numbers, and similar transformations can be done with any of the standard probability distribution's support function.

Citation

For Hamiltonian Monte Carlo, the images were taken from A Conceptual Introduction to Hamiltonian Monte Carlo by Michael Betancourt.

\ No newline at end of file +

What is an oscillatory system slowly loses energy and falls inward towards the center. To avoid this issue, we can do a few things:

  1. Project back to the manifold after steps. That can be costly (but almost might only need to happen every once in awhile!)

  2. Use a symplectic integrator.

A symplectic integrator is an integrator who's solution lives on a symplectic manifold, i.e. it preserves area in in the $(x,p)$ ellipses as it numerically approximates the flow. This means that:

  • Long-time integrations are truly cyclic with only floating point drift.

  • Steps preserve area. In the sense of Hamiltonian Monte Carlo, this means preserve probability and thus increase the acceptance rate.

These properties are demonstrated in the Kepler problem demo. However, note that while the solution lives on a symplectic manifold, it isn't necessarily the correct symplectic manifold. The shift in the manifold is $\mathcal{O}(\Delta t^k)$ where $k$ is the order of the method. For more information on symplectic integration, consult this StackOverflow response which goes into depth.

Application: Bayesian Estimation of Differential Equation Parameters

For a full demo of probabilistic programming on a differential equation system, see this tutorial on Bayesian inference of pendulum parameteres utilizing DifferentialEquations.jl and DiffEqBayes.jl.

Bayesian Estimation of Posterior Distributions with Variational Inference

Instead of using sampling, one can use variational inference to push through probability distributions. There are many ways to do variational inference, but a lot of the methods can be very model-specific. However, a recent change to probabilistic programming has been the development of Automatic Differentiation Variational Inference (ADVI): a general variational inference method which is not model-specific and instead uses AD. This has allowed for large expensive models to get effective distributional estimation, something that wasn't previously possible with HMC. In this section we will build up this methodology and understand its performance characteristics.

ADVI as Optimization

In this form of variational inference, we wish to directly estimate the posterior distribution. To do so, we pick a functional form to represent the solution $q(\theta; \phi)$ where $\phi$ are latent variables. We want our resulting distribution to fit the posterior, and tus we enforce that:

\[ \phi^\ast = \text{argmin}_{\phi} \text{KL} \left(q(\theta; \phi) \Vert p(\theta | D)\right) \]

where KL is the KL-divergence. KL-divergence is a distance function over probability distributions, and so this is simply a cost function over the distance between a chosen distribution and a desired distribution, where when $\phi$ are good we will have $q$ as a good approximation to the posterior.

However, the KL divergence lacks an analytical form because it requires knowing the posterior, the quantity we are trying to numerically estimate. However, it turns out that we can instead maximize the Evidence Lower Bound (ELBO):

\[ \mathcal{L}(\phi) = \mathbb{E}_{q}[\log p(x,\theta)] - \mathbb{E}_q [\log q(\theta; \phi)] \]

The ELBO is equivalent to the negative KL divergence up to a constant $\log p(x)$, which means that maximizing this is equivalent to minimizing the KL divergence.

One last detail is necessary in order for this problem to be tractable. To know the set of possible values to optimize over, we assume that the support of $q$ is a subset of the support of the prior. This means that our prior has to cover the probability distribution, which makes sense and matches Cromwell's rule for MCMC.

At this point we now assume that $q$ is Gaussian. When we rewrite the ELBO in terms of the standard Gaussian, we receive an expectation that is automatically differentiable. Calculating gradients is thus done with AD. Using only one or a few solves gives a noisy gradient to sample and optimize the latent variables to hone in on latent variables.

A Note on Implementation of Optimization for Probabilistic Programming

Variable domains can be constrained. For example, you may require a positive value. This can be handled by a transformation. For example, if $y$ must be positive, then one can optimize implicitly using $\exp(y)$ at every point, this allowing $y$ to be any real value with then $\exp(y)$ is positive. This turns the problem into an unconstrained optimization over the real numbers, and similar transformations can be done with any of the standard probability distribution's support function.

Citation

For Hamiltonian Monte Carlo, the images were taken from A Conceptual Introduction to Hamiltonian Monte Carlo by Michael Betancourt.

\ No newline at end of file diff --git a/_weave/lecture17/global_sensitivity/index.html b/_weave/lecture17/global_sensitivity/index.html index 45cc19a2..a9bf1f6e 100644 --- a/_weave/lecture17/global_sensitivity/index.html +++ b/_weave/lecture17/global_sensitivity/index.html @@ -9,4 +9,4 @@

Global Sensitivity Analysis

Chris Rackauckas
using LatinHypercubeSampling p = LHCoptim(120,2,1000) scatter(p[1][:,1],p[1][:,2]) -

For a reference library with many different quasi-Monte Carlo samplers, check out QuasiMonteCarlo.jl.

Fourier Amplitude Sensitivity Sampling (FAST) and eFAST

The FAST method is a change to the Sobol method to allow for faster convergence. First transform the variables $x_i$ onto the space $[0,1]$. Then, instead of the linear decomposition, one decomposes into a Fourier basis:

\[ f(x_i,x_2,\ldots,x_n) = \sum_{m_1 = -\infty}^{\infty} \ldots \sum_{m_n = -\infty}^{\infty} C_{m_1m_2\ldots m_n}\exp\left(2\pi i (m_1 x_1 + \ldots + m_n x_n)\right) \]

where

\[ C_{m_1m_2\ldots m_n} = \int_0^1 \ldots \int_0^1 f(x_i,x_2,\ldots,x_n) \exp\left(-2\pi i (m_1 x_1 + \ldots + m_n x_n)\right) \]

The ANOVA like decomposition is thus

\[ f_0 = C_{0\ldots 0} \]

\[ f_j = \sum_{m_j \neq 0} C_{0\ldots 0 m_j 0 \ldots 0} \exp (2\pi i m_j x_j) \]

\[ f_{jk} = \sum_{m_j \neq 0} \sum_{m_k \neq 0} C_{0\ldots 0 m_j 0 \ldots m_k 0 \ldots 0} \exp \left(2\pi i (m_j x_j + m_k x_k)\right) \]

The first order conditional variance is thus:

\[ V_j = \int_0^1 f_j^2 (x_j) dx_j = \sum_{m_j \neq 0} |C_{0\ldots 0 m_j 0 \ldots 0}|^2 \]

or

\[ V_j = 2\sum_{m_j = 1}^\infty \left(A_{m_j}^2 + B_{m_j}^2 \right) \]

where $C_{0\ldots 0 m_j 0 \ldots 0} = A_{m_j} + i B_{m_j}$. By Fourier series we know this to be:

\[ A_{m_j} = \int_0^1 \ldots \int_0^1 f(x)\cos(2\pi m_j x_j)dx \]

\[ B_{m_j} = \int_0^1 \ldots \int_0^1 f(x)\sin(2\pi m_j x_j)dx \]

Implementation via the Ergodic Theorem

Define

\[ X_j(s) = \frac{1}{2\pi} (\omega_j s \mod 2\pi) \]

By the ergodic theorem, if $\omega_j$ are irrational numbers, then the dynamical system will never repeat values and thus it will create a solution that is dense in the plane (Let's prove a bit later). As an animation:

(here, $\omega_1 = \pi$ and $\omega_2 = 7$)

This means that:

\[ A_{m_j} = \lim_{T\rightarrow \infty} \frac{1}{2T} \int_{-T}^T f(x)\cos(m_j \omega_j s)ds \]

\[ B_{m_j} = \lim_{T\rightarrow \infty} \frac{1}{2T} \int_{-T}^T f(x)\sin(m_j \omega_j s)ds \]

i.e. the multidimensional integral can be approximated by the integral over a single line.

One can satisfy this approximately to get a simpler form for the integral. Using $\omega_i$ as integers, the integral is periodic and so only integrating over $2\pi$ is required. This would mean that:

\[ A_{m_j} \approx \frac{1}{2\pi} \int_{-\pi}^\pi f(x)\cos(m_j \omega_j s)ds \]

\[ B_{m_j} \approx \frac{1}{2\pi} \int_{-\pi}^\pi f(x)\sin(m_j \omega_j s)ds \]

It's only approximate since the sequence cannot be dense. For example, with $\omega_1 = 11$ and $\omega_2 = 7$:

A higher period thus gives a better fill of the space and thus a better approximation, but may require a more points. However, this transformation makes the true integrals simple one dimensional quadratures which can be efficiently computed.

To get the total index from this method, one can calculate the total contribution of the complementary set, i.e. $V_{c_i} = \sum_{j \neq i} V_j$ and then

\[ S_{T_i} = 1 - S_{c_i} \]

Note that this then is a fast measure for the total contribution of variable $i$, including all higher-order nonlinear interactions, all from one-dimensional integrals! (This extension is called extended FAST or eFAST)

Proof of the Ergodic Theorem

Look at the map $x_{n+1} = x_n + \alpha (\text{mod} 1)$, where $\alpha$ is irrational. This is the irrational rotation map that corresponds to our problem. We wish to prove that in any interval $I$, there is a point of our orbit in this interval.

First let's prove a useful result: our points get arbitrarily close. Assume that for some finite $\epsilon$ that no two points are $\epsilon$ apart. This means that we at most have spacings of $\epsilon$ between the points, and thus we have at most $\frac{2\pi}{\epsilon}$ points (rounded up). This means our orbit is periodic. This means that there is a $p$ such that

\[ x_{n+p} = x_n \]

which means that $p \alpha = 1$ or $p = \frac{1}{\alpha}$ which is a contradiction since $\alpha$ is irrational.

Thus for every $\epsilon$ there are two points which are $\epsilon$ apart. Now take any arbitrary $I$. Let $\epsilon < d/2$ where $d$ is the length of the interval. We have just shown that there are two points $\epsilon$ apart, so there is a point that is $x_{n+m}$ and $x_{n+k}$ which are $<\epsilon$ apart. Assuming WLOG $m>k$, this means that $m-k$ rotations takes one from $x_{n+k}$ to $x_{n+m}$, and so $m-k$ rotations is a rotation by $\epsilon$. If we do $\frac{1}{\epsilon}$ rounded up rotations, we will then cover the space with intervals of length epsilon, each with one point of the orbit in it. Since $\epsilon < d/2$, one of those intervals is completely encapsulated in $I$, which means there is at least one point in our orbit that is in $I$.

Thus for every interval we have at least one point in our orbit that lies in it, proving that the rotation map with irrational $\alpha$ is dense. Note that during the proof we essentially showed as well that if $\alpha$ is rational, then the map is periodic based on the denominator of the map in its reduced form.

A Quick Note on Parallelism

Very quick note: all of these are hyper parallel since it does the same calculation per parameter or trajectory, and each calculation is long. For quasi-Monte Carlo, after generating "good enough" trajectories, one can evaluate the model at all points in parallel, and then simply do the GSA index measurement. For FAST, one can do each quadrature in parallel.

\ No newline at end of file +

For a reference library with many different quasi-Monte Carlo samplers, check out QuasiMonteCarlo.jl.

Fourier Amplitude Sensitivity Sampling (FAST) and eFAST

The FAST method is a change to the Sobol method to allow for faster convergence. First transform the variables $x_i$ onto the space $[0,1]$. Then, instead of the linear decomposition, one decomposes into a Fourier basis:

\[ f(x_i,x_2,\ldots,x_n) = \sum_{m_1 = -\infty}^{\infty} \ldots \sum_{m_n = -\infty}^{\infty} C_{m_1m_2\ldots m_n}\exp\left(2\pi i (m_1 x_1 + \ldots + m_n x_n)\right) \]

where

\[ C_{m_1m_2\ldots m_n} = \int_0^1 \ldots \int_0^1 f(x_i,x_2,\ldots,x_n) \exp\left(-2\pi i (m_1 x_1 + \ldots + m_n x_n)\right) \]

The ANOVA like decomposition is thus

\[ f_0 = C_{0\ldots 0} \]

\[ f_j = \sum_{m_j \neq 0} C_{0\ldots 0 m_j 0 \ldots 0} \exp (2\pi i m_j x_j) \]

\[ f_{jk} = \sum_{m_j \neq 0} \sum_{m_k \neq 0} C_{0\ldots 0 m_j 0 \ldots m_k 0 \ldots 0} \exp \left(2\pi i (m_j x_j + m_k x_k)\right) \]

The first order conditional variance is thus:

\[ V_j = \int_0^1 f_j^2 (x_j) dx_j = \sum_{m_j \neq 0} |C_{0\ldots 0 m_j 0 \ldots 0}|^2 \]

or

\[ V_j = 2\sum_{m_j = 1}^\infty \left(A_{m_j}^2 + B_{m_j}^2 \right) \]

where $C_{0\ldots 0 m_j 0 \ldots 0} = A_{m_j} + i B_{m_j}$. By Fourier series we know this to be:

\[ A_{m_j} = \int_0^1 \ldots \int_0^1 f(x)\cos(2\pi m_j x_j)dx \]

\[ B_{m_j} = \int_0^1 \ldots \int_0^1 f(x)\sin(2\pi m_j x_j)dx \]

Implementation via the Ergodic Theorem

Define

\[ X_j(s) = \frac{1}{2\pi} (\omega_j s \mod 2\pi) \]

By the ergodic theorem, if $\omega_j$ are irrational numbers, then the dynamical system will never repeat values and thus it will create a solution that is dense in the plane (Let's prove a bit later). As an animation:

(here, $\omega_1 = \pi$ and $\omega_2 = 7$)

This means that:

\[ A_{m_j} = \lim_{T\rightarrow \infty} \frac{1}{2T} \int_{-T}^T f(x)\cos(m_j \omega_j s)ds \]

\[ B_{m_j} = \lim_{T\rightarrow \infty} \frac{1}{2T} \int_{-T}^T f(x)\sin(m_j \omega_j s)ds \]

i.e. the multidimensional integral can be approximated by the integral over a single line.

One can satisfy this approximately to get a simpler form for the integral. Using $\omega_i$ as integers, the integral is periodic and so only integrating over $2\pi$ is required. This would mean that:

\[ A_{m_j} \approx \frac{1}{2\pi} \int_{-\pi}^\pi f(x)\cos(m_j \omega_j s)ds \]

\[ B_{m_j} \approx \frac{1}{2\pi} \int_{-\pi}^\pi f(x)\sin(m_j \omega_j s)ds \]

It's only approximate since the sequence cannot be dense. For example, with $\omega_1 = 11$ and $\omega_2 = 7$:

A higher period thus gives a better fill of the space and thus a better approximation, but may require a more points. However, this transformation makes the true integrals simple one dimensional quadratures which can be efficiently computed.

To get the total index from this method, one can calculate the total contribution of the complementary set, i.e. $V_{c_i} = \sum_{j \neq i} V_j$ and then

\[ S_{T_i} = 1 - S_{c_i} \]

Note that this then is a fast measure for the total contribution of variable $i$, including all higher-order nonlinear interactions, all from one-dimensional integrals! (This extension is called extended FAST or eFAST)

Proof of the Ergodic Theorem

Look at the map $x_{n+1} = x_n + \alpha (\text{mod} 1)$, where $\alpha$ is irrational. This is the irrational rotation map that corresponds to our problem. We wish to prove that in any interval $I$, there is a point of our orbit in this interval.

First let's prove a useful result: our points get arbitrarily close. Assume that for some finite $\epsilon$ that no two points are $\epsilon$ apart. This means that we at most have spacings of $\epsilon$ between the points, and thus we have at most $\frac{2\pi}{\epsilon}$ points (rounded up). This means our orbit is periodic. This means that there is a $p$ such that

\[ x_{n+p} = x_n \]

which means that $p \alpha = 1$ or $p = \frac{1}{\alpha}$ which is a contradiction since $\alpha$ is irrational.

Thus for every $\epsilon$ there are two points which are $\epsilon$ apart. Now take any arbitrary $I$. Let $\epsilon < d/2$ where $d$ is the length of the interval. We have just shown that there are two points $\epsilon$ apart, so there is a point that is $x_{n+m}$ and $x_{n+k}$ which are $<\epsilon$ apart. Assuming WLOG $m>k$, this means that $m-k$ rotations takes one from $x_{n+k}$ to $x_{n+m}$, and so $m-k$ rotations is a rotation by $\epsilon$. If we do $\frac{1}{\epsilon}$ rounded up rotations, we will then cover the space with intervals of length epsilon, each with one point of the orbit in it. Since $\epsilon < d/2$, one of those intervals is completely encapsulated in $I$, which means there is at least one point in our orbit that is in $I$.

Thus for every interval we have at least one point in our orbit that lies in it, proving that the rotation map with irrational $\alpha$ is dense. Note that during the proof we essentially showed as well that if $\alpha$ is rational, then the map is periodic based on the denominator of the map in its reduced form.

A Quick Note on Parallelism

Very quick note: all of these are hyper parallel since it does the same calculation per parameter or trajectory, and each calculation is long. For quasi-Monte Carlo, after generating "good enough" trajectories, one can evaluate the model at all points in parallel, and then simply do the GSA index measurement. For FAST, one can do each quadrature in parallel.

\ No newline at end of file diff --git a/_weave/lecture17/jl_4XQZis/global_sensitivity_2_1.png b/_weave/lecture17/jl_4XQZis/global_sensitivity_2_1.png deleted file mode 100644 index 213efbee..00000000 Binary files a/_weave/lecture17/jl_4XQZis/global_sensitivity_2_1.png and /dev/null differ diff --git a/_weave/lecture17/jl_CZCo1s/global_sensitivity_2_1.png b/_weave/lecture17/jl_CZCo1s/global_sensitivity_2_1.png new file mode 100644 index 00000000..441b95fe Binary files /dev/null and b/_weave/lecture17/jl_CZCo1s/global_sensitivity_2_1.png differ diff --git a/_weave/lecture18/code_profiling/index.html b/_weave/lecture18/code_profiling/index.html index 18e37942..ecdaa609 100644 --- a/_weave/lecture18/code_profiling/index.html +++ b/_weave/lecture18/code_profiling/index.html @@ -100,4 +100,4 @@

Code Profiling and Optimization

Chris Rackauckas
@profile for i in 1:10000 sol = solve(prob,Tsit5(),save_everystep=false) end
 Juno.profiler()
-

Now that this looks like a fairly good profile, we can use this to dig in and find out what lines need to be optimized!

\ No newline at end of file +

Now that this looks like a fairly good profile, we can use this to dig in and find out what lines need to be optimized!

\ No newline at end of file diff --git a/_weave/lecture18/jl_ZpmxjZ/code_profiling_5_1.png b/_weave/lecture18/jl_V5rucR/code_profiling_5_1.png similarity index 100% rename from _weave/lecture18/jl_ZpmxjZ/code_profiling_5_1.png rename to _weave/lecture18/jl_V5rucR/code_profiling_5_1.png diff --git a/_weave/lecture19/jl_LA0wN9/uncertainty_programming_1_1.png b/_weave/lecture19/jl_ueBBKe/uncertainty_programming_1_1.png similarity index 100% rename from _weave/lecture19/jl_LA0wN9/uncertainty_programming_1_1.png rename to _weave/lecture19/jl_ueBBKe/uncertainty_programming_1_1.png diff --git a/_weave/lecture19/uncertainty_programming/index.html b/_weave/lecture19/uncertainty_programming/index.html index 73732e0e..7a4a48c2 100644 --- a/_weave/lecture19/uncertainty_programming/index.html +++ b/_weave/lecture19/uncertainty_programming/index.html @@ -198,4 +198,4 @@

Uncertainty Programming, Generalized Uncertainty Quantification plot(sim,vars=(0,1),linealpha=0.4)
 ERROR: UndefVarError: `AdaptiveProbIntsUncertainty` not defined
-

Notice that while an interval estimate would have grown to allow all extremes together, this form keeps the trajectories alive, allowing them to fall back to the mode, which decreases the true uncertainty. This is thus a good explanation as to why general methods will overestimate uncertainty.

Adjoints of Uncertainty and the Koopman Operator

Everything that we've demonstrated here so far can be thought of as "forward mode uncertainty quantification". For every example we have constructed a method such that, for a known probability distribution in x, we build the probability distribution of the output of the program, and then compute quantities from that. On a dynamical system this pushforward of a measure is denoted by the Frobenius-Perron operator. With a pushforward operator $P$ and an initial uncertainty density $f$, we can represent calculating the expected value of some cost function on the solution via:

\[ \mathbb{E}[g(x)|X \sim Pf] = \int_{S(A)} P f(x) g(x) dx \]

where $S$ is the program, i.e. $S(A)$ is the total set of points by pushing every value of $A$ through our program, and $P f(x)$ is the pushforward operator applied to the probability distribution. What this means is that, to calculate the expectation on the output of our program, like to calculate the mean value of the ODE's solution given uncertainty in the parameters, we can pushforward the probability distribution to construct $Pf$ and on this probability distribution calculate the expected value of some $g$ cost function on the solution.

The problem, as seen earlier, is that pushing forward entire probability distributions is a fairly expensive process. We can instead think about doing the adjoint to this cost function, i.e. pulling back the cost function and computing it on the initial density. In terms of inner product notation, this would be doing:

\[ \langle Pf,g \rangle = \langle f, Ug \rangle \]

meaning $U$ is the adjoint operator to the pushforward $P$. This operator is known as the Koopman operator. There are many properties one can use about the Koopman operator, one special property being it's a linear operator on the space of observables, but it also gives a nice expression for computing uncertainty expectations. Using the Koopman operator, we can rewrite the expectation as:

\[ \mathbb{E}[g(x)|X \sim Pf] = \mathbb{E}[Ug(x)|X \sim f] \]

or perform the integral on the pullback of the cost function, i.e.

\[ \mathbb{E}[g(x)|X \sim f] = \int_A Ug(x) f(x) dx \]

In images it looks like:

Koopman vs FP

This expression gives us a fast way to compute expectations on the program output without having to compute the full uncertainty distribution on the output. This can thus be used for optimization under uncertainty, i.e. the optimization of loss functions with respect to expectations of the program's output under the assumption of given input uncertainty distributions. For more information, see The Koopman Expectation: An Operator Theoretic Method for Efficient Analysis and Optimization of Uncertain Hybrid Dynamical Systems.

\ No newline at end of file +

Notice that while an interval estimate would have grown to allow all extremes together, this form keeps the trajectories alive, allowing them to fall back to the mode, which decreases the true uncertainty. This is thus a good explanation as to why general methods will overestimate uncertainty.

Adjoints of Uncertainty and the Koopman Operator

Everything that we've demonstrated here so far can be thought of as "forward mode uncertainty quantification". For every example we have constructed a method such that, for a known probability distribution in x, we build the probability distribution of the output of the program, and then compute quantities from that. On a dynamical system this pushforward of a measure is denoted by the Frobenius-Perron operator. With a pushforward operator $P$ and an initial uncertainty density $f$, we can represent calculating the expected value of some cost function on the solution via:

\[ \mathbb{E}[g(x)|X \sim Pf] = \int_{S(A)} P f(x) g(x) dx \]

where $S$ is the program, i.e. $S(A)$ is the total set of points by pushing every value of $A$ through our program, and $P f(x)$ is the pushforward operator applied to the probability distribution. What this means is that, to calculate the expectation on the output of our program, like to calculate the mean value of the ODE's solution given uncertainty in the parameters, we can pushforward the probability distribution to construct $Pf$ and on this probability distribution calculate the expected value of some $g$ cost function on the solution.

The problem, as seen earlier, is that pushing forward entire probability distributions is a fairly expensive process. We can instead think about doing the adjoint to this cost function, i.e. pulling back the cost function and computing it on the initial density. In terms of inner product notation, this would be doing:

\[ \langle Pf,g \rangle = \langle f, Ug \rangle \]

meaning $U$ is the adjoint operator to the pushforward $P$. This operator is known as the Koopman operator. There are many properties one can use about the Koopman operator, one special property being it's a linear operator on the space of observables, but it also gives a nice expression for computing uncertainty expectations. Using the Koopman operator, we can rewrite the expectation as:

\[ \mathbb{E}[g(x)|X \sim Pf] = \mathbb{E}[Ug(x)|X \sim f] \]

or perform the integral on the pullback of the cost function, i.e.

\[ \mathbb{E}[g(x)|X \sim f] = \int_A Ug(x) f(x) dx \]

In images it looks like:

Koopman vs FP

This expression gives us a fast way to compute expectations on the program output without having to compute the full uncertainty distribution on the output. This can thus be used for optimization under uncertainty, i.e. the optimization of loss functions with respect to expectations of the program's output under the assumption of given input uncertainty distributions. For more information, see The Koopman Expectation: An Operator Theoretic Method for Efficient Analysis and Optimization of Uncertain Hybrid Dynamical Systems.

\ No newline at end of file diff --git a/course/index.html b/course/index.html index dc9842a1..dc53d8f7 100644 --- a/course/index.html +++ b/course/index.html @@ -1 +1 @@ - Course Overview - MIT Parallel Computing and Scientific Machine Learning (SciML)

Course Overview

Syllabus

Pre-recorded online lectures are available to complement the lecture notes

Prerequisites: While this course will be mixing ideas from high performance computing, numerical analysis, and machine learning, no one in the course is expected to have covered all of these topics before. Understanding of calculus, linear algebra, and programming is essential. 18.337 is a graduate-level subject so mathematical maturity and the ability to learn from primary literature is necessary. Problem sets will involve use of Julia, a Matlab-like environment (little or no prior experience required; you will learn as you go).

Textbook & Other Reading: There is no textbook for this course or the field of scientific machine learning. Some helpful resources are Hairer and Wanner's Solving Ordinary Differential Equations I & II and Gilbert Strang's Computational Science and Engineering. Much of the reading will come in the form of primary literature from journal articles posted here.

Schedule of Topics

Each topic is a group of three pieces: a numerical method, a performance-engineering technique, and a scientific application. These three together form a complete usable program that is demonstrated.

  • The basics of scientific simulators (Week 1-2)

    • What is Scientific Machine Learning?

    • Optimization of serial code.

    • Introduction to discrete and continuous dynamical systems.

  • Introduction to Parallel Computing (Week 2-3)

    • Forms of parallelism and applications

    • Parallelizing differential equation solvers

    • Optimal local parallelism via multithreading

    • Linear Algebra libraries you should know

Homework 1: Parallelized dynamical system simulations and ODE integrators

  • Continuous Dynamics (Week 4)

    • Ordinary differential equations as the language for ecology, Newtonian mechanics, and beyond.

    • Numerical methods for non-stiff ordinary differential equations

    • Definition of stiffness

    • Efficiently solving stiff ordinary differential equations

    • Stiff differential equations arising from biochemical interactions in developmental biology and ecology

    • Utilizing type systems and generic algorithms as a mathematical tool

    • Forward-mode automatic differentiation for solving f(x)=0

    • Matrix coloring and sparse differentiation

Homework 2: Parameter estimation in dynamical systems and overhead of parallelism

  • Inverse problems and Differentiable Programming (Week 6)

    • Definition of inverse problems with applications to clinical pharmacology and smartgrid optimization

    • Adjoint methods for fast gradients

    • Automated adjoints through reverse-mode automatic differentiation (backpropagation)

    • Adjoints of differential equations

    • Using neural ordinary differential equations as a memory-efficient RNN for deep learning

  • Neural networks, and array-based parallelism (Week 8)

    • Cache optimization in numerical linear algebra

    • Parallelism through array operations

    • How to optimize algorithms for GPUs

  • Distributed parallel computing (Jeremy Kepner: Weeks 7-8)

    • Forms of parallelism

    • Using distributed computing vs multithreading

    • Message passing and deadlock

    • Map-Reduce as a framework for distributed parallelism

    • Implementing distributed parallel algorithms with MPI

Homework 3: Training neural ordinary differential equations (with GPUs)

  • Physics-Informed Neural Networks and Neural Differential Equations (Week 9-10)

    • Automatic discovery of differential equations

    • Solving differential equations with neural networks

    • Discretizations of PDEs

    • Basics of neural networks and definitions

    • The relationship between convolutional neural networks and PDEs

  • Probabilistic Programming, AKA Bayesian Estimation on Programs (Week 10-11)

    • The connection between optimization and Bayesian methods: Bayesian posteriors vs MAP optimization

    • Introduction to Markov-Chain Monte Carlo methods

    • Hamiltonian Monte Carlo is just a symplectic ODE solver

    • Uncertainty quantification of parameter estimates through posteriors

  • Globalizing the understanding of models (Week 11-12)

    • Global sensitivity analysis

    • Global optimization

    • Surrogate Modeling

    • Uncertainty Quantification

Homeworks

Lecture Summaries and Handouts

Note that lectures are broken down by topic, not by day. Some lectures are more than 1 class day, others are less.

\ No newline at end of file + Course Overview - MIT Parallel Computing and Scientific Machine Learning (SciML)

Course Overview

Syllabus

Pre-recorded online lectures are available to complement the lecture notes

Prerequisites: While this course will be mixing ideas from high performance computing, numerical analysis, and machine learning, no one in the course is expected to have covered all of these topics before. Understanding of calculus, linear algebra, and programming is essential. 18.337 is a graduate-level subject so mathematical maturity and the ability to learn from primary literature is necessary. Problem sets will involve use of Julia, a Matlab-like environment (little or no prior experience required; you will learn as you go).

Textbook & Other Reading: There is no textbook for this course or the field of scientific machine learning. Some helpful resources are Hairer and Wanner's Solving Ordinary Differential Equations I & II and Gilbert Strang's Computational Science and Engineering. Much of the reading will come in the form of primary literature from journal articles posted here.

Schedule of Topics

Each topic is a group of three pieces: a numerical method, a performance-engineering technique, and a scientific application. These three together form a complete usable program that is demonstrated.

  • The basics of scientific simulators (Week 1-2)

    • What is Scientific Machine Learning?

    • Optimization of serial code.

    • Introduction to discrete and continuous dynamical systems.

  • Introduction to Parallel Computing (Week 2-3)

    • Forms of parallelism and applications

    • Parallelizing differential equation solvers

    • Optimal local parallelism via multithreading

    • Linear Algebra libraries you should know

Homework 1: Parallelized dynamical system simulations and ODE integrators

  • Continuous Dynamics (Week 4)

    • Ordinary differential equations as the language for ecology, Newtonian mechanics, and beyond.

    • Numerical methods for non-stiff ordinary differential equations

    • Definition of stiffness

    • Efficiently solving stiff ordinary differential equations

    • Stiff differential equations arising from biochemical interactions in developmental biology and ecology

    • Utilizing type systems and generic algorithms as a mathematical tool

    • Forward-mode automatic differentiation for solving f(x)=0

    • Matrix coloring and sparse differentiation

Homework 2: Parameter estimation in dynamical systems and overhead of parallelism

  • Inverse problems and Differentiable Programming (Week 6)

    • Definition of inverse problems with applications to clinical pharmacology and smartgrid optimization

    • Adjoint methods for fast gradients

    • Automated adjoints through reverse-mode automatic differentiation (backpropagation)

    • Adjoints of differential equations

    • Using neural ordinary differential equations as a memory-efficient RNN for deep learning

  • Neural networks, and array-based parallelism (Week 8)

    • Cache optimization in numerical linear algebra

    • Parallelism through array operations

    • How to optimize algorithms for GPUs

  • Distributed parallel computing (Jeremy Kepner: Weeks 7-8)

    • Forms of parallelism

    • Using distributed computing vs multithreading

    • Message passing and deadlock

    • Map-Reduce as a framework for distributed parallelism

    • Implementing distributed parallel algorithms with MPI

Homework 3: Training neural ordinary differential equations (with GPUs)

  • Physics-Informed Neural Networks and Neural Differential Equations (Week 9-10)

    • Automatic discovery of differential equations

    • Solving differential equations with neural networks

    • Discretizations of PDEs

    • Basics of neural networks and definitions

    • The relationship between convolutional neural networks and PDEs

  • Probabilistic Programming, AKA Bayesian Estimation on Programs (Week 10-11)

    • The connection between optimization and Bayesian methods: Bayesian posteriors vs MAP optimization

    • Introduction to Markov-Chain Monte Carlo methods

    • Hamiltonian Monte Carlo is just a symplectic ODE solver

    • Uncertainty quantification of parameter estimates through posteriors

  • Globalizing the understanding of models (Week 11-12)

    • Global sensitivity analysis

    • Global optimization

    • Surrogate Modeling

    • Uncertainty Quantification

Homeworks

Lecture Summaries and Handouts

Note that lectures are broken down by topic, not by day. Some lectures are more than 1 class day, others are less.

\ No newline at end of file diff --git a/feed.xml b/feed.xml index 67cfa692..2e2d06e2 100644 --- a/feed.xml +++ b/feed.xml @@ -31,7 +31,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -50,7 +50,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -69,7 +69,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -88,7 +88,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -107,7 +107,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -126,7 +126,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -145,7 +145,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -164,7 +164,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -183,7 +183,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -202,7 +202,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -221,7 +221,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -240,7 +240,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -259,7 +259,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -278,7 +278,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -297,7 +297,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -316,7 +316,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -335,7 +335,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -354,7 +354,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -373,7 +373,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -392,7 +392,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -411,7 +411,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 @@ -430,7 +430,7 @@ - Sat, 15 Jul 2023 00:00:00 +0000 + Sun, 29 Oct 2023 00:00:00 +0000 diff --git a/homework/01/index.html b/homework/01/index.html index 682edb6a..67ce720b 100644 --- a/homework/01/index.html +++ b/homework/01/index.html @@ -1 +1 @@ - Homework 1 - MIT Parallel Computing and Scientific Machine Learning (SciML)

Homework 1, Parallelized Dynamics

Chris Rackauckas
September 15th, 2020

Due October 1st, 2020 at midnight EST.

Homework 1 is a chance to get some experience implementing discrete dynamical systems techniques in a way that is parallelized, and a time to understand the fundamental behavior of the bottleneck algorithms in scientific computing.

Problem 1: A Ton of New Facts on Newton

In lecture 4 we looked at the properties of discrete dynamical systems to see that running many systems for infinitely many steps would go to a steady state. This process is used as a numerical method known as fixed point iteration to solve for the steady state of systems $x_{n+1} = f(x_{n})$. Under a transformation (which we will do in this homework), it can be used to solve rootfinding problems $f(x) = 0$ to solve for $x$.

In this problem we will look into Newton's method. Newton's method is the dynamical system defined by the update process:

\[ x_{n+1} = x_n - \left(\frac{dg}{dx}(x_n)\right)^{-1} g(x_n) \]

For these problems, assume that $\frac{dg}{dx}$ is non-singular. We will prove a few properties to show why, in practice, Newton methods are preferred for quickly calculating the steady state.

Part 1

Show that if $x^\ast$ is a steady state of the equation, then $g(x^\ast) = 0$.

Part 2

Take a look at the Quasi-Newton approximation:

\[ x_{n+1} = x_n - \left(\frac{dg}{dx}(x_0)\right)^{-1} g(x_n) \]

for some fixed $x_0$. Derive the stability of the Quasi-Newton approximation in the form of a matrix whose eigenvalues need to be constrained. Use this to argue that if $x_0$ is sufficiently close to $x^\ast$ then the steady state is a stable (attracting) steady state.

Part 3

Relaxed Quasi-Newton is the method:

\[ x_{n+1} = x_n - \alpha \left(\frac{dg}{dx}(x_0)\right)^{-1} g(x_n) \]

Argue that for some sufficiently small $\alpha$ that the Quasi-Newton iterations will be stable if the eigenvalues of $(\left(\frac{dg}{dx}(x_0)\right)^{-1} g(x_n))^\prime$ are all positive for every $x$.

(Technically, these assumptions can be greatly relaxed, but weird cases arise. When $x \in \mathbb{C}$, this holds except on some set of Lebesgue measure zero. Feel free to explore this.)

Part 4

Fixed point iteration is the dynamical system

\[ x_{n+1} = g(x_n) \]

which converges to $g(x)=x$.

  1. What is a small change to the dynamical system that could be done such that $g(x)=0$ is the steady state?

  2. How can you change the $\left(\frac{dg}{dx}(x_0)\right)^{-1}$ term from the Quasi-Newton iteration to get a method equivalent to fixed point iteration? What does this imply about the difference in stability between Quasi-Newton and fixed point iteration if $\frac{dg}{dx}$ has large eigenvalues?

Problem 2: The Root of all Problems

In this problem we will practice writing fast and type-generic Julia code by producing an algorithm that will compute the quantile of any probability distribution.

Part 1

Many problems can be interpreted as a rootfinding problem. For example, let's take a look at a problem in statistics. Let $X$ be a random variable with a cumulative distribution function (CDF) of $cdf(x)$. Recall that the CDF is a monotonically increasing function in $[0,1]$ which is the total probability of $X < x$. The $y$th quantile of $X$ is the value $x$ at with $X$ has a y% chance of being less than $x$. Interpret the problem of computing an arbitrary quantile $y$ as a rootfinding problem, and use Newton's method to write an algorithm for computing $x$.

(Hint: Recall that $cdf^{\prime}(x) = pdf(x)$, the probability distribution function.)

Part 2

Use the types from Distributions.jl to write a function my_quantile(y,d) which uses multiple dispatch to compute the $y$th quantile for any UnivariateDistribution d from Distributions.jl. Test your function on Gamma(5, 1), Normal(0, 1), and Beta(2, 4) against the Distributions.quantile function built into the library.

(Hint: Have a keyword argument for $x_0$, and let its default be the mean or median of the distribution.)

Problem 3: Bifurcating Data for Parallelism

In this problem we will write code for efficient generation of the bifurcation diagram of the logistic equation.

Part 1

The logistic equation is the dynamical system given by the update relation:

\[ x_{n+1} = rx_n (1-x_n) \]

where $r$ is some parameter. Write a function which iterates the equation from $x_0 = 0.25$ enough times to be sufficiently close to its long-term behavior (400 iterations) and samples 150 points from the steady state attractor (i.e. output iterations 401:550) as a function of $r$, and mutates some vector as a solution, i.e. calc_attractor!(out,f,p,num_attract=150;warmup=400).

Test your function with $r = 2.9$. Double check that your function computes the correct result by calculating the analytical steady state value.

Part 2

The bifurcation plot shows how a steady state changes as a parameter changes. Compute the long-term result of the logistic equation at the values of r = 2.9:0.001:4, and plot the steady state values for each $r$ as an r x steady_attractor scatter plot. You should get a very bizarrely awesome picture, the bifurcation graph of the logistic equation.

(Hint: Generate a single matrix for the attractor values, and use calc_attractor! on views of columns for calculating the output, or inline the calc_attractor! computation directly onto the matrix, or even give calc_attractor! an input for what column to modify.)

Part 3

Multithread your bifurcation graph generator by performing different steady state calculations on different threads. Does your timing improve? Why? Be careful and check to make sure you have more than 1 thread!

Part 4

Multiprocess your bifurcation graph generator first by using pmap, and then by using @distributed. Does your timing improve? Why? Be careful to add processes before doing the distributed call.

(Note: You may need to change your implementation around to be allocating differently in order for it to be compatible with multiprocessing!)

Part 5

Which method is the fastest? Why?

\ No newline at end of file + Homework 1 - MIT Parallel Computing and Scientific Machine Learning (SciML)

Homework 1, Parallelized Dynamics

Chris Rackauckas
September 15th, 2020

Due October 1st, 2020 at midnight EST.

Homework 1 is a chance to get some experience implementing discrete dynamical systems techniques in a way that is parallelized, and a time to understand the fundamental behavior of the bottleneck algorithms in scientific computing.

Problem 1: A Ton of New Facts on Newton

In lecture 4 we looked at the properties of discrete dynamical systems to see that running many systems for infinitely many steps would go to a steady state. This process is used as a numerical method known as fixed point iteration to solve for the steady state of systems $x_{n+1} = f(x_{n})$. Under a transformation (which we will do in this homework), it can be used to solve rootfinding problems $f(x) = 0$ to solve for $x$.

In this problem we will look into Newton's method. Newton's method is the dynamical system defined by the update process:

\[ x_{n+1} = x_n - \left(\frac{dg}{dx}(x_n)\right)^{-1} g(x_n) \]

For these problems, assume that $\frac{dg}{dx}$ is non-singular. We will prove a few properties to show why, in practice, Newton methods are preferred for quickly calculating the steady state.

Part 1

Show that if $x^\ast$ is a steady state of the equation, then $g(x^\ast) = 0$.

Part 2

Take a look at the Quasi-Newton approximation:

\[ x_{n+1} = x_n - \left(\frac{dg}{dx}(x_0)\right)^{-1} g(x_n) \]

for some fixed $x_0$. Derive the stability of the Quasi-Newton approximation in the form of a matrix whose eigenvalues need to be constrained. Use this to argue that if $x_0$ is sufficiently close to $x^\ast$ then the steady state is a stable (attracting) steady state.

Part 3

Relaxed Quasi-Newton is the method:

\[ x_{n+1} = x_n - \alpha \left(\frac{dg}{dx}(x_0)\right)^{-1} g(x_n) \]

Argue that for some sufficiently small $\alpha$ that the Quasi-Newton iterations will be stable if the eigenvalues of $(\left(\frac{dg}{dx}(x_0)\right)^{-1} g(x_n))^\prime$ are all positive for every $x$.

(Technically, these assumptions can be greatly relaxed, but weird cases arise. When $x \in \mathbb{C}$, this holds except on some set of Lebesgue measure zero. Feel free to explore this.)

Part 4

Fixed point iteration is the dynamical system

\[ x_{n+1} = g(x_n) \]

which converges to $g(x)=x$.

  1. What is a small change to the dynamical system that could be done such that $g(x)=0$ is the steady state?

  2. How can you change the $\left(\frac{dg}{dx}(x_0)\right)^{-1}$ term from the Quasi-Newton iteration to get a method equivalent to fixed point iteration? What does this imply about the difference in stability between Quasi-Newton and fixed point iteration if $\frac{dg}{dx}$ has large eigenvalues?

Problem 2: The Root of all Problems

In this problem we will practice writing fast and type-generic Julia code by producing an algorithm that will compute the quantile of any probability distribution.

Part 1

Many problems can be interpreted as a rootfinding problem. For example, let's take a look at a problem in statistics. Let $X$ be a random variable with a cumulative distribution function (CDF) of $cdf(x)$. Recall that the CDF is a monotonically increasing function in $[0,1]$ which is the total probability of $X < x$. The $y$th quantile of $X$ is the value $x$ at with $X$ has a y% chance of being less than $x$. Interpret the problem of computing an arbitrary quantile $y$ as a rootfinding problem, and use Newton's method to write an algorithm for computing $x$.

(Hint: Recall that $cdf^{\prime}(x) = pdf(x)$, the probability distribution function.)

Part 2

Use the types from Distributions.jl to write a function my_quantile(y,d) which uses multiple dispatch to compute the $y$th quantile for any UnivariateDistribution d from Distributions.jl. Test your function on Gamma(5, 1), Normal(0, 1), and Beta(2, 4) against the Distributions.quantile function built into the library.

(Hint: Have a keyword argument for $x_0$, and let its default be the mean or median of the distribution.)

Problem 3: Bifurcating Data for Parallelism

In this problem we will write code for efficient generation of the bifurcation diagram of the logistic equation.

Part 1

The logistic equation is the dynamical system given by the update relation:

\[ x_{n+1} = rx_n (1-x_n) \]

where $r$ is some parameter. Write a function which iterates the equation from $x_0 = 0.25$ enough times to be sufficiently close to its long-term behavior (400 iterations) and samples 150 points from the steady state attractor (i.e. output iterations 401:550) as a function of $r$, and mutates some vector as a solution, i.e. calc_attractor!(out,f,p,num_attract=150;warmup=400).

Test your function with $r = 2.9$. Double check that your function computes the correct result by calculating the analytical steady state value.

Part 2

The bifurcation plot shows how a steady state changes as a parameter changes. Compute the long-term result of the logistic equation at the values of r = 2.9:0.001:4, and plot the steady state values for each $r$ as an r x steady_attractor scatter plot. You should get a very bizarrely awesome picture, the bifurcation graph of the logistic equation.

(Hint: Generate a single matrix for the attractor values, and use calc_attractor! on views of columns for calculating the output, or inline the calc_attractor! computation directly onto the matrix, or even give calc_attractor! an input for what column to modify.)

Part 3

Multithread your bifurcation graph generator by performing different steady state calculations on different threads. Does your timing improve? Why? Be careful and check to make sure you have more than 1 thread!

Part 4

Multiprocess your bifurcation graph generator first by using pmap, and then by using @distributed. Does your timing improve? Why? Be careful to add processes before doing the distributed call.

(Note: You may need to change your implementation around to be allocating differently in order for it to be compatible with multiprocessing!)

Part 5

Which method is the fastest? Why?

\ No newline at end of file diff --git a/homework/02/index.html b/homework/02/index.html index 45e35294..0cf0bf35 100644 --- a/homework/02/index.html +++ b/homework/02/index.html @@ -11,4 +11,4 @@ module load julia-latest module load mpi/mpich-x86_64 -mpirun julia mycode.jl

to receive two cores on two nodes. Recreate the bandwidth vs message plots and the interpretation. Does the fact that the nodes are physically disconnected cause a substantial difference?

\ No newline at end of file +mpirun julia mycode.jl

to receive two cores on two nodes. Recreate the bandwidth vs message plots and the interpretation. Does the fact that the nodes are physically disconnected cause a substantial difference?

\ No newline at end of file diff --git a/homework/03/index.html b/homework/03/index.html index 24cb6703..1a640c46 100644 --- a/homework/03/index.html +++ b/homework/03/index.html @@ -1 +1 @@ - Homework 3 - MIT Parallel Computing and Scientific Machine Learning (SciML)

Neural Ordinary Differential Equation Adjoints

Chris Rackauckas
November 20th, 2020

In this homework, we will write an implementation of neural ordinary differential equations from scratch. You may use the DifferentialEquations.jl ODE solver, but not the adjoint sensitivities functionality. Optionally, a second problem is to add GPU support to your implementation.

Due December 9th, 2020 at midnight.

Please email the results to 18337.mit.psets@gmail.com.

Problem 1: Neural ODE from Scratch

In this problem we will work through the development of a neural ODE.

Part 1: Gradients as vjps

Use the definition of the pullback as a vector-Jacobian product (vjp) to show that $B_f^x(1) = \left( \nabla f(x) \right)^{T}$ for a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$.

(Hint: if you put 1 into the pullback, what kind of function is it? What does the Jacobian look like?)

Part 2: Backpropagation of a neural network

Implement a simple $NN: \mathbb{R}^2 \rightarrow \mathbb{R}^2$ neural network

\[ NN(u;W_i,b_i) = W_2 tanh.(W_1 u + b_1) + b_2 \]

where $W_1$ is $50 \times 2$, $b_1$ is length 50, $W_2$ is $2 \times 50$, and $b_2$ is length 2. Implement the pullback of the neural network: $B_{NN}^{u,W_i,b_i}(y)$ to calculate the derivative of the neural network with respect to each of these inputs. Check for correctness by using ForwardDiff.jl to calculate the gradient.

Part 3: Implementing an ODE adjoint

The adjoint of an ODE can be described as the set of vector equations:

\[ \begin{align} u' &= f(u,p,t)\\ \end{align} \]

forward, and then

\[ \begin{align} \lambda' &= -\lambda^\ast \frac{\partial f}{\partial u}\\ \mu' &= -\lambda^\ast \frac{\partial f}{\partial p}\\ \end{align} \]

solved in reverse time from $T$ to $0$ for some cost function $C(p)$. For this problem, we will use the L2 loss function.

Note that $\mu(T) = 0$ and $\lambda(T) = \frac{\partial C}{\partial u(T)}$. This is written in the form where the only data point is at time $T$. If that is not the case, the reverse solve needs to add the jump $\frac{\partial C}{\partial u(t_i)}$ to $\lambda$ at each data point $u(t_i)$. Use this example for how to add these jumps to the equation.

Using this formulation of the adjoint, it holds that $\mu(0) = \frac{\partial C}{\partial p}$, and thus solving these ODEs in reverse gives the solution for the gradient as a part of the system at time zero.

Notice that $B_f^u(\lambda) = \lambda^\ast \frac{\partial f}{\partial u}$ and similarly for $\mu$. Implement an adjoint calculation for a neural ordinary differential equation where

\[ u' = NN(u) \]

from above. Solve the ODE forwards using OrdinaryDiffEq.jl's Tsit5() integrator, then use the interpolation from the forward pass for the u values of the backpass and solve.

(Note: you will want to double check this gradient by using something like ForwardDiff! Start with only measuring the datapoint at the end, then try multiple data points.)

Part 4: Training the neural ODE

Generate data from the ODE $u' = Au$ where A = [-0.1 2.0; -2.0 -0.1] at t=0.0:0.1:1.0 (use saveat) with $u(0) = [2,0]$. Define the cost function C(θ) to be the Euclidean distance between the neural ODE's solution and the data. Optimize this cost function by using gradient descent where the gradient is your adjoint method's output.

(Note: calculate the cost and the gradient at the same time by using the forward pass to calculate the cost, and then use it in the adjoint for the interpolation. Note that you should not use saveat in the forward pass then, because otherwise the interpolation is linear. Instead, post-interpolate the data points.)

(Optional) Problem 2: Array-Based GPU Computing

If you have access to a GPU, you may wish to try the following.

Part 1: GPU Neural Network

Change your neural network to be GPU-accelerated by using CuArrays.jl for the underlying array types.

Part 2: GPU Neural ODE

Change the initial condition of the ODE solves to a CuArray to make your neural ODE GPU-accelerated.

\ No newline at end of file + Homework 3 - MIT Parallel Computing and Scientific Machine Learning (SciML)

Neural Ordinary Differential Equation Adjoints

Chris Rackauckas
November 20th, 2020

In this homework, we will write an implementation of neural ordinary differential equations from scratch. You may use the DifferentialEquations.jl ODE solver, but not the adjoint sensitivities functionality. Optionally, a second problem is to add GPU support to your implementation.

Due December 9th, 2020 at midnight.

Please email the results to 18337.mit.psets@gmail.com.

Problem 1: Neural ODE from Scratch

In this problem we will work through the development of a neural ODE.

Part 1: Gradients as vjps

Use the definition of the pullback as a vector-Jacobian product (vjp) to show that $B_f^x(1) = \left( \nabla f(x) \right)^{T}$ for a function $f: \mathbb{R}^n \rightarrow \mathbb{R}$.

(Hint: if you put 1 into the pullback, what kind of function is it? What does the Jacobian look like?)

Part 2: Backpropagation of a neural network

Implement a simple $NN: \mathbb{R}^2 \rightarrow \mathbb{R}^2$ neural network

\[ NN(u;W_i,b_i) = W_2 tanh.(W_1 u + b_1) + b_2 \]

where $W_1$ is $50 \times 2$, $b_1$ is length 50, $W_2$ is $2 \times 50$, and $b_2$ is length 2. Implement the pullback of the neural network: $B_{NN}^{u,W_i,b_i}(y)$ to calculate the derivative of the neural network with respect to each of these inputs. Check for correctness by using ForwardDiff.jl to calculate the gradient.

Part 3: Implementing an ODE adjoint

The adjoint of an ODE can be described as the set of vector equations:

\[ \begin{align} u' &= f(u,p,t)\\ \end{align} \]

forward, and then

\[ \begin{align} \lambda' &= -\lambda^\ast \frac{\partial f}{\partial u}\\ \mu' &= -\lambda^\ast \frac{\partial f}{\partial p}\\ \end{align} \]

solved in reverse time from $T$ to $0$ for some cost function $C(p)$. For this problem, we will use the L2 loss function.

Note that $\mu(T) = 0$ and $\lambda(T) = \frac{\partial C}{\partial u(T)}$. This is written in the form where the only data point is at time $T$. If that is not the case, the reverse solve needs to add the jump $\frac{\partial C}{\partial u(t_i)}$ to $\lambda$ at each data point $u(t_i)$. Use this example for how to add these jumps to the equation.

Using this formulation of the adjoint, it holds that $\mu(0) = \frac{\partial C}{\partial p}$, and thus solving these ODEs in reverse gives the solution for the gradient as a part of the system at time zero.

Notice that $B_f^u(\lambda) = \lambda^\ast \frac{\partial f}{\partial u}$ and similarly for $\mu$. Implement an adjoint calculation for a neural ordinary differential equation where

\[ u' = NN(u) \]

from above. Solve the ODE forwards using OrdinaryDiffEq.jl's Tsit5() integrator, then use the interpolation from the forward pass for the u values of the backpass and solve.

(Note: you will want to double check this gradient by using something like ForwardDiff! Start with only measuring the datapoint at the end, then try multiple data points.)

Part 4: Training the neural ODE

Generate data from the ODE $u' = Au$ where A = [-0.1 2.0; -2.0 -0.1] at t=0.0:0.1:1.0 (use saveat) with $u(0) = [2,0]$. Define the cost function C(θ) to be the Euclidean distance between the neural ODE's solution and the data. Optimize this cost function by using gradient descent where the gradient is your adjoint method's output.

(Note: calculate the cost and the gradient at the same time by using the forward pass to calculate the cost, and then use it in the adjoint for the interpolation. Note that you should not use saveat in the forward pass then, because otherwise the interpolation is linear. Instead, post-interpolate the data points.)

(Optional) Problem 2: Array-Based GPU Computing

If you have access to a GPU, you may wish to try the following.

Part 1: GPU Neural Network

Change your neural network to be GPU-accelerated by using CuArrays.jl for the underlying array types.

Part 2: GPU Neural ODE

Change the initial condition of the ODE solves to a CuArray to make your neural ODE GPU-accelerated.

\ No newline at end of file diff --git a/homework/index.html b/homework/index.html index 36c96a56..011d04c5 100644 --- a/homework/index.html +++ b/homework/index.html @@ -1 +1 @@ - Homework Overview - MIT Parallel Computing and Scientific Machine Learning (SciML)

Homework Overview

\ No newline at end of file + Homework Overview - MIT Parallel Computing and Scientific Machine Learning (SciML)

Homework Overview

\ No newline at end of file diff --git a/index.html b/index.html index e51a68c3..053e56af 100644 --- a/index.html +++ b/index.html @@ -1 +1 @@ - Parallel Computing and Scientific Machine Learning (SciML): Methods and Applications - MIT Parallel Computing and Scientific Machine Learning (SciML)

Parallel Computing and Scientific Machine Learning (SciML): Methods and Applications

This book is a compilation of lecture notes from the MIT Course 18.337J/6.338J: Parallel Computing and Scientific Machine Learning. Links to the old notes https://mitmath.github.io/18337 will redirect here

This repository is meant to be a live document, updating to continuously add the latest details on methods from the field of scientific machine learning and the latest techniques for high-performance computing.

Note

You can help improve this course!
Please report mistakes you find in the content.
Similarly, suggest improvements to the organization and navigation of this site.

Introduction to Parallel Computing and Scientific Machine Learning

There are two main branches of technical computing: machine learning and scientific computing. Machine learning has received a lot of hype over the last decade, with techniques such as convolutional neural networks and TSne nonlinear dimensional reductions powering a new generation of data-driven analytics. On the other hand, many scientific disciplines carry on with large-scale modeling through differential equation modeling, looking at stochastic differential equations and partial differential equations describing scientific laws.

However, there has been a recent convergence of the two disciplines. This field, scientific machine learning, has been showcasing results like how partial differential equation simulations can be accelerated with neural networks. New methods, such as probabilistic and differentiable programming, have started to be developed specifically for enhancing the tools of this domain. However, the techniques in this field combine two huge areas of computational and numerical practice, meaning that the methods are sufficiently complex. How do you backpropagate an ODE defined by neural networks? How do you perform unsupervised learning of a scientific simulator?

In this class we will dig into the methods and understand what they do, why they were made, and thus how to integrate numerical methods across fields to accentuate their pros while mitigating their cons. This class will be a survey of the numerical techniques, showcasing how many disciplines are doing the same thing under different names, and using a common mathematical language to derive efficient routines which capture both data-driven and mechanistic-based modeling.

However, these methods will quickly run into a scaling issue if naively coded. To handle this problem, everything will have a focus on performance-engineering. We will start by focusing on algorithms that are inherently serial and learn to optimize serial code. Then we will showcase how logic-heavy code can be parallelized through multithreading and distributed computing techniques like MPI, while direct mathematical descriptions can be parallelized through GPU computing.

The final part of the course will be a unique project which pulls together these techniques. As a new field, the students will be exposed to the "low hanging fruit" and will be directed towards an area in which they can make a quick impact. For the final project, students will team up to solve a new problem in the field of scientific machine learning, and receive help in writing up a publication-quality analysis about their work.

\ No newline at end of file + Parallel Computing and Scientific Machine Learning (SciML): Methods and Applications - MIT Parallel Computing and Scientific Machine Learning (SciML)

Parallel Computing and Scientific Machine Learning (SciML): Methods and Applications

This book is a compilation of lecture notes from the MIT Course 18.337J/6.338J: Parallel Computing and Scientific Machine Learning. Links to the old notes https://mitmath.github.io/18337 will redirect here

This repository is meant to be a live document, updating to continuously add the latest details on methods from the field of scientific machine learning and the latest techniques for high-performance computing.

Note

You can help improve this course!
Please report mistakes you find in the content.
Similarly, suggest improvements to the organization and navigation of this site.

Introduction to Parallel Computing and Scientific Machine Learning

There are two main branches of technical computing: machine learning and scientific computing. Machine learning has received a lot of hype over the last decade, with techniques such as convolutional neural networks and TSne nonlinear dimensional reductions powering a new generation of data-driven analytics. On the other hand, many scientific disciplines carry on with large-scale modeling through differential equation modeling, looking at stochastic differential equations and partial differential equations describing scientific laws.

However, there has been a recent convergence of the two disciplines. This field, scientific machine learning, has been showcasing results like how partial differential equation simulations can be accelerated with neural networks. New methods, such as probabilistic and differentiable programming, have started to be developed specifically for enhancing the tools of this domain. However, the techniques in this field combine two huge areas of computational and numerical practice, meaning that the methods are sufficiently complex. How do you backpropagate an ODE defined by neural networks? How do you perform unsupervised learning of a scientific simulator?

In this class we will dig into the methods and understand what they do, why they were made, and thus how to integrate numerical methods across fields to accentuate their pros while mitigating their cons. This class will be a survey of the numerical techniques, showcasing how many disciplines are doing the same thing under different names, and using a common mathematical language to derive efficient routines which capture both data-driven and mechanistic-based modeling.

However, these methods will quickly run into a scaling issue if naively coded. To handle this problem, everything will have a focus on performance-engineering. We will start by focusing on algorithms that are inherently serial and learn to optimize serial code. Then we will showcase how logic-heavy code can be parallelized through multithreading and distributed computing techniques like MPI, while direct mathematical descriptions can be parallelized through GPU computing.

The final part of the course will be a unique project which pulls together these techniques. As a new field, the students will be exposed to the "low hanging fruit" and will be directed towards an area in which they can make a quick impact. For the final project, students will team up to solve a new problem in the field of scientific machine learning, and receive help in writing up a publication-quality analysis about their work.

\ No newline at end of file diff --git a/lectures/index.html b/lectures/index.html index 4b523ad8..2dab76a1 100644 --- a/lectures/index.html +++ b/lectures/index.html @@ -1 +1 @@ - Lecture Overview - MIT Parallel Computing and Scientific Machine Learning (SciML)

Lecture Overview

  1. Lecture 1: Introduction and Syllabus
    1. Lecture 1.0: Introduction
      1. Lecture and Notes
    2. Lecture 1.1: Getting Started with Julia
      1. Lecture and Notes
      2. Optional Extra Resources
  2. Lecture 2: Optimizing Serial Code
    1. Lecture and Notes
    2. Optional Extra Resources
  3. Lecture 3: Introduction to Scientific Machine Learning Through Physics-Informed Neural Networks
    1. Optional Extra Resources
  4. Lecture 4: Introduction to Discrete Dynamical Systems
    1. Optional Extra Resources
  5. Lecture 5: Array-Based Parallelism, Embarrassingly Parallel Problems, and Data-Parallelism: The Basics of Single Node Parallel Computing
    1. Optional Extra Resources
  6. Lecture 6: Styles of Parallelism
  7. Lecture 7: Ordinary Differential Equations: Applications and Discretizations
  8. Lecture 8: Forward-Mode Automatic Differentiation
  9. Lecture 9: Solving Stiff Ordinary Differential Equations
    1. Lecture Notes
    2. Additional Readings on Convergence of Newton's Method
  10. Lecture 10: Basic Parameter Estimation, Reverse-Mode AD, and Inverse Problems
  11. Lecture 11: Differentiable Programming and Neural Differential Equations
    1. Additional Readings on AD Implementations
  12. Lecture 12: SciML in Practice
    1. Lecture 12.1: MPI for Distributed Computing
    2. Lecture 12.2: Mathematics of Machine Learning and High Performance Computing
  13. Lecture 13: GPU Computing
  14. Lecture 14: Partial Differential Equations and Convolutional Neural Networks
    1. Additional Readings
  15. Lecture 15: More Algorithms which Connect Differential Equations and Machine Learning
  16. Lecture 16: Probabilistic Programming
  17. Lecture 17: Global Sensitivity Analysis
  18. Lecture 18: Code Profiling and Optimization
  19. Lecture 19: Uncertainty Programming and Generalized Uncertainty Quantification
  20. Final Project

Lecture 1: Introduction and Syllabus

Lecture 1.0: Introduction

Lecture and Notes

This is to make sure we're all on the same page. It goes over the syllabus and what will be expected of you throughout the course. If you have not joined the Slack, please use the link from the introduction email (or email me if you need the link!).

Lecture 1.1: Getting Started with Julia

Lecture and Notes

Optional Extra Resources

If you are not comfortable with Julia yet, here's a few resources as sort of a "crash course" to get you up an running:

Some deeper materials:

Steven Johnson will be running a Julia workshop on 9/8/2020 for people who are interested. More details TBA.

Lecture 2: Optimizing Serial Code

Lecture and Notes

Optional Extra Resources

Before we start to parallelize code, build huge models, and automatically learn physics, we need to make sure our code is "good". How do you know you're writing "good" code? That's what this lecture seeks to answer. In this lecture we'll go through the techniques for writing good serial code and checking that your code is efficient.

Lecture 3: Introduction to Scientific Machine Learning Through Physics-Informed Neural Networks

Optional Extra Resources

Now let's take our first stab at the application: scientific machine learning. What is scientific machine learning? We will define the field by looking at a few approaches people are taking and what kinds of problems are being solved using scientific machine learning. The field of scientific machine learning and its span across computational science to applications in climate modeling and aerospace will be introduced. The methodologies that will be studied, in their various names, will be introduced, and the general formula that is arising in the discipline will be laid out: a mixture of scientific simulation tools like differential equations with machine learning primitives like neural networks, tied together through differentiable programming to achieve results that were previously not possible. After doing a survey, we while dive straight into developing a physics-informed neural network solver which solves an ordinary differential equation.

Lecture 4: Introduction to Discrete Dynamical Systems

Optional Extra Resources

Now that the stage is set, we see that to go deeper we will need a good grasp on how both discrete and continuous dynamical systems work. We will start by developing the basics of our scientific simulators: differential and difference equations. A quick overview of geometric results in the study of differential and difference equations will set the stage for understanding nonlinear dynamics, which we will quickly turn to numerical methods to visualize. Even if there is not analytical solution to the dynamical system, overarching behavior such as convergence to zero can be determined through asymptotic means and linearization. We will see later that these same techniques for the basis for the analysis of numerical methods for differential equations, such as the Runge-Kutta and Adams-Bashforth methods.

Since the discretization of differential equations is indeed a discrete dynamical system, we will use this as a case study to see how serial scalar-heavy codes should be optimized. SIMD, in-place operations, broadcasting, heap allocations, and static arrays will be used to get fast codes for dynamical system simulation. These simulations will then be used to reveal some intriguing properties of dynamical systems which will be further explored through the rest of the course.

Lecture 5: Array-Based Parallelism, Embarrassingly Parallel Problems, and Data-Parallelism: The Basics of Single Node Parallel Computing

Optional Extra Resources

Now that we have a concrete problem, let's start investigating ways to parallelize its solution. We will first see that many systems have an almost automatic way of parallelizing through array operations, which we will call array-based parallelism. The ability to easily parallelize large blocked linear algebra will be discussed, along with libraries like OpenBLAS, Intel MKL, CuBLAS (GPU parallelism) and Elemental.jl. This gives a form of Within-Method Parallelism which we can use to optimize specific algorithms which utilize linearity. Another form of parallelism is to parallelize over the inputs. We will describe how this is a form of data parallelism, and use this as a framework to introduce shared memory and distributed parallelism. The interactions between these parallelization methods and application considerations will be discussed.

Lecture 6: Styles of Parallelism

Here we continue down the line of describing methods of parallelism by giving a high level overview of the types of parallelism. SIMD and multithreading are reviewed as the basic forms of parallelism where message passing is not a concern. Then accelerators, such as GPUs and TPUs are introduced. Moving further, distributed parallel computing and its models are showcased. What we will see is that what kind of parallelism we are doing actually is not the main determiner as to how we need to think about parallelism. Instead, the determining factor is the parallel programming model, where just a handful of models, like task-based parallelism or SPMD models, are seen across all of the different hardware abstractions.

Lecture 7: Ordinary Differential Equations: Applications and Discretizations

In this lecture we will describe ordinary differential equations, where they arise in scientific contexts, and how they are solved. We will see that understanding the properties of the numerical methods requires understanding the dynamics of the discrete system generated from the approximation to the continuous system, and thus stability of a numerical method is directly tied to the stability properties of the dynamics. This gives the idea of stiffness, which is a larger computational idea about ill-conditioned systems.

Lecture 8: Forward-Mode Automatic Differentiation

As we will soon see, the ability to calculate derivatives underpins a lot of problems in both scientific computing and machine learning. We will specifically see it show up in later lectures on solving implicit equations f(x)=0 for stiff ordinary differential equation solvers, and in fitting neural networks. The common high performance way that this is done is called automatic differentiation. This lecture introduces the methods of forward and reverse mode automatic differentiation to setup future studies uses of the technique.

Lecture 9: Solving Stiff Ordinary Differential Equations

Lecture Notes

Additional Readings on Convergence of Newton's Method

Solving stiff ordinary differential equations, especially those which arise from partial differential equations, are the common bottleneck of scientific computing. The largest-scale scientific computing models are generally using heavy compute power in order to tackle some implicitly timestepped PDE solve! Thus we will take a deep dive into how the different methods which are combined to create a stiff ordinary differential equation solver, looking at different aspects of Jacobian computations and linear solving and the effects that they have.

Lecture 10: Basic Parameter Estimation, Reverse-Mode AD, and Inverse Problems

Now that we have models, how do you fit the models to data? This lecture goes through the basic shooting method for parameter estimation, showcases how it's equivalent to training neural networks, and gives an in-depth discussion of how reverse-mode automatic differentiation is utilized in the training process for the efficient calculation of gradients.

Lecture 11: Differentiable Programming and Neural Differential Equations

Additional Readings on AD Implementations

Given the efficiency of reverse-mode automatic differentiation, we want to see how far we can push this idea. How could one implement reverse-mode AD without computational graphs, and include problems like nonlinear solving and ordinary differential equations? Are there methods other than shooting methods that can be utilized for parameter fitting? This lecture will explore where reverse-mode AD intersects with scientific modeling, and where machine learning begins to enter scientific computing.

Lecture 12: SciML in Practice

Lecture 12.1: MPI for Distributed Computing

Guest Lecturer: Lauren E. Milechin, MIT Lincoln Lab and the MIT Supercloud Guest Writer: Jeremy Kepner, MIT Lincoln Lab and the MIT Supercloud

In this lecture we went over the basics of MPI (Message Passing Interface) for distributed computing and examples on how to use MPI.jl to write parallel programs that work efficiently over multiple computers (or "compute nodes"). The MPI programming model and the job scripts required for using MPI on the MIT Supercloud HPC were demonstrated.

Lecture 12.2: Mathematics of Machine Learning and High Performance Computing

Guest Lecturer: Jeremy Kepner, MIT Lincoln Lab and the MIT Supercloud

In this lecture we went over the mathematics behind big data, machine learning, and high performance computing. Pieces like Amdahl's law for describing maximal parallel compute efficiency were described and demonstrated to showcase some hard ceiling on the capabilities of parallel computing, and these laws were described in the context of big data computations in order to assess the viability of distributed computing within that domain's context.

Lecture 13: GPU Computing

Guest Lecturer: Valentin Churavy, MIT Julia Lab

In this lecture we take a deeper dive into the architectural differences of GPUs and how that changes the parallel computing mindset that's required to arrive at efficient code. Valentin walks through the compilation process and how the resulting behaviors are due to core trade-offs in GPU-based programming and direct compilation for such hardware.

Lecture 14: Partial Differential Equations and Convolutional Neural Networks

Additional Readings

In this lecture we will continue to relate the methods of machine learning to those in scientific computing by looking at the relationship between convolutional neural networks and partial differential equations. It turns out they are more than just similar: the two are both stencil computations on spatial data!

Lecture 15: More Algorithms which Connect Differential Equations and Machine Learning

Neural ordinary differential equations and physics-informed neural networks are only the tip of the iceberg. In this lecture we will look into other algorithms which are utilizing the connection between neural networks and machine learning. We will generalize to augmented neural ordinary differential equations and universal differential equations with DiffEqFlux.jl, which now allows for stiff equations, stochasticity, delays, constraint equations, event handling, etc. to all take place in a neural differential equation format. Then we will dig into the methods for solving high dimensional partial differential equations through transformations to backwards stochastic differential equations (BSDEs), and the applications to mathematical finance through Black-Scholes along with stochastic optimal control through Hamilton-Jacobi-Bellman equations. We then look into alternative training techniques using reservoir computing, such as continuous-time echo state networks, which alleviate some of the gradient issues associated with training neural networks on stiff and chaotic dynamical systems. We showcase a few of the methods which are being used to automatically discover equations in their symbolic form such as SINDy. To end it, we look into methods for accelerating differential equation solving through neural surrogate models, and uncover the true idea of what's going on, along with understanding when these applications can be used effectively.

Lecture 16: Probabilistic Programming

All of our previous discussions lived in a deterministic world. Not this one. Here we turn to a probabilistic view and allow programs to have random variables. Forward simulation of a random program is seen to be simple through Monte Carlo sampling. However, parameter estimation is now much more involved, since in this case we need to estimate not just values but probability distributions. It turns out that Bayes' rule gives a framework for performing such estimations. We see that classical parameter estimation falls out as a maximization of probability with the "simplest" form of distributions, and thus this gives a nice generalization even to standard parameter estimation and justifies the use of L2 loss functions and regularization (as a perturbation by a prior). Next, we turn to estimating the distributions, which we see is possible for small problems using Metropolis Hastings, but for larger problems we develop Hamiltonian Monte Carlo. It turns out that Hamiltonian Monte Carlo has strong ties to both ODEs and differentiable programming: it is defined as solving ODEs which arise from a Hamiltonian, and derivatives of the likelihood are required, which is essentially the same idea as derivatives of cost functions! We then describe an alternative approach: Automatic Differentiation Variational Inference (ADVI), which once again is using the tools of differentiable programming to estimate distributions of probabilistic programs.

Lecture 17: Global Sensitivity Analysis

Our previous analysis of sensitivities was all local. What does it mean to example the sensitivities of a model globally? It turns out the probabilistic programming viewpoint gives us a solid way of describing how we expect values to be changing over larger sets of parameters via the random variables that describe the program's inputs. This means we can decompose the output variance into indices which can be calculated via various quadrature approximations which then give a tractable measurement to "variable x has no effect on the mean solution".

Lecture 18: Code Profiling and Optimization

How do you put everything together in this course? Let's take a look at a PDE solver code given in a method of lines form. In this lecture I walk through the code and demonstrate how to serial optimize it, and showcase the interaction between variable caching and automatic differentiation.

Lecture 19: Uncertainty Programming and Generalized Uncertainty Quantification

We end the course by taking a look at another mathematical topic to see whether it can be addressed in a similar manner: uncertainty quantification (UQ). There are ways which it can be handled similar to automatic differentiation. Measurements.jl gives a forward-propagation approach, somewhat like ForwardDiff's dual numbers, through a number type which is representative of normal distributions and pushes these values through a program. This has many advantages, since it allows for uncertainty quantification without sampling, but turns the number types into a value that is heap allocated. Other approaches are investigated, like interval arithmetic which is rigorous but limited in scope. But on the entirely other end, a non-general method for ODEs is shown which utilizes the trajectory structure of the differential equation solution and doesn't give the blow up that the other methods see. This showcases that uses higher level information can be helpful in UQ, and less local approaches may be necessary. We end by showcasing the Koopman operator as the adjoint of the pushforward of the uncertainty measure, and as an adjoint method it can give accelerated computations of uncertainty against cost functions.

Final Project

The final project is a 10-20 page paper using the style template from the SIAM Journal on Numerical Analysis (or similar). The final project must include code for a high performance (or parallelized) implementation of the algorithm in a form that is usable by others. A thorough performance analysis is expected. Model your paper on academic review articles (e.g. read SIAM Review and similar journals for examples).

One possibility is to review an interesting algorithm not covered in the course and develop a high performance implementation. Some examples include:

  • High performance PDE solvers for specific PDEs like Navier-Stokes

  • Common high performance algorithms (Ex: Jacobian-Free Newton Krylov for PDEs)

  • Recreation of a parameter sensitivity study in a field like biology, pharmacology, or climate science

  • Augmented Neural Ordinary Differential Equations

  • Neural Jump Stochastic Differential Equations

  • Parallelized stencil calculations

  • Distributed linear algebra kernels

  • Parallel implementations of statistical libraries, such as survival statistics or linear models for big data. Here's one example parallel library) and a second example.

  • Parallelization of data analysis methods

  • Type-generic implementations of sparse linear algebra methods

  • A fast regex library

  • Math library primitives (exp, log, etc.)

Another possibility is to work on state-of-the-art performance engineering. This would be implementing a new auto-parallelization or performance enhancement. For these types of projects, implementing an application for benchmarking is not required, and one can instead benchmark the effects on already existing code to find cases where it is beneficial (or leads to performance regressions). Possible examples are:

Additionally, Scientific Machine Learning is a wide open field with lots of low hanging fruit. Instead of a review, a suitable research project can be used for chosen for the final project. Possibilities include:

  • Acceleration methods for adjoints of differential equations

  • Improved methods for Physics-Informed Neural Networks

  • New applications of neural differential equations

  • Parallelized implicit ODE solvers for large ODE systems

  • GPU-parallelized ODE/SDE solvers for small systems

Final project topics must be declared by October 30th with a 1 page extended abstract.

\ No newline at end of file + Lecture Overview - MIT Parallel Computing and Scientific Machine Learning (SciML)

Lecture Overview

  1. Lecture 1: Introduction and Syllabus
    1. Lecture 1.0: Introduction
      1. Lecture and Notes
    2. Lecture 1.1: Getting Started with Julia
      1. Lecture and Notes
      2. Optional Extra Resources
  2. Lecture 2: Optimizing Serial Code
    1. Lecture and Notes
    2. Optional Extra Resources
  3. Lecture 3: Introduction to Scientific Machine Learning Through Physics-Informed Neural Networks
    1. Optional Extra Resources
  4. Lecture 4: Introduction to Discrete Dynamical Systems
    1. Optional Extra Resources
  5. Lecture 5: Array-Based Parallelism, Embarrassingly Parallel Problems, and Data-Parallelism: The Basics of Single Node Parallel Computing
    1. Optional Extra Resources
  6. Lecture 6: Styles of Parallelism
  7. Lecture 7: Ordinary Differential Equations: Applications and Discretizations
  8. Lecture 8: Forward-Mode Automatic Differentiation
  9. Lecture 9: Solving Stiff Ordinary Differential Equations
    1. Lecture Notes
    2. Additional Readings on Convergence of Newton's Method
  10. Lecture 10: Basic Parameter Estimation, Reverse-Mode AD, and Inverse Problems
  11. Lecture 11: Differentiable Programming and Neural Differential Equations
    1. Additional Readings on AD Implementations
  12. Lecture 12: SciML in Practice
    1. Lecture 12.1: MPI for Distributed Computing
    2. Lecture 12.2: Mathematics of Machine Learning and High Performance Computing
  13. Lecture 13: GPU Computing
  14. Lecture 14: Partial Differential Equations and Convolutional Neural Networks
    1. Additional Readings
  15. Lecture 15: More Algorithms which Connect Differential Equations and Machine Learning
  16. Lecture 16: Probabilistic Programming
  17. Lecture 17: Global Sensitivity Analysis
  18. Lecture 18: Code Profiling and Optimization
  19. Lecture 19: Uncertainty Programming and Generalized Uncertainty Quantification
  20. Final Project

Lecture 1: Introduction and Syllabus

Lecture 1.0: Introduction

Lecture and Notes

This is to make sure we're all on the same page. It goes over the syllabus and what will be expected of you throughout the course. If you have not joined the Slack, please use the link from the introduction email (or email me if you need the link!).

Lecture 1.1: Getting Started with Julia

Lecture and Notes

Optional Extra Resources

If you are not comfortable with Julia yet, here's a few resources as sort of a "crash course" to get you up an running:

Some deeper materials:

Steven Johnson will be running a Julia workshop on 9/8/2020 for people who are interested. More details TBA.

Lecture 2: Optimizing Serial Code

Lecture and Notes

Optional Extra Resources

Before we start to parallelize code, build huge models, and automatically learn physics, we need to make sure our code is "good". How do you know you're writing "good" code? That's what this lecture seeks to answer. In this lecture we'll go through the techniques for writing good serial code and checking that your code is efficient.

Lecture 3: Introduction to Scientific Machine Learning Through Physics-Informed Neural Networks

Optional Extra Resources

Now let's take our first stab at the application: scientific machine learning. What is scientific machine learning? We will define the field by looking at a few approaches people are taking and what kinds of problems are being solved using scientific machine learning. The field of scientific machine learning and its span across computational science to applications in climate modeling and aerospace will be introduced. The methodologies that will be studied, in their various names, will be introduced, and the general formula that is arising in the discipline will be laid out: a mixture of scientific simulation tools like differential equations with machine learning primitives like neural networks, tied together through differentiable programming to achieve results that were previously not possible. After doing a survey, we while dive straight into developing a physics-informed neural network solver which solves an ordinary differential equation.

Lecture 4: Introduction to Discrete Dynamical Systems

Optional Extra Resources

Now that the stage is set, we see that to go deeper we will need a good grasp on how both discrete and continuous dynamical systems work. We will start by developing the basics of our scientific simulators: differential and difference equations. A quick overview of geometric results in the study of differential and difference equations will set the stage for understanding nonlinear dynamics, which we will quickly turn to numerical methods to visualize. Even if there is not analytical solution to the dynamical system, overarching behavior such as convergence to zero can be determined through asymptotic means and linearization. We will see later that these same techniques for the basis for the analysis of numerical methods for differential equations, such as the Runge-Kutta and Adams-Bashforth methods.

Since the discretization of differential equations is indeed a discrete dynamical system, we will use this as a case study to see how serial scalar-heavy codes should be optimized. SIMD, in-place operations, broadcasting, heap allocations, and static arrays will be used to get fast codes for dynamical system simulation. These simulations will then be used to reveal some intriguing properties of dynamical systems which will be further explored through the rest of the course.

Lecture 5: Array-Based Parallelism, Embarrassingly Parallel Problems, and Data-Parallelism: The Basics of Single Node Parallel Computing

Optional Extra Resources

Now that we have a concrete problem, let's start investigating ways to parallelize its solution. We will first see that many systems have an almost automatic way of parallelizing through array operations, which we will call array-based parallelism. The ability to easily parallelize large blocked linear algebra will be discussed, along with libraries like OpenBLAS, Intel MKL, CuBLAS (GPU parallelism) and Elemental.jl. This gives a form of Within-Method Parallelism which we can use to optimize specific algorithms which utilize linearity. Another form of parallelism is to parallelize over the inputs. We will describe how this is a form of data parallelism, and use this as a framework to introduce shared memory and distributed parallelism. The interactions between these parallelization methods and application considerations will be discussed.

Lecture 6: Styles of Parallelism

Here we continue down the line of describing methods of parallelism by giving a high level overview of the types of parallelism. SIMD and multithreading are reviewed as the basic forms of parallelism where message passing is not a concern. Then accelerators, such as GPUs and TPUs are introduced. Moving further, distributed parallel computing and its models are showcased. What we will see is that what kind of parallelism we are doing actually is not the main determiner as to how we need to think about parallelism. Instead, the determining factor is the parallel programming model, where just a handful of models, like task-based parallelism or SPMD models, are seen across all of the different hardware abstractions.

Lecture 7: Ordinary Differential Equations: Applications and Discretizations

In this lecture we will describe ordinary differential equations, where they arise in scientific contexts, and how they are solved. We will see that understanding the properties of the numerical methods requires understanding the dynamics of the discrete system generated from the approximation to the continuous system, and thus stability of a numerical method is directly tied to the stability properties of the dynamics. This gives the idea of stiffness, which is a larger computational idea about ill-conditioned systems.

Lecture 8: Forward-Mode Automatic Differentiation

As we will soon see, the ability to calculate derivatives underpins a lot of problems in both scientific computing and machine learning. We will specifically see it show up in later lectures on solving implicit equations f(x)=0 for stiff ordinary differential equation solvers, and in fitting neural networks. The common high performance way that this is done is called automatic differentiation. This lecture introduces the methods of forward and reverse mode automatic differentiation to setup future studies uses of the technique.

Lecture 9: Solving Stiff Ordinary Differential Equations

Lecture Notes

Additional Readings on Convergence of Newton's Method

Solving stiff ordinary differential equations, especially those which arise from partial differential equations, are the common bottleneck of scientific computing. The largest-scale scientific computing models are generally using heavy compute power in order to tackle some implicitly timestepped PDE solve! Thus we will take a deep dive into how the different methods which are combined to create a stiff ordinary differential equation solver, looking at different aspects of Jacobian computations and linear solving and the effects that they have.

Lecture 10: Basic Parameter Estimation, Reverse-Mode AD, and Inverse Problems

Now that we have models, how do you fit the models to data? This lecture goes through the basic shooting method for parameter estimation, showcases how it's equivalent to training neural networks, and gives an in-depth discussion of how reverse-mode automatic differentiation is utilized in the training process for the efficient calculation of gradients.

Lecture 11: Differentiable Programming and Neural Differential Equations

Additional Readings on AD Implementations

Given the efficiency of reverse-mode automatic differentiation, we want to see how far we can push this idea. How could one implement reverse-mode AD without computational graphs, and include problems like nonlinear solving and ordinary differential equations? Are there methods other than shooting methods that can be utilized for parameter fitting? This lecture will explore where reverse-mode AD intersects with scientific modeling, and where machine learning begins to enter scientific computing.

Lecture 12: SciML in Practice

Lecture 12.1: MPI for Distributed Computing

Guest Lecturer: Lauren E. Milechin, MIT Lincoln Lab and the MIT Supercloud Guest Writer: Jeremy Kepner, MIT Lincoln Lab and the MIT Supercloud

In this lecture we went over the basics of MPI (Message Passing Interface) for distributed computing and examples on how to use MPI.jl to write parallel programs that work efficiently over multiple computers (or "compute nodes"). The MPI programming model and the job scripts required for using MPI on the MIT Supercloud HPC were demonstrated.

Lecture 12.2: Mathematics of Machine Learning and High Performance Computing

Guest Lecturer: Jeremy Kepner, MIT Lincoln Lab and the MIT Supercloud

In this lecture we went over the mathematics behind big data, machine learning, and high performance computing. Pieces like Amdahl's law for describing maximal parallel compute efficiency were described and demonstrated to showcase some hard ceiling on the capabilities of parallel computing, and these laws were described in the context of big data computations in order to assess the viability of distributed computing within that domain's context.

Lecture 13: GPU Computing

Guest Lecturer: Valentin Churavy, MIT Julia Lab

In this lecture we take a deeper dive into the architectural differences of GPUs and how that changes the parallel computing mindset that's required to arrive at efficient code. Valentin walks through the compilation process and how the resulting behaviors are due to core trade-offs in GPU-based programming and direct compilation for such hardware.

Lecture 14: Partial Differential Equations and Convolutional Neural Networks

Additional Readings

In this lecture we will continue to relate the methods of machine learning to those in scientific computing by looking at the relationship between convolutional neural networks and partial differential equations. It turns out they are more than just similar: the two are both stencil computations on spatial data!

Lecture 15: More Algorithms which Connect Differential Equations and Machine Learning

Neural ordinary differential equations and physics-informed neural networks are only the tip of the iceberg. In this lecture we will look into other algorithms which are utilizing the connection between neural networks and machine learning. We will generalize to augmented neural ordinary differential equations and universal differential equations with DiffEqFlux.jl, which now allows for stiff equations, stochasticity, delays, constraint equations, event handling, etc. to all take place in a neural differential equation format. Then we will dig into the methods for solving high dimensional partial differential equations through transformations to backwards stochastic differential equations (BSDEs), and the applications to mathematical finance through Black-Scholes along with stochastic optimal control through Hamilton-Jacobi-Bellman equations. We then look into alternative training techniques using reservoir computing, such as continuous-time echo state networks, which alleviate some of the gradient issues associated with training neural networks on stiff and chaotic dynamical systems. We showcase a few of the methods which are being used to automatically discover equations in their symbolic form such as SINDy. To end it, we look into methods for accelerating differential equation solving through neural surrogate models, and uncover the true idea of what's going on, along with understanding when these applications can be used effectively.

Lecture 16: Probabilistic Programming

All of our previous discussions lived in a deterministic world. Not this one. Here we turn to a probabilistic view and allow programs to have random variables. Forward simulation of a random program is seen to be simple through Monte Carlo sampling. However, parameter estimation is now much more involved, since in this case we need to estimate not just values but probability distributions. It turns out that Bayes' rule gives a framework for performing such estimations. We see that classical parameter estimation falls out as a maximization of probability with the "simplest" form of distributions, and thus this gives a nice generalization even to standard parameter estimation and justifies the use of L2 loss functions and regularization (as a perturbation by a prior). Next, we turn to estimating the distributions, which we see is possible for small problems using Metropolis Hastings, but for larger problems we develop Hamiltonian Monte Carlo. It turns out that Hamiltonian Monte Carlo has strong ties to both ODEs and differentiable programming: it is defined as solving ODEs which arise from a Hamiltonian, and derivatives of the likelihood are required, which is essentially the same idea as derivatives of cost functions! We then describe an alternative approach: Automatic Differentiation Variational Inference (ADVI), which once again is using the tools of differentiable programming to estimate distributions of probabilistic programs.

Lecture 17: Global Sensitivity Analysis

Our previous analysis of sensitivities was all local. What does it mean to example the sensitivities of a model globally? It turns out the probabilistic programming viewpoint gives us a solid way of describing how we expect values to be changing over larger sets of parameters via the random variables that describe the program's inputs. This means we can decompose the output variance into indices which can be calculated via various quadrature approximations which then give a tractable measurement to "variable x has no effect on the mean solution".

Lecture 18: Code Profiling and Optimization

How do you put everything together in this course? Let's take a look at a PDE solver code given in a method of lines form. In this lecture I walk through the code and demonstrate how to serial optimize it, and showcase the interaction between variable caching and automatic differentiation.

Lecture 19: Uncertainty Programming and Generalized Uncertainty Quantification

We end the course by taking a look at another mathematical topic to see whether it can be addressed in a similar manner: uncertainty quantification (UQ). There are ways which it can be handled similar to automatic differentiation. Measurements.jl gives a forward-propagation approach, somewhat like ForwardDiff's dual numbers, through a number type which is representative of normal distributions and pushes these values through a program. This has many advantages, since it allows for uncertainty quantification without sampling, but turns the number types into a value that is heap allocated. Other approaches are investigated, like interval arithmetic which is rigorous but limited in scope. But on the entirely other end, a non-general method for ODEs is shown which utilizes the trajectory structure of the differential equation solution and doesn't give the blow up that the other methods see. This showcases that uses higher level information can be helpful in UQ, and less local approaches may be necessary. We end by showcasing the Koopman operator as the adjoint of the pushforward of the uncertainty measure, and as an adjoint method it can give accelerated computations of uncertainty against cost functions.

Final Project

The final project is a 10-20 page paper using the style template from the SIAM Journal on Numerical Analysis (or similar). The final project must include code for a high performance (or parallelized) implementation of the algorithm in a form that is usable by others. A thorough performance analysis is expected. Model your paper on academic review articles (e.g. read SIAM Review and similar journals for examples).

One possibility is to review an interesting algorithm not covered in the course and develop a high performance implementation. Some examples include:

  • High performance PDE solvers for specific PDEs like Navier-Stokes

  • Common high performance algorithms (Ex: Jacobian-Free Newton Krylov for PDEs)

  • Recreation of a parameter sensitivity study in a field like biology, pharmacology, or climate science

  • Augmented Neural Ordinary Differential Equations

  • Neural Jump Stochastic Differential Equations

  • Parallelized stencil calculations

  • Distributed linear algebra kernels

  • Parallel implementations of statistical libraries, such as survival statistics or linear models for big data. Here's one example parallel library) and a second example.

  • Parallelization of data analysis methods

  • Type-generic implementations of sparse linear algebra methods

  • A fast regex library

  • Math library primitives (exp, log, etc.)

Another possibility is to work on state-of-the-art performance engineering. This would be implementing a new auto-parallelization or performance enhancement. For these types of projects, implementing an application for benchmarking is not required, and one can instead benchmark the effects on already existing code to find cases where it is beneficial (or leads to performance regressions). Possible examples are:

Additionally, Scientific Machine Learning is a wide open field with lots of low hanging fruit. Instead of a review, a suitable research project can be used for chosen for the final project. Possibilities include:

  • Acceleration methods for adjoints of differential equations

  • Improved methods for Physics-Informed Neural Networks

  • New applications of neural differential equations

  • Parallelized implicit ODE solvers for large ODE systems

  • GPU-parallelized ODE/SDE solvers for small systems

Final project topics must be declared by October 30th with a 1 page extended abstract.

\ No newline at end of file diff --git a/notes/02-Optimizing_Serial_Code/index.html b/notes/02-Optimizing_Serial_Code/index.html index d028b1c6..429bbf7c 100644 --- a/notes/02-Optimizing_Serial_Code/index.html +++ b/notes/02-Optimizing_Serial_Code/index.html @@ -10,7 +10,7 @@ end @btime inner_rows!(C,A,B)
-16.500 μs (0 allocations: 0 bytes)
+21.600 μs (0 allocations: 0 bytes)
 
 function inner_cols!(C,A,B)
   for j in 1:100, i in 1:100
@@ -19,7 +19,7 @@
 end
 @btime inner_cols!(C,A,B)
 
-7.933 μs (0 allocations: 0 bytes)
+9.500 μs (0 allocations: 0 bytes)
 

Lower Level View: The Stack and the Heap

Locally, the stack is composed of a stack and a heap. The stack requires a static allocation: it is ordered. Because it's ordered, it is very clear where things are in the stack, and therefore accesses are very quick (think instantaneous). However, because this is static, it requires that the size of the variables is known at compile time (to determine all of the variable locations). Since that is not possible with all variables, there exists the heap. The heap is essentially a stack of pointers to objects in memory. When heap variables are needed, their values are pulled up the cache chain and accessed.

Heap Allocations and Speed

Heap allocations are costly because they involve this pointer indirection, so stack allocation should be done when sensible (it's not helpful for really large arrays, but for small values like scalars it's essential!)

 function inner_alloc!(C,A,B)
   for j in 1:100, i in 1:100
@@ -29,7 +29,7 @@
 end
 @btime inner_alloc!(C,A,B)
 
-314.598 μs (10000 allocations: 625.00 KiB)
+363.501 μs (10000 allocations: 625.00 KiB)
 
 function inner_noalloc!(C,A,B)
   for j in 1:100, i in 1:100
@@ -39,7 +39,7 @@
 end
 @btime inner_noalloc!(C,A,B)
 
-7.533 μs (0 allocations: 0 bytes)
+8.800 μs (0 allocations: 0 bytes)
 

Why does the array here get heap-allocated? It isn't able to prove/guarantee at compile-time that the array's size will always be a given value, and thus it allocates it to the heap. @btime tells us this allocation occurred and shows us the total heap memory that was taken. Meanwhile, the size of a Float64 number is known at compile-time (64-bits), and so this is stored onto the stack and given a specific location that the compiler will be able to directly address.

Note that one can use the StaticArrays.jl library to get statically-sized arrays and thus arrays which are stack-allocated:

 using StaticArrays
 function static_inner_alloc!(C,A,B)
@@ -50,7 +50,7 @@
 end
 @btime static_inner_alloc!(C,A,B)
 
-8.166 μs (0 allocations: 0 bytes)
+9.200 μs (0 allocations: 0 bytes)
 

Mutation to Avoid Heap Allocations

Many times you do need to write into an array, so how can you write into an array without performing a heap allocation? The answer is mutation. Mutation is changing the values of an already existing array. In that case, no free memory has to be found to put the array (and no memory has to be freed by the garbage collector).

In Julia, functions which mutate the first value are conventionally noted by a !. See the difference between these two equivalent functions:

 function inner_noalloc!(C,A,B)
   for j in 1:100, i in 1:100
@@ -60,7 +60,7 @@
 end
 @btime inner_noalloc!(C,A,B)
 
-7.733 μs (0 allocations: 0 bytes)
+9.600 μs (0 allocations: 0 bytes)
 
 function inner_alloc(A,B)
   C = similar(A)
@@ -71,7 +71,7 @@
 end
 @btime inner_alloc(A,B)
 
-15.100 μs (2 allocations: 78.17 KiB)
+16.200 μs (2 allocations: 78.17 KiB)
 

To use this algorithm effectively, the ! algorithm assumes that the caller already has allocated the output array to put as the output argument. If that is not true, then one would need to manually allocate. The goal of that interface is to give the caller control over the allocations to allow them to manually reduce the total number of heap allocations and thus increase the speed.

Julia's Broadcasting Mechanism

Wouldn't it be nice to not have to write the loop there? In many high level languages this is simply called vectorization. In Julia, we will call it array vectorization to distinguish it from the SIMD vectorization which is common in lower level languages like C, Fortran, and Julia.

In Julia, if you use . on an operator it will transform it to the broadcasted form. Broadcast is lazy: it will build up an entire .'d expression and then call broadcast! on composed expression. This is customizable and documented in detail. However, to a first approximation we can think of the broadcast mechanism as a mechanism for building fused expressions. For example, the Julia code:

 A .+ B .+ C;
 

under the hood lowers to something like:

@@ -86,29 +86,29 @@
 end
 @btime unfused(A,B,C);
 
-31.800 μs (4 allocations: 156.34 KiB)
+31.200 μs (4 allocations: 156.34 KiB)
 
 fused(A,B,C) = A .+ B .+ C
 @btime fused(A,B,C);
 
-18.400 μs (2 allocations: 78.17 KiB)
+18.600 μs (2 allocations: 78.17 KiB)
 

Note that we can also fuse the output by using .=. This is essentially the vectorized version of a ! function:

 D = similar(A)
 fused!(D,A,B,C) = (D .= A .+ B .+ C)
 @btime fused!(D,A,B,C);
 
-10.800 μs (0 allocations: 0 bytes)
+10.500 μs (0 allocations: 0 bytes)
 

Note on Broadcasting Function Calls

Julia allows for broadcasting the call () operator as well. .() will call the function element-wise on all arguments, so sin.(A) will be the elementwise sine function. This will fuse Julia like the other operators.

Note on Vectorization and Speed

In articles on MATLAB, Python, R, etc., this is where you will be told to vectorize your code. Notice from above that this isn't a performance difference between writing loops and using vectorized broadcasts. This is not abnormal! The reason why you are told to vectorize code in these other languages is because they have a high per-operation overhead (which will be discussed further down). This means that every call, like +, is costly in these languages. To get around this issue and make the language usable, someone wrote and compiled the loop for the C/Fortran function that does the broadcasted form (see numpy's Github repo). Thus A .+ B's MATLAB/Python/R equivalents are calling a single C function to generally avoid the cost of function calls and thus are faster.

But this is not an intrinsic property of vectorization. Vectorization isn't "fast" in these languages, it's just close to the correct speed. The reason vectorization is recommended is because looping is slow in these languages. Because looping isn't slow in Julia (or C, C++, Fortran, etc.), loops and vectorization generally have the same speed. So use the one that works best for your code without a care about performance.

(As a small side effect, these high level languages tend to allocate a lot of temporary variables since the individual C kernels are written for specific numbers of inputs and thus don't naturally fuse. Julia's broadcast mechanism is just generating and JIT compiling Julia functions on the fly, and thus it can accommodate the combinatorial explosion in the amount of choices just by only compiling the combinations that are necessary for a specific code)

Heap Allocations from Slicing

It's important to note that slices in Julia produce copies instead of views. Thus for example:

 A[50,50]
 
-0.6987883598884902
+0.1874936774122129
 

allocates a new output. This is for safety, since if it pointed to the same array then writing to it would change the original array. We can demonstrate this by asking for a view instead of a copy.

 @show A[1]
 E = @view A[1:5,1:5]
 E[1] = 2.0
 @show A[1]
 
-A[1] = 0.49711339210357286
+A[1] = 0.6237197114754515
 A[1] = 2.0
 2.0
 

However, this means that @view A[1:5,1:5] did not allocate an array (it does allocate a pointer if the escape analysis is unable to prove that it can be elided. This means that in small loops there will be no allocation, while if the view is returned from a function for example it will allocate the pointer, ~80 bytes, but not the memory of the array. This means that it is O(1) in cost but with a relatively small constant).

Asymptotic Cost of Heap Allocations

Heap allocations have to locate and prepare a space in RAM that is proportional to the amount of memory that is calculated, which means that the cost of a heap allocation for an array is O(n), with a large constant. As RAM begins to fill up, this cost dramatically increases. If you run out of RAM, your computer may begin to use swap, which is essentially RAM simulated on your hard drive. Generally when you hit swap your performance is so dead that you may think that your computation froze, but if you check your resource use you will notice that it's actually just filled the RAM and starting to use the swap.

But think of it as O(n) with a large constant factor. This means that for operations which only touch the data once, heap allocations can dominate the computational cost:

@@ -130,7 +130,7 @@
 plot(ns,alloc,label="=",xscale=:log10,yscale=:log10,legend=:bottomright,
      title="Micro-optimizations matter for BLAS1")
 plot!(ns,noalloc,label=".=")
-

However, when the computation takes O(n^3), like in matrix multiplications, the high constant factor only comes into play when the matrices are sufficiently small:

+

However, when the computation takes O(n^3), like in matrix multiplications, the high constant factor only comes into play when the matrices are sufficiently small:

 using LinearAlgebra, BenchmarkTools
 function alloc_timer(n)
     A = rand(n,n)
@@ -149,7 +149,7 @@
 plot(ns,alloc,label="*",xscale=:log10,yscale=:log10,legend=:bottomright,
      title="Micro-optimizations only matter for small matmuls")
 plot!(ns,noalloc,label="mul!")
-

Though using a mutating form is never bad and always is a little bit better.

Optimizing Memory Use Summary

Julia's Type Inference and the Compiler

Many people think Julia is fast because it is JIT compiled. That is simply not true (we've already shown examples where Julia code isn't fast, but it's always JIT compiled!). Instead, the reason why Julia is fast is because the combination of two ideas:

These two features naturally give rise to Julia's core design feature: multiple dispatch. Let's break down these pieces.

Type Inference

At the core level of the computer, everything has a type. Some languages are more explicit about said types, while others try to hide the types from the user. A type tells the compiler how to to store and interpret the memory of a value. For example, if the compiled code knows that the value in the register is supposed to be interpreted as a 64-bit floating point number, then it understands that slab of memory like:

Importantly, it will know what to do for function calls. If the code tells it to add two floating point numbers, it will send them as inputs to the Floating Point Unit (FPU) which will give the output.

If the types are not known, then... ? So one cannot actually compute until the types are known, since otherwise it's impossible to interpret the memory. In languages like C, the programmer has to declare the types of variables in the program:

void add(double *a, double *b, double *c, size_t n){
+

Though using a mutating form is never bad and always is a little bit better.

Optimizing Memory Use Summary

Julia's Type Inference and the Compiler

Many people think Julia is fast because it is JIT compiled. That is simply not true (we've already shown examples where Julia code isn't fast, but it's always JIT compiled!). Instead, the reason why Julia is fast is because the combination of two ideas:

These two features naturally give rise to Julia's core design feature: multiple dispatch. Let's break down these pieces.

Type Inference

At the core level of the computer, everything has a type. Some languages are more explicit about said types, while others try to hide the types from the user. A type tells the compiler how to to store and interpret the memory of a value. For example, if the compiled code knows that the value in the register is supposed to be interpreted as a 64-bit floating point number, then it understands that slab of memory like:

Importantly, it will know what to do for function calls. If the code tells it to add two floating point numbers, it will send them as inputs to the Floating Point Unit (FPU) which will give the output.

If the types are not known, then... ? So one cannot actually compute until the types are known, since otherwise it's impossible to interpret the memory. In languages like C, the programmer has to declare the types of variables in the program:

void add(double *a, double *b, double *c, size_t n){
   size_t i;
   for(i = 0; i < n; ++i) {
     c[i] = a[i] + b[i];
@@ -172,7 +172,7 @@
 
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `f`
-define i64 @julia_f_2897(i64 signext %0, i64 signext %1) #0 {
+define i64 @julia_f_2893(i64 signext %0, i64 signext %1) #0 {
 top:
 ; ┌ @ int.jl:87 within `+`
    %2 = add i64 %1, %0
@@ -184,7 +184,7 @@
 
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `f`
-define double @julia_f_2899(double %0, double %1) #0 {
+define double @julia_f_2895(double %0, double %1) #0 {
 top:
 ; ┌ @ float.jl:408 within `+`
    %2 = fadd double %0, %1
@@ -204,7 +204,7 @@
 
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `g`
-define i64 @julia_g_2901(i64 signext %0, i64 signext %1) #0 {
+define i64 @julia_g_2897(i64 signext %0, i64 signext %1) #0 {
 top:
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 6 within `g`
@@ -249,7 +249,7 @@
 
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `f`
-define double @julia_f_3427(double %0, i64 signext %1) #0 {
+define double @julia_f_3423(double %0, i64 signext %1) #0 {
 top:
 ; ┌ @ promotion.jl:410 within `+`
 ; │┌ @ promotion.jl:381 within `promote`
@@ -290,7 +290,7 @@
 
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `g`
-define double @julia_g_3430(double %0, i64 signext %1) #0 {
+define double @julia_g_3426(double %0, i64 signext %1) #0 {
 top:
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 5 within `g`
@@ -360,7 +360,7 @@
 0.4
 

The + function in Julia is just defined as +(a,b), and we can actually point to that code in the Julia distribution:

 @which +(2.0,5)
-
+(x::Number, y::Number) in Base at promotion.jl:410

To control at a higher level, Julia uses abstract types. For example, Float64 <: AbstractFloat, meaning Float64s are a subtype of AbstractFloat. We also have that Int <: Integer, while both AbstractFloat <: Number and Integer <: Number.

Julia allows the user to define dispatches at a higher level, and the version that is called is the most strict version that is correct. For example, right now with ff we will get a MethodError if we call it between a Int and a Float64 because no such method exists:

+
+(x::Number, y::Number) in Base at promotion.jl:410

To control at a higher level, Julia uses abstract types. For example, Float64 <: AbstractFloat, meaning Float64s are a subtype of AbstractFloat. We also have that Int <: Integer, while both AbstractFloat <: Number and Integer <: Number.

Julia allows the user to define dispatches at a higher level, and the version that is called is the most strict version that is correct. For example, right now with ff we will get a MethodError if we call it between a Int and a Float64 because no such method exists:

 ff(2.0,5)
 
 ERROR: MethodError: no method matching ff(::Float64, ::Int64)
@@ -381,7 +381,7 @@
 
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `ff`
-define double @julia_ff_3551(double %0, i64 signext %1) #0 {
+define double @julia_ff_3547(double %0, i64 signext %1) #0 {
 top:
 ; ┌ @ promotion.jl:410 within `+`
 ; │┌ @ promotion.jl:381 within `promote`
@@ -537,7 +537,7 @@
 
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `g`
-define void @julia_g_3690([2 x double]* noalias nocapture noundef nonnull s
+define void @julia_g_3686([2 x double]* noalias nocapture noundef nonnull s
 ret([2 x double]) align 8 dereferenceable(16) %0, [2 x double]* nocapture n
 oundef nonnull readonly align 8 dereferenceable(16) %1, [2 x double]* nocap
 ture noundef nonnull readonly align 8 dereferenceable(16) %2) #0 {
@@ -653,7 +653,7 @@
 
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `g`
-define [2 x float] @julia_g_3702([2 x float]* nocapture noundef nonnull rea
+define [2 x float] @julia_g_3698([2 x float]* nocapture noundef nonnull rea
 donly align 4 dereferenceable(8) %0, [2 x float]* nocapture noundef nonnull
  readonly align 4 dereferenceable(8) %1) #0 {
 top:
@@ -754,7 +754,7 @@
 
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `g`
-define void @julia_g_3721([2 x {}*]* noalias nocapture noundef nonnull sret
+define void @julia_g_3717([2 x {}*]* noalias nocapture noundef nonnull sret
 ([2 x {}*]) align 8 dereferenceable(16) %0, [2 x {}*]* nocapture noundef no
 nnull readonly align 8 dereferenceable(16) %1, [2 x {}*]* nocapture noundef
  nonnull readonly align 8 dereferenceable(16) %2) #0 {
@@ -788,22 +788,22 @@
    store {}** %13, {}*** %12, align 8
    %14 = bitcast {}*** %pgcstack to {}***
    store {}** %gcframe2.sub, {}*** %14, align 8
-   call void @"j_+_3723"([2 x {}*]* noalias nocapture noundef nonnull sret(
-[2 x {}*]) %9, [2 x {}*]* nocapture nonnull readonly %1, i64 signext 4) #0
+   call void @"j_+_3719"([2 x {}*]* noalias nocapture noundef nonnull sret(
+[2 x {}*]) %5, [2 x {}*]* nocapture nonnull readonly %1, i64 signext 4) #0
 ; └
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 6 within `g`
 ; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
 :2 within `f`
-   call void @"j_+_3724"([2 x {}*]* noalias nocapture noundef nonnull sret(
-[2 x {}*]) %5, i64 signext 2, [2 x {}*]* nocapture readonly %9) #0
+   call void @"j_+_3720"([2 x {}*]* noalias nocapture noundef nonnull sret(
+[2 x {}*]) %9, i64 signext 2, [2 x {}*]* nocapture readonly %5) #0
 ; └
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 7 within `g`
 ; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd
 :2 within `f`
-   call void @"j_+_3725"([2 x {}*]* noalias nocapture noundef nonnull sret(
-[2 x {}*]) %7, [2 x {}*]* nocapture readonly %5, [2 x {}*]* nocapture nonnu
+   call void @"j_+_3721"([2 x {}*]* noalias nocapture noundef nonnull sret(
+[2 x {}*]) %7, [2 x {}*]* nocapture readonly %9, [2 x {}*]* nocapture nonnu
 ll readonly %2) #0
 ; └
   %15 = bitcast [2 x {}*]* %0 to i8*
@@ -836,28 +836,28 @@
 b = MyComplex(2.0,1.0)
 @btime g(a,b)
 
-22.189 ns (1 allocation: 32 bytes)
+29.548 ns (1 allocation: 32 bytes)
 MyComplex(9.0, 2.0)
 
 a = MyParameterizedComplex(1.0,1.0)
 b = MyParameterizedComplex(2.0,1.0)
 @btime g(a,b)
 
-22.088 ns (1 allocation: 32 bytes)
+26.835 ns (1 allocation: 32 bytes)
 MyParameterizedComplex{Float64}(9.0, 2.0)
 
 a = MySlowComplex(1.0,1.0)
 b = MySlowComplex(2.0,1.0)
 @btime g(a,b)
 
-130.643 ns (5 allocations: 96 bytes)
+141.829 ns (5 allocations: 96 bytes)
 MySlowComplex(9.0, 2.0)
 
 a = MySlowComplex2(1.0,1.0)
 b = MySlowComplex2(2.0,1.0)
 @btime g(a,b)
 
-871.875 ns (14 allocations: 288 bytes)
+931.034 ns (14 allocations: 288 bytes)
 MySlowComplex2(9.0, 2.0)
 

Note on Julia

Note that, because of these type specialization, value types, etc. properties, the number types, even ones such as Int, Float64, and Complex, are all themselves implemented in pure Julia! Thus even basic pieces can be implemented in Julia with full performance, given one uses the features correctly.

Note on isbits

Note that a type which is mutable struct will not be isbits. This means that mutable structs will be a pointer to a heap allocated object, unless it's shortlived and the compiler can erase its construction. Also, note that isbits compiles down to bit operations from pure Julia, which means that these types can directly compile to GPU kernels through CUDAnative without modification.

Function Barriers

Since functions automatically specialize on their input types in Julia, we can use this to our advantage in order to make an inner loop fully inferred. For example, take the code from above but with a loop:

 function r(x)
@@ -872,7 +872,7 @@
 end
 @btime r(x)
 
-6.140 μs (300 allocations: 4.69 KiB)
+6.725 μs (300 allocations: 4.69 KiB)
 604.0
 

In here, the loop variables are not inferred and thus this is really slow. However, we can force a function call in the middle to end up with specialization and in the inner loop be stable:

 s(x) = _s(x[1],x[2])
@@ -888,7 +888,7 @@
 end
 @btime s(x)
 
-309.829 ns (1 allocation: 16 bytes)
+332.200 ns (1 allocation: 16 bytes)
 604.0
 

Notice that this algorithm still doesn't infer:

 @code_warntype s(x)
@@ -920,7 +920,7 @@
 
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `fff`
-define i64 @julia_fff_3875(i64 signext %0) #0 {
+define i64 @julia_fff_3871(i64 signext %0) #0 {
 top:
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 8 within `fff`
@@ -934,7 +934,7 @@
 
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `fff`
-define double @julia_fff_3877(double %0) #0 {
+define double @julia_fff_3873(double %0) #0 {
 top:
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 8 within `fff`
@@ -949,7 +949,7 @@
   C[i,j] = A[i,j] + B[i,j]
 end
 
-804.197 μs (30000 allocations: 468.75 KiB)
+1.011 ms (30000 allocations: 468.75 KiB)
 

This is very slow because the types of A, B, and C cannot be inferred. Why can't they be inferred? Well, at any time in the dynamic REPL scope I can do something like C = "haha now a string!", and thus it cannot specialize on the types currently existing in the REPL (since asynchronous changes could also occur), and therefore it defaults back to doing a type check at every single function which slows it down. Moral of the story, Julia functions are fast but its global scope is too dynamic to be optimized.

Summary

  • Julia is not fast because of its JIT, it's fast because of function specialization and type inference

  • Type stable functions allow inference to fully occur

  • Multiple dispatch works within the function specialization mechanism to create overhead-free compile time controls

  • Julia will specialize the generic functions

  • Making sure values are concretely typed in inner loops is essential for performance

Overheads of Individual Operations

Now let's dig even a little deeper. Everything the processor does has a cost. A great chart to keep in mind is this classic one. A few things should immediately jump out to you:

  • Simple arithmetic, like floating point additions, are super cheap. ~1 clock cycle, or a few nanoseconds.

  • Processors do branch prediction on if statements. If the code goes down the predicted route, the if statement costs ~1-2 clock cycles. If it goes down the wrong route, then it will take ~10-20 clock cycles. This means that predictable branches, like ones with clear patterns or usually the same output, are much cheaper (almost free) than unpredictable branches.

  • Function calls are expensive: 15-60 clock cycles!

  • RAM reads are very expensive, with lower caches less expensive.

Bounds Checking

Let's check the LLVM IR on one of our earlier loops:

 function inner_noalloc!(C,A,B)
   for j in 1:100, i in 1:100
@@ -961,7 +961,7 @@
 
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `inner_noalloc!`
-define nonnull {}* @"japi1_inner_noalloc!_3886"({}* %0, {}** noalias nocapt
+define nonnull {}* @"japi1_inner_noalloc!_3882"({}* %0, {}** noalias nocapt
 ure noundef readonly %1, i32 %2) #0 {
 top:
   %3 = alloca {}**, align 8
@@ -1104,7 +1104,7 @@
   br i1 %.not18, label %L36, label %L2
 
 L36:                                              ; preds = %L25
-  ret {}* inttoptr (i64 139639611547656 to {}*)
+  ret {}* inttoptr (i64 139979581812744 to {}*)
 
 oob:                                              ; preds = %L5.us.us.postl
 oop, %L2.split.us.L2.split.us.split_crit_edge, %L2
@@ -1229,17 +1229,17 @@
 end
 @btime inner_noalloc!(C,A,B)
 
-7.600 μs (0 allocations: 0 bytes)
+8.500 μs (0 allocations: 0 bytes)
 
 @btime inner_noalloc_ib!(C,A,B)
 
-5.016 μs (0 allocations: 0 bytes)
+5.450 μs (0 allocations: 0 bytes)
 

SIMD

Now let's inspect the LLVM IR again:

 @code_llvm inner_noalloc_ib!(C,A,B)
 
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `inner_noalloc_ib!`
-define nonnull {}* @"japi1_inner_noalloc_ib!_3922"({}* %0, {}** noalias noc
+define nonnull {}* @"japi1_inner_noalloc_ib!_3918"({}* %0, {}** noalias noc
 apture noundef readonly %1, i32 %2) #0 {
 top:
   %3 = alloca {}**, align 8
@@ -1374,7 +1374,7 @@
   br i1 %.not.not10, label %L36, label %L2
 
 L36:                                              ; preds = %L25
-  ret {}* inttoptr (i64 139639611547656 to {}*)
+  ret {}* inttoptr (i64 139979581812744 to {}*)
 }
 

If you look closely, you will see things like:

%wide.load24 = load <4 x double>, <4 x double> addrspac(13)* %46, align 8
 ; └
@@ -1383,7 +1383,7 @@
 @code_llvm fma(2.0,5.0,3.0)
 
 ;  @ floatfuncs.jl:426 within `fma`
-define double @julia_fma_3923(double %0, double %1, double %2) #0 {
+define double @julia_fma_3919(double %0, double %1, double %2) #0 {
 common.ret:
 ; ┌ @ floatfuncs.jl:421 within `fma_llvm`
    %3 = call double @llvm.fma.f64(double %0, double %1, double %2)
@@ -1430,7 +1430,7 @@ 

Inlining

 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 4 within `qinline`
-define double @julia_qinline_3926(double %0, double %1) #0 {
+define double @julia_qinline_3922(double %0, double %1) #0 {
 top:
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 7 within `qinline`
@@ -1467,17 +1467,17 @@ 

Inlining

 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 11 within `qnoinline`
-define double @julia_qnoinline_3928(double %0, double %1) #0 {
+define double @julia_qnoinline_3924(double %0, double %1) #0 {
 top:
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 14 within `qnoinline`
-  %2 = call double @j_fnoinline_3930(double %0, i64 signext 4) #0
+  %2 = call double @j_fnoinline_3926(double %0, i64 signext 4) #0
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 15 within `qnoinline`
-  %3 = call double @j_fnoinline_3931(i64 signext 2, double %2) #0
+  %3 = call double @j_fnoinline_3927(i64 signext 2, double %2) #0
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 16 within `qnoinline`
-  %4 = call double @j_fnoinline_3932(double %3, double %1) #0
+  %4 = call double @j_fnoinline_3928(double %3, double %1) #0
   ret double %4
 }
 
@@ -1496,7 +1496,7 @@

Inlining

-22.390 ns (1 allocation: 16 bytes)
+27.839 ns (1 allocation: 16 bytes)
 9.0
 
@@ -1508,7 +1508,7 @@

Inlining

-26.004 ns (1 allocation: 16 bytes)
+31.690 ns (1 allocation: 16 bytes)
 9.0
 
@@ -1536,7 +1536,7 @@

Note on Benchmarking

-1.699 ns (0 allocations: 0 bytes)
+1.900 ns (0 allocations: 0 bytes)
 9.0
 
@@ -1553,7 +1553,7 @@

Note on Benchmarking

 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture02/optimizing.jmd:
 2 within `cheat`
-define double @julia_cheat_3960() #0 {
+define double @julia_cheat_3956() #0 {
 top:
   ret double 9.000000e+00
 }
@@ -1578,7 +1578,7 @@ 

Discussion Questions

diff --git a/notes/03-Introduction_to_Scientific_Machine_Learning_through_Physics-Informed_Neural_Networks/index.html b/notes/03-Introduction_to_Scientific_Machine_Learning_through_Physics-Informed_Neural_Networks/index.html index 2c7f4797..fe4cad7b 100644 --- a/notes/03-Introduction_to_Scientific_Machine_Learning_through_Physics-Informed_Neural_Networks/index.html +++ b/notes/03-Introduction_to_Scientific_Machine_Learning_through_Physics-Informed_Neural_Networks/index.html @@ -19,11 +19,11 @@ simpleNN(rand(10))
 5-element Vector{Float64}:
- -2.4350849734604516
-  4.348579464751774
- -0.430223629075539
-  1.7699710631897965
- -8.170048057983601
+ -1.3383331584519713
+ -2.39034673226155
+  5.476769960460295
+ -4.383289115950564
+ -8.639151272106952
 

This is our direct definition of a neural network. Notice that we choose to use tanh as our activation function between the layers.

Defining Neural Networks with Flux.jl

One of the main deep learning libraries in Julia is Flux.jl. Flux is an interesting library for scientific machine learning because it is built on top of language-wide automatic differentiation libraries, giving rise to a programming paradigm known as differentiable programming, which means that one can write a program in a manner that it has easily accessible fast derivatives. However, due to being built on a differentiable programming base, the underlying functionality is simply standard Julia code,

To learn how to use the library, consult the documentation. A Google search will bring up the Flux.jl Github repository. From there, the blue link on the README brings you to the package documentation. This is common through Julia so it's a good habit to learn!

In the documentation you will find that the way a neural network is defined is through a Chain of layers. A Dense layer is the kind we defined above, which is given by an input size, an output size, and an activation function. For example, the following recreates the neural network that we had above:

 using Flux
 NN2 = Chain(Dense(10 => 32,tanh),
@@ -32,11 +32,11 @@
 NN2(rand(10))
 
 5-element Vector{Float64}:
- -0.29078475578534135
- -0.36515252241118434
-  0.18666332556964638
-  0.27894106921878215
- -0.07838097075616664
+ -0.367383275136521
+  0.09404428112599642
+  0.24593293916044112
+ -0.19206701177467855
+ -0.1616706778424914
 

Notice that Flux.jl as a library is written in pure Julia, which means that every piece of this syntax is just sugar over some Julia code that we can specialize ourselves (this is the advantage of having a language fast enough for the implementation of the library and the use of the library!)

For example, the activation function is just a scalar Julia function. If we wanted to replace it by something like the quadratic function, we can just use an anonymous function to define the scalar function we would like to use:

 NN3 = Chain(Dense(10 => 32,x->x^2),
             Dense(32 => 32,x->max(0,x)),
@@ -44,11 +44,11 @@
 NN3(rand(10))
 
 5-element Vector{Float64}:
- -0.09374245725516782
-  0.36412359730517746
- -0.04188060346619738
- -0.0063866127237920105
- -0.07180727833605427
+ -0.0635693658681912
+  0.10343361412920299
+  0.23396297818125808
+ -0.0655890950970476
+  0.05933084866332009
 

The second activation function there is what's known as a relu. A relu can be good to use because it's an exceptionally fast operation and satisfies a form of the universal approximation theorem (UAT). However, a downside is that its derivative is not continuous, which could impact the numerical properties of some algorithms, and thus it's widely used throughout standard machine learning but we'll see reasons why it may be disadvantageous in some cases in scientific machine learning.

Digging into the Construction of a Neural Network Library

Again, as mentioned before, this neural network NN2 is simply a function:

 simpleNN(x) = W[3]*tanh.(W[2]*tanh.(W[1]*x + b[1]) + b[2]) + b[3]
 
@@ -96,26 +96,26 @@
 denselayer_f(rand(32))
 
 32-element Vector{Float64}:
-  0.6502745532344867
- -0.2764078547672906
- -0.33756397356649603
-  0.2905378129827325
-  0.4829203037778629
-  0.49259637695301317
- -0.7244575524152954
-  0.6201698112871467
-  0.43852802909767524
- -0.29660901905762604
+  0.5742544502155338
+ -0.33380176214476615
+  0.542810141740359
+  0.14469442202480834
+ -0.29914058859543896
+  0.6716839940944329
+  0.2881809336509002
+  0.5692583753772852
+ -0.5698291775734202
+ -0.6771694953190442
   ⋮
- -0.2999804882299587
-  0.6165336309867477
-  0.29528760817159105
- -0.4957092213910054
-  0.7795838428322517
- -0.4247540893917004
- -0.42115911678801005
- -0.5614876585119257
- -0.48468637337209475
+  0.06555175730264437
+ -0.7822997657577385
+ -0.2509639764291457
+ -0.14439865793581114
+ -0.057541402830386
+  0.5126940597261566
+ -0.6583375064922741
+  0.19901747735535336
+ -0.11878186145036329
 

So okay, Dense objects are just functions that have weight and bias matrices inside of them. Now what does Chain do?

 @which Chain(1,2,3)
 
Chain(xs...) in Flux at /home/runner/.julia/packages/Flux/ZdbJr/src/layers/basic.jl:39

Again, for our explanations here we will look at the slightly simpler code From and earlier version of the Flux package:

@@ -169,57 +169,53 @@
 loss() = sum(abs2,sum(abs2,NN(rand(10)).-1) for i in 1:100)
 loss()
 
-4550.070342185175
+4762.879890603975
 

This loss function takes 100 random points in $[0,1]^{10}$ and then computes the output of the neural network minus 1 on each of the values, and sums up the squared values (abs2). Why the squared values? This means that every computed loss value is positive, and so we know that by decreasing the loss this means that, on average our neural network outputs are closer to 1. What are the weights? Since we're using the Flux callable struct style from above, the weights are those inside of the NN chain object, which we can inspect:

 NN[1].weight # The W matrix of the first layer
 
 32×10 Matrix{Float32}:
-  0.206907   -0.293996     0.246735   …  -0.316413    -0.344967    0.140159
- -0.234722    0.0792494   -0.255654       0.304856     0.301727   -0.15234
- -0.131264   -0.300535     0.34379        0.137062    -0.355142    0.292703
- -0.165884   -0.184028    -0.244061      -0.0888335   -0.0501927  -0.364835
-  0.273653   -0.27581     -0.165154       0.107254    -0.0985882  -0.131832
- -0.0837413   0.0814385    0.193289   …   0.114349    -0.0310933   0.343392
- -0.108227   -0.110772     0.155364       0.177525     0.160025   -0.017761
-1
-  0.131718    0.00724105  -0.223872      -0.00875761   0.112146    0.245469
- -0.173363    0.105232    -0.331788       0.224498     0.0817328   0.163695
- -0.129662    0.0193645    0.225084      -0.131568     0.124624    0.106667
-  ⋮                                   ⋱                           
-  0.366625    0.215874    -0.284587       0.257149     0.181714   -0.244675
-  0.134637   -0.280037     0.24618       -0.276576     0.0496992   0.026246
-6
-  0.0804886  -0.0138646    0.056448   …  -0.336282    -0.244829   -0.33495
- -0.0437822  -0.260398    -0.190927       0.287319    -0.192932    0.053829
- -0.0672412   0.283508     0.192685      -0.105615    -0.115523    0.037439
-8
-  0.0563317  -0.317537     0.356511       0.136938     0.349309   -0.187046
-  0.266178    0.125742     0.0387179     -0.322464     0.10805     0.268939
- -0.14143     0.315573     0.308718   …   0.357034    -0.30481     0.063483
-7
- -0.293272    0.181026    -0.101116      -0.135126    -0.249626    0.022162
-9
+ -0.042315   -0.152674    0.248828   -0.00480644  …  -0.165152   -0.0683321
+  0.0196488  -0.355363   -0.0511883  -0.275463       -0.345056    0.247659
+  0.0696147   0.198262   -0.0505652   0.208592        0.349459    0.160529
+  0.229841    0.181453   -0.206514   -0.165194       -0.262046   -0.123437
+ -0.298059   -0.320777    0.161661   -0.0647406       0.293637   -0.325253
+ -0.28259    -0.0159905  -0.227372   -0.26533     …  -0.159701    0.215338
+ -0.0417327   0.0246552   0.282349   -0.0145864      -0.277505    0.0595062
+  0.106574    0.0655952   0.11508    -0.0328105      -0.185263    0.12242
+  0.316739   -0.147705    0.088275   -0.0220919      -0.180979   -0.24828
+  0.223117   -0.0728504   0.0867307  -0.349231        0.223369   -0.147801
+  ⋮                                               ⋱              
+  0.156      -0.302055    0.14692    -0.189734        0.363814    0.366614
+  0.31773     0.32263    -0.375842   -0.306479        0.26718     0.257999
+  0.0447154   0.0959551  -0.0754782   0.32307     …   0.0750746   0.320953
+ -0.120086   -0.0858004  -0.29501    -0.0595151       0.278634    0.225435
+ -0.0773527   0.0215229   0.276879   -0.297569        0.179297    0.0024109
+5
+ -0.242057    0.116857   -0.0286312   0.12309         0.342328   -0.0437305
+ -0.0715863  -0.0341185   0.0922671  -0.0341884      -0.0671414  -0.0955032
+ -0.118132    0.0959598   0.148726    0.331655    …  -0.0611473   0.377659
+ -0.36258     0.312327   -0.368223    0.118171       -0.28799    -0.102674
 

Now let's grab all of the parameters together:

 p = Flux.params(NN)
 
-Params([Float32[0.20690748 -0.2939962 … -0.34496742 0.14015937; -0.23472181
- 0.07924941 … 0.30172688 -0.15234008; … ; -0.14143002 0.31557292 … -0.30481
-017 0.06348374; -0.2932724 0.18102595 … -0.24962649 0.022162892], Float32[0
-.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.0, 0.0
-, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[-0.03256267 -0.13882013 … 0.19279614 0.
-05722884; -0.0030363456 0.07537417 … 0.022959018 -0.2238231; … ; 0.29075414
- 0.27948645 … -0.20741504 0.052605283; 0.21510808 -0.21531041 … -0.28893065
- -0.087447755], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …
-  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.03591951 0.0
-95983386 … 0.3207747 -0.08298867; -0.10454989 0.12266854 … 0.25205672 -0.02
-537506; … ; 0.12059488 0.0905982 … 0.059288114 0.2102544; 0.07917695 -0.169
-73555 … -0.2598032 -0.35143456], Float32[0.0, 0.0, 0.0, 0.0, 0.0]])
+Params([Float32[-0.042314984 -0.15267445 … -0.16515188 -0.06833213; 0.01964
+8807 -0.35536322 … -0.34505555 0.2476594; … ; -0.11813201 0.09595979 … -0.0
+61147317 0.37765864; -0.36258012 0.31232706 … -0.28799024 -0.10267412], Flo
+at32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  0.0, 0.0, 0.0, 0.
+0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[0.20425878 -0.076422386 … 0.08237
+911 -0.27405605; -0.28111216 0.16626641 … -0.29174885 -0.16367172; … ; 0.27
+852747 -0.23803714 … 0.2357961 0.14744177; 0.26311168 -0.2878293 … -0.03472
+6076 0.2109272], Float32[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  
+…  0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0], Float32[-0.104982175 
+-0.12251709 … -0.17302983 -0.039281193; 0.3006238 0.08044029 … 0.017479276 
+0.10239558; … ; 0.1948438 -0.26296428 … -0.2930175 0.0440955; 0.02304553 -0
+.34583634 … 0.19846451 0.38075408], Float32[0.0, 0.0, 0.0, 0.0, 0.0]])
 

That's a helper function on Chain which recursively gathers all of the defining parameters. Let's now find the optimal values p which cause the neural network to be the constant 1 function:

 Flux.train!(loss, p, Iterators.repeated((), 10000), ADAM(0.1))
 

Now let's check the loss:

 loss()
 
-6.637833380399612e-5
+5.824328915289363e-9
 

This means that NN(x) is now a very good function approximator to f(x) = ones(5)!

So Why Machine Learning? Why Neural Networks?

All we did was find parameters that made NN(x) act like a function f(x). How does that relate to machine learning? Well, in any case where one is acting on data (x,y), the idea is to assume that there exists some underlying mathematical model f(x) = y. If we had perfect knowledge of what f is, then from only the information of x we can then predict what y would be. The inference problem is to then figure out what function f should be. Therefore, machine learning on data is simply this problem of finding an approximator to some unknown function!

So why neural networks? Neural networks satisfy two properties. The first of which is known as the Universal Approximation Theorem (UAT), which in simple non-mathematical language means that, for any ϵ of accuracy, if your neural network is large enough (has enough layers, the weight matrices are large enough), then it can approximate any (nice) function f within that ϵ. Therefore, we can reduce the problem of finding missing functions, the problem of machine learning, to a problem of finding the weights of neural networks, which is a well-defined mathematical optimization problem.

Why neural networks specifically? That's a fairly good question, since there are many other functions with this property. For example, you will have learned from analysis that $a_0 + a_1 x + a_2 x^2 + \ldots$ arbitrary polynomials can be used to approximate any analytic function (this is the Taylor series). Similarly, a Fourier series

\[ f(x) = a_0 + \sum_k b_k \cos(kx) + c_k \sin(kx) \]

can approximate any continuous function f (and discontinuous functions also can have convergence, etc. these are the details of a harmonic analysis course).

That's all for one dimension. How about two dimensional functions? It turns out it's not difficult to prove that tensor products of universal approximators will give higher dimensional universal approximators. So for example, tensoring together two polynomials:

\[ a_0 + a_1 x + a_2 y + a_3 x y + a_4 x^2 y + a_5 x y^2 + a_6 x^2 y^2 + \ldots \]

will give a two-dimensional function approximator. But notice how we have to resolve every combination of terms. This means that if we used n coefficients in each dimension d, the total number of coefficients to build a d-dimensional universal approximator from one-dimensional objects would need $n^d$ coefficients. This exponential growth is known as the curse of dimensionality.

The second property of neural networks that makes them applicable to machine learning is that they overcome the curse of dimensionality. The proofs in this area can be a little difficult to parse, but what they boil down to is proving in many cases that the growth of neural networks to sufficiently approximate a d-dimensional function grows as a polynomial of d, rather than exponential. This means that there's some dimensional cutoff where for $d>cutoff$ it is more efficient to use a neural network. This can be problem-specific, but generally it tends to be the case at least by 8 or 10 dimensions.

Neural networks have a few other properties to consider as well:

  1. The assumptions of the neural network can be encoded into the neural architectures. A neural network where the last layer has an activation function x->x^2 is a neural network where all outputs are positive. This means that if you want to find a positive function, you can make the optimization easier by enforcing this constraint. A lot of other constraints can be enforced, like tanh activation functions can make the neural network be a smooth (all derivatives finite) function, or other activations can cause finite numbers of learnable discontinuities.

  2. Generating higher dimensional forms from one dimensional forms does not have good symmetry. For example, the two-dimensional tensor Fourier basis does not have a good way to represent $sin(xy)$. This property of the approximator is called (non)isotropy and more detail can be found in this wonderful talk about function approximation for multidimensional integration (cubature). Neural networks are naturally not aligned to a basis.

  3. Neural networks are "easy" to compute. There's good software for them, GPU-acceleration, and all other kinds of tooling that make them particularly simple to use.

  4. There are proofs that in many scenarios for neural networks the local minima are the global minima, meaning that local optimization is sufficient for training a neural network. Global optimization (which we will cover later in the course) is much more expensive than local methods like gradient descent, and thus this can be a good property to abuse for faster computation.

From Machine Learning to Scientific Machine Learning: Structure and Science

This understanding of a neural network and their libraries directly bridges to the understanding of scientific machine learning and the computation done in the field. In scientific machine learning, neural networks and machine learning are used as the basis to solve problems in scientific computing. Scientific computing, as a discipline also known as Computational Science, is a field of study which focuses on scientific simulation, using tools such as differential equations to investigate physical, biological, and other phenomena.

What we wish to do in scientific machine learning is use these properties of neural networks to improve the way that we investigate our scientific models.

Aside: Why Differential Equations?

Why do differential equations come up so often in as the model in the scientific context? This is a deep question with quite a simple answer. Essentially, all scientific experiments always have to test how things change. For example, you take a system now, you change it, and your measurement is how the changes you made caused changes in the system. This boils down to gather information about how, for some arbitrary system $y = f(x)$, how $\Delta x$ is related to $\Delta y$. Thus what you learn from scientific experiments, what is codified as scientific laws, is not "the answer", but the answer to how things change. This process of writing down equations by describing how they change precisely gives differential equations.

Solving ODEs with Neural Networks: The Physics-Informed Neural Network

Now let's get to our first true SciML application: solving ordinary differential equations with neural networks. The process of solving a differential equation with a neural network, or using a differential equation as a regularizer in the loss function, is known as a physics-informed neural network, since this allows for physical equations to guide the training of the neural network in circumstances where data might be lacking.

Background: A Method for Solving Ordinary Differential Equations with Neural Networks

This is a result first due to Lagaris et. al from 1998. The idea is to solve differential equations using neural networks by representing the solution by a neural network and training the resulting network to satisfy the conditions required by the differential equation.

Let's say we want to solve a system of ordinary differential equations

\[ u' = f(u,t) \]

with $t \in [0,1]$ and a known initial condition $u(0)=u_0$. To solve this, we approximate the solution by a neural network:

\[ NN(t) \approx u(t) \]

If $NN(t)$ was the true solution, then it would hold that $NN'(t) = f(NN(t),t)$ for all $t$. Thus we turn this condition into our loss function. This motivates the loss function:

\[ L(p) = \sum_i \left(\frac{dNN(t_i)}{dt} - f(NN(t_i),t_i) \right)^2 \]

The choice of $t_i$ could be done in many ways: it can be random, it can be a grid, etc. Anyways, when this loss function is minimized (gradients computed with standard reverse-mode automatic differentiation), then we have that $\frac{dNN(t_i)}{dt} \approx f(NN(t_i),t_i)$ and thus $NN(t)$ approximately solves the differential equation.

Note that we still have to handle the initial condition. One simple way to do this is to add an initial condition term to the cost function. This would look like:

\[ L(p) = (NN(0) - u_0)^2 + \sum_i \left(\frac{dNN(t_i)}{dt} - f(NN(t_i),t_i) \right)^2 \]

While that would work, it can be more efficient to encode the initial condition into the function itself so that it's trivially satisfied for any possible set of parameters. For example, instead of directly using a neural network, we can use:

\[ g(t) = u_0 + tNN(t) \]

as our solution. Notice that $g(t)$ is thus a universal approximator for all continuous functions such that $g(0)=u_0$ (this is a property one should prove!). Since $g(t)$ will always satisfy the initial condition, we can train $g(t)$ to satisfy the derivative function then it will automatically be a solution to the derivative function. In this sense, we can use the loss function:

\[ L(p) = \sum_i \left(\frac{dg(t_i)}{dt} - f(g(t_i),t_i) \right)^2 \]

where $p$ are the parameters that define $g$, which in turn are the parameters which define the neural network $NN$ that define $g$. Thus this reduces down, once again, to simply finding weights which minimize a loss function!

Coding Up the Method

Now let's implement this method with Flux. Let's define a neural network to be the NN(t) above. To make the problem easier, let's look at the ODE:

\[ u' = \cos 2\pi t \]

and approximate it with the neural network from a scalar to a scalar:

 using Flux
 NNODE = Chain(x -> [x], # Take in a scalar and transform it into an array
@@ -228,7 +224,7 @@
            first) # Take first value, i.e. return a scalar
 NNODE(1.0)
 
-0.12358128803653629
+-0.07905010348616885
 

Instead of directly approximating the neural network, we will use the transformed equation that is forced to satisfy the boundary conditions. Using u0=1.0, we have the function:

 g(t) = t*NNODE(t) + 1f0
 
@@ -252,23 +248,23 @@
 display(loss())
 Flux.train!(loss, Flux.params(NNODE), data, opt; cb=cb)
 
-0.5292029324358418
-0.4932957105006129
-0.4516506740086385
-0.3061094653058115
-0.07083152746013988
-0.011665014535397188
-0.006347690360824062
-0.005509980046802497
-0.005214355730566553
-0.0049850620896091675
-0.004787041823734979
+0.5178025866113664
+0.5003380507674955
+0.4817250368610114
+0.4072179526120902
+0.1709448358222078
+0.020230873113184646
+0.005706788834646396
+0.004160672851654196
+0.0038591849808713866
+0.003736976081002845
+0.003611986041281208
 

How well did this do? Well if we take the integral of both sides of our differential equation, we see it's fairly trivial:

\[ \int g' = g = \int \cos 2\pi t = C + \frac{\sin 2\pi t}{2\pi} \]

where we defined $C = 1$. Let's take a bunch of (input,output) pairs from the neural network and plot it against the analytical solution to the differential equation:

 using Plots
 t = 0:0.001:1.0
 plot(t,g.(t),label="NN")
 plot!(t,1.0 .+ sin.(2π.*t)/2π, label = "True Solution")
-

We see that it matches very well, and we can keep improving this fit by increasing the size of the neural network, using more training points, and training for more iterations.

Example: Harmonic Oscillator Informed Training

Using this idea, differential equations encoding physical laws can be utilized inside of loss functions for terms which we have some basis to believe should approximately follow some physical system. Let's investigate this last step by looking at how to inform the training of a neural network using the harmonic oscillator.

Let's assume that we are taking measurements of (position,force) in some real one-dimensional spring pushing and pulling against a wall.

But instead of the simple spring, let's assume we had a more complex spring, for example, let's say $F(x) = -kx + 0.1sin(x)$ where this extra term is due to some deformities in the metal (assume mass=1). Then by Newton's law of motion we have a second order ordinary differential equation:

\[ x'' = -kx + 0.1 \sin(x) \]

We can use the DifferentialEquations.jl package to solve this differential equation and see what this system looks like:

+

We see that it matches very well, and we can keep improving this fit by increasing the size of the neural network, using more training points, and training for more iterations.

Example: Harmonic Oscillator Informed Training

Using this idea, differential equations encoding physical laws can be utilized inside of loss functions for terms which we have some basis to believe should approximately follow some physical system. Let's investigate this last step by looking at how to inform the training of a neural network using the harmonic oscillator.

Let's assume that we are taking measurements of (position,force) in some real one-dimensional spring pushing and pulling against a wall.

But instead of the simple spring, let's assume we had a more complex spring, for example, let's say $F(x) = -kx + 0.1sin(x)$ where this extra term is due to some deformities in the metal (assume mass=1). Then by Newton's law of motion we have a second order ordinary differential equation:

\[ x'' = -kx + 0.1 \sin(x) \]

We can use the DifferentialEquations.jl package to solve this differential equation and see what this system looks like:

 using DifferentialEquations
 k = 1.0
 force(dx,x,k,t) = -k*x + 0.1sin(x)
@@ -305,7 +301,7 @@
 loss() = sum(abs2,NNForce(position_data[i]) - force_data[i] for i in 1:length(position_data))
 loss()
 
-0.004090519383696386
+0.0010967192704571457
 

Our random parameters do not do so well, so let's train!

 opt = Flux.Descent(0.01)
 data = Iterators.repeated((), 5000)
@@ -319,24 +315,24 @@
 display(loss())
 Flux.train!(loss, Flux.params(NNForce), data, opt; cb=cb)
 
-0.004090519383696386
-0.003165046027128867
-0.002641747350681186
-0.0022070107178882607
-0.0018447603635215674
-0.0015422621040588331
-0.0012892905773686783
-0.001077546330758725
-0.0009002280237856551
-0.0007517186517138613
-0.000627355617761891
+0.0010967192704571457
+0.0008617590463802304
+0.0007290146954419487
+0.0006163647557874953
+0.0005208026932904165
+0.00043977568374691406
+0.00037111415967897563
+0.0003129703126082548
+0.0002637671607014535
+0.00022216044374859072
+0.00018700348642827335
 

The neural network almost exactly matched the dataset, but how well did it actually learn the real force function? Let's plot it to see:

 learned_force_plot = NNForce.(positions_plot)
 
 plot(plot_t,force_plot,xlabel="t",label="True Force")
 plot!(plot_t,learned_force_plot,label="Predicted Force")
 scatter!(t,force_data,label="Force Measurements")
-

Ouch. The problem is that a neural network can approximate any function, so it approximated a function that fits the data, but not the correct function. We somehow need to have more data... but where can we get more data?

Well, even a first year undergrad in physics will know Hooke's law, which is that the idealized spring should satisfy $F(x) = -kx$. This is a decent assumption for the evolution of the system:

+

Ouch. The problem is that a neural network can approximate any function, so it approximated a function that fits the data, but not the correct function. We somehow need to have more data... but where can we get more data?

Well, even a first year undergrad in physics will know Hooke's law, which is that the idealized spring should satisfy $F(x) = -kx$. This is a decent assumption for the evolution of the system:

 force2(dx,x,k,t) = -k*x
 prob_simplified = SecondOrderODEProblem(force2,1.0,0.0,(0.0,10.0),k)
 sol_simplified = solve(prob_simplified)
@@ -347,7 +343,7 @@
 loss_ode() = sum(abs2,NNForce(x) - (-k*x) for x in random_positions)
 loss_ode()
 
-14.500465286856848
+6.600899833173732
 

If this term is zero, then $F(x) = -kx$, which is approximately true. So now let's put these together:

 λ = 0.1
 composed_loss() = loss() + λ*loss_ode()
@@ -372,15 +368,15 @@
 plot!(plot_t,learned_force_plot,label="Predicted Force")
 scatter!(t,force_data,label="Force Measurements")
 
-1.4506738843034468
-0.0006670716572459478
-0.000629829423817155
-0.000596875962324285
-0.000567402212010099
-0.0005408107593652083
-0.0005166549010322651
-0.0004945863481240524
-0.00047432515093941667
-0.00045564658413633173
-0.0004383653530088632
-

And there we go: we have used knowledge of physics to help inform our neural network training process!

Conclusion

In this lecture we motivated machine learning not as a process of predicting from data but as a process for learning arbitrary nonlinear functions. Neural networks were just one choice of possible function. We then demonstrated how differential equations could be solved using this function approximation technique and then put together these two domains, solving differential equations and approximating data, into a single process to allow for physical knowledge to be embedded into the training process of a neural network, thus arriving at a physics-informed neural network. This is just one method in scientific machine learning which we will be exploring in more detail, demonstrating how we can utilize scientific knowledge to improve fits and allow for data-efficient machine learning.

\ No newline at end of file +0.6602769868038015 +0.0008962173483139916 +0.0008395519776902406 +0.0007893063297842377 +0.0007444563419638081 +0.0007041903913019822 +0.0006678489876113494 +0.000634894748184695 +0.0006048844663758801 +0.0005774476005671365 +0.0005522700329561084 +

And there we go: we have used knowledge of physics to help inform our neural network training process!

Conclusion

In this lecture we motivated machine learning not as a process of predicting from data but as a process for learning arbitrary nonlinear functions. Neural networks were just one choice of possible function. We then demonstrated how differential equations could be solved using this function approximation technique and then put together these two domains, solving differential equations and approximating data, into a single process to allow for physical knowledge to be embedded into the training process of a neural network, thus arriving at a physics-informed neural network. This is just one method in scientific machine learning which we will be exploring in more detail, demonstrating how we can utilize scientific knowledge to improve fits and allow for data-efficient machine learning.

\ No newline at end of file diff --git a/notes/04-How_Loops_Work-An_Introduction_to_Discrete_Dynamics/index.html b/notes/04-How_Loops_Work-An_Introduction_to_Discrete_Dynamics/index.html index 1326e3a5..7922ae64 100644 --- a/notes/04-How_Loops_Work-An_Introduction_to_Discrete_Dynamics/index.html +++ b/notes/04-How_Loops_Work-An_Introduction_to_Discrete_Dynamics/index.html @@ -113,7 +113,7 @@ end @time solve_system_save(lorenz,[1.0,0.0,0.0],p,1000)
-0.000072 seconds (1.00 k allocations: 86.062 KiB)
+0.000075 seconds (1.00 k allocations: 86.062 KiB)
 1000-element Vector{Vector{Float64}}:
  [1.0, 0.0, 0.0]
  [0.8, 0.56, 0.0]
@@ -138,7 +138,7 @@
 
 @time solve_system_save_push(lorenz,[1.0,0.0,0.0],p,1000)
 
-0.021052 seconds (4.63 k allocations: 344.371 KiB, 99.51% compilation tim
+0.019902 seconds (4.63 k allocations: 344.371 KiB, 99.49% compilation tim
 e)
 1000-element Vector{Vector{Float64}}:
  [1.0, 0.0, 0.0]
@@ -164,7 +164,7 @@
 

The first time Julia compiles the function, and the second is a straight call.

 @time solve_system_save_push(lorenz,[1.0,0.0,0.0],p,1000)
 
-0.000108 seconds (1.01 k allocations: 99.984 KiB)
+0.000122 seconds (1.01 k allocations: 99.984 KiB)
 1000-element Vector{Vector{Float64}}:
  [1.0, 0.0, 0.0]
  [0.8, 0.56, 0.0]
@@ -190,7 +190,7 @@
 using BenchmarkTools
 @btime solve_system_save(lorenz,[1.0,0.0,0.0],p,1000)
 
-39.100 μs (1001 allocations: 86.06 KiB)
+43.100 μs (1001 allocations: 86.06 KiB)
 1000-element Vector{Vector{Float64}}:
  [1.0, 0.0, 0.0]
  [0.8, 0.56, 0.0]
@@ -215,7 +215,7 @@
 
 @btime solve_system_save_push(lorenz,[1.0,0.0,0.0],p,1000)
 
-51.200 μs (1006 allocations: 99.98 KiB)
+52.500 μs (1006 allocations: 99.98 KiB)
 1000-element Vector{Vector{Float64}}:
  [1.0, 0.0, 0.0]
  [0.8, 0.56, 0.0]
@@ -248,7 +248,7 @@
 end
 @btime solve_system_save_matrix(lorenz,[1.0,0.0,0.0],p,1000)
 
-78.799 μs (2001 allocations: 179.66 KiB)
+98.700 μs (2001 allocations: 179.66 KiB)
 3×1000 Matrix{Float64}:
  1.0  0.8   0.752    0.80096   0.920338   …   1.98201    1.67886    1.4744
  0.0  0.56  0.9968   1.39785   1.81805        0.466287   0.656559   0.85300
@@ -265,7 +265,7 @@
 end
 @btime solve_system_save_matrix_view(lorenz,[1.0,0.0,0.0],p,1000)
 
-49.800 μs (1002 allocations: 101.61 KiB)
+58.601 μs (1002 allocations: 101.61 KiB)
 3×1000 Matrix{Float64}:
  1.0  0.8   0.752    0.80096   0.920338   …   1.98201    1.67886    1.4744
  0.0  0.56  0.9968   1.39785   1.81805        0.466287   0.656559   0.85300
@@ -282,7 +282,7 @@
 end
 @btime solve_system_save_matrix_resize(lorenz,[1.0,0.0,0.0],p,1000)
 
-2.913 ms (2318 allocations: 11.65 MiB)
+2.857 ms (2318 allocations: 11.65 MiB)
 3×1000 Matrix{Float64}:
  1.0  0.8   0.752    0.80096   0.920338   …   1.98201    1.67886    1.4744
  0.0  0.56  0.9968   1.39785   1.81805        0.466287   0.656559   0.85300
@@ -388,7 +388,7 @@
 

which would compute f and then take the values of du and update u with them, but that's 3 extra operations than required, whereas u,du = du,u will change u to be a pointer to the updated memory and now du is an "empty" cache array that we can refill (this decreases the computational cost by ~33%). Let's see what the cost is with this newest version:

 @btime solve_system(lorenz,[1.0,0.0,0.0],p,1000)
 
-37.600 μs (1000 allocations: 78.12 KiB)
+42.501 μs (1000 allocations: 78.12 KiB)
 3-element Vector{Float64}:
   1.4744010677851374
   0.8530017039412324
@@ -396,7 +396,7 @@
 
 @btime solve_system_mutate(lorenz,[1.0,0.0,0.0],p,1000)
 
-6.775 μs (3 allocations: 240 bytes)
+8.067 μs (3 allocations: 240 bytes)
 3-element Vector{Float64}:
   1.4744010677851374
   0.8530017039412324
@@ -445,7 +445,7 @@
 
 @btime solve_system_save(lorenz,@SVector[1.0,0.0,0.0],p,1000)
 
-6.825 μs (2 allocations: 23.48 KiB)
+8.200 μs (2 allocations: 23.48 KiB)
 1000-element Vector{SVector{3, Float64}}:
  [1.0, 0.0, 0.0]
  [0.8, 0.56, 0.0]
@@ -511,7 +511,7 @@
 
 @btime solve_system_save(lorenz,@SVector[1.0,0.0,0.0],p,1000)
 
-6.020 μs (2 allocations: 23.48 KiB)
+6.600 μs (2 allocations: 23.48 KiB)
 1000-element Vector{SVector{3, Float64}}:
  [1.0, 0.0, 0.0]
  [0.8, 0.56, 0.0]
@@ -536,7 +536,7 @@
 

And we can get down to non-allocating for the loop:

 @btime solve_system(lorenz,@SVector([1.0,0.0,0.0]),p,1000)
 
-5.167 μs (1 allocation: 32 bytes)
+5.700 μs (1 allocation: 32 bytes)
 3-element SVector{3, Float64} with indices SOneTo(3):
   1.4744010677851374
   0.8530017039412324
@@ -552,7 +552,7 @@
 u = Vector{typeof(@SVector([1.0,0.0,0.0]))}(undef,1000)
 @btime solve_system_save!(u,lorenz,@SVector([1.0,0.0,0.0]),p,1000)
 
-5.350 μs (0 allocations: 0 bytes)
+6.360 μs (0 allocations: 0 bytes)
 1000-element Vector{SVector{3, Float64}}:
  [1.0, 0.0, 0.0]
  [0.8, 0.56, 0.0]
@@ -574,4 +574,4 @@
  [1.9820054139405763, 0.46628657468365653, 22.964748583050085]
  [1.6788616460891923, 0.6565587545689172, 21.758445642263496]
  [1.4744010677851374, 0.8530017039412324, 20.62004063423844]
-

It is important to note that this single allocation does not seem to effect the timing of the result in this case, when run serially. However, when parallelism or embedded applications get involved, this can be a significant effect.

Discussion Questions

  1. What are some ways to compute steady states? Periodic orbits?

  2. When using the mutating algorithms, what are the data dependencies between different solves if they were to happen simultaneously?

  3. We saw that there is a connection between delayed systems and multivariable systems. How deep does that go? Is every delayed system also a multivariable system and vice versa? Is this a useful idea to explore?

\ No newline at end of file +

It is important to note that this single allocation does not seem to effect the timing of the result in this case, when run serially. However, when parallelism or embedded applications get involved, this can be a significant effect.

Discussion Questions

  1. What are some ways to compute steady states? Periodic orbits?

  2. When using the mutating algorithms, what are the data dependencies between different solves if they were to happen simultaneously?

  3. We saw that there is a connection between delayed systems and multivariable systems. How deep does that go? Is every delayed system also a multivariable system and vice versa? Is this a useful idea to explore?

\ No newline at end of file diff --git a/notes/05-The_Basics_of_Single_Node_Parallel_Computing/index.html b/notes/05-The_Basics_of_Single_Node_Parallel_Computing/index.html index 92f46e3b..f492b307 100644 --- a/notes/05-The_Basics_of_Single_Node_Parallel_Computing/index.html +++ b/notes/05-The_Basics_of_Single_Node_Parallel_Computing/index.html @@ -27,7 +27,7 @@ u = Vector{typeof(@SVector([1.0,0.0,0.0]))}(undef,1000) @btime solve_system_save!(u,lorenz,@SVector([1.0,0.0,0.0]),p,1000)
-4.750 μs (0 allocations: 0 bytes)
+6.580 μs (0 allocations: 0 bytes)
 1000-element Vector{SVector{3, Float64}}:
  [1.0, 0.0, 0.0]
  [0.8, 0.56, 0.0]
@@ -74,7 +74,7 @@
 u = [Vector{Float64}(undef,3) for i in 1:1000]
 @btime solve_system_save_iip!(u,lorenz!,[1.0,0.0,0.0],p,1000)
 
-6.460 μs (1 allocation: 80 bytes)
+8.400 μs (1 allocation: 80 bytes)
 1000-element Vector{Vector{Float64}}:
  [1.0, 0.0, 0.0]
  [0.8, 0.56, 0.0]
@@ -127,7 +127,7 @@
 u = [Vector{Float64}(undef,3) for i in 1:1000]
 @btime solve_system_save_iip!(u,lorenz_mt!,[1.0,0.0,0.0],p,1000);
 
-1.690 ms (6994 allocations: 671.28 KiB)
+2.001 ms (6994 allocations: 671.28 KiB)
 

Parallelism doesn't always make things faster. There are two costs associated with this code. For one, we had to go to the slower heap+mutation version, so its implementation starting point is slower. But secondly, and more importantly, the cost of spinning a new thread is non-negligible. In fact, here we can see that it even needs to make a small allocation for the new context. The total cost is on the order of It's on the order of 50ns: not huge, but something to take note of. So what we've done is taken almost free calculations and made them ~50ns by making each in a different thread, instead of just having it be one thread with one call stack.

The moral of the story is that you need to make sure that there's enough work per thread in order to effectively accelerate a program with parallelism.

Data-Parallel Problems

So not every setup is amenable to parallelism. Dynamical systems are notorious for being quite difficult to parallelize because the dependency of the future time step on the previous time step is clear, meaning that one cannot easily "parallelize through time" (though it is possible, which we will study later).

However, one common way that these systems are generally parallelized is in their inputs. The following questions allow for independent simulations:

  • What steady state does an input u0 go to for some list/region of initial conditions?

  • How does the solution very when I use a different p?

The problem has a few descriptions. For one, it's called an embarrassingly parallel problem since the problem can remain largely intact to solve the parallelism problem. To solve this, we can use the exact same solve_system_save_iip!, and just change how we are calling it. Secondly, this is called a data parallel problem, since it parallelized by splitting up the input data (here, the possible u0 or ps) and acting on them independently.

Multithreaded Parameter Searches

Now let's multithread our parameter search. Let's say we wanted to compute the mean of the values in the trajectory. For a single input pair, we can compute that like:

 using Statistics
 function compute_trajectory_mean(u0,p)
@@ -137,7 +137,7 @@
 end
 @btime compute_trajectory_mean(@SVector([1.0,0.0,0.0]),p)
 
-7.933 μs (3 allocations: 23.52 KiB)
+8.400 μs (3 allocations: 23.52 KiB)
 3-element SVector{3, Float64} with indices SOneTo(3):
  -0.3114996234648468
  -0.30974901748976497
@@ -151,7 +151,7 @@
 end
 @btime compute_trajectory_mean2(@SVector([1.0,0.0,0.0]),p)
 
-6.850 μs (3 allocations: 112 bytes)
+7.950 μs (3 allocations: 112 bytes)
 3-element SVector{3, Float64} with indices SOneTo(3):
  -0.3114996234648468
  -0.30974901748976497
@@ -165,7 +165,7 @@
 end
 @btime compute_trajectory_mean3(@SVector([1.0,0.0,0.0]),p)
 
-7.300 μs (1 allocation: 32 bytes)
+7.900 μs (1 allocation: 32 bytes)
 3-element SVector{3, Float64} with indices SOneTo(3):
  -0.3114996234648468
  -0.30974901748976497
@@ -178,7 +178,7 @@
 compute_trajectory_mean4(u0,p) = _compute_trajectory_mean4(_u_cache,u0,p)
 @btime compute_trajectory_mean4(@SVector([1.0,0.0,0.0]),p)
 
-6.775 μs (1 allocation: 32 bytes)
+7.900 μs (1 allocation: 32 bytes)
 3-element SVector{3, Float64} with indices SOneTo(3):
  -0.3114996234648468
  -0.30974901748976497
@@ -187,50 +187,50 @@
 ps = [(0.02,10.0,28.0,8/3) .* (1.0,rand(3)...) for i in 1:1000]
 
 1000-element Vector{NTuple{4, Float64}}:
- (0.02, 3.5506238069563842, 25.156388865912405, 1.2944671907763334)
- (0.02, 4.905102015647083, 20.798912517739257, 1.42455210067359)
- (0.02, 4.049777415039721, 15.634106342204387, 1.0689250599984783)
- (0.02, 0.617703562486901, 6.650265362134062, 0.9001192510844916)
- (0.02, 7.728479101055119, 4.283032277473355, 0.3814247162007132)
- (0.02, 8.69335001681721, 10.46660511452242, 0.009482281358365558)
- (0.02, 1.3817524075453613, 21.49817664634096, 2.228092866290097)
- (0.02, 3.6730466076349777, 13.370353460873327, 2.6306870107737916)
- (0.02, 4.757259184310949, 3.3848400213645222, 1.155402147677568)
- (0.02, 3.426685349820837, 0.40559752571178276, 2.1334711357331244)
+ (0.02, 3.037620156858795, 14.552719997603571, 1.5510247426254855)
+ (0.02, 4.936696155756138, 20.507022795818497, 2.183378859929042)
+ (0.02, 8.131977339125193, 7.7700882850491215, 0.48969639712480506)
+ (0.02, 1.755851672936397, 6.216686394026905, 1.7054311072661656)
+ (0.02, 8.13099723484067, 0.7588409559739153, 1.085714992546022)
+ (0.02, 1.7427316470389997, 12.0532270800079, 1.8453944444264936)
+ (0.02, 0.7583476249041621, 0.38895059916641284, 1.342153386465824)
+ (0.02, 7.723642173433803, 24.870938762602417, 0.6149070924838658)
+ (0.02, 4.354945546654296, 17.371545500497078, 0.44694951653095283)
+ (0.02, 1.5358900244982832, 9.084952881006156, 0.015530076994568986)
  ⋮
- (0.02, 8.332190592332246, 0.6269286088808745, 0.5604011320047804)
- (0.02, 5.476210596389, 10.370091055241664, 2.630030085241478)
- (0.02, 9.585550213838584, 18.44496697411735, 2.3643328335046734)
- (0.02, 4.343650227443543, 8.928297401416833, 2.1487711036406507)
- (0.02, 4.197305238186817, 10.260410200606863, 0.7413293451348505)
- (0.02, 2.0801528934988553, 9.917167676210104, 2.3442057783156334)
- (0.02, 3.7927870826229384, 24.445671325659124, 1.3108338858226898)
- (0.02, 7.342872394586884, 19.618883234188363, 1.5924503611012586)
- (0.02, 0.5667773787942509, 14.12384188972344, 0.506252200199163)
+ (0.02, 9.4927953946179, 19.43338395266777, 2.417035074069966)
+ (0.02, 1.2325741898934928, 3.235845837538059, 1.9310880212250672)
+ (0.02, 2.283079380936549, 15.943802804296585, 1.046984974419047)
+ (0.02, 3.9404703099197147, 1.1138370645453657, 0.5821002324991641)
+ (0.02, 0.2390881608605666, 16.14427395660405, 2.220052350252045)
+ (0.02, 1.3196303788806696, 24.830822721813846, 0.21431493909461025)
+ (0.02, 3.9863134178487747, 21.062045162769856, 1.8628923280400778)
+ (0.02, 4.425030472036934, 17.732299157775685, 1.0584427777088066)
+ (0.02, 5.923649789057791, 26.526499645471556, 1.5436207861637943)
 

And let's get the mean of the trajectory for each of the parameters.

 serial_out = map(p -> compute_trajectory_mean4(@SVector([1.0,0.0,0.0]),p),ps)
 
 1000-element Vector{SVector{3, Float64}}:
- [1.300771053925321, 1.2535219713772234, 21.921820081229736]
- [1.6308717261795633, 1.6037306854298525, 18.15225347468799]
- [1.7374617596652808, 1.7617277647285448, 13.601745113059051]
- [2.216634823089521, 2.3182370473551135, 5.409143685835168]
- [0.9306213595722924, 0.9315560984103936, 3.037663155931778]
- [0.16460696835639138, 0.15885544565187876, 17.921719196723085]
- [6.597795994389806, 6.806146848303711, 19.81586165018462]
- [5.0545716205734665, 5.123177741500901, 10.877586029568883]
- [1.6226045441626067, 1.6295488156533835, 2.2529107787132308]
- [0.024424343928562804, 0.009833040449289765, 0.0005146829566867803]
+ [-0.8995493442373456, -0.9545822583205469, 11.09763927433081]
+ [-0.2738640024613979, -0.2306122941828055, 16.92499769658805]
+ [0.21179044146551448, 0.20702430632781724, 5.759297153864314]
+ [2.911850465163551, 2.9683134505875706, 4.990515699186825]
+ [0.025022984412806943, 0.01888025302654529, 0.0009455939378620537]
+ [4.40004482736076, 4.500938337735848, 10.610355510012694]
+ [0.10352309459360956, 0.03764493928476319, 0.007957185983071241]
+ [0.25024346906854367, 0.25387664489025796, 24.14794855973064]
+ [0.38592086367787864, 0.3632040150421861, 16.004707391771415]
+ [0.17993985356714867, 0.14738543939576892, 11.347343162572958]
  ⋮
- [0.01597542267866004, 0.009975030135296149, 0.0009292432129033854]
- [4.550653603851785, 4.585199848662789, 8.568748261393232]
- [-1.554071305410045, -1.5875011180422711, 15.820131977465435]
- [3.9090248954460445, 3.9454434221744092, 7.437219335889599]
- [-0.9834730207162191, -0.9596995893927198, 8.307105651389753]
- [4.459468412015161, 4.545330459447452, 8.582678518797346]
- [0.8446566491898391, 0.8163840321561793, 21.572295069890302]
- [1.4417199630874142, 1.4184886481104566, 17.480873491994632]
- [2.5688108553476776, 2.707871623177261, 12.583059217222598]
+ [-0.0840117067670748, -0.0654512792833464, 16.857099406918405]
+ [1.999119397276924, 2.0428444387084164, 2.088984817566904]
+ [-1.6917818513472662, -1.677628394209012, 13.216976250124453]
+ [0.2624261419672769, 0.2530056461259637, 0.10257425482542186]
+ [5.295904745372588, 6.299305056930067, 14.345380093144529]
+ [-0.6455940789619545, -0.654050188127725, 22.961101218013503]
+ [-0.20563814866072042, -0.1853469928300127, 17.308722418188218]
+ [0.8349700059676074, 0.8703348079339471, 15.570853262114188]
+ [0.4191880800406462, 0.4178869606733214, 24.039132485781064]
 

Now let's do this with multithreading:

 function tmap(f,ps)
   out = Vector{typeof(@SVector([1.0,0.0,0.0]))}(undef,1000)
@@ -243,26 +243,26 @@
 threaded_out = tmap(p -> compute_trajectory_mean4(@SVector([1.0,0.0,0.0]),p),ps)
 
 1000-element Vector{SVector{3, Float64}}:
- [1.300771053925321, 1.2535219713772234, 21.921820081229736]
- [1.6308717261795633, 1.6037306854298525, 18.15225347468799]
- [1.7374617596652808, 1.7617277647285448, 13.601745113059051]
- [2.216634823089521, 2.3182370473551135, 5.409143685835168]
- [0.9306213595722924, 0.9315560984103936, 3.037663155931778]
- [0.16460696835639138, 0.15885544565187876, 17.921719196723085]
- [6.597795994389806, 6.806146848303711, 19.81586165018462]
- [5.0545716205734665, 5.123177741500901, 10.877586029568883]
- [1.6226045441626067, 1.6295488156533835, 2.2529107787132308]
- [0.024424343928562804, 0.009833040449289765, 0.0005146829566867803]
+ [-0.8995493442373456, -0.9545822583205469, 11.09763927433081]
+ [-0.2738640024613979, -0.2306122941828055, 16.92499769658805]
+ [0.21179044146551448, 0.20702430632781724, 5.759297153864314]
+ [2.911850465163551, 2.9683134505875706, 4.990515699186825]
+ [0.025022984412806943, 0.01888025302654529, 0.0009455939378620537]
+ [4.40004482736076, 4.500938337735848, 10.610355510012694]
+ [0.10352309459360956, 0.03764493928476319, 0.007957185983071241]
+ [0.25024346906854367, 0.25387664489025796, 24.14794855973064]
+ [0.38592086367787864, 0.3632040150421861, 16.004707391771415]
+ [0.17993985356714867, 0.14738543939576892, 11.347343162572958]
  ⋮
- [0.01597542267866004, 0.009975030135296149, 0.0009292432129033854]
- [4.550653603851785, 4.585199848662789, 8.568748261393232]
- [-1.554071305410045, -1.5875011180422711, 15.820131977465435]
- [3.9090248954460445, 3.9454434221744092, 7.437219335889599]
- [-0.9834730207162191, -0.9596995893927198, 8.307105651389753]
- [4.459468412015161, 4.545330459447452, 8.582678518797346]
- [0.8446566491898391, 0.8163840321561793, 21.572295069890302]
- [1.4417199630874142, 1.4184886481104566, 17.480873491994632]
- [2.5688108553476776, 2.707871623177261, 12.583059217222598]
+ [-0.0840117067670748, -0.0654512792833464, 16.857099406918405]
+ [1.999119397276924, 2.0428444387084164, 2.088984817566904]
+ [-1.6917818513472662, -1.677628394209012, 13.216976250124453]
+ [0.2624261419672769, 0.2530056461259637, 0.10257425482542186]
+ [5.295904745372588, 6.299305056930067, 14.345380093144529]
+ [-0.6455940789619545, -0.654050188127725, 22.961101218013503]
+ [-0.20563814866072042, -0.1853469928300127, 17.308722418188218]
+ [0.8349700059676074, 0.8703348079339471, 15.570853262114188]
+ [0.4191880800406462, 0.4178869606733214, 24.039132485781064]
 

Let's check the output:

 serial_out - threaded_out
 
@@ -296,7 +296,7 @@
 end
 @btime compute_trajectory_mean5(@SVector([1.0,0.0,0.0]),p)
 
-7.300 μs (1 allocation: 32 bytes)
+7.900 μs (1 allocation: 32 bytes)
 3-element SVector{3, Float64} with indices SOneTo(3):
  -0.3114996234648468
  -0.30974901748976497
@@ -330,53 +330,53 @@
 
 @btime serial_out = map(p -> compute_trajectory_mean5(@SVector([1.0,0.0,0.0]),p),ps)
 
-7.912 ms (3 allocations: 23.50 KiB)
+7.906 ms (3 allocations: 23.50 KiB)
 1000-element Vector{SVector{3, Float64}}:
- [1.300771053925321, 1.2535219713772234, 21.921820081229736]
- [1.6308717261795633, 1.6037306854298525, 18.15225347468799]
- [1.7374617596652808, 1.7617277647285448, 13.601745113059051]
- [2.216634823089521, 2.3182370473551135, 5.409143685835168]
- [0.9306213595722924, 0.9315560984103936, 3.037663155931778]
- [0.16460696835639138, 0.15885544565187876, 17.921719196723085]
- [6.597795994389806, 6.806146848303711, 19.81586165018462]
- [5.0545716205734665, 5.123177741500901, 10.877586029568883]
- [1.6226045441626067, 1.6295488156533835, 2.2529107787132308]
- [0.024424343928562804, 0.009833040449289765, 0.0005146829566867803]
+ [-0.8995493442373456, -0.9545822583205469, 11.09763927433081]
+ [-0.2738640024613979, -0.2306122941828055, 16.92499769658805]
+ [0.21179044146551448, 0.20702430632781724, 5.759297153864314]
+ [2.911850465163551, 2.9683134505875706, 4.990515699186825]
+ [0.025022984412806943, 0.01888025302654529, 0.0009455939378620537]
+ [4.40004482736076, 4.500938337735848, 10.610355510012694]
+ [0.10352309459360956, 0.03764493928476319, 0.007957185983071241]
+ [0.25024346906854367, 0.25387664489025796, 24.14794855973064]
+ [0.38592086367787864, 0.3632040150421861, 16.004707391771415]
+ [0.17993985356714867, 0.14738543939576892, 11.347343162572958]
  ⋮
- [0.01597542267866004, 0.009975030135296149, 0.0009292432129033854]
- [4.550653603851785, 4.585199848662789, 8.568748261393232]
- [-1.554071305410045, -1.5875011180422711, 15.820131977465435]
- [3.9090248954460445, 3.9454434221744092, 7.437219335889599]
- [-0.9834730207162191, -0.9596995893927198, 8.307105651389753]
- [4.459468412015161, 4.545330459447452, 8.582678518797346]
- [0.8446566491898391, 0.8163840321561793, 21.572295069890302]
- [1.4417199630874142, 1.4184886481104566, 17.480873491994632]
- [2.5688108553476776, 2.707871623177261, 12.583059217222598]
+ [-0.0840117067670748, -0.0654512792833464, 16.857099406918405]
+ [1.999119397276924, 2.0428444387084164, 2.088984817566904]
+ [-1.6917818513472662, -1.677628394209012, 13.216976250124453]
+ [0.2624261419672769, 0.2530056461259637, 0.10257425482542186]
+ [5.295904745372588, 6.299305056930067, 14.345380093144529]
+ [-0.6455940789619545, -0.654050188127725, 22.961101218013503]
+ [-0.20563814866072042, -0.1853469928300127, 17.308722418188218]
+ [0.8349700059676074, 0.8703348079339471, 15.570853262114188]
+ [0.4191880800406462, 0.4178869606733214, 24.039132485781064]
 
 @btime threaded_out = tmap(p -> compute_trajectory_mean5(@SVector([1.0,0.0,0.0]),p),ps)
 
-7.875 ms (9 allocations: 24.12 KiB)
+7.906 ms (9 allocations: 24.12 KiB)
 1000-element Vector{SVector{3, Float64}}:
- [1.300771053925321, 1.2535219713772234, 21.921820081229736]
- [1.6308717261795633, 1.6037306854298525, 18.15225347468799]
- [1.7374617596652808, 1.7617277647285448, 13.601745113059051]
- [2.216634823089521, 2.3182370473551135, 5.409143685835168]
- [0.9306213595722924, 0.9315560984103936, 3.037663155931778]
- [0.16460696835639138, 0.15885544565187876, 17.921719196723085]
- [6.597795994389806, 6.806146848303711, 19.81586165018462]
- [5.0545716205734665, 5.123177741500901, 10.877586029568883]
- [1.6226045441626067, 1.6295488156533835, 2.2529107787132308]
- [0.024424343928562804, 0.009833040449289765, 0.0005146829566867803]
+ [-0.8995493442373456, -0.9545822583205469, 11.09763927433081]
+ [-0.2738640024613979, -0.2306122941828055, 16.92499769658805]
+ [0.21179044146551448, 0.20702430632781724, 5.759297153864314]
+ [2.911850465163551, 2.9683134505875706, 4.990515699186825]
+ [0.025022984412806943, 0.01888025302654529, 0.0009455939378620537]
+ [4.40004482736076, 4.500938337735848, 10.610355510012694]
+ [0.10352309459360956, 0.03764493928476319, 0.007957185983071241]
+ [0.25024346906854367, 0.25387664489025796, 24.14794855973064]
+ [0.38592086367787864, 0.3632040150421861, 16.004707391771415]
+ [0.17993985356714867, 0.14738543939576892, 11.347343162572958]
  ⋮
- [0.01597542267866004, 0.009975030135296149, 0.0009292432129033854]
- [4.550653603851785, 4.585199848662789, 8.568748261393232]
- [-1.554071305410045, -1.5875011180422711, 15.820131977465435]
- [3.9090248954460445, 3.9454434221744092, 7.437219335889599]
- [-0.9834730207162191, -0.9596995893927198, 8.307105651389753]
- [4.459468412015161, 4.545330459447452, 8.582678518797346]
- [0.8446566491898391, 0.8163840321561793, 21.572295069890302]
- [1.4417199630874142, 1.4184886481104566, 17.480873491994632]
- [2.5688108553476776, 2.707871623177261, 12.583059217222598]
+ [-0.0840117067670748, -0.0654512792833464, 16.857099406918405]
+ [1.999119397276924, 2.0428444387084164, 2.088984817566904]
+ [-1.6917818513472662, -1.677628394209012, 13.216976250124453]
+ [0.2624261419672769, 0.2530056461259637, 0.10257425482542186]
+ [5.295904745372588, 6.299305056930067, 14.345380093144529]
+ [-0.6455940789619545, -0.654050188127725, 22.961101218013503]
+ [-0.20563814866072042, -0.1853469928300127, 17.308722418188218]
+ [0.8349700059676074, 0.8703348079339471, 15.570853262114188]
+ [0.4191880800406462, 0.4178869606733214, 24.039132485781064]
 

Hierarchical Task-Based Multithreading and Dynamic Scheduling

The major change in Julia v1.3 is that Julia's Tasks, which are traditionally its green threads interface, are now the basis of its multithreading infrastructure. This means that all independent threads are parallelized, and a new interface for multithreading will exist that works by spawning threads.

This implementation follows Go's goroutines and the classic multithreading interface of Cilk. There is a Julia-level scheduler that handles the multithreading to put different tasks on different vCPU threads. A benefit from this is hierarchical multithreading. Since Julia's tasks can spawn tasks, what can happen is a task can create tasks which create tasks which etc. In Julia (/Go/Cilk), this is then seen as a single pool of tasks which it can schedule, and thus it will still make sure only N are running at a time (as opposed to the naive implementation where the total number of running threads is equal then multiplied). This is essential for numerical performance because running multiple compute threads on a single CPU thread requires constant context switching between the threads, which will slow down the computations.

To directly use the task-based interface, simply use Threads.@spawn to spawn new tasks. For example:

 function tmap2(f,ps)
   tasks = [Threads.@spawn f(ps[i]) for i in 1:1000]
@@ -385,51 +385,51 @@
 threaded_out = tmap2(p -> compute_trajectory_mean5(@SVector([1.0,0.0,0.0]),p),ps)
 
 1000-element Vector{SVector{3, Float64}}:
- [1.300771053925321, 1.2535219713772234, 21.921820081229736]
- [1.6308717261795633, 1.6037306854298525, 18.15225347468799]
- [1.7374617596652808, 1.7617277647285448, 13.601745113059051]
- [2.216634823089521, 2.3182370473551135, 5.409143685835168]
- [0.9306213595722924, 0.9315560984103936, 3.037663155931778]
- [0.16460696835639138, 0.15885544565187876, 17.921719196723085]
- [6.597795994389806, 6.806146848303711, 19.81586165018462]
- [5.0545716205734665, 5.123177741500901, 10.877586029568883]
- [1.6226045441626067, 1.6295488156533835, 2.2529107787132308]
- [0.024424343928562804, 0.009833040449289765, 0.0005146829566867803]
+ [-0.8995493442373456, -0.9545822583205469, 11.09763927433081]
+ [-0.2738640024613979, -0.2306122941828055, 16.92499769658805]
+ [0.21179044146551448, 0.20702430632781724, 5.759297153864314]
+ [2.911850465163551, 2.9683134505875706, 4.990515699186825]
+ [0.025022984412806943, 0.01888025302654529, 0.0009455939378620537]
+ [4.40004482736076, 4.500938337735848, 10.610355510012694]
+ [0.10352309459360956, 0.03764493928476319, 0.007957185983071241]
+ [0.25024346906854367, 0.25387664489025796, 24.14794855973064]
+ [0.38592086367787864, 0.3632040150421861, 16.004707391771415]
+ [0.17993985356714867, 0.14738543939576892, 11.347343162572958]
  ⋮
- [0.01597542267866004, 0.009975030135296149, 0.0009292432129033854]
- [4.550653603851785, 4.585199848662789, 8.568748261393232]
- [-1.554071305410045, -1.5875011180422711, 15.820131977465435]
- [3.9090248954460445, 3.9454434221744092, 7.437219335889599]
- [-0.9834730207162191, -0.9596995893927198, 8.307105651389753]
- [4.459468412015161, 4.545330459447452, 8.582678518797346]
- [0.8446566491898391, 0.8163840321561793, 21.572295069890302]
- [1.4417199630874142, 1.4184886481104566, 17.480873491994632]
- [2.5688108553476776, 2.707871623177261, 12.583059217222598]
+ [-0.0840117067670748, -0.0654512792833464, 16.857099406918405]
+ [1.999119397276924, 2.0428444387084164, 2.088984817566904]
+ [-1.6917818513472662, -1.677628394209012, 13.216976250124453]
+ [0.2624261419672769, 0.2530056461259637, 0.10257425482542186]
+ [5.295904745372588, 6.299305056930067, 14.345380093144529]
+ [-0.6455940789619545, -0.654050188127725, 22.961101218013503]
+ [-0.20563814866072042, -0.1853469928300127, 17.308722418188218]
+ [0.8349700059676074, 0.8703348079339471, 15.570853262114188]
+ [0.4191880800406462, 0.4178869606733214, 24.039132485781064]
 

However, if we check the timing we see:

 @btime tmap2(p -> compute_trajectory_mean5(@SVector([1.0,0.0,0.0]),p),ps)
 
-8.648 ms (6005 allocations: 562.70 KiB)
+8.594 ms (6005 allocations: 562.70 KiB)
 1000-element Vector{SVector{3, Float64}}:
- [1.300771053925321, 1.2535219713772234, 21.921820081229736]
- [1.6308717261795633, 1.6037306854298525, 18.15225347468799]
- [1.7374617596652808, 1.7617277647285448, 13.601745113059051]
- [2.216634823089521, 2.3182370473551135, 5.409143685835168]
- [0.9306213595722924, 0.9315560984103936, 3.037663155931778]
- [0.16460696835639138, 0.15885544565187876, 17.921719196723085]
- [6.597795994389806, 6.806146848303711, 19.81586165018462]
- [5.0545716205734665, 5.123177741500901, 10.877586029568883]
- [1.6226045441626067, 1.6295488156533835, 2.2529107787132308]
- [0.024424343928562804, 0.009833040449289765, 0.0005146829566867803]
+ [-0.8995493442373456, -0.9545822583205469, 11.09763927433081]
+ [-0.2738640024613979, -0.2306122941828055, 16.92499769658805]
+ [0.21179044146551448, 0.20702430632781724, 5.759297153864314]
+ [2.911850465163551, 2.9683134505875706, 4.990515699186825]
+ [0.025022984412806943, 0.01888025302654529, 0.0009455939378620537]
+ [4.40004482736076, 4.500938337735848, 10.610355510012694]
+ [0.10352309459360956, 0.03764493928476319, 0.007957185983071241]
+ [0.25024346906854367, 0.25387664489025796, 24.14794855973064]
+ [0.38592086367787864, 0.3632040150421861, 16.004707391771415]
+ [0.17993985356714867, 0.14738543939576892, 11.347343162572958]
  ⋮
- [0.01597542267866004, 0.009975030135296149, 0.0009292432129033854]
- [4.550653603851785, 4.585199848662789, 8.568748261393232]
- [-1.554071305410045, -1.5875011180422711, 15.820131977465435]
- [3.9090248954460445, 3.9454434221744092, 7.437219335889599]
- [-0.9834730207162191, -0.9596995893927198, 8.307105651389753]
- [4.459468412015161, 4.545330459447452, 8.582678518797346]
- [0.8446566491898391, 0.8163840321561793, 21.572295069890302]
- [1.4417199630874142, 1.4184886481104566, 17.480873491994632]
- [2.5688108553476776, 2.707871623177261, 12.583059217222598]
+ [-0.0840117067670748, -0.0654512792833464, 16.857099406918405]
+ [1.999119397276924, 2.0428444387084164, 2.088984817566904]
+ [-1.6917818513472662, -1.677628394209012, 13.216976250124453]
+ [0.2624261419672769, 0.2530056461259637, 0.10257425482542186]
+ [5.295904745372588, 6.299305056930067, 14.345380093144529]
+ [-0.6455940789619545, -0.654050188127725, 22.961101218013503]
+ [-0.20563814866072042, -0.1853469928300127, 17.308722418188218]
+ [0.8349700059676074, 0.8703348079339471, 15.570853262114188]
+ [0.4191880800406462, 0.4178869606733214, 24.039132485781064]
 

Threads.@threads is built on the same multithreading infrastructure, so why is this so much slower? The reason is because Threads.@threads employs static scheduling while Threads.@spawn is using dynamic scheduling. Dynamic scheduling is the model of allowing the runtime to determine the ordering and scheduling of processes, i.e. what tasks will run run where and when. Julia's task-based multithreading system has a thread scheduler which will automatically do this for you in the background, but because this is done at runtime it will have overhead. Static scheduling is the model of pre-determining where and when tasks will run, instead of allowing this to be determined at runtime. Threads.@threads is "quasi-static" in the sense that it cuts the loop so that it spawns only as many tasks as there are threads, essentially assigning one thread for even chunks of the input data.

Does this lack of runtime overhead mean that static scheduling is "better"? No, it simply has trade-offs. Static scheduling assumes that the runtime of each block is the same. For this specific case where there are fixed number of loop iterations for the dynamical systems, we know that every compute_trajectory_mean5 costs exactly the same, and thus this will be more efficient. However, There are many cases where this might not be efficient. For example:

 function sleepmap_static()
   out = Vector{Int}(undef,24)
@@ -489,24 +489,24 @@
 A*B
 
 10000×10000 Matrix{Float64}:
- 2497.67  2491.1   2492.67  2489.33  …  2518.19  2454.25  2494.02  2505.91
- 2500.48  2487.5   2502.05  2504.62     2521.67  2485.92  2484.91  2500.74
- 2505.35  2492.7   2509.1   2517.35     2537.91  2482.31  2503.15  2530.22
- 2511.74  2497.84  2516.63  2507.37     2535.95  2493.89  2496.25  2518.27
- 2503.29  2496.67  2500.11  2503.93     2528.92  2469.13  2502.02  2521.17
- 2505.7   2494.62  2493.32  2504.15  …  2535.61  2482.82  2494.33  2511.56
- 2488.05  2475.19  2502.63  2499.82     2529.41  2471.18  2483.94  2508.2
- 2467.27  2471.25  2466.7   2474.89     2518.44  2446.91  2480.0   2485.43
- 2525.91  2513.54  2537.32  2523.61     2538.24  2503.12  2506.08  2528.71
- 2497.25  2500.94  2520.61  2517.3      2553.96  2483.94  2494.92  2547.87
+ 2495.47  2520.71  2530.22  2524.49  …  2503.61  2535.82  2511.56  2499.67
+ 2464.58  2466.29  2491.42  2449.62     2469.64  2499.86  2479.97  2463.95
+ 2452.32  2465.73  2482.12  2455.61     2460.24  2486.44  2472.71  2476.1
+ 2483.8   2485.67  2517.83  2485.45     2482.03  2513.04  2507.42  2497.89
+ 2485.62  2511.57  2516.88  2490.67     2482.53  2525.14  2514.93  2489.36
+ 2471.15  2479.17  2489.15  2474.51  …  2473.88  2505.56  2483.08  2466.69
+ 2469.96  2478.31  2491.2   2476.9      2480.98  2502.52  2488.43  2471.05
+ 2487.8   2501.18  2517.1   2493.89     2499.57  2517.53  2521.67  2506.1
+ 2512.49  2484.45  2511.12  2500.0      2498.04  2549.49  2534.7   2506.65
+ 2469.53  2475.42  2489.72  2443.06     2462.05  2506.42  2495.04  2473.58
     ⋮                                ⋱                             
- 2495.89  2485.45  2499.62  2504.25     2528.85  2476.86  2493.4   2492.0
- 2525.35  2499.89  2519.73  2515.7      2531.09  2495.28  2519.64  2534.39
- 2506.35  2505.91  2515.78  2527.56     2530.31  2481.43  2504.66  2515.94
- 2504.93  2490.24  2495.28  2521.8      2545.88  2468.61  2483.92  2509.99
- 2500.07  2478.29  2488.61  2500.22  …  2527.18  2488.51  2489.7   2500.94
- 2510.45  2498.04  2516.35  2512.16     2552.08  2484.41  2493.94  2530.27
- 2484.07  2482.82  2482.02  2505.73     2534.99  2455.2   2488.88  2498.46
- 2484.22  2466.34  2483.2   2490.26     2504.3   2463.77  2473.1   2493.58
- 2507.89  2483.82  2477.71  2499.6      2520.03  2463.15  2482.67  2500.59
-

If you are using a computer that has N cores, then this will use N cores. Try it and look at your resource usage!

Array-Based Parallelism

The simplest form of parallelism is array-based parallelism. The idea is that you use some construction of an array whose operations are already designed to be parallel under the hood. In Julia, some examples of this are:

  • DistributedArrays (Distributed Computing)

  • Elemental

  • MPIArrays

  • CuArrays (GPUs)

This is not a Julia specific idea either.

BLAS and Standard Libraries

The basic linear algebra calls are all handled by a set of libraries which follow the same interface known as BLAS (Basic Linear Algebra Subroutines). It's divided into 3 portions:

  • BLAS1: Element-wise operations (O(n))

  • BLAS2: Matrix-vector operations (O(n^2))

  • BLAS3: Matrix-matrix operations (O(n^3))

BLAS implementations are highly optimized, like OpenBLAS and Intel MKL, so every numerical language and library essentially uses similar underlying BLAS implementations. Extensions to these, known as LAPACK, include operations like factorizations, and are included in these standard libraries. These are all multithreaded. The reason why this is a location to target is because the operation count is high enough that parallelism can be made efficient even when only targeting this level: a matrix multiplication can take on the order of seconds, minutes, hours, or even days, and these are all highly parallel operations. This means you can get away with a bunch just by parallelizing at this level, which happens to be a bottleneck for a lot scientific computing codes.

This is also commonly the level at which GPU computing occurs in machine learning libraries for reasons which we will explain later.

MPI

Well, this is a big topic and we'll address this one later!

Conclusion

The easiest forms of parallelism are:

  • Embarrassingly parallel

  • Array-level parallelism (built into linear algebra)

Exploit these when possible.

\ No newline at end of file + 2517.45 2501.76 2535.21 2516.96 2501.37 2534.94 2526.12 2499.41 + 2501.8 2519.93 2519.24 2477.3 2481.52 2531.88 2522.3 2507.0 + 2490.03 2511.51 2519.28 2503.13 2480.05 2524.13 2531.29 2504.18 + 2469.77 2469.62 2506.22 2454.79 2458.06 2492.25 2478.63 2469.99 + 2486.49 2471.17 2492.96 2473.14 … 2465.24 2514.3 2502.17 2485.15 + 2452.35 2472.2 2484.65 2468.31 2463.58 2504.26 2496.91 2463.18 + 2504.85 2501.1 2517.51 2487.68 2487.3 2515.24 2513.68 2502.75 + 2470.61 2495.72 2498.27 2473.8 2469.82 2512.21 2500.38 2480.41 + 2477.44 2484.09 2505.52 2468.3 2479.17 2498.62 2492.54 2463.26 +

If you are using a computer that has N cores, then this will use N cores. Try it and look at your resource usage!

Array-Based Parallelism

The simplest form of parallelism is array-based parallelism. The idea is that you use some construction of an array whose operations are already designed to be parallel under the hood. In Julia, some examples of this are:

  • DistributedArrays (Distributed Computing)

  • Elemental

  • MPIArrays

  • CuArrays (GPUs)

This is not a Julia specific idea either.

BLAS and Standard Libraries

The basic linear algebra calls are all handled by a set of libraries which follow the same interface known as BLAS (Basic Linear Algebra Subroutines). It's divided into 3 portions:

  • BLAS1: Element-wise operations (O(n))

  • BLAS2: Matrix-vector operations (O(n^2))

  • BLAS3: Matrix-matrix operations (O(n^3))

BLAS implementations are highly optimized, like OpenBLAS and Intel MKL, so every numerical language and library essentially uses similar underlying BLAS implementations. Extensions to these, known as LAPACK, include operations like factorizations, and are included in these standard libraries. These are all multithreaded. The reason why this is a location to target is because the operation count is high enough that parallelism can be made efficient even when only targeting this level: a matrix multiplication can take on the order of seconds, minutes, hours, or even days, and these are all highly parallel operations. This means you can get away with a bunch just by parallelizing at this level, which happens to be a bottleneck for a lot scientific computing codes.

This is also commonly the level at which GPU computing occurs in machine learning libraries for reasons which we will explain later.

MPI

Well, this is a big topic and we'll address this one later!

Conclusion

The easiest forms of parallelism are:

  • Embarrassingly parallel

  • Array-level parallelism (built into linear algebra)

Exploit these when possible.

\ No newline at end of file diff --git a/notes/06-The_Different_Flavors_of_Parallelism/index.html b/notes/06-The_Different_Flavors_of_Parallelism/index.html index 6fdaa0ba..367e866e 100644 --- a/notes/06-The_Different_Flavors_of_Parallelism/index.html +++ b/notes/06-The_Different_Flavors_of_Parallelism/index.html @@ -9,26 +9,26 @@ arr = [MyComplex(rand(),rand()) for i in 1:100]
 100-element Vector{MyComplex}:
- MyComplex(0.6878206973001487, 0.5944376681381183)
- MyComplex(0.876185978521746, 0.8834270064512657)
- MyComplex(0.7274511955405121, 0.9776333218870947)
- MyComplex(0.7860152352196623, 0.2246266273480778)
- MyComplex(0.42890688640506736, 0.15924501742288877)
- MyComplex(0.01060237947774434, 0.2262353380168034)
- MyComplex(0.7935771560827849, 0.1525403932328463)
- MyComplex(0.16117727903914125, 0.8243946095589878)
- MyComplex(0.5550542564876422, 0.3914851972778649)
- MyComplex(0.1343545153578216, 0.30599273978359387)
+ MyComplex(0.32516364337338777, 0.3550304843026272)
+ MyComplex(0.003144120188337096, 0.163179107928081)
+ MyComplex(0.07235316202378828, 0.5895002453826597)
+ MyComplex(0.2737074271530896, 0.3854369147402019)
+ MyComplex(0.14651058280465834, 0.25703921696007137)
+ MyComplex(0.9246961046559387, 0.3320769495992342)
+ MyComplex(0.7929175766889385, 0.8321792812407953)
+ MyComplex(0.8748941992144769, 0.3582692781165362)
+ MyComplex(0.94807059958605, 0.3693183290045081)
+ MyComplex(0.5393529457015298, 0.7326361499924592)
  ⋮
- MyComplex(0.42776724920957265, 0.7237887225861321)
- MyComplex(0.6342639668615949, 0.2563619040165326)
- MyComplex(0.07057459953946532, 0.3258356216180154)
- MyComplex(0.6917262339464759, 0.1363145249683415)
- MyComplex(0.23408343153685507, 0.8505351210651642)
- MyComplex(0.11196456763380669, 0.4193970173512319)
- MyComplex(0.6501836427783281, 0.058275727876870964)
- MyComplex(0.9806260355655791, 0.7003595452846337)
- MyComplex(0.01632409219317188, 0.9588968220373235)
+ MyComplex(0.5782671062296417, 0.4548938009032666)
+ MyComplex(0.8920956658422236, 0.028109929218517404)
+ MyComplex(0.27794806113432613, 0.9658640245583793)
+ MyComplex(0.6111486171406361, 0.7344804686656914)
+ MyComplex(0.09662923368940446, 0.5548190454939068)
+ MyComplex(0.1827962566614879, 0.41595303387734917)
+ MyComplex(0.5059171681027851, 0.7038191273302745)
+ MyComplex(0.7481225257229924, 0.6801002251820268)
+ MyComplex(0.1179925265666455, 0.30061080588876155)
 

is represented in memory as

[real1,imag1,real2,imag2,...]

while the struct of array formats are

@@ -43,18 +43,18 @@
-MyComplexes([0.5273266334521056, 0.6758175680585644, 0.166526088766354, 0.5
-535370802900049, 0.3706105595339403, 0.41554002467170703, 0.592171034860571
-3, 0.687445572159449, 0.4587923428284365, 0.2896374897304236  …  0.34479599
-43539092, 0.4601670959875128, 0.6729984392350403, 0.11784099516786106, 0.51
-75904664901904, 0.17071237273923245, 0.27028579620359694, 0.129621218441456
-35, 0.8376248105911108, 0.7568834682904622], [0.11843099277515678, 0.964674
-7111296081, 0.7932654641659607, 0.04037040139768633, 0.9438159545817943, 0.
-6272736660119286, 0.37233000892415713, 0.9339530622496862, 0.07322910120678
-683, 0.17352841715253697  …  0.9228774206720473, 0.28315860186168174, 0.788
-710859023613, 0.4871373335900553, 0.3812150269155198, 0.577326908234619, 0.
-7607815057072469, 0.6564758612650347, 0.3478709665140167, 0.327831535892906
-03])
+MyComplexes([0.7164779312690199, 0.5865544146333738, 0.90320198698556, 0.64
+28203752547009, 0.72203667868656, 0.9034505920162977, 0.8682221101684356, 0
+.7993643426972368, 0.10606753677087344, 0.6961824507525881  …  0.0390623120
+0873134, 0.3981890396653336, 0.8166985144117405, 0.7501680127601921, 0.8700
+763078059573, 0.8917956139913009, 0.21038830324164248, 0.26642517407150745,
+ 0.2841261157899654, 0.852858764291031], [0.30436828324609155, 0.3114792567
+638629, 0.32397797517215754, 0.30423234353914086, 0.5030364597671834, 0.284
+27331086202323, 0.38011862192567225, 0.9695695499725508, 0.0536101205990224
+8, 0.7453650930055286  …  0.7174599146800597, 0.3096056720083835, 0.4395626
+8897495865, 0.49109120743236956, 0.6499699516275391, 0.3152078820742673, 0.
+37054301749019425, 0.2592646036335612, 0.5410006630195909, 0.00057106714998
+11193])
 
@@ -73,7 +73,7 @@
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral
 lelism.jmd:5 within `average`
-define void @julia_average_13534([2 x double]* noalias nocapture noundef no
+define void @julia_average_13513([2 x double]* noalias nocapture noundef no
 nnull sret([2 x double]) align 8 dereferenceable(16) %0, {}* noundef nonnul
 l align 16 dereferenceable(40) %1) #0 {
 top:
@@ -107,21 +107,21 @@
 
 L8:                                               ; preds = %top
 ; ││││││││││ @ reduce.jl:427 within `_mapreduce`
-            store {}* inttoptr (i64 139639388780752 to {}*), {}** %.sub, al
+            store {}* inttoptr (i64 139979357674032 to {}*), {}** %.sub, al
 ign 8
             %7 = getelementptr inbounds [4 x {}*], [4 x {}*]* %2, i64 0, i6
 4 1
-            store {}* inttoptr (i64 139639384143408 to {}*), {}** %7, align
+            store {}* inttoptr (i64 139979354801408 to {}*), {}** %7, align
  8
             %8 = getelementptr inbounds [4 x {}*], [4 x {}*]* %2, i64 0, i6
 4 2
             store {}* %1, {}** %8, align 8
             %9 = getelementptr inbounds [4 x {}*], [4 x {}*]* %2, i64 0, i6
 4 3
-            store {}* inttoptr (i64 139639425355568 to {}*), {}** %9, align
+            store {}* inttoptr (i64 139979385976816 to {}*), {}** %9, align
  8
-            %10 = call nonnull {}* @ijl_invoke({}* inttoptr (i64 1396393902
-75360 to {}*), {}** nonnull %.sub, i32 4, {}* inttoptr (i64 139637665489760
+            %10 = call nonnull {}* @ijl_invoke({}* inttoptr (i64 1399793752
+24496 to {}*), {}** nonnull %.sub, i32 4, {}* inttoptr (i64 139977605545056
  to {}*))
             call void @llvm.trap()
             unreachable
@@ -197,7 +197,7 @@
 L42:                                              ; preds = %L14
 ; ││││││││││ @ reduce.jl:442 within `_mapreduce`
 ; ││││││││││┌ @ reduce.jl:272 within `mapreduce_impl`
-             call void @j_mapreduce_impl_13536([2 x double]* noalias nocapt
+             call void @j_mapreduce_impl_13515([2 x double]* noalias nocapt
 ure noundef nonnull sret([2 x double]) %tmpcast, {}* nonnull %1, i64 signex
 t 1, i64 signext %6, i64 signext 1024) #0
 ; └└└└└└└└└└└
@@ -364,7 +364,7 @@ 

Next Level Up: Multithreading

-178.399 μs (7 allocations: 640 bytes)
+246.699 μs (7 allocations: 640 bytes)
 
@@ -377,7 +377,7 @@

Next Level Up: Multithreading

-180.099 μs (7 allocations: 640 bytes)
+213.501 μs (7 allocations: 640 bytes)
 
@@ -390,7 +390,7 @@

Next Level Up: Multithreading

-57.799 μs (7 allocations: 640 bytes)
+73.001 μs (7 allocations: 640 bytes)
 
@@ -403,7 +403,7 @@

Next Level Up: Multithreading

-2.499 ns (0 allocations: 0 bytes)
+3.500 ns (0 allocations: 0 bytes)
 
@@ -418,17 +418,17 @@

Next Level Up: Multithreading

 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral
 lelism.jmd:26 within `h`
-define void @julia_h_13637() #0 {
+define void @julia_h_13616() #0 {
 top:
-  %.promoted = load i64, i64* inttoptr (i64 139638901468496 to i64*), align
- 16
+  %.promoted = load i64, i64* inttoptr (i64 139977816649728 to i64*), align
+ 4096
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral
 lelism.jmd:28 within `h`
   %0 = add i64 %.promoted, 10000
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral
 lelism.jmd:29 within `h`
 ; ┌ @ Base.jl within `setproperty!`
-   store i64 %0, i64* inttoptr (i64 139638901468496 to i64*), align 16
+   store i64 %0, i64* inttoptr (i64 139977816649728 to i64*), align 4096
 ; └
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral
 lelism.jmd:30 within `h`
@@ -454,7 +454,7 @@ 

Next Level Up: Multithreading

-2.799 ns (0 allocations: 0 bytes)
+3.200 ns (0 allocations: 0 bytes)
 
@@ -467,13 +467,13 @@

Next Level Up: Multithreading

 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral
 lelism.jmd:3 within `h2`
-define void @julia_h2_13644() #0 {
+define void @julia_h2_13623() #0 {
 top:
 ;  @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral
 lelism.jmd:6 within `h2`
 ; ┌ @ refvalue.jl:56 within `getindex`
 ; │┌ @ Base.jl:37 within `getproperty`
-    %0 = load i64, i64* inttoptr (i64 139638902028752 to i64*), align 16
+    %0 = load i64, i64* inttoptr (i64 139977816937680 to i64*), align 16
 ; └└
 ; ┌ @ range.jl:5 within `Colon`
 ; │┌ @ range.jl:397 within `UnitRange`
@@ -483,15 +483,15 @@ 

Next Level Up: Multithreading

br i1 %.inv, label %L18.preheader, label %L34 L18.preheader: ; preds = %top - %.promoted = load i64, i64* inttoptr (i64 139638901468496 to i64*), align - 16 + %.promoted = load i64, i64* inttoptr (i64 139977816649728 to i64*), align + 4096 ; @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral lelism.jmd:8 within `h2` %1 = add i64 %.promoted, %0 ; @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral lelism.jmd:7 within `h2` ; ┌ @ Base.jl within `setproperty!` - store i64 %1, i64* inttoptr (i64 139638901468496 to i64*), align 16 + store i64 %1, i64* inttoptr (i64 139977816649728 to i64*), align 4096 ; └ ; @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture06/styles_of_paral lelism.jmd:8 within `h2` @@ -522,7 +522,7 @@

Next Level Up: Multithreading

-114.188 ns (0 allocations: 0 bytes)
+160.279 ns (0 allocations: 0 bytes)
 
@@ -905,7 +905,7 @@

The Bait-and-switch: Parallelism is about Programming Models

diff --git a/notes/07-Ordinary_Differential_Equations-Applications_and_Discretizations/index.html b/notes/07-Ordinary_Differential_Equations-Applications_and_Discretizations/index.html index 8d1ba85c..27c90628 100644 --- a/notes/07-Ordinary_Differential_Equations-Applications_and_Discretizations/index.html +++ b/notes/07-Ordinary_Differential_Equations-Applications_and_Discretizations/index.html @@ -405,4 +405,4 @@ plot(sol)
 plot(sol, xscale=:log10, tspan=(1e-6, 60), layout=(3,1))
-

Geometric Properties

Linear Ordinary Differential Equations

The simplest ordinary differential equation is the scalar linear ODE, which is given in the form

\[ u' = \alpha u \]

We can solve this by noticing that $(e^{\alpha t})^\prime = \alpha e^{\alpha t}$ satisfies the differential equation and thus the general solution is:

\[ u(t) = u(0)e^{\alpha t} \]

From the analytical solution we have that:

  • If $Re(\alpha) > 0$ then $u(t) \rightarrow \infty$ as $t \rightarrow \infty$

  • If $Re(\alpha) < 0$ then $u(t) \rightarrow 0$ as $t \rightarrow \infty$

  • If $Re(\alpha) = 0$ then $u(t)$ has a constant or periodic solution.

This theory can then be extended to multivariable systems in the same way as the discrete dynamics case. Let $u$ be a vector and have

\[ u' = Au \]

be a linear ordinary differential equation. Assuming $A$ is diagonalizable, we diagonalize $A = P^{-1}DP$ to get

\[ Pu' = DPu \]

and change coordinates $z = Pu$ so that we have

\[ z' = Dz \]

which decouples the equation into a system of linear ordinary differential equations which we solve individually. Thus we see that, similarly to the discrete dynamical system, we have that:

  • If all of the eigenvalues negative, then $u(t) \rightarrow 0$ as $t \rightarrow \infty$

  • If any eigenvalue is positive, then $u(t) \rightarrow \infty$ as $t \rightarrow \infty$

Nonlinear Ordinary Differential Equations

As with discrete dynamical systems, the geometric properties extend locally to the linearization of the continuous dynamical system as defined by:

\[ u' = \frac{df}{du} u \]

where $\frac{df}{du}$ is the Jacobian of the system. This is a consequence of the Hartman-Grubman Theorem.

Numerically Solving Ordinary Differential Equations

Euler's Method

To numerically solve an ordinary differential equation, one turns the continuous equation into a discrete equation by discretizing it. The simplest discretization is the Euler method. The Euler method can be thought of as a simple approximation replacing $dt$ with a small non-infinitesimal $\Delta t$. Thus we can approximate

\[ f(u,p,t) = u' = \frac{du}{dt} \approx \frac{\Delta u}{\Delta t} \]

and now since $\Delta u = u_{n+1} - u_n$ we have that

\[ \Delta t f(u,p,t) = u_{n+1} - u_n \]

We need to make a choice as to where we evaluate $f$ at. The simplest approximation is to evaluate it at $t_n$ with $u_n$ where we already have the data, and thus we re-arrange to get

\[ u_{n+1} = u_n + \Delta t f(u,p,t) \]

This is the Euler method.

We can interpret it more rigorously by looking at the Taylor series expansion. First write out the Taylor series for the ODE's solution in the near future:

\[ u(t+\Delta t) = u(t) + \Delta t u'(t) + \frac{\Delta t^2}{2} u''(t) + \ldots \]

Recall that $u' = f(u,p,t)$ by the definition of the ODE system, and thus we have that

\[ u(t+\Delta t) = u(t) + \Delta t f(u,p,t) + \mathcal{O}(\Delta t^2) \]

This is a first order approximation because the error in our step can be expressed as an error in the derivative, i.e.

\[ \frac{u(t + \Delta t) - u(t)}{\Delta t} = f(u,p,t) + \mathcal{O}(\Delta t) \]

Higher Order Methods

We can use this analysis to extend our methods to higher order approximation by simply matching the Taylor series to a higher order. Intuitively, when we developed the Euler method we had to make a choice:

\[ u_{n+1} = u_n + \Delta t f(u,p,t) \]

where do we evaluate $f$? One may think that the best derivative approximation my come from the middle of the interval, in which case we might want to evaluate it at $t + \frac{\Delta t}{2}$. To do so, we can use the Euler method to approximate the value at $t + \frac{\Delta t}{2}$ and then use that value to approximate the derivative at $t + \frac{\Delta t}{2}$. This looks like:

\[ k_1 = f(u_n,p,t)\\ k_2 = f(u_n + \frac{\Delta t}{2} k_1,p,t + \frac{\Delta t}{2})\\ u_{n+1} = u_n + \Delta t k_2 \]

which we can also write as:

\[ u_{n+1} = u_n + \Delta t f(u_n + \frac{\Delta t}{2} f_n,p,t + \frac{\Delta t}{2}) \]

where $f_n = f(u_n,p,t)$. If we do the two-dimensional Taylor expansion we get:

\[ u_{n+1} = u_n + \Delta t f_n + \frac{\Delta t^2}{2}(f_t + f_u f)(u_n,p,t)\\ + \frac{\Delta t^3}{6} (f_{tt} + 2f_{tu}f + f_{uu}f^2)(u_n,p,t) \]

which when we compare against the true Taylor series:

\[ u(t+\Delta t) = u_n + \Delta t f(u_n,p,t) + \frac{\Delta t^2}{2}(f_t + f_u f)(u_n,p,t) + \frac{\Delta t^3}{6}(f_{tt} + 2f_{tu} + f_{uu}f^2 + f_t f_u + f_u^2 f)(u_n,p,t) \]

and thus we see that

\[ u(t + \Delta t) - u_n = \mathcal{O}(\Delta t^3) \]

Runge-Kutta Methods

More generally, Runge-Kutta methods are of the form:

\[ k_1 = f(u_n,p,t)\\ k_2 = f(u_n + \Delta t (a_{21} k_1),p,t + \Delta t c_2)\\ k_3 = f(u_n + \Delta t (a_{31} k_1 + a_{32} k_2),p,t + \Delta t c_3)\\ \vdots \\ u_{n+1} = u_n + \Delta t (b_1 k_1 + \ldots + b_s k_s) \]

where $s$ is the number of stages. These can be expressed as a tableau:

The order of the Runge-Kutta method is simply the number of terms in the Taylor series that ends up being matched by the resulting expansion. For example, for the 4th order you can expand out and see that the following equations need to be satisfied:

The classic Runge-Kutta method is also known as RK4 and is the following 4th order method:

\[ k_1 = f(u_n,p,t)\\ k_2 = f(u_n + \frac{\Delta t}{2} k_1,p,t + \frac{\Delta t}{2})\\ k_3 = f(u_n + \frac{\Delta t}{2} k_2,p,t + \frac{\Delta t}{2})\\ k_4 = f(u_n + \Delta t k_3,p,t + \Delta t)\\ u_{n+1} = u_n + \frac{\Delta t}{6}(k_1 + 2 k_2 + 2 k_3 + k_4)\\ \]

While it's widely known and simple to remember, it's not necessarily good. The way to judge a Runge-Kutta method is by looking at the size of the coefficient of the next term in the Taylor series: if it's large then the true error can be larger, even if it matches another one asymptotically.

What Makes a Good Method?

Leading Truncation Coefficients

For given orders of explicit Runge-Kutta methods, lower bounds for the number of f evaluations (stages) required to receive a given order are known:

While unintuitive, using the method is not necessarily the one that reduces the coefficient the most. The reason is because what is attempted in ODE solving is precisely the opposite of the analysis. In the ODE analysis, we're looking at behavior as $\Delta t \rightarrow 0$. However, when efficiently solving ODEs, we want to use the largest $\Delta t$ which satisfies error tolerances.

The most widely used method is the Dormand-Prince 5th order Runge-Kutta method, whose tableau is represented as:

Notice that this method takes 7 calls to f for 5th order. The key to this method is that it has optimized leading truncation error coefficients, under some extra assumptions which allow for the analysis to be simplified.

Looking at the Effects of RK Method Choices and Code Optimizations

Pulling from the SciML Benchmarks, we can see the general effect of these different properties on a given set of Runge-Kutta methods:

Here, the order of the method is given in the name. We can see one immediate factor is that, as the requested error in the calculation decreases, the higher order methods become more efficient. This is because to decrease error, you decrease $\Delta t$, and thus the exponent difference with respect to $\Delta t$ has more of a chance to pay off for the extra calls to f. Additionally, we can see that order is not the only determining factor for efficiency: the Vern8 method seems to have a clear approximate 2.5x performance advantage over the whole span of the benchmark compared to the DP8 method, even though both are 8th order methods. This is because of the leading truncation terms: with a small enough $\Delta t$, the more optimized method (Vern8) will generally have low error in a step for the same $\Delta t$ because the coefficients in the expansion are generally smaller.

This is a factor which is generally ignored in high level discussions of numerical differential equations, but can lead to orders of magnitude differences! This is highlighted in the following plot:

Here we see ODEInterface.jl's ODEInterfaceDiffEq.jl wrapper into the SciML common interface for the standard dopri method from Fortran, and ODE.jl, the original ODE solvers in Julia, have a performance disadvantage compared to the DifferentialEquations.jl methods due in part to some of the coding performance pieces that we discussed in the first few lectures.

Specifically, a large part of this can be attributed to inlining of the higher order functions, i.e. ODEs are defined by a user function and then have to be called from the solver. If the solver code is compiled as a shared library ahead of time, like is commonly done in C++ or Fortran, then there can be a function call overhead that is eliminated by JIT compilation optimizing across the function call barriers (known as interprocedural optimization). This is one way which a JIT system can outperform an AOT (ahead of time) compiled system in real-world code (for completeness, two other ways are by doing full function specialization, which is something that is not generally possible in AOT languages given that you cannot know all types ahead of time for a fully generic function, and calling C itself, i.e. c-ffi (foreign function interface), can be optimized using the runtime information of the JIT compiler to outperform C!).

The other performance difference being shown here is due to optimization of the method. While a slightly different order, we can see a clear difference in the performance of RK4 vs the coefficient optimized methods. It's about the same order of magnitude as "highly optimized code differences", showing that both the Runge-Kutta coefficients and the code implementation can have a significant impact on performance.

Taking a look at what happens when interpreted languages get involved highlights some of the code challenges in this domain. Let's take a look at for example the results when simulating 3 ODE systems with the various RK methods:

We see that using interpreted languages introduces around a 50x-100x performance penalty. If you recall in your previous lecture, the discrete dynamical system that was being simulated was the 3-dimensional Lorenz equation discretized by Euler's method, meaning that the performance of that implementation is a good proxy for understanding the performance differences in this graph. Recall that in previous lectures we saw an approximately 5x performance advantage when specializing on the system function and size and around 10x by reducing allocations: these features account for the performance differences noticed between library implementations, which are then compounded by the use of different RK methods (note that R uses "call by copy" which even further increases the memory usages and makes standard usage of the language incompatible with mutating function calls!).

Stability of a Method

Simply having an order on the truncation error does not imply convergence of the method. The disconnect is that the errors at a given time point may not dissipate. What also needs to be checked is the asymptotic behavior of a disturbance. To see this, one can utilize the linear test problem:

\[ u' = \alpha u \]

and ask the question, does the discrete dynamical system defined by the discretized ODE end up going to zero? You would hope that the discretized dynamical system and the continuous dynamical system have the same properties in this simple case, and this is known as linear stability analysis of the method.

As an example, take a look at the Euler method. Recall that the Euler method was given by:

\[ u_{n+1} = u_n + \Delta t f(u_n,p,t) \]

When we plug in the linear test equation, we get that

\[ u_{n+1} = u_n + \Delta t \alpha u_n \]

If we let $z = \Delta t \alpha$, then we get the following:

\[ u_{n+1} = u_n + z u_n = (1+z)u_n \]

which is stable when $z$ is in the shifted unit circle. This means that, as a necessary condition, the step size $\Delta t$ needs to be small enough that $z$ satisfies this condition, placing a stepsize limit on the method.

If $\Delta t$ is ever too large, it will cause the equation to overshoot zero, which then causes oscillations that spiral out to infinity.

Thus the stability condition places a hard constraint on the allowed $\Delta t$ which will result in a realistic simulation.

For reference, the stability regions of the 2nd and 4th order Runge-Kutta methods that we discussed are as follows:

Interpretation of the Linear Stability Condition

To interpret the linear stability condition, recall that the linearization of a system interprets the dynamics as locally being due to the Jacobian of the system. Thus

\[ u' = f(u,p,t) \]

is locally equivalent to

\[ u' = \frac{df}{du}u \]

You can understand the local behavior through diagonalizing this matrix. Therefore, the scalar for the linear stability analysis is performing an analysis on the eigenvalues of the Jacobian. The method will be stable if the largest eigenvalues of df/du are all within the stability limit. This means that stability effects are different throughout the solution of a nonlinear equation and are generally understood locally (though different more comprehensive stability conditions exist!).

Implicit Methods

If instead of the Euler method we defined $f$ to be evaluated at the future point, we would receive a method like:

\[ u_{n+1} = u_n + \Delta t f(u_{n+1},p,t+\Delta t) \]

in which case, for the stability calculation we would have that

\[ u_{n+1} = u_n + \Delta t \alpha u_n \]

or

\[ (1-z) u_{n+1} = u_n \]

which means that

\[ u_{n+1} = \frac{1}{1-z} u_n \]

which is stable for all $Re(z) < 0$ a property which is known as A-stability. It is also stable as $z \rightarrow \infty$, a property known as L-stability. This means that for equations with very ill-conditioned Jacobians, this method is still able to be use reasonably large stepsizes and can thus be efficient.

Stiffness and Timescale Separation

From this we see that there is a maximal stepsize whenever the eigenvalues of the Jacobian are sufficiently large. It turns out that's not an issue if the phenomena we see are fast, since then the total integration time tends to be small. However, if we have some equations with both fast modes and slow modes, like the Robertson equation, then it is very difficult because in order to resolve the slow dynamics over a long timespan, one needs to ensure that the fast dynamics do not diverge. This is a property known as stiffness. Stiffness can thus be approximated in some sense by the condition number of the Jacobian. The condition number of a matrix is its maximal eigenvalue divided by its minimal eigenvalue and gives a rough measure of the local timescale separations. If this value is large and one wants to resolve the slow dynamics, then explicit integrators, like the explicit Runge-Kutta methods described before, have issues with stability. In this case implicit integrators (or other forms of stabilized stepping) are required in order to efficiently reach the end time step.

Exploiting Continuity

So far, we have looked at ordinary differential equations as a $\Delta t \rightarrow 0$ formulation of a discrete dynamical system. However, continuous dynamics and discrete dynamics have very different characteristics which can be utilized in order to arrive at simpler models and faster computations.

Geometric Properties: No Jumping and the Poincaré–Bendixson theorem

In terms of geometric properties, continuity places a large constraint on the possible dynamics. This is because of the physical constraint on "jumping", i.e. flows of differential equations cannot jump over each other. If you are ever at some point in phase space and $f$ is not explicitly time-dependent, then the direction of $u'$ is uniquely determined (given reasonable assumptions on $f$), meaning that flow lines (solutions to the differential equation) can never cross.

A result from this is the Poincaré–Bendixson theorem, which states that, with any arbitrary (but nice) two dimensional continuous system, you can only have 3 behaviors:

  • Steady state behavior

  • Divergence

  • Periodic orbits

A simple proof by picture shows this.

\ No newline at end of file +

Geometric Properties

Linear Ordinary Differential Equations

The simplest ordinary differential equation is the scalar linear ODE, which is given in the form

\[ u' = \alpha u \]

We can solve this by noticing that $(e^{\alpha t})^\prime = \alpha e^{\alpha t}$ satisfies the differential equation and thus the general solution is:

\[ u(t) = u(0)e^{\alpha t} \]

From the analytical solution we have that:

  • If $Re(\alpha) > 0$ then $u(t) \rightarrow \infty$ as $t \rightarrow \infty$

  • If $Re(\alpha) < 0$ then $u(t) \rightarrow 0$ as $t \rightarrow \infty$

  • If $Re(\alpha) = 0$ then $u(t)$ has a constant or periodic solution.

This theory can then be extended to multivariable systems in the same way as the discrete dynamics case. Let $u$ be a vector and have

\[ u' = Au \]

be a linear ordinary differential equation. Assuming $A$ is diagonalizable, we diagonalize $A = P^{-1}DP$ to get

\[ Pu' = DPu \]

and change coordinates $z = Pu$ so that we have

\[ z' = Dz \]

which decouples the equation into a system of linear ordinary differential equations which we solve individually. Thus we see that, similarly to the discrete dynamical system, we have that:

  • If all of the eigenvalues negative, then $u(t) \rightarrow 0$ as $t \rightarrow \infty$

  • If any eigenvalue is positive, then $u(t) \rightarrow \infty$ as $t \rightarrow \infty$

Nonlinear Ordinary Differential Equations

As with discrete dynamical systems, the geometric properties extend locally to the linearization of the continuous dynamical system as defined by:

\[ u' = \frac{df}{du} u \]

where $\frac{df}{du}$ is the Jacobian of the system. This is a consequence of the Hartman-Grubman Theorem.

Numerically Solving Ordinary Differential Equations

Euler's Method

To numerically solve an ordinary differential equation, one turns the continuous equation into a discrete equation by discretizing it. The simplest discretization is the Euler method. The Euler method can be thought of as a simple approximation replacing $dt$ with a small non-infinitesimal $\Delta t$. Thus we can approximate

\[ f(u,p,t) = u' = \frac{du}{dt} \approx \frac{\Delta u}{\Delta t} \]

and now since $\Delta u = u_{n+1} - u_n$ we have that

\[ \Delta t f(u,p,t) = u_{n+1} - u_n \]

We need to make a choice as to where we evaluate $f$ at. The simplest approximation is to evaluate it at $t_n$ with $u_n$ where we already have the data, and thus we re-arrange to get

\[ u_{n+1} = u_n + \Delta t f(u,p,t) \]

This is the Euler method.

We can interpret it more rigorously by looking at the Taylor series expansion. First write out the Taylor series for the ODE's solution in the near future:

\[ u(t+\Delta t) = u(t) + \Delta t u'(t) + \frac{\Delta t^2}{2} u''(t) + \ldots \]

Recall that $u' = f(u,p,t)$ by the definition of the ODE system, and thus we have that

\[ u(t+\Delta t) = u(t) + \Delta t f(u,p,t) + \mathcal{O}(\Delta t^2) \]

This is a first order approximation because the error in our step can be expressed as an error in the derivative, i.e.

\[ \frac{u(t + \Delta t) - u(t)}{\Delta t} = f(u,p,t) + \mathcal{O}(\Delta t) \]

Higher Order Methods

We can use this analysis to extend our methods to higher order approximation by simply matching the Taylor series to a higher order. Intuitively, when we developed the Euler method we had to make a choice:

\[ u_{n+1} = u_n + \Delta t f(u,p,t) \]

where do we evaluate $f$? One may think that the best derivative approximation my come from the middle of the interval, in which case we might want to evaluate it at $t + \frac{\Delta t}{2}$. To do so, we can use the Euler method to approximate the value at $t + \frac{\Delta t}{2}$ and then use that value to approximate the derivative at $t + \frac{\Delta t}{2}$. This looks like:

\[ k_1 = f(u_n,p,t)\\ k_2 = f(u_n + \frac{\Delta t}{2} k_1,p,t + \frac{\Delta t}{2})\\ u_{n+1} = u_n + \Delta t k_2 \]

which we can also write as:

\[ u_{n+1} = u_n + \Delta t f(u_n + \frac{\Delta t}{2} f_n,p,t + \frac{\Delta t}{2}) \]

where $f_n = f(u_n,p,t)$. If we do the two-dimensional Taylor expansion we get:

\[ u_{n+1} = u_n + \Delta t f_n + \frac{\Delta t^2}{2}(f_t + f_u f)(u_n,p,t)\\ + \frac{\Delta t^3}{6} (f_{tt} + 2f_{tu}f + f_{uu}f^2)(u_n,p,t) \]

which when we compare against the true Taylor series:

\[ u(t+\Delta t) = u_n + \Delta t f(u_n,p,t) + \frac{\Delta t^2}{2}(f_t + f_u f)(u_n,p,t) + \frac{\Delta t^3}{6}(f_{tt} + 2f_{tu} + f_{uu}f^2 + f_t f_u + f_u^2 f)(u_n,p,t) \]

and thus we see that

\[ u(t + \Delta t) - u_n = \mathcal{O}(\Delta t^3) \]

Runge-Kutta Methods

More generally, Runge-Kutta methods are of the form:

\[ k_1 = f(u_n,p,t)\\ k_2 = f(u_n + \Delta t (a_{21} k_1),p,t + \Delta t c_2)\\ k_3 = f(u_n + \Delta t (a_{31} k_1 + a_{32} k_2),p,t + \Delta t c_3)\\ \vdots \\ u_{n+1} = u_n + \Delta t (b_1 k_1 + \ldots + b_s k_s) \]

where $s$ is the number of stages. These can be expressed as a tableau:

The order of the Runge-Kutta method is simply the number of terms in the Taylor series that ends up being matched by the resulting expansion. For example, for the 4th order you can expand out and see that the following equations need to be satisfied:

The classic Runge-Kutta method is also known as RK4 and is the following 4th order method:

\[ k_1 = f(u_n,p,t)\\ k_2 = f(u_n + \frac{\Delta t}{2} k_1,p,t + \frac{\Delta t}{2})\\ k_3 = f(u_n + \frac{\Delta t}{2} k_2,p,t + \frac{\Delta t}{2})\\ k_4 = f(u_n + \Delta t k_3,p,t + \Delta t)\\ u_{n+1} = u_n + \frac{\Delta t}{6}(k_1 + 2 k_2 + 2 k_3 + k_4)\\ \]

While it's widely known and simple to remember, it's not necessarily good. The way to judge a Runge-Kutta method is by looking at the size of the coefficient of the next term in the Taylor series: if it's large then the true error can be larger, even if it matches another one asymptotically.

What Makes a Good Method?

Leading Truncation Coefficients

For given orders of explicit Runge-Kutta methods, lower bounds for the number of f evaluations (stages) required to receive a given order are known:

While unintuitive, using the method is not necessarily the one that reduces the coefficient the most. The reason is because what is attempted in ODE solving is precisely the opposite of the analysis. In the ODE analysis, we're looking at behavior as $\Delta t \rightarrow 0$. However, when efficiently solving ODEs, we want to use the largest $\Delta t$ which satisfies error tolerances.

The most widely used method is the Dormand-Prince 5th order Runge-Kutta method, whose tableau is represented as:

Notice that this method takes 7 calls to f for 5th order. The key to this method is that it has optimized leading truncation error coefficients, under some extra assumptions which allow for the analysis to be simplified.

Looking at the Effects of RK Method Choices and Code Optimizations

Pulling from the SciML Benchmarks, we can see the general effect of these different properties on a given set of Runge-Kutta methods:

Here, the order of the method is given in the name. We can see one immediate factor is that, as the requested error in the calculation decreases, the higher order methods become more efficient. This is because to decrease error, you decrease $\Delta t$, and thus the exponent difference with respect to $\Delta t$ has more of a chance to pay off for the extra calls to f. Additionally, we can see that order is not the only determining factor for efficiency: the Vern8 method seems to have a clear approximate 2.5x performance advantage over the whole span of the benchmark compared to the DP8 method, even though both are 8th order methods. This is because of the leading truncation terms: with a small enough $\Delta t$, the more optimized method (Vern8) will generally have low error in a step for the same $\Delta t$ because the coefficients in the expansion are generally smaller.

This is a factor which is generally ignored in high level discussions of numerical differential equations, but can lead to orders of magnitude differences! This is highlighted in the following plot:

Here we see ODEInterface.jl's ODEInterfaceDiffEq.jl wrapper into the SciML common interface for the standard dopri method from Fortran, and ODE.jl, the original ODE solvers in Julia, have a performance disadvantage compared to the DifferentialEquations.jl methods due in part to some of the coding performance pieces that we discussed in the first few lectures.

Specifically, a large part of this can be attributed to inlining of the higher order functions, i.e. ODEs are defined by a user function and then have to be called from the solver. If the solver code is compiled as a shared library ahead of time, like is commonly done in C++ or Fortran, then there can be a function call overhead that is eliminated by JIT compilation optimizing across the function call barriers (known as interprocedural optimization). This is one way which a JIT system can outperform an AOT (ahead of time) compiled system in real-world code (for completeness, two other ways are by doing full function specialization, which is something that is not generally possible in AOT languages given that you cannot know all types ahead of time for a fully generic function, and calling C itself, i.e. c-ffi (foreign function interface), can be optimized using the runtime information of the JIT compiler to outperform C!).

The other performance difference being shown here is due to optimization of the method. While a slightly different order, we can see a clear difference in the performance of RK4 vs the coefficient optimized methods. It's about the same order of magnitude as "highly optimized code differences", showing that both the Runge-Kutta coefficients and the code implementation can have a significant impact on performance.

Taking a look at what happens when interpreted languages get involved highlights some of the code challenges in this domain. Let's take a look at for example the results when simulating 3 ODE systems with the various RK methods:

We see that using interpreted languages introduces around a 50x-100x performance penalty. If you recall in your previous lecture, the discrete dynamical system that was being simulated was the 3-dimensional Lorenz equation discretized by Euler's method, meaning that the performance of that implementation is a good proxy for understanding the performance differences in this graph. Recall that in previous lectures we saw an approximately 5x performance advantage when specializing on the system function and size and around 10x by reducing allocations: these features account for the performance differences noticed between library implementations, which are then compounded by the use of different RK methods (note that R uses "call by copy" which even further increases the memory usages and makes standard usage of the language incompatible with mutating function calls!).

Stability of a Method

Simply having an order on the truncation error does not imply convergence of the method. The disconnect is that the errors at a given time point may not dissipate. What also needs to be checked is the asymptotic behavior of a disturbance. To see this, one can utilize the linear test problem:

\[ u' = \alpha u \]

and ask the question, does the discrete dynamical system defined by the discretized ODE end up going to zero? You would hope that the discretized dynamical system and the continuous dynamical system have the same properties in this simple case, and this is known as linear stability analysis of the method.

As an example, take a look at the Euler method. Recall that the Euler method was given by:

\[ u_{n+1} = u_n + \Delta t f(u_n,p,t) \]

When we plug in the linear test equation, we get that

\[ u_{n+1} = u_n + \Delta t \alpha u_n \]

If we let $z = \Delta t \alpha$, then we get the following:

\[ u_{n+1} = u_n + z u_n = (1+z)u_n \]

which is stable when $z$ is in the shifted unit circle. This means that, as a necessary condition, the step size $\Delta t$ needs to be small enough that $z$ satisfies this condition, placing a stepsize limit on the method.

If $\Delta t$ is ever too large, it will cause the equation to overshoot zero, which then causes oscillations that spiral out to infinity.

Thus the stability condition places a hard constraint on the allowed $\Delta t$ which will result in a realistic simulation.

For reference, the stability regions of the 2nd and 4th order Runge-Kutta methods that we discussed are as follows:

Interpretation of the Linear Stability Condition

To interpret the linear stability condition, recall that the linearization of a system interprets the dynamics as locally being due to the Jacobian of the system. Thus

\[ u' = f(u,p,t) \]

is locally equivalent to

\[ u' = \frac{df}{du}u \]

You can understand the local behavior through diagonalizing this matrix. Therefore, the scalar for the linear stability analysis is performing an analysis on the eigenvalues of the Jacobian. The method will be stable if the largest eigenvalues of df/du are all within the stability limit. This means that stability effects are different throughout the solution of a nonlinear equation and are generally understood locally (though different more comprehensive stability conditions exist!).

Implicit Methods

If instead of the Euler method we defined $f$ to be evaluated at the future point, we would receive a method like:

\[ u_{n+1} = u_n + \Delta t f(u_{n+1},p,t+\Delta t) \]

in which case, for the stability calculation we would have that

\[ u_{n+1} = u_n + \Delta t \alpha u_n \]

or

\[ (1-z) u_{n+1} = u_n \]

which means that

\[ u_{n+1} = \frac{1}{1-z} u_n \]

which is stable for all $Re(z) < 0$ a property which is known as A-stability. It is also stable as $z \rightarrow \infty$, a property known as L-stability. This means that for equations with very ill-conditioned Jacobians, this method is still able to be use reasonably large stepsizes and can thus be efficient.

Stiffness and Timescale Separation

From this we see that there is a maximal stepsize whenever the eigenvalues of the Jacobian are sufficiently large. It turns out that's not an issue if the phenomena we see are fast, since then the total integration time tends to be small. However, if we have some equations with both fast modes and slow modes, like the Robertson equation, then it is very difficult because in order to resolve the slow dynamics over a long timespan, one needs to ensure that the fast dynamics do not diverge. This is a property known as stiffness. Stiffness can thus be approximated in some sense by the condition number of the Jacobian. The condition number of a matrix is its maximal eigenvalue divided by its minimal eigenvalue and gives a rough measure of the local timescale separations. If this value is large and one wants to resolve the slow dynamics, then explicit integrators, like the explicit Runge-Kutta methods described before, have issues with stability. In this case implicit integrators (or other forms of stabilized stepping) are required in order to efficiently reach the end time step.

Exploiting Continuity

So far, we have looked at ordinary differential equations as a $\Delta t \rightarrow 0$ formulation of a discrete dynamical system. However, continuous dynamics and discrete dynamics have very different characteristics which can be utilized in order to arrive at simpler models and faster computations.

Geometric Properties: No Jumping and the Poincaré–Bendixson theorem

In terms of geometric properties, continuity places a large constraint on the possible dynamics. This is because of the physical constraint on "jumping", i.e. flows of differential equations cannot jump over each other. If you are ever at some point in phase space and $f$ is not explicitly time-dependent, then the direction of $u'$ is uniquely determined (given reasonable assumptions on $f$), meaning that flow lines (solutions to the differential equation) can never cross.

A result from this is the Poincaré–Bendixson theorem, which states that, with any arbitrary (but nice) two dimensional continuous system, you can only have 3 behaviors:

  • Steady state behavior

  • Divergence

  • Periodic orbits

A simple proof by picture shows this.

\ No newline at end of file diff --git a/notes/08-Forward-Mode_Automatic_Differentiation_(AD)_via_High_Dimensional_Algebras/index.html b/notes/08-Forward-Mode_Automatic_Differentiation_(AD)_via_High_Dimensional_Algebras/index.html index b651dad2..c7433753 100644 --- a/notes/08-Forward-Mode_Automatic_Differentiation_(AD)_via_High_Dimensional_Algebras/index.html +++ b/notes/08-Forward-Mode_Automatic_Differentiation_(AD)_via_High_Dimensional_Algebras/index.html @@ -18,9 +18,9 @@ ϵ2 = (1+ϵ) - 1 (ϵ - ϵ2)
-ϵ = 9.831067687145973e-11
-1 + ϵ = 1.0000000000983107
--1.6048307731825555e-17
+ϵ = 7.470048420814885e-11
+1 + ϵ = 1.0000000000747005
+1.6174664895368224e-17
 

See how $\epsilon$ is only rebuilt at accuracy around $10^{-16}$ and thus we only keep around 6 digits of accuracy when it's generated at the size of around $10^{-10}$!

Finite Differencing and Numerical Stability

To start understanding how to compute derivatives on a computer, we start with finite differencing. For finite differencing, recall that the definition of the derivative is:

\[ f'(x) = \lim_{\epsilon \rightarrow 0} \frac{f(x+\epsilon)-f(x)}{\epsilon} \]

Finite differencing directly follows from this definition by choosing a small $\epsilon$. However, choosing a good $\epsilon$ is very difficult. If $\epsilon$ is too large than there is error since this definition is asymptotic. However, if $\epsilon$ is too small, you receive roundoff error. To understand why you would get roundoff error, recall that floating point error is relative, and can essentially store 16 digits of accuracy. So let's say we choose $\epsilon = 10^{-6}$. Then $f(x+\epsilon) - f(x)$ is roughly the same in the first 6 digits, meaning that after the subtraction there is only 10 digits of accuracy, and then dividing by $10^{-6}$ simply brings those 10 digits back up to the correct relative size.

This means that we want to choose $\epsilon$ small enough that the $\mathcal{O}(\epsilon^2)$ error of the truncation is balanced by the $O(1/\epsilon)$ roundoff error. Under some minor assumptions, one can argue that the average best point is $\sqrt(E)$, where E is machine epsilon

 @show eps(Float64)
 @show sqrt(eps(Float64))
@@ -85,7 +85,7 @@
 a, b, c, d = 1, 2, 3, 4
 @btime add($(Ref(a))[], $(Ref(b))[], $(Ref(c))[], $(Ref(d))[])
 
-3.699 ns (0 allocations: 0 bytes)
+4.200 ns (0 allocations: 0 bytes)
 (4, 6)
 
 a = Dual(1, 2)
@@ -95,17 +95,17 @@
 add(a, b)
 @btime add($(Ref(a))[], $(Ref(b))[])
 
-3.499 ns (0 allocations: 0 bytes)
+3.900 ns (0 allocations: 0 bytes)
 Dual{Int64}(4, 6)
 

It seems like we have lost no performance.

 @code_native add(1, 2, 3, 4)
 
 .text
 	.file	"add"
-	.globl	julia_add_16015                 # -- Begin function julia_add_16015
+	.globl	julia_add_15994                 # -- Begin function julia_add_15994
 	.p2align	4, 0x90
-	.type	julia_add_16015,@function
-julia_add_16015:                        # @julia_add_16015
+	.type	julia_add_15994,@function
+julia_add_15994:                        # @julia_add_15994
 ; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture08/automatic_diff
 erentiation.jmd:2 within `add`
 	.cfi_startproc
@@ -126,7 +126,7 @@
 	.cfi_def_cfa %rsp, 8
 	retq
 .Lfunc_end0:
-	.size	julia_add_16015, .Lfunc_end0-julia_add_16015
+	.size	julia_add_15994, .Lfunc_end0-julia_add_15994
 	.cfi_endproc
 ; └
                                         # -- End function
@@ -136,10 +136,10 @@
 
 .text
 	.file	"add"
-	.globl	julia_add_16017                 # -- Begin function julia_add_16017
+	.globl	julia_add_15996                 # -- Begin function julia_add_15996
 	.p2align	4, 0x90
-	.type	julia_add_16017,@function
-julia_add_16017:                        # @julia_add_16017
+	.type	julia_add_15996,@function
+julia_add_15996:                        # @julia_add_15996
 ; ┌ @ /home/runner/work/SciMLBook/SciMLBook/_weave/lecture08/automatic_diff
 erentiation.jmd:5 within `add`
 	.cfi_startproc
@@ -160,7 +160,7 @@
 	.cfi_def_cfa %rsp, 8
 	retq
 .Lfunc_end0:
-	.size	julia_add_16017, .Lfunc_end0-julia_add_16017
+	.size	julia_add_15996, .Lfunc_end0-julia_add_15996
 	.cfi_endproc
 ; └
                                         # -- End function
@@ -323,4 +323,4 @@
 2-element SVector{2, Float64} with indices SOneTo(2):
  0.7071067811865476
  0.7071067811865476
-

Conclusion

To make derivative calculations efficient and correct, we can move to higher dimensional numbers. In multiple dimensions, these then allow for multiple directional derivatives to be computed simultaneously, giving a method for computing the Jacobian of a function $f$ on a single input. This is a direct application of using the compiler as part of a mathematical framework.

References

  • John L. Bell, An Invitation to Smooth Infinitesimal Analysis, http://publish.uwo.ca/~jbell/invitation%20to%20SIA.pdf

  • Bell, John L. A Primer of Infinitesimal Analysis

  • Nocedal & Wright, Numerical Optimization, Chapter 8

  • Griewank & Walther, Evaluating Derivatives

Many thanks to David Sanders for helping make these lecture notes.

\ No newline at end of file +

Conclusion

To make derivative calculations efficient and correct, we can move to higher dimensional numbers. In multiple dimensions, these then allow for multiple directional derivatives to be computed simultaneously, giving a method for computing the Jacobian of a function $f$ on a single input. This is a direct application of using the compiler as part of a mathematical framework.

References

  • John L. Bell, An Invitation to Smooth Infinitesimal Analysis, http://publish.uwo.ca/~jbell/invitation%20to%20SIA.pdf

  • Bell, John L. A Primer of Infinitesimal Analysis

  • Nocedal & Wright, Numerical Optimization, Chapter 8

  • Griewank & Walther, Evaluating Derivatives

Many thanks to David Sanders for helping make these lecture notes.

\ No newline at end of file diff --git a/notes/09-Solving_Stiff_Ordinary_Differential_Equations/index.html b/notes/09-Solving_Stiff_Ordinary_Differential_Equations/index.html index eed3a047..ccfe452a 100644 --- a/notes/09-Solving_Stiff_Ordinary_Differential_Equations/index.html +++ b/notes/09-Solving_Stiff_Ordinary_Differential_Equations/index.html @@ -1 +1 @@ - Solving Stiff Ordinary Differential Equations - MIT Parallel Computing and Scientific Machine Learning (SciML)

Solving Stiff Ordinary Differential Equations

Chris Rackauckas
October 14th, 2020

Youtube Video Link

We have previously shown how to solve non-stiff ODEs via optimized Runge-Kutta methods, but we ended by showing that there is a fundamental limitation of these methods when attempting to solve stiff ordinary differential equations. However, we can get around these limitations by using different types of methods, like implicit Euler. Let's now go down the path of understanding how to efficiently implement stiff ordinary differential equation solvers, and its interaction with other domains like automatic differentiation.

When one is solving a large-scale scientific computing problem with MPI, this is almost always the piece of code where all of the time is spent, so let's understand how what it's doing.

Newton's Method and Jacobians

Recall that the implicit Euler method is the following:

\[ u_{n+1} = u_n + \Delta t f(u_{n+1},p,t + \Delta t) \]

If we wanted to use this method, we would need to find out how to get the value $u_{n+1}$ when only knowing the value $u_n$. To do so, we can move everything to one side:

\[ u_{n+1} - \Delta t f(u_{n+1},p,t + \Delta t) - u_n = 0 \]

and now we have a problem

\[ g(u_{n+1}) = 0 \]

This is the classic rootfinding problem $g(x)=0$, find $x$. The way that we solve the rootfinding problem is, once again, by replacing this problem about a continuous function $g$ with a discrete dynamical system whose steady state is the solution to the $g(x)=0$. There are many methods for this, but some choices of the rootfinding method effect the stability of the ODE solver itself since we need to make sure that the steady state solution is a stable steady state of the iteration process, otherwise the rootfinding method will diverge (will be explored in the homework).

Thus for example, fixed point iteration is not appropriate for stiff differential equations. Methods which are used in the stiff case are either Anderson Acceleration or Newton's method. Newton's is by far the most common (and generally performs the best), so we can go down this route.

Let's use the syntax $g(x)=0$. Here we need some starting value $x_0$ as our first guess for $u_{n+1}$. The easiest guess is $u_{n}$, though additional information about the equation can be used to compute a better starting value (known as a step predictor). Once we have a starting value, we run the iteration:

\[ x_{k+1} = x_k - J(x_k)^{-1}g(x_k) \]

where $J(x_k)$ is the Jacobian of $g$ at the point $x_k$. However, the mathematical formulation is never the syntax that you should use for the actual application! Instead, numerically this is two stages:

  • Solve $Ja=g(x_k)$ for $a$

  • Update $x_{k+1} = x_k - a$

By doing this, we can turn the matrix inversion into a problem of a linear solve and then an update. The reason this is done is manyfold, but one major reason is because the inverse of a sparse matrix can be dense, and this Jacobian is in many cases (PDEs) a large and dense matrix.

Now let's break this down step by step.

Some Quick Notes

The Jacobian of $g$ can also be written as $J = I - \gamma \frac{df}{du}$ for the ODE $u' = f(u,p,t)$, where $\gamma = \Delta t$ for the implicit Euler method. This general form holds for all other (SDIRK) implicit methods, changing the value of $\gamma$. Additionally, the class of Rosenbrock methods solves a linear system with exactly the same $J$, meaning that essentially all implicit and semi-implicit ODE solvers have to do the same Newton iteration process on the same structure. This is the portion of the code that is generally the bottleneck.

Additionally, if one is solving a mass matrix ODE: $Mu' = f(u,p,t)$, exactly the same treatment can be had with $J = M - \gamma \frac{df}{du}$. This works even if $M$ is singular, a case known as a differential-algebraic equation or a DAE. A DAE for example can be an ODE with constraint equations, and these structures can be represented as an ODE where these constraints lead to a singularity in the mass matrix (a row of all zeros is a term that is only the right hand side equals zero!).

Generation of the Jacobian

Dense Finite Differences and Forward-Mode AD

Recall that the Jacobian is the matrix of $\frac{df_i}{dx_j}$ for $f$ a vector-valued function. The simplest way to generate the Jacobian is through finite differences. For each $h_j = h e_j$ for $e_j$ the basis vector of the $j$th axis and some sufficiently small $h$, then we can compute column $j$ of the Jacobian by:

\[ \frac{f(x+h_j)-f(x)}{h} \]

Thus $m+1$ applications of $f$ are required to compute the full Jacobian.

This can be improved by using forward-mode automatic differentiation. Recall that we can formulate a multidimensional duel number of the form

\[ d = x + v_1 \epsilon_1 + \ldots + v_m \epsilon_m \]

We can then seed the vectors $v_j = h_j$ so that the differentiation directions are along the basis vectors, and then the output dual is the result:

\[ f(d) = f(x) + J_1 \epsilon_1 + \ldots + J_m \epsilon_m \]

where $J_j$ is the $j$th column of the Jacobian. And thus with one calculation of the primal (f(x)) we have calculated the entire Jacobian.

Sparse Differentiation and Matrix Coloring

However, when the Jacobian is sparse we can compute it much faster. We can understand this by looking at the following system:

\[ f(x)=\left[\begin{array}{c} x_{1}+x_{3}\\ x_{2}x_{3}\\ x_{1} \end{array}\right] \]

Notice that in 3 differencing steps we can calculate:

\[ f(x+\epsilon e_{1})=\left[\begin{array}{c} x_{1}+x_{3}+\epsilon\\ x_{2}x_{3}\\ x_{1}+\epsilon \end{array}\right] \]

\[ f(x+\epsilon e_{2})=\left[\begin{array}{c} x_{1}+x_{3}\\ x_{2}x_{3}+\epsilon x_{3}\\ x_{1} \end{array}\right] \]

\[ f(x+\epsilon e_{3})=\left[\begin{array}{c} x_{1}+x_{3}+\epsilon\\ x_{2}x_{3}+\epsilon x_{2}\\ x_{1} \end{array}\right] \]

and thus:

\[ \frac{f(x+\epsilon e_{1})-f(x)}{\epsilon}=\left[\begin{array}{c} 1\\ 0\\ 1 \end{array}\right] \]

\[ \frac{f(x+\epsilon e_{2})-f(x)}{\epsilon}=\left[\begin{array}{c} 0\\ x_{3}\\ 0 \end{array}\right] \]

\[ \frac{f(x+\epsilon e_{3})-f(x)}{\epsilon}=\left[\begin{array}{c} 1\\ x_{2}\\ 0 \end{array}\right] \]

But notice that the calculation of $e_1$ and $e_2$ do not interact. If we had done:

\[ \frac{f(x+\epsilon e_{1}+\epsilon e_{2})-f(x)}{\epsilon}=\left[\begin{array}{c} 1\\ x_{3}\\ 1 \end{array}\right] \]

we would still get the correct value for every row because the $\epsilon$ terms do not collide (a situation known as perturbation confusion). If we knew the sparsity pattern of the Jacobian included a 0 at (2,1), (1,2), and (3,2), then we would know that the vectors would have to be $[1 0 1]$ and $[0 x_3 0]$, meaning that columns 1 and 2 can be computed simultaneously and decompressed. This is the key to sparse differentiation.

With forward-mode automatic differentiation, recall that we calculate multiple dimensions simultaneously by using a multidimensional dual number seeded by the vectors of the differentiation directions, that is:

\[ d = x + v_1 \epsilon_1 + \ldots + v_m \epsilon_m \]

Instead of using the primitive differentiation directions $e_j$, we can instead replace this with the mixed values. For example, the Jacobian of the example function can be computed in one function call to $f$ with the dual number input:

\[ d = x + (e_1 + e_2) \epsilon_1 + e_3 \epsilon_2 \]

and performing the decompression via the sparsity pattern. Thus the sparsity pattern gives a direct way to optimize the construction of the Jacobian.

This idea of independent directions can be formalized as a matrix coloring. Take $S_{ij}$ the sparsity pattern of some Jacobian matrix $J_{ij}$. Define a graph on the nodes 1 through m where there is an edge between $i$ and $j$ if there is a row where $i$ and $j$ are non-zero. This graph is the column connectivity graph of the Jacobian. What we wish to do is find the smallest set of differentiation directions such that differentiating in the direction of $e_i$ does not collide with differentiation in the direction of $e_j$. The connectivity graph is setup so that way this cannot be done if the two nodes are adjacent. If we let the subset of nodes differentiated together be a color, the question is, what is the smallest number of colors s.t. no adjacent nodes are the same color. This is the classic distance-1 coloring problem from graph theory. It is well-known that the problem of finding the chromatic number, the minimal number of colors for a graph, is generally NP-complete. However, there are heuristic methods for performing a distance-1 coloring quite quickly. For example, a greedy algorithm is as follows:

  • Pick a node at random to be color 1.

  • Make all nodes adjacent to that be the lowest color that they can be (in this step that will be 2).

  • Now look at all nodes adjacent to that. Make all nodes be the lowest color that they can be (either 1 or 3).

  • Repeat by looking at the next set of adjacent nodes and color as conservatively as possible.

This can be visualized as follows:

The result will color the entire connected component. While not giving an optimal result, it will still give a result that is a sufficient reduction in the number of differentiation directions (without solving an NP-complete problem) and thus can lead to a large computational saving.

At the end, let $c_i$ be the vector of 1's and 0's, where it's 1 for every node that is color $i$ and 0 otherwise. Sparse automatic differentiation of the Jacobian is then computed with:

\[ d = x + c_1 \epsilon_1 + \ldots + c_k \epsilon_k \]

that is, the full Jacobian is computed with one dual number which consists of the primal calculation along with $k$ dual dimensions, where $k$ is the computed chromatic number of the connectivity graph on the Jacobian. Once this calculation is complete, the colored columns can be decompressed into the full Jacobian using the sparsity information, generating the original quantity that we wanted to compute.

For more information on the graph coloring aspects, find the paper titled "What Color Is Your Jacobian? Graph Coloring for Computing Derivatives" by Gebremedhin.

Note on Sparse Reverse-Mode AD

Reverse-mode automatic differentiation can be though of as a method for computing one row of a Jacobian per seed, as opposed to one column per seed given by forward-mode AD. Thus sparse reverse-mode automatic differentiation can be done by looking at the connectivity graph of the column and using the resulting color vectors to seed the reverse accumulation process.

Linear Solving

After the Jacobian has been computed, we need to solve a linear equation $Ja=b$. While mathematically you can solve this by computing the inverse $J^{-1}$, this is not a good way to perform the calculation because even if $J$ is sparse, then $J^{-1}$ is in general dense and thus may not fit into memory (remember, this is $N^2$ as many terms, where $N$ is the size of the ordinary differential equation that is being solved, so if it's a large equation it is very feasible and common that the ODE is representable but its full Jacobian is not able to fit into RAM). Note that some may say that this is done for numerical stability reasons: that is incorrect. In fact, under reasonable assumptions for how the inverse is computed, it will be as numerically stable as other techniques we will mention.

Thus instead of generating the inverse, we can instead perform a matrix factorization. A matrix factorization is a transformation of the matrix into a form that is more amenable to certain analyses. For our purposes, a general Jacobian within a Newton iteration can be transformed via the LU-factorization or (LU-decomposition), i.e.

\[ J = LU \]

where $L$ is lower triangular and $U$ is upper triangular. If we write the linear equation in this form:

\[ LUa = b \]

then we see that we can solve it by first solving $L(Ua) = b$. Since $L$ is lower triangular, this is done by the backsubstitution algorithm. That is, in a lower triangular form, we can solve for the first value since we have:

\[ L_{11} a_1 = b_1 \]

and thus by dividing we solve. For the next term, we have that

\[ L_{21} a_1 + L_{22} a_2 = b_2 \]

and thus we plug in the solution to $a_1$ and solve to get $a_2$. The lower triangular form allows this to continue. This occurs in 1+2+3+...+n operations, and is thus O(n^2). Next, we solve $Ua = b$, which once again is done by a backsubstitution algorithm but in the reverse direction. Together those two operations are O(n^2) and complete the inversion of $LU$.

So is this an O(n^2) algorithm for computing the solution of a linear system? No, because the computation of $LU$ itself is an O(n^3) calculation, and thus the true complexity of solving a linear system is still O(n^3). However, if we have already factorized $J$, then we can repeatedly use the same $LU$ factors to solve additional linear problems $Jv = u$ with different vectors. We can exploit this to accelerate the Newton method. Instead of doing the calculation:

\[ x_{k+1} = x_k - J(x_k)^{-1}g(x_k) \]

we can instead do:

\[ x_{k+1} = x_k - J(x_0)^{-1}g(x_k) \]

so that all of the Jacobians are the same. This means that a single O(n^3) factorization can be done, with multiple O(n^2) calculations using the same factorization. This is known as a Quasi-Newton method. While this makes the Newton method no longer quadratically convergent, it minimizes the large constant factor on the computational cost while retaining the same dynamical properties, i.e. the same steady state and thus the same overall solution. This makes sense for sufficiently large $n$, but requires sufficiently large $n$ because the loss of quadratic convergence means that it will take more steps to converge than before, and thus more $O(n^2)$ backsolves are required, meaning that the difference between factorizations and backsolves needs to be large enough in order to offset the cost of extra steps.

Note on Sparse Factorization

Note that LU-factorization, and other factorizations, have generalizations to sparse matrices where a symbolic factorization is utilized to compute a sparse storage of the values which then allow for a fast backsubstitution. More details are outside the scope of this course, but note that Julia and MATLAB will both use the library SuiteSparse in the background when lu is called on a sparse matrix.

Jacobian-Free Newton Krylov (JFNK)

An alternative method for solving the linear system is the Jacobian-Free Newton Krylov technique. This technique is broken into two pieces: the jvp calculation and the Krylov subspace iterative linear solver.

Jacobian-Vector Products as Directional Derivatives

We don't actually need to compute $J$ itself, since all that we actually need is the v = J*w. Is it possible to compute the Jacobian-Vector Product, or the jvp, without producing the Jacobian?

To see how this is done let's take a look at what is actually calculated. Written out in the standard basis, we have that:

\[ w_i = \sum_{j}^{m} J_{ij} v_{j} \]

Now write out what $J$ means and we see that:

\[ w_i = \sum_j^{m} \frac{df_i}{dx_j} v_j = \nabla f_i(x) \cdot v \]

that is, the $i$th component of $Jv$ is the directional derivative of $f_i$ in the direction $v$. This means that in general, the jvp $Jv$ is actually just the directional derivative in the direction of $v$, that is:

\[ Jv = \nabla f \cdot v \]

and therefore it has another mathematical representation, that is:

\[ Jv = \lim_{\epsilon \rightarrow 0} \frac{f(x+v \epsilon) - f(x)}{\epsilon} \]

From this alternative form it is clear that we can always compute a jvp with a single computation. Using finite differences, a simple approximation is the following:

\[ Jv \approx \frac{f(x+v \epsilon) - f(x)}{\epsilon} \]

for non-zero $\epsilon$. Similarly, recall that in forward-mode automatic differentiation we can choose directions by seeding the dual part. Therefore, using the dual number with one partial component:

\[ d = x + v \epsilon \]

we get that

\[ f(d) = f(x) + Jv \epsilon \]

and thus a single application with a single partial gives the jvp.

Note on Reverse-Mode Automatic Differentiation

As noted earlier, reverse-mode automatic differentiation has its primitives compute rows of the Jacobian in the seeded direction. This means that the seeded reverse-mode call with the vector $v$ computes $v^T J$, that is the vector (transpose) Jacobian transpose, or vjp for short. When discussing parameter estimation and adjoints, this shorthand will be introduced as a way for using a traditionally machine learning tool to accelerate traditionally scientific computing tasks.

Krylov Subspace Methods For Solving Linear Systems

Basic Iterative Solver Methods

Now that we have direct access to quick calculations of $Jv$, how would we use this to solve the linear system $Jw = v$ quickly? This is done through iterative linear solvers. These methods replace the process of solving for a factorization with, you may have guessed it, a discrete dynamical system whose solution is $w$. To do this, what we want is some iterative process so that

\[ Jw - b = 0 \]

So now let's split $J = A - B$, then if we are iterating the vectors $w_k$ such that $w_k \rightarrow w$, then if we plug this into the previous (residual) equation we get

\[ A w_{k+1} = Bw_k + b \]

since when we plug in $w$ we get zero (the sequence must be Cauchy so the difference $w_{k+1} - w_k \rightarrow 0$). Thus if we can split our matrix $J$ into a component $A$ which is easy to invert and a part $B$ that is just everything else, then we would have a bunch of easy linear systems to solve. There are many different choices that we can do. If we let $J = L + D + U$, where $L$ is the lower portion of $J$, $D$ is the diagonal, and $U$ is the upper portion, then the following are well-known methods:

  • Richardson: $A = \omega I$ for some $\omega$

  • Jacobi: $A = D$

  • Damped Jacobi: $A = \omega D$

  • Gauss-Seidel: $A = D-L$

  • Successive Over Relaxation: $A = \omega D - L$

  • Symmetric Successive Over Relaxation: $A = \frac{1}{\omega (2 - \omega)}(D-\omega L)D^{-1}(D-\omega U)$

These decompositions are chosen since a diagonal matrix is easy to invert (it's just the inversion of the scalars of the diagonal) and it's easy to solve an upper or lower triangular linear system (once again, it's backsubstitution).

Since these methods give a a linear dynamical system, we know that there is a unique steady state solution, which happens to be $Aw - Bw = Jw = b$. Thus we will converge to it as long as the steady state is stable. To see if it's stable, take the update equation

\[ w_{k+1} = A^{-1}(Bw_k + b) \]

and check the eigenvalues of the system: if they are within the unit circle then you have stability. Notice that this can always occur by bringing the eigenvalues of $A^{-1}$ closer to zero, which can be done by multiplying $A$ by a significantly large value, hence the $\omega$ quantities. While that always works, this essentially amounts to decreasing the stepsize of the iterative process and thus requiring more steps, thus making it take more computations. Thus the game is to pick the largest stepsize ($\omega$) for which the steady state is stable. We will leave that as outside the topic of this course.

Krylov Subspace Methods

While the classical iterative solver methods give the background for understanding an alternative to direct inversion or factorization of a matrix, the problem with that approach is that it requires the ability to split the matrix $J$, which we would like to avoid computing. Instead, we would like to develop an iterative solver technique which instead just uses the solution to $Jv$. Indeed there are such methods, and these are the Krylov subspace methods. A Krylov subspace is the space spanned by:

\[ \mathcal{K}_k = \text{span} \{v,Jv,J^2 v, \ldots, J^k v\} \]

There are a few nice properties about Krylov subspaces that can be exploited. For one, it is known that there is a finite maximum dimension of the Krylov subspace, that is there is a value $r$ such that $J^{r+1} v \in \mathcal{K}_r$, which means that the complete Krylov subspace can be computed in finitely many jvp, since $J^2 v$ is just the jvp where the vector is the jvp. Indeed, one can show that $J^i v$ is linearly independent for each $i$, and thus that maximal value is $m$, the dimension of the Jacobian. Therefore in $m$ jvps the solution is guaranteed to live in the Krylov subspace, giving a maximal computational cost and a proof of convergence if the vector in there is the "optimal in the space".

The most common method in the Krylov subspace family of methods is the GMRES method. Essentially, in step $i$ one computes $\mathcal{K}_i$, and finds the $x$ that is the closest to the Krylov subspace, i.e. finds the $x \in \mathcal{K}_i$ such that $\Vert Jx-v \Vert$ is minimized. At each step, it adds the new vector to the Krylov subspace after orthogonalizing it against the other vectors via Arnoldi iterations, leading to an orthogonal basis of $\mathcal{K}_i$ which makes it easy to express $x$.

While one has a guaranteed bound on the number of possible jvps in GMRES which is simply the number of ODEs (since that is what determines the size of the Jacobian and thus the total dimension of the problem), that bound is not necessarily a good one. For a large sparse matrix, it may be computationally impractical to ever compute 100,000 jvps. Thus one does not typically run the algorithm to conclusion, and instead stops when $\Vert Jx-v \Vert$ is sufficiently below some user-defined error tolerance.

Intermediate Conclusion

Let's take a step back and see what our intermediate conclusion is. In order to solve for the implicit step, it just boils down to doing Newton's method on some $g(x)=0$. If the Jacobian is small enough, one factorizes the Jacobian and uses Quasi-Newton iterations in order to utilize the stored LU-decomposition in multiple steps to reduce the computation cost. If the Jacobian is sparse, sparse automatic differentiation through matrix coloring is employed to directly fill the sparse matrix with less applications of $g$, and then this sparse matrix is factorized using a sparse LU factorization.

When the matrix is too large, then one resorts to using a Krylov subspace method, since this only requires being able to do $Jv$ calculations. In general, $Jv$ can be done matrix-free because it is simply the directional derivative in the direction of the vector $v$, which can be computed through either numerical or forward-mode automatic differentiation. This is then used in the GMRES iterative process to find the solution in the Krylov subspace which is closest to the solution, exiting early when the residual error is small enough. If this is converging too slow, then preconditioning is used.

That's the basic algorithm, but what are the other important details for getting this right?

The Need for Speed

Preconditioning

However, the speed at GMRES convergences is dependent on the correlations between the vectors, which can be shown to be related to the condition number of the Jacobian matrix. A high condition number makes convergence slower (this is the case for the traditional iterative methods as well), which in turn is an issue because it is the high condition number on the Jacobian which leads to stiffness and causes one to have to use an implicit integrator in the first place!

To help speed up the convergence, a common technique is known as preconditioning. Preconditioning is the process of using a semi-inverse to the matrix in order to split the matrix so that the iterative problem that is being solved is one that has a smaller condition number. Mathematically, it involves decomposing $J = P_l A P_r$ where $P_l$ and $P_r$ are the left and right preconditioners which have simple inverses, and thus instead of solving $Jx=v$, we would solve:

\[ P_l A P_r x = v \]

or

\[ A P_r x = P_l^{-1}v \]

which then means that the Krylov subpace that needs to be solved for is that defined by $A$: $\mathcal{K} = \text{span}\{v,Av,A^2 v, \ldots\}$. There are many possible choices for these preconditioners, but they are usually problem dependent. For example, for ODEs which come from parabolic and elliptic PDE discretizations, the multigrid method, such as a geometric multigrid or an algebraic multigrid, is a preconditioner that can accelerate the iterative solving process. One generic preconditioner that can generally be used is to divide by the norm of the vector $v$, which is a scaling employed by both SUNDIALS CVODE and by DifferentialEquations.jl and can be shown to be almost always advantageous.

Jacobian Re-use

If the problem is small enough such that the factorization is used and a Quasi-Newton technique is employed, it then holds that for most steps $J$ is only approximate since it can be using an old LU-factorization. To push it even further, high performance codes allow for jacobian reuse, which is allowing the same Jacobian to be reused between different timesteps. If the Jacobian is too incorrect, it can cause the Newton iterations to diverge, which is then when one would calculate a new Jacobian and compute a new LU-factorization.

Adaptive Timestepping

In simple cases, like partial differential equation discretizations of physical problems, the resulting ODEs are not too stiff and thus Newton's iteration generally works. However, in cases like stiff biological models, Newton's iteration can itself not always be stable enough to allow convergence. In fact, with many of the stiff biological models commonly used in benchmarks, no method is stable enough to pass without using adaptive timestepping! Thus one may need to adapt the timestep in order to improve the ability for the Newton method to converge (smaller timesteps increase the stability of the Newton stepping, see the homework).

This needs to be mixed with the Jacobian re-use strategy, since $J = I - \gamma \frac{df}{du}$ where $\gamma$ is dependent on $\Delta t$ (and $\gamma = \Delta t$ for implicit Euler) means that the Jacobian of the Newton method changes as $\Delta t$ changes. Thus one usually has a tiered algorithm for determining when to update the factorizations of $J$ vs when to compute a new $\frac{df}{du}$ and then refactorize. This is generally dependent on estimates of convergence rates to heuristically guess how far off $\frac{df}{du}$ is from the current true value.

So how does one perform adaptivity? This is generally done through a rejection sampling technique. First one needs some estimate of the error in a step. This is calculated through an embedded method, which is a method that is able to be calculated without any extra $f$ evaluations that is (usually) one order different from the true method. The difference between the true and the embedded method is then an error estimate. If this is greater than a user chosen tolerance, the step is rejected and re-ran with a smaller $\Delta t$ (possibly refactorizing, etc.). If this is less than the user tolerance, the step is accepted and $\Delta t$ is changed.

There are many schemes for how one can change $\Delta t$. One of the most common is known as the P-control, which stands for the proportional controller which is used throughout control theory. In this case, the control is to change $\Delta t$ in proportion to the current error ratio from the desired tolerance. If we let

\[ q = \frac{\text{E}}{\max(u_k,u_{k+1}) \tau_r + \tau_a} \]

where $\tau_r$ is the relative tolerance and $\tau_a$ is the absolute tolerance, then $q$ is the ratio of the current error to the current tolerance. If $q<1$, then the error is less than the tolerance and the step is accepted, and vice versa for $q>1$. In either case, we let $\Delta t_{new} = q \Delta t$ be the proportional update.

However, proportional error control has many known features that are undesirable. For example, it happens to work in a "bang bang" manner, meaning that it can drastically change its behavior from step to step. One step may multiply the step size by 10x, then the next by 2x. This is an issue because it effects the stability of the ODE solver method (since the stability is not a property of a single step, but rather it's a property of the global behavior over time)! Thus to smooth it out, one can use a PI-control, which modifies the control factor by a history value, i.e. the error in one step in the past. This of course also means that one can utilize a PID-controller for time stepping. And there are many other techniques that can be used, but many of the most optimized codes tend to use a PI-control mechanism.

Methodological Summary

Here's a quick summary of the methodologies in a hierarchical sense:

  • At the lowest level is the linear solve, either done by JFNK or (sparse) factorization. For large enough systems, this is the brunt of the work. This is thus the piece to computationally optimize as much as possible, and parallelize. For sparse factorizations, this can be done with a distributed sparse library implementation. For JFNK, the efficiency is simply due to the efficiency of your ODE function f.

  • An optional level for JFNK is the preconditioning level, where preconditioners can be used to decrease the total number of iterations required for Krylov subspace methods like GMRES to converge, and thus reduce the total number of f calls.

  • At the nonlinear solver level, different Newton-like techniques are utilized to minimize the number of factorizations/linear solves required, and maximize the stability of the Newton method.

  • At the ODE solver level, more efficient integrators and adaptive methods for stiff ODEs are used to reduce the cost by affecting the linear solves. Most of these calculations are dominated by the linear solve portion when it's in the regime of large stiff systems. Jacobian reuse techniques, partial factorizations, and IMEX methods come into play as ways to reduce the cost per factorization and reduce the total number of factorizations.

\ No newline at end of file + Solving Stiff Ordinary Differential Equations - MIT Parallel Computing and Scientific Machine Learning (SciML)

Solving Stiff Ordinary Differential Equations

Chris Rackauckas
October 14th, 2020

Youtube Video Link

We have previously shown how to solve non-stiff ODEs via optimized Runge-Kutta methods, but we ended by showing that there is a fundamental limitation of these methods when attempting to solve stiff ordinary differential equations. However, we can get around these limitations by using different types of methods, like implicit Euler. Let's now go down the path of understanding how to efficiently implement stiff ordinary differential equation solvers, and its interaction with other domains like automatic differentiation.

When one is solving a large-scale scientific computing problem with MPI, this is almost always the piece of code where all of the time is spent, so let's understand how what it's doing.

Newton's Method and Jacobians

Recall that the implicit Euler method is the following:

\[ u_{n+1} = u_n + \Delta t f(u_{n+1},p,t + \Delta t) \]

If we wanted to use this method, we would need to find out how to get the value $u_{n+1}$ when only knowing the value $u_n$. To do so, we can move everything to one side:

\[ u_{n+1} - \Delta t f(u_{n+1},p,t + \Delta t) - u_n = 0 \]

and now we have a problem

\[ g(u_{n+1}) = 0 \]

This is the classic rootfinding problem $g(x)=0$, find $x$. The way that we solve the rootfinding problem is, once again, by replacing this problem about a continuous function $g$ with a discrete dynamical system whose steady state is the solution to the $g(x)=0$. There are many methods for this, but some choices of the rootfinding method effect the stability of the ODE solver itself since we need to make sure that the steady state solution is a stable steady state of the iteration process, otherwise the rootfinding method will diverge (will be explored in the homework).

Thus for example, fixed point iteration is not appropriate for stiff differential equations. Methods which are used in the stiff case are either Anderson Acceleration or Newton's method. Newton's is by far the most common (and generally performs the best), so we can go down this route.

Let's use the syntax $g(x)=0$. Here we need some starting value $x_0$ as our first guess for $u_{n+1}$. The easiest guess is $u_{n}$, though additional information about the equation can be used to compute a better starting value (known as a step predictor). Once we have a starting value, we run the iteration:

\[ x_{k+1} = x_k - J(x_k)^{-1}g(x_k) \]

where $J(x_k)$ is the Jacobian of $g$ at the point $x_k$. However, the mathematical formulation is never the syntax that you should use for the actual application! Instead, numerically this is two stages:

  • Solve $Ja=g(x_k)$ for $a$

  • Update $x_{k+1} = x_k - a$

By doing this, we can turn the matrix inversion into a problem of a linear solve and then an update. The reason this is done is manyfold, but one major reason is because the inverse of a sparse matrix can be dense, and this Jacobian is in many cases (PDEs) a large and dense matrix.

Now let's break this down step by step.

Some Quick Notes

The Jacobian of $g$ can also be written as $J = I - \gamma \frac{df}{du}$ for the ODE $u' = f(u,p,t)$, where $\gamma = \Delta t$ for the implicit Euler method. This general form holds for all other (SDIRK) implicit methods, changing the value of $\gamma$. Additionally, the class of Rosenbrock methods solves a linear system with exactly the same $J$, meaning that essentially all implicit and semi-implicit ODE solvers have to do the same Newton iteration process on the same structure. This is the portion of the code that is generally the bottleneck.

Additionally, if one is solving a mass matrix ODE: $Mu' = f(u,p,t)$, exactly the same treatment can be had with $J = M - \gamma \frac{df}{du}$. This works even if $M$ is singular, a case known as a differential-algebraic equation or a DAE. A DAE for example can be an ODE with constraint equations, and these structures can be represented as an ODE where these constraints lead to a singularity in the mass matrix (a row of all zeros is a term that is only the right hand side equals zero!).

Generation of the Jacobian

Dense Finite Differences and Forward-Mode AD

Recall that the Jacobian is the matrix of $\frac{df_i}{dx_j}$ for $f$ a vector-valued function. The simplest way to generate the Jacobian is through finite differences. For each $h_j = h e_j$ for $e_j$ the basis vector of the $j$th axis and some sufficiently small $h$, then we can compute column $j$ of the Jacobian by:

\[ \frac{f(x+h_j)-f(x)}{h} \]

Thus $m+1$ applications of $f$ are required to compute the full Jacobian.

This can be improved by using forward-mode automatic differentiation. Recall that we can formulate a multidimensional duel number of the form

\[ d = x + v_1 \epsilon_1 + \ldots + v_m \epsilon_m \]

We can then seed the vectors $v_j = h_j$ so that the differentiation directions are along the basis vectors, and then the output dual is the result:

\[ f(d) = f(x) + J_1 \epsilon_1 + \ldots + J_m \epsilon_m \]

where $J_j$ is the $j$th column of the Jacobian. And thus with one calculation of the primal (f(x)) we have calculated the entire Jacobian.

Sparse Differentiation and Matrix Coloring

However, when the Jacobian is sparse we can compute it much faster. We can understand this by looking at the following system:

\[ f(x)=\left[\begin{array}{c} x_{1}+x_{3}\\ x_{2}x_{3}\\ x_{1} \end{array}\right] \]

Notice that in 3 differencing steps we can calculate:

\[ f(x+\epsilon e_{1})=\left[\begin{array}{c} x_{1}+x_{3}+\epsilon\\ x_{2}x_{3}\\ x_{1}+\epsilon \end{array}\right] \]

\[ f(x+\epsilon e_{2})=\left[\begin{array}{c} x_{1}+x_{3}\\ x_{2}x_{3}+\epsilon x_{3}\\ x_{1} \end{array}\right] \]

\[ f(x+\epsilon e_{3})=\left[\begin{array}{c} x_{1}+x_{3}+\epsilon\\ x_{2}x_{3}+\epsilon x_{2}\\ x_{1} \end{array}\right] \]

and thus:

\[ \frac{f(x+\epsilon e_{1})-f(x)}{\epsilon}=\left[\begin{array}{c} 1\\ 0\\ 1 \end{array}\right] \]

\[ \frac{f(x+\epsilon e_{2})-f(x)}{\epsilon}=\left[\begin{array}{c} 0\\ x_{3}\\ 0 \end{array}\right] \]

\[ \frac{f(x+\epsilon e_{3})-f(x)}{\epsilon}=\left[\begin{array}{c} 1\\ x_{2}\\ 0 \end{array}\right] \]

But notice that the calculation of $e_1$ and $e_2$ do not interact. If we had done:

\[ \frac{f(x+\epsilon e_{1}+\epsilon e_{2})-f(x)}{\epsilon}=\left[\begin{array}{c} 1\\ x_{3}\\ 1 \end{array}\right] \]

we would still get the correct value for every row because the $\epsilon$ terms do not collide (a situation known as perturbation confusion). If we knew the sparsity pattern of the Jacobian included a 0 at (2,1), (1,2), and (3,2), then we would know that the vectors would have to be $[1 0 1]$ and $[0 x_3 0]$, meaning that columns 1 and 2 can be computed simultaneously and decompressed. This is the key to sparse differentiation.

With forward-mode automatic differentiation, recall that we calculate multiple dimensions simultaneously by using a multidimensional dual number seeded by the vectors of the differentiation directions, that is:

\[ d = x + v_1 \epsilon_1 + \ldots + v_m \epsilon_m \]

Instead of using the primitive differentiation directions $e_j$, we can instead replace this with the mixed values. For example, the Jacobian of the example function can be computed in one function call to $f$ with the dual number input:

\[ d = x + (e_1 + e_2) \epsilon_1 + e_3 \epsilon_2 \]

and performing the decompression via the sparsity pattern. Thus the sparsity pattern gives a direct way to optimize the construction of the Jacobian.

This idea of independent directions can be formalized as a matrix coloring. Take $S_{ij}$ the sparsity pattern of some Jacobian matrix $J_{ij}$. Define a graph on the nodes 1 through m where there is an edge between $i$ and $j$ if there is a row where $i$ and $j$ are non-zero. This graph is the column connectivity graph of the Jacobian. What we wish to do is find the smallest set of differentiation directions such that differentiating in the direction of $e_i$ does not collide with differentiation in the direction of $e_j$. The connectivity graph is setup so that way this cannot be done if the two nodes are adjacent. If we let the subset of nodes differentiated together be a color, the question is, what is the smallest number of colors s.t. no adjacent nodes are the same color. This is the classic distance-1 coloring problem from graph theory. It is well-known that the problem of finding the chromatic number, the minimal number of colors for a graph, is generally NP-complete. However, there are heuristic methods for performing a distance-1 coloring quite quickly. For example, a greedy algorithm is as follows:

  • Pick a node at random to be color 1.

  • Make all nodes adjacent to that be the lowest color that they can be (in this step that will be 2).

  • Now look at all nodes adjacent to that. Make all nodes be the lowest color that they can be (either 1 or 3).

  • Repeat by looking at the next set of adjacent nodes and color as conservatively as possible.

This can be visualized as follows:

The result will color the entire connected component. While not giving an optimal result, it will still give a result that is a sufficient reduction in the number of differentiation directions (without solving an NP-complete problem) and thus can lead to a large computational saving.

At the end, let $c_i$ be the vector of 1's and 0's, where it's 1 for every node that is color $i$ and 0 otherwise. Sparse automatic differentiation of the Jacobian is then computed with:

\[ d = x + c_1 \epsilon_1 + \ldots + c_k \epsilon_k \]

that is, the full Jacobian is computed with one dual number which consists of the primal calculation along with $k$ dual dimensions, where $k$ is the computed chromatic number of the connectivity graph on the Jacobian. Once this calculation is complete, the colored columns can be decompressed into the full Jacobian using the sparsity information, generating the original quantity that we wanted to compute.

For more information on the graph coloring aspects, find the paper titled "What Color Is Your Jacobian? Graph Coloring for Computing Derivatives" by Gebremedhin.

Note on Sparse Reverse-Mode AD

Reverse-mode automatic differentiation can be though of as a method for computing one row of a Jacobian per seed, as opposed to one column per seed given by forward-mode AD. Thus sparse reverse-mode automatic differentiation can be done by looking at the connectivity graph of the column and using the resulting color vectors to seed the reverse accumulation process.

Linear Solving

After the Jacobian has been computed, we need to solve a linear equation $Ja=b$. While mathematically you can solve this by computing the inverse $J^{-1}$, this is not a good way to perform the calculation because even if $J$ is sparse, then $J^{-1}$ is in general dense and thus may not fit into memory (remember, this is $N^2$ as many terms, where $N$ is the size of the ordinary differential equation that is being solved, so if it's a large equation it is very feasible and common that the ODE is representable but its full Jacobian is not able to fit into RAM). Note that some may say that this is done for numerical stability reasons: that is incorrect. In fact, under reasonable assumptions for how the inverse is computed, it will be as numerically stable as other techniques we will mention.

Thus instead of generating the inverse, we can instead perform a matrix factorization. A matrix factorization is a transformation of the matrix into a form that is more amenable to certain analyses. For our purposes, a general Jacobian within a Newton iteration can be transformed via the LU-factorization or (LU-decomposition), i.e.

\[ J = LU \]

where $L$ is lower triangular and $U$ is upper triangular. If we write the linear equation in this form:

\[ LUa = b \]

then we see that we can solve it by first solving $L(Ua) = b$. Since $L$ is lower triangular, this is done by the backsubstitution algorithm. That is, in a lower triangular form, we can solve for the first value since we have:

\[ L_{11} a_1 = b_1 \]

and thus by dividing we solve. For the next term, we have that

\[ L_{21} a_1 + L_{22} a_2 = b_2 \]

and thus we plug in the solution to $a_1$ and solve to get $a_2$. The lower triangular form allows this to continue. This occurs in 1+2+3+...+n operations, and is thus O(n^2). Next, we solve $Ua = b$, which once again is done by a backsubstitution algorithm but in the reverse direction. Together those two operations are O(n^2) and complete the inversion of $LU$.

So is this an O(n^2) algorithm for computing the solution of a linear system? No, because the computation of $LU$ itself is an O(n^3) calculation, and thus the true complexity of solving a linear system is still O(n^3). However, if we have already factorized $J$, then we can repeatedly use the same $LU$ factors to solve additional linear problems $Jv = u$ with different vectors. We can exploit this to accelerate the Newton method. Instead of doing the calculation:

\[ x_{k+1} = x_k - J(x_k)^{-1}g(x_k) \]

we can instead do:

\[ x_{k+1} = x_k - J(x_0)^{-1}g(x_k) \]

so that all of the Jacobians are the same. This means that a single O(n^3) factorization can be done, with multiple O(n^2) calculations using the same factorization. This is known as a Quasi-Newton method. While this makes the Newton method no longer quadratically convergent, it minimizes the large constant factor on the computational cost while retaining the same dynamical properties, i.e. the same steady state and thus the same overall solution. This makes sense for sufficiently large $n$, but requires sufficiently large $n$ because the loss of quadratic convergence means that it will take more steps to converge than before, and thus more $O(n^2)$ backsolves are required, meaning that the difference between factorizations and backsolves needs to be large enough in order to offset the cost of extra steps.

Note on Sparse Factorization

Note that LU-factorization, and other factorizations, have generalizations to sparse matrices where a symbolic factorization is utilized to compute a sparse storage of the values which then allow for a fast backsubstitution. More details are outside the scope of this course, but note that Julia and MATLAB will both use the library SuiteSparse in the background when lu is called on a sparse matrix.

Jacobian-Free Newton Krylov (JFNK)

An alternative method for solving the linear system is the Jacobian-Free Newton Krylov technique. This technique is broken into two pieces: the jvp calculation and the Krylov subspace iterative linear solver.

Jacobian-Vector Products as Directional Derivatives

We don't actually need to compute $J$ itself, since all that we actually need is the v = J*w. Is it possible to compute the Jacobian-Vector Product, or the jvp, without producing the Jacobian?

To see how this is done let's take a look at what is actually calculated. Written out in the standard basis, we have that:

\[ w_i = \sum_{j}^{m} J_{ij} v_{j} \]

Now write out what $J$ means and we see that:

\[ w_i = \sum_j^{m} \frac{df_i}{dx_j} v_j = \nabla f_i(x) \cdot v \]

that is, the $i$th component of $Jv$ is the directional derivative of $f_i$ in the direction $v$. This means that in general, the jvp $Jv$ is actually just the directional derivative in the direction of $v$, that is:

\[ Jv = \nabla f \cdot v \]

and therefore it has another mathematical representation, that is:

\[ Jv = \lim_{\epsilon \rightarrow 0} \frac{f(x+v \epsilon) - f(x)}{\epsilon} \]

From this alternative form it is clear that we can always compute a jvp with a single computation. Using finite differences, a simple approximation is the following:

\[ Jv \approx \frac{f(x+v \epsilon) - f(x)}{\epsilon} \]

for non-zero $\epsilon$. Similarly, recall that in forward-mode automatic differentiation we can choose directions by seeding the dual part. Therefore, using the dual number with one partial component:

\[ d = x + v \epsilon \]

we get that

\[ f(d) = f(x) + Jv \epsilon \]

and thus a single application with a single partial gives the jvp.

Note on Reverse-Mode Automatic Differentiation

As noted earlier, reverse-mode automatic differentiation has its primitives compute rows of the Jacobian in the seeded direction. This means that the seeded reverse-mode call with the vector $v$ computes $v^T J$, that is the vector (transpose) Jacobian transpose, or vjp for short. When discussing parameter estimation and adjoints, this shorthand will be introduced as a way for using a traditionally machine learning tool to accelerate traditionally scientific computing tasks.

Krylov Subspace Methods For Solving Linear Systems

Basic Iterative Solver Methods

Now that we have direct access to quick calculations of $Jv$, how would we use this to solve the linear system $Jw = v$ quickly? This is done through iterative linear solvers. These methods replace the process of solving for a factorization with, you may have guessed it, a discrete dynamical system whose solution is $w$. To do this, what we want is some iterative process so that

\[ Jw - b = 0 \]

So now let's split $J = A - B$, then if we are iterating the vectors $w_k$ such that $w_k \rightarrow w$, then if we plug this into the previous (residual) equation we get

\[ A w_{k+1} = Bw_k + b \]

since when we plug in $w$ we get zero (the sequence must be Cauchy so the difference $w_{k+1} - w_k \rightarrow 0$). Thus if we can split our matrix $J$ into a component $A$ which is easy to invert and a part $B$ that is just everything else, then we would have a bunch of easy linear systems to solve. There are many different choices that we can do. If we let $J = L + D + U$, where $L$ is the lower portion of $J$, $D$ is the diagonal, and $U$ is the upper portion, then the following are well-known methods:

  • Richardson: $A = \omega I$ for some $\omega$

  • Jacobi: $A = D$

  • Damped Jacobi: $A = \omega D$

  • Gauss-Seidel: $A = D-L$

  • Successive Over Relaxation: $A = \omega D - L$

  • Symmetric Successive Over Relaxation: $A = \frac{1}{\omega (2 - \omega)}(D-\omega L)D^{-1}(D-\omega U)$

These decompositions are chosen since a diagonal matrix is easy to invert (it's just the inversion of the scalars of the diagonal) and it's easy to solve an upper or lower triangular linear system (once again, it's backsubstitution).

Since these methods give a a linear dynamical system, we know that there is a unique steady state solution, which happens to be $Aw - Bw = Jw = b$. Thus we will converge to it as long as the steady state is stable. To see if it's stable, take the update equation

\[ w_{k+1} = A^{-1}(Bw_k + b) \]

and check the eigenvalues of the system: if they are within the unit circle then you have stability. Notice that this can always occur by bringing the eigenvalues of $A^{-1}$ closer to zero, which can be done by multiplying $A$ by a significantly large value, hence the $\omega$ quantities. While that always works, this essentially amounts to decreasing the stepsize of the iterative process and thus requiring more steps, thus making it take more computations. Thus the game is to pick the largest stepsize ($\omega$) for which the steady state is stable. We will leave that as outside the topic of this course.

Krylov Subspace Methods

While the classical iterative solver methods give the background for understanding an alternative to direct inversion or factorization of a matrix, the problem with that approach is that it requires the ability to split the matrix $J$, which we would like to avoid computing. Instead, we would like to develop an iterative solver technique which instead just uses the solution to $Jv$. Indeed there are such methods, and these are the Krylov subspace methods. A Krylov subspace is the space spanned by:

\[ \mathcal{K}_k = \text{span} \{v,Jv,J^2 v, \ldots, J^k v\} \]

There are a few nice properties about Krylov subspaces that can be exploited. For one, it is known that there is a finite maximum dimension of the Krylov subspace, that is there is a value $r$ such that $J^{r+1} v \in \mathcal{K}_r$, which means that the complete Krylov subspace can be computed in finitely many jvp, since $J^2 v$ is just the jvp where the vector is the jvp. Indeed, one can show that $J^i v$ is linearly independent for each $i$, and thus that maximal value is $m$, the dimension of the Jacobian. Therefore in $m$ jvps the solution is guaranteed to live in the Krylov subspace, giving a maximal computational cost and a proof of convergence if the vector in there is the "optimal in the space".

The most common method in the Krylov subspace family of methods is the GMRES method. Essentially, in step $i$ one computes $\mathcal{K}_i$, and finds the $x$ that is the closest to the Krylov subspace, i.e. finds the $x \in \mathcal{K}_i$ such that $\Vert Jx-v \Vert$ is minimized. At each step, it adds the new vector to the Krylov subspace after orthogonalizing it against the other vectors via Arnoldi iterations, leading to an orthogonal basis of $\mathcal{K}_i$ which makes it easy to express $x$.

While one has a guaranteed bound on the number of possible jvps in GMRES which is simply the number of ODEs (since that is what determines the size of the Jacobian and thus the total dimension of the problem), that bound is not necessarily a good one. For a large sparse matrix, it may be computationally impractical to ever compute 100,000 jvps. Thus one does not typically run the algorithm to conclusion, and instead stops when $\Vert Jx-v \Vert$ is sufficiently below some user-defined error tolerance.

Intermediate Conclusion

Let's take a step back and see what our intermediate conclusion is. In order to solve for the implicit step, it just boils down to doing Newton's method on some $g(x)=0$. If the Jacobian is small enough, one factorizes the Jacobian and uses Quasi-Newton iterations in order to utilize the stored LU-decomposition in multiple steps to reduce the computation cost. If the Jacobian is sparse, sparse automatic differentiation through matrix coloring is employed to directly fill the sparse matrix with less applications of $g$, and then this sparse matrix is factorized using a sparse LU factorization.

When the matrix is too large, then one resorts to using a Krylov subspace method, since this only requires being able to do $Jv$ calculations. In general, $Jv$ can be done matrix-free because it is simply the directional derivative in the direction of the vector $v$, which can be computed through either numerical or forward-mode automatic differentiation. This is then used in the GMRES iterative process to find the solution in the Krylov subspace which is closest to the solution, exiting early when the residual error is small enough. If this is converging too slow, then preconditioning is used.

That's the basic algorithm, but what are the other important details for getting this right?

The Need for Speed

Preconditioning

However, the speed at GMRES convergences is dependent on the correlations between the vectors, which can be shown to be related to the condition number of the Jacobian matrix. A high condition number makes convergence slower (this is the case for the traditional iterative methods as well), which in turn is an issue because it is the high condition number on the Jacobian which leads to stiffness and causes one to have to use an implicit integrator in the first place!

To help speed up the convergence, a common technique is known as preconditioning. Preconditioning is the process of using a semi-inverse to the matrix in order to split the matrix so that the iterative problem that is being solved is one that has a smaller condition number. Mathematically, it involves decomposing $J = P_l A P_r$ where $P_l$ and $P_r$ are the left and right preconditioners which have simple inverses, and thus instead of solving $Jx=v$, we would solve:

\[ P_l A P_r x = v \]

or

\[ A P_r x = P_l^{-1}v \]

which then means that the Krylov subpace that needs to be solved for is that defined by $A$: $\mathcal{K} = \text{span}\{v,Av,A^2 v, \ldots\}$. There are many possible choices for these preconditioners, but they are usually problem dependent. For example, for ODEs which come from parabolic and elliptic PDE discretizations, the multigrid method, such as a geometric multigrid or an algebraic multigrid, is a preconditioner that can accelerate the iterative solving process. One generic preconditioner that can generally be used is to divide by the norm of the vector $v$, which is a scaling employed by both SUNDIALS CVODE and by DifferentialEquations.jl and can be shown to be almost always advantageous.

Jacobian Re-use

If the problem is small enough such that the factorization is used and a Quasi-Newton technique is employed, it then holds that for most steps $J$ is only approximate since it can be using an old LU-factorization. To push it even further, high performance codes allow for jacobian reuse, which is allowing the same Jacobian to be reused between different timesteps. If the Jacobian is too incorrect, it can cause the Newton iterations to diverge, which is then when one would calculate a new Jacobian and compute a new LU-factorization.

Adaptive Timestepping

In simple cases, like partial differential equation discretizations of physical problems, the resulting ODEs are not too stiff and thus Newton's iteration generally works. However, in cases like stiff biological models, Newton's iteration can itself not always be stable enough to allow convergence. In fact, with many of the stiff biological models commonly used in benchmarks, no method is stable enough to pass without using adaptive timestepping! Thus one may need to adapt the timestep in order to improve the ability for the Newton method to converge (smaller timesteps increase the stability of the Newton stepping, see the homework).

This needs to be mixed with the Jacobian re-use strategy, since $J = I - \gamma \frac{df}{du}$ where $\gamma$ is dependent on $\Delta t$ (and $\gamma = \Delta t$ for implicit Euler) means that the Jacobian of the Newton method changes as $\Delta t$ changes. Thus one usually has a tiered algorithm for determining when to update the factorizations of $J$ vs when to compute a new $\frac{df}{du}$ and then refactorize. This is generally dependent on estimates of convergence rates to heuristically guess how far off $\frac{df}{du}$ is from the current true value.

So how does one perform adaptivity? This is generally done through a rejection sampling technique. First one needs some estimate of the error in a step. This is calculated through an embedded method, which is a method that is able to be calculated without any extra $f$ evaluations that is (usually) one order different from the true method. The difference between the true and the embedded method is then an error estimate. If this is greater than a user chosen tolerance, the step is rejected and re-ran with a smaller $\Delta t$ (possibly refactorizing, etc.). If this is less than the user tolerance, the step is accepted and $\Delta t$ is changed.

There are many schemes for how one can change $\Delta t$. One of the most common is known as the P-control, which stands for the proportional controller which is used throughout control theory. In this case, the control is to change $\Delta t$ in proportion to the current error ratio from the desired tolerance. If we let

\[ q = \frac{\text{E}}{\max(u_k,u_{k+1}) \tau_r + \tau_a} \]

where $\tau_r$ is the relative tolerance and $\tau_a$ is the absolute tolerance, then $q$ is the ratio of the current error to the current tolerance. If $q<1$, then the error is less than the tolerance and the step is accepted, and vice versa for $q>1$. In either case, we let $\Delta t_{new} = q \Delta t$ be the proportional update.

However, proportional error control has many known features that are undesirable. For example, it happens to work in a "bang bang" manner, meaning that it can drastically change its behavior from step to step. One step may multiply the step size by 10x, then the next by 2x. This is an issue because it effects the stability of the ODE solver method (since the stability is not a property of a single step, but rather it's a property of the global behavior over time)! Thus to smooth it out, one can use a PI-control, which modifies the control factor by a history value, i.e. the error in one step in the past. This of course also means that one can utilize a PID-controller for time stepping. And there are many other techniques that can be used, but many of the most optimized codes tend to use a PI-control mechanism.

Methodological Summary

Here's a quick summary of the methodologies in a hierarchical sense:

  • At the lowest level is the linear solve, either done by JFNK or (sparse) factorization. For large enough systems, this is the brunt of the work. This is thus the piece to computationally optimize as much as possible, and parallelize. For sparse factorizations, this can be done with a distributed sparse library implementation. For JFNK, the efficiency is simply due to the efficiency of your ODE function f.

  • An optional level for JFNK is the preconditioning level, where preconditioners can be used to decrease the total number of iterations required for Krylov subspace methods like GMRES to converge, and thus reduce the total number of f calls.

  • At the nonlinear solver level, different Newton-like techniques are utilized to minimize the number of factorizations/linear solves required, and maximize the stability of the Newton method.

  • At the ODE solver level, more efficient integrators and adaptive methods for stiff ODEs are used to reduce the cost by affecting the linear solves. Most of these calculations are dominated by the linear solve portion when it's in the regime of large stiff systems. Jacobian reuse techniques, partial factorizations, and IMEX methods come into play as ways to reduce the cost per factorization and reduce the total number of factorizations.

\ No newline at end of file diff --git a/notes/10-Basic_Parameter_Estimation-Reverse-Mode_AD-and_Inverse_Problems/index.html b/notes/10-Basic_Parameter_Estimation-Reverse-Mode_AD-and_Inverse_Problems/index.html index 52fb4082..c5a03fa2 100644 --- a/notes/10-Basic_Parameter_Estimation-Reverse-Mode_AD-and_Inverse_Problems/index.html +++ b/notes/10-Basic_Parameter_Estimation-Reverse-Mode_AD-and_Inverse_Problems/index.html @@ -1 +1 @@ - Basic Parameter Estimation, Reverse-Mode AD, and Inverse Problems - MIT Parallel Computing and Scientific Machine Learning (SciML)

Basic Parameter Estimation, Reverse-Mode AD, and Inverse Problems

Chris Rackauckas
October 22th, 2020

Youtube Video Link

Have a model. Have data. Fit model to data.

This is a problem that goes under many different names: parameter estimation, inverse problems, training, etc. In this lecture we will go through the methods for how that's done, starting with the basics and bringing in the recent techniques from machine learning that can be used to improve the basic implementations.

The Shooting Method for Parameter Fitting

Assume that we have some model $u = f(p)$ where $p$ is our parameters, where we put in some parameters and receive our simulated data $u$. How should you choose $p$ such that $u$ best fits that data? The shooting method directly uses this high level definition of the model by putting a cost function on the output $C(p)$. This cost function is dependent on a user-choice and it's model-dependent. However, a common one is the L2-loss. If $y$ is our expected data, then the L2-loss function against the data is simply:

\[ C(p) = \Vert f(p) - y \Vert \]

where $C(p): \mathbb{R}^n \rightarrow \mathbb{R}$ is a function that returns a scalar. The shooting method then directly optimizes this cost function by having the optimizer generate a data given new choices of $p$.

Methods for Optimization

There are many different nonlinear optimization methods which can be used for this purpose, and for a full survey one should look at packages like JuMP, Optim.jl, and NLopt.jl.

There are generally two sets of methods: global and local optimization methods. Local optimization methods attempt to find the best nearby extrema by finding a point where the gradient $\frac{dC}{dp} = 0$. Global optimization methods attempt to explore the whole space and find the best of the extrema. Global methods tend to employ a lot more heuristics and are extremely computationally difficult, and thus many studies focus on local optimization. We will focus strictly on local optimization, but one may want to look into global optimization for many applications of parameter estimation.

Most local optimizers make use of derivative information in order to accelerate the solver. The simplest of which is the method of gradient descent. In this method, given a set of parameters $p_i$, the next step of parameters one will try is:

\[ p_{i+1} = p_i - \alpha \frac{dC}{dP} \]

that is, update $p_i$ by walking in the downward direction of the gradient. Instead of using just first order information, one may want to directly solve the rootfinding problem $\frac{dC}{dp} = 0$ using Newton's method. Newton's method in this case looks like:

\[ p_{i+1} = p_i - (\frac{d}{dp}\frac{dC}{dp})^{-1} \frac{dC}{dp} \]

But notice that the Jacobian of the gradient is the Hessian, and thus we can rewrite this as:

\[ p_{i+1} = p_i - H(p_i)^{-1} \frac{dC(p_i)}{dp} \]

where $H(p)$ is the Hessian matrix $H_{ij} = \frac{dC}{dx_i dx_j}$. However, solving a system of equations which involves the Hessian can be difficult (just like the Jacobian, but now with another layer of differentiation!), and thus many optimization techniques attempt to avoid the Hessian. A commonly used technique that is somewhat in the middle is the BFGS technique, which is a gradient-based optimization method that attempts to approximate the Hessian along the way to modify its stepping behavior. It uses the history of previously calculated points in order to build this quick Hessian approximate. If one keeps only a constant length history, say 5 time points, then one arrives at the l-BFGS technique, which is one of the most common large-scale optimization techniques.

Connection Between Optimization and Differential Equations

There is actually a strong connection between optimization and differential equations. Let's say we wanted to follow the gradient of the solution towards a local minimum. That would mean that the flow that we would wish to follow is given by an ODE, specifically the ODE:

\[ p' = -\frac{dC}{dp} \]

If we apply the Euler method to this ODE, then we receive

\[ p_{n+1} = p_n - \alpha \frac{dC(p_n)}{dp} \]

and we thus recover the gradient descent method. Now assume that you want to use implicit Euler. Then we would have the system

\[ p_{n+1} = p_n - \alpha \frac{dC(p_{n+1})}{dp} \]

which we would then move to one side:

\[ p_{n+1} - p_n + \alpha \frac{dC(p_{n+1})}{dp} = 0 \]

and solve each step via a Newton method. For this Newton method, we need to take the Jacobian of this gradient function, and once again the Hessian arrives as the fundamental quantity.

Neural Network Training as a Shooting Method for Functions

A one layer dense neuron is traditionally written as the function:

\[ layer(x) = \sigma.(Wx + b) \]

where $x \in \mathbb{R}^n$, $W \in \mathbb{R}^{m \times n}$, $b \in \mathbb{R}^{m}$ and $\sigma$ is some choice of $\mathbb{R}\rightarrow\mathbb{R}$ nonlinear function, where the . is the Julia dot to signify element-wise operation.

A traditional neural network, feed-forward network, or multi-layer perceptron is a 3 layer function, i.e.

\[ NN(x) = W_3 \sigma_2.(W_2\sigma_1.(W_1x + b_1) + b_2) + b_3 \]

where the first layer is called the input layer, the second is called the hidden layer, and the final is called the output layer. This specific function was seen as desirable because of the Universal Approximation Theorem, which is formally stated as follows:

Let $\sigma$ be a nonconstant, bounded, and continuous function. Let $I_m = [0,1]^m$. The space of real-valued continuous functions on $I_m$ is denoted by $C(I_m)$. For any $\epsilon >0$ and any $f\in C(I_m)$, there exists an integer $N$, real constants $W_i$ and $b_i$ s.t.

\[ \Vert NN(x) - f(x) \Vert < \epsilon \]

for all $x \in I_m$. Equivalently, $NN$ given parameters is dense in $C(I_m)$.

However, it turns out that using only one hidden layer can require exponential growth in the size of said hidden layer, where the size is given by the number of columns in $W_1$. To counteract this, deep neural networks were developed to be in the form of the recurrence relation:

\[ v_{i+1} = \sigma_i.(W_i v_{i} + b_i) \]

\[ v_1 = x \]

\[ DNN(x) = v_{n} \]

for some $n$ where $n$ is the number of layers. Given a sufficient size of the hidden layers, this kind of function is a universal approximator (2017). Although it's not quite known yet, some results have shown that this kind of function is able to fit high dimensional functions without the curse of dimensionality, i.e. the number of parameters does not grow exponentially with the input size. More mathematical results in this direction are still being investigated.

However, this theory gives a direct way to transform the fitting of an arbitrary function into a parameter shooting problem. Given an unknown function $f$ one wishes to fit, one can place the cost function

\[ C(p) = \Vert DNN(x;p) - f(x) \Vert \]

where $DNN(x;p)$ signifies the deep neural network given by the parameters $p$, where the full set of parameters is the $W_i$ and $b_i$. To make the evaluation of that function be practical, we can instead say we wish to evaluate the difference at finitely many points:

\[ C(p) = \sum_k^N \Vert DNN(x_k;p) - f(x_k) \Vert \]

Training a neural network is machine learning speak for finding the $p$ which minimizes this cost function. Notice that this is then a shooting method problem, where a cost function is defined by direct evaluations of the model with some choice of parameters.

Recurrent Neural Networks

Recurrent neural networks are networks which are given by the recurrence relation:

\[ x_{k+1} = x_k + DNN(x_k,k;p) \]

Given our machinery, we can see this is equivalent to the Euler discretization with $\Delta t = 1$ on the neural ordinary differential equation defined by:

\[ x' = DNN(x,t;p) \]

Thus a recurrent neural network is a sequence of applications of a neural network (or possibly a neural network indexed by integer time).

Computing Gradients

This shows that many different problems, from training neural networks to fitting differential equations, all have the same underlying mathematical structure which requires the ability to compute the gradient of a cost function given model evaluations. However, this simply reduces to computing the gradient of the model's output given the parameters. To see this, let's take for example the L2 loss function, i.e.

\[ C(p) = \sum_i^N \Vert f(x_i;p) - y_i \Vert \]

for some finite data points $y_i$. In the ODE model, $y_i$ are time series points. In the general neural network, $y_i = d(x_i)$ for the function we wish to fit $d$. In data science applications of machine learning, $y_i = d_i$ the discrete data points we wish to fit. In any of these cases, we see that by the chain rule we have

\[ \frac{dC}{dp} = \sum_i^N 2 \left(f(x_i;p) - y_i \right) \frac{df(x_i)}{dp} \]

and therefore, knowing how to efficiently compute $\frac{df(x_i)}{dp}$ is the essential question for shooting-based parameter fitting.

Forward-Mode Automatic Differentiation for Gradients

Let's recall the forward-mode method for computing gradients. For an arbitrary nonlinear function $f$ with scalar output, we can compute derivatives by putting a dual number in. For example, with

\[ d = d_0 + v_1 \epsilon_1 + \ldots + v_m \epsilon_m \]

we have that

\[ f(d) = f(d_0) + f'(d_0)v_1 \epsilon_1 + \ldots + f'(d_0)v_m \epsilon_m \]

where $f'(d_0)v_i$ is the direction derivative in the direction of $v_i$. To compute the gradient with respond to the input, we thus need to make $v_i = e_i$.

However, in this case we now do not want to compute the derivative with respect to the input! Instead, now we have $f(x;p)$ and want to compute the derivatives with respect to $p$. This simply means that we want to take derivatives in the directions of the parameters. To do this, let:

\[ x = x_0 + 0 \epsilon_1 + \ldots + 0 \epsilon_k \]

\[ P = p + e_1 \epsilon_1 + \ldots + e_k \epsilon_k \]

where there are $k$ parameters. We then have that

\[ f(x;P) = f(x;p) + \frac{df}{dp_1} \epsilon_1 + \ldots + \frac{df}{dp_k} \epsilon_k \]

as the output, and thus a $k+1$-dimensional number computes the gradient of the function with respect to $k$ parameters.

Can we do better?

The Adjoint Technique and Reverse Accumulation

The fast method for computing gradients goes under many times. The adjoint technique, backpropagation, and reverse-mode automatic differentiation are in some sense all equivalent phrases given to this method from different disciplines. To understand the adjoint technique, we will look at the multivariate chain rule on a computation graph. Recall that for $f(x(t),y(t))$ that we have:

\[ \frac{df}{dt} = \frac{df}{dx}\frac{dx}{dt} + \frac{df}{dy}\frac{dy}{dt} \]

We can visualize our direct dependences as the computation graph:

i.e. $t$ directly determines $x$ and $y$ which then determines $f$. To calculate Assume you've already evaluated $f(t)$. If this has been done, then you've already had to calculate $x$ and $y$. Thus given the function $f$, we can now calculate $\frac{df}{dx}$ and $\frac{df}{dy}$, and then calculate $\frac{dx}{dt}$ and $\frac{dy}{dt}$.

Now let's put another layer in the computation. Let's make $f(x(v(t),w(t)),y(v(t),w(t))$. We can write out the full expression for the derivative. Notice that even with this additional layer, the statement we wrote above still holds:

\[ \frac{df}{dt} = \frac{df}{dx}\frac{dx}{dt} + \frac{df}{dy}\frac{dy}{dt} \]

So given an evaluation of $f$, we can (still) directly calculate $\frac{df}{dx}$ and $\frac{df}{dy}$. But now, to calculate $\frac{dx}{dt}$ and $\frac{dy}{dt}$, we do the next step of the chain rule:

\[ \frac{dx}{dt} = \frac{dx}{dv}\frac{dv}{dt} + \frac{dx}{dw}\frac{dw}{dt} \]

and similar for $y$. So plug it all in, and you see that our equations will grow wild if we actually try to plug it in! But it's clear that, to calculate $\frac{df}{dt}$, we can first calculate $\frac{df}{dx}$, and then multiply that to $\frac{dx}{dt}$. If we had more layers, we could calculate the sensitivity (the derivative) of the output to the last layer, then and then the sensitivity to the second layer back is the sensitivity of the last layer multiplied to that, and the third layer back has the sensitivity of the second layer multiplied to it!

Logistic Regression Example

To better see this structure, let's write out a simple example. Let our forward pass through our function be:

\[ \begin{align} z &= wx + b\\ y &= \sigma(z)\\ \mathcal{L} &= \frac{1}{2}(y-t)^2\\ \mathcal{R} &= \frac{1}{2}w^2\\ \mathcal{L}_{reg} &= \mathcal{L} + \lambda \mathcal{R}\end{align} \]

The formulation of the program here is called a Wengert list, tape, or graph. In this, $x$ and $t$ are inputs, $b$ and $W$ are parameters, $z$, $y$, $\mathcal{L}$, and $\mathcal{R}$ are intermediates, and $\mathcal{L}_{reg}$ is our output.

This is a simple univariate logistic regression model. To do logistic regression, we wish to find the parameters $w$ and $b$ which minimize the distance of $\mathcal{L}_{reg}$ from a desired output, which is done by computing derivatives.

Let's calculate the derivatives with respect to each quantity in reverse order. If our program is $f(x) = \mathcal{L}_{reg}$, then we have that

\[ \frac{df}{d\mathcal{L}_{reg}} = 1 \]

as the derivatives of the last layer. To computerize our notation, let's write

\[ \overline{\mathcal{L}_{reg}} = \frac{df}{d\mathcal{L}_{reg}} \]

for our computed values. For the derivatives of the second to last layer, we have that:

\[ \begin{align} \overline{\mathcal{R}} &= \frac{df}{d\mathcal{L}_{reg}} \frac{d\mathcal{L}_{reg}}{d\mathcal{R}}\\ &= \overline{\mathcal{L}_{reg}} \lambda \end{align} \]

\[ \begin{align} \overline{\mathcal{L}} &= \frac{df}{d\mathcal{L}_{reg}} \frac{d\mathcal{L}_{reg}}{d\mathcal{L}}\\ &= \overline{\mathcal{L}_{reg}} \end{align} \]

This was our observation from before that the derivative of the second layer is the partial derivative of the current values times the sensitivity of the final layer. And then we keep multiplying, so now for our next layer we have that:

\[ \begin{align} \overline{y} &= \overline{\mathcal{L}} \frac{d\mathcal{L}}{dy}\\ &= \overline{\mathcal{L}} (y-t) \end{align} \]

And notice that the chain rule holds since $\overline{\mathcal{L}}$ implicitly already has the multiplication by $\overline{\mathcal{L}_{reg}}$ inside of it. Then the next layer is:

\[ \begin{align} \frac{df}{z} &= \overline{y} \frac{dy}{dz}\\ &= \overline{y} \sigma^\prime(z) \end{align} \]

Then the next layer. Notice that here, by the chain rule on $w$ we have that:

\[ \begin{align} \overline{w} &= \overline{z} \frac{\partial z}{\partial w} + \overline{\mathcal{R}} \frac{d \mathcal{R}}{dw}\\ &= \overline{z} x + \overline{\mathcal{R}} w\end{align} \]

\[ \begin{align} \overline{b} &= \overline{z} \frac{\partial z}{\partial b}\\ &= \overline{z} \end{align} \]

This completely calculates all derivatives. In conclusion, the rule is:

  • You sum terms from each outward arrow

  • Each arrow has the derivative term of the end times the partial of the current term.

  • Recurse backwards to build simple linear combination expressions.

You can thus think of the relations as a message passing relation in reverse to the forward pass:

Note that the reverse-pass has the values of the forward pass, like $x$ and $t$, embedded within it.

Backpropagation of a Neural Network

Now let's look at backpropagation of a deep neural network. Before getting to it in the linear algebraic sense, let's write everything in terms of scalars. This means we can write a simple neural network as:

\[ \begin{align} z_i &= \sum_j W_{ij}^1 x_j + b_i^1\\ h_i &= \sigma(z_i)\\ y_i &= \sum_j W_{ij}^2 h_j + b_i^2\\ \mathcal{L} &= \frac{1}{2} \sum_k \left(y_k - t_k \right)^2 \end{align} \]

where I have chosen the L2 loss function. This is visualized by the computational graph:

Then we can do the same process as before to get:

\[ \begin{align} \overline{\mathcal{L}} &= 1\\ \overline{y_i} &= \overline{\mathcal{L}} (y_i - t_i)\\ \overline{w_{ij}^2} &= \overline{y_i} h_j\\ \overline{b_i^2} &= \overline{y_i}\\ \overline{h_i} &= \sum_k (\overline{y_k}w_{ki}^2)\\ \overline{z_i} &= \overline{h_i}\sigma^\prime(z_i)\\ \overline{w_{ij}^1} &= \overline{z_i} x_j\\ \overline{b_i^1} &= \overline{z_i}\end{align} \]

just by examining the computation graph. Now let's write this in linear algebraic form.

The forward pass for this simple neural network was:

\[ \begin{align} z &= W_1 x + b_1\\ h &= \sigma(z)\\ y &= W_2 h + b_2\\ \mathcal{L} = \frac{1}{2} \Vert y-t \Vert^2 \end{align} \]

If we carefully decode our scalar expression, we see that we get the following:

\[ \begin{align} \overline{\mathcal{L}} &= 1\\ \overline{y} &= \overline{\mathcal{L}}(y-t)\\ \overline{W_2} &= \overline{y}h^{T}\\ \overline{b_2} &= \overline{y}\\ \overline{h} &= W_2^T \overline{y}\\ \overline{z} &= \overline{h} .* \sigma^\prime(z)\\ \overline{W_1} &= \overline{z} x^T\\ \overline{b_1} &= \overline{z} \end{align} \]

We can thus decode the rules as:

  • Multiplying by the matrix going forwards means multiplying by the transpose going backwards. A term on the left stays on the left, and a term on the right stays on the right.

  • Element-wise operations give element-wise multiplication

Notice that the summation is then easily encoded into this rule by the transpose operation.

We can write it in the general DNN form of:

\[ r_i = W_i v_{i} + b_i \]

\[ v_{i+1} = \sigma_i.(r_i) \]

\[ v_1 = x \]

\[ \mathcal{L} = \frac{1}{2} \Vert v_{n} - t \Vert \]

\[ \begin{align} \overline{\mathcal{L}} &= 1\\ \overline{v_n} &= \overline{\mathcal{L}}(y-t)\\ \overline{r_i} &= \overline{v_i} .* \sigma_i^\prime (r_i)\\ \overline{W_i} &= \overline{v_i}r_{i-1}^{T}\\ \overline{b_i} &= \overline{v_i}\\ \overline{v_{i-1}} &= W_{i}^{T} \overline{v_i} \end{align} \]

Reverse-Mode Automatic Differentiation and vjps

Backpropagation of a neural network is thus a different way of accumulating derivatives. If $f$ is a composition of $L$ functions:

\[ f = f^L \circ f^{L-1} \circ \ldots \circ f^1 \]

Then the Jacobian matrix satisfies:

\[ J = J_L J_{L-1} \ldots J_1 \]

A program is essentially a nice way of writing a function in composition form. Forward-mode automatic differentiation worked by propagating forward the actions of the Jacobians at every step of the program:

\[ Jv = J_L (J_{L-1} (\ldots (J_1 v) \ldots )) \]

effectively calculating the Jacobian of the program by multiplying by the Jacobians from left to right at each step of the way. This means doing primitive $Jv$ calculations on each underlying problem, and pushing that calculation through.

But what about reverse accumulation? This can be isolated to the simple expression graph:

In backpropagation, we just showed that when doing reverse accumulation, the rule is that multiplication forwards is multiplication by the transpose backwards. So if the forward way to compute the Jacobian in reverse is to replace the matrix by its transpose:

We can either look at it as $J^T v$, or by transposing the equation $v^T J$. It's right there that we have a vector-transpose Jacobian product, or a vjp.

We can thus think of this as a different direction for the Jacobian accumulation. Reverse-mode automatic differentiation moves backwards through our composed Jacobian. For a value $v$ at the end, we can push it backwards:

\[ v^T J = (\ldots ((v^T J_L) J_{L-1}) \ldots ) J_1 \]

doing a vjp at every step of the way, which is simply doing reverse-mode AD of that function (and if it's linear, then simply doing the matrix multiplication). Thus reverse-mode AD is just a grouping of vjps into a single larger expression, instead of linearizing every single step.

Primitives of Reverse Mode

For forward-mode AD, we saw that we could define primitives in order to accelerate the calculation. For example, knowing that

\[ exp(x+\epsilon) = exp(x) + exp(x)\epsilon \]

allows the program to skip autodifferentiating through the code for exp. This was simple with forward-mode since we could represent the operation on a Dual number. What's the equivalent for reverse-mode AD? The answer is the pullback function. If $y = [y_1,y_2,\ldots] = f(x_1,x_2, \ldots)$, then $[\overline{x_1},\overline{x_2},\ldots]=\mathcal{B}_f^x(\overline{y})$ is the pullback of $f$ at the point $x$, defined for a scalar loss function $L(y)$ as:

\[ \overline{x_i} = \frac{\partial L}{\partial x_i} = \sum_j \frac{\partial L}{\partial y_j} \frac{\partial y_j}{\partial x_i} \]

Using the notation from earlier, $\overline{y} = \frac{\partial L}{\partial y}$ is the derivative of the some intermediate w.r.t. the cost function, and thus

\[ \overline{x_i} = \sum_j \overline{y_j} \frac{\partial y_j}{\partial x_i} = \mathcal{B}_f^x(\overline{y}) \]

Note that $\mathcal{B}_f^x(\overline{y})$ is a function of $x$ because the reverse pass that is use embeds values from the forward pass, and the values from the forward pass to use are those calculated during the evaluation of $f(x)$.

By the chain rule, if we don't have a primitive defined for $y_i(x)$, we can compute that by $\mathcal{B}_{y_i}(\overline{y})$, and recursively apply this process until we hit rules that we know. The rules to start with are the scalar derivative rules with follow quite simply, and the multivariate rules which we derived above. For example, if $y=f(x)=Ax$, then

\[ \mathcal{B}_{f}^x(\overline{y}) = \overline{y}^T A \]

which is simply saying that the Jacobian of $f$ at $x$ is $A$, and so the vjp is to multiply the vector transpose by $A$.

Likewise, for element-wise operations, the Jacobian is diagonal, and thus the vjp is multiplying once again by a diagonal matrix against the derivative, deriving the same pullback as we had for backpropagation in a neural network. This then is a quicker encoding and derivation of backpropagation.

Multivariate Derivatives from Reverse Mode

Since the primitive of reverse mode is the vjp, we can understand its behavior by looking at a large primitive. In our simplest case, the function $f(x)=Ax$ outputs a vector value, which we apply our loss function $L(y) = \Vert y-t \Vert$ to get a scalar. Thus we seed the scalar output $v=1$, and in the first step backwards we have a vector to scalar function, so the first pullback transforms from $1$ to the vector $v_2 = 2|y-t|$. Then we take that vector and multiply it like $v_2^T A$ to get the derivatives w.r.t. $x$.

Now let $L(y)$ be a vector function, i.e. we output a vector instead of a scalar from our loss function. Then $v$ is the seed to this process. Let's assume that $v = e_i$, one of the basis vectors. Then

\[ v_i^T J = e_i^T J \]

pulls computes a row of the Jacobian. There, if we had a vector function $y=f(x)$, the pullback $\mathcal{B}_f^x(e_i)$ is the row of the Jacobian $f'(x)$. Concatenating these is thus a way to build a full Jacobian. The gradient is thus a special case where $y$ is scalar, and thus the resulting Jacobian is just a single row, and therefore we set the seed equal to $1$ to compute the unscaled gradient.

Multi-Seeding

Similarly to forward-mode having a dual number with multiple simultaneous derivatives through partials $d = x + v_1 \epsilon_1 + \ldots + v_m \epsilon_m$, one can see that multi-seeding is an option in reverse-mode AD by, instead of pulling back a matrix instead of a row vector, where each row is a direction. Thus the matrix $A = [v_1 v_2 \ldots v_n]^T$ evaluated as $\mathcal{B}_f^x(A)$ is the equivalent operation to the forward-mode $f(d)$ for generalized multivariate multiseeded reverse-mode automatic differentiation. One should take care to recognize the Jacobian as a generalized linear operator in this case and ensure that the shapes in the program correctly handle this storage of the reverse seed. When linear, this will automatically make use of BLAS3 operations, making it an efficient form for neural networks.

Sparse Reverse Mode AD

Since the Jacobian is built row-by-row with reverse mode AD, the sparse differentiation discussion from forward-mode AD applies similarly but to the transpose. Therefore, in order to perform sparse reverse mode automatic differentiation, one would build up a connectivity graph of the columns, and perform a coloring algorithm on this graph. The seeds of the reverse call, $v_i$, would then be the color vectors, which would compute compressed rows, that are then decompressed similarly to the forward-mode case.

Forward Mode vs Reverse Mode

Notice that a pullback of a single scalar gives the gradient of a function, while the pushforward using forward-mode of a dual gives a directional derivative. Forward mode computes columns of a Jacobian, while reverse mode computes gradients (rows of a Jacobian). Therefore, the relative efficiency of the two approaches is based on the size of the Jacobian. If $f:\mathbb{R}^n \rightarrow \mathbb{R}^m$, then the Jacobian is of size $m \times n$. If $m$ is much smaller than $n$, then computing by each row will be faster, and thus use reverse mode. In the case of a gradient, $m=1$ while $n$ can be large, leading to this phenomena. Likewise, if $n$ is much smaller than $m$, then computing by each column will be faster. We will see shortly the reverse mode AD has a high overhead with respect to forward mode, and thus if the values are relatively equal (or $n$ and $m$ are small), forward mode is more efficient.

However, since optimization needs gradients, reverse-mode definitely has a place in the standard toolchain which is why backpropagation is so central to machine learning.

Side Note on Mixed Mode

Interestingly, one can find cases where mixing the forward and reverse mode results would give an asymptotically better result. For example, if a Jacobian was non-zero in only the first 3 rows and first 3 columns, then sparse forward mode would still require N partials and reverse mode would require M seeds. However, one forward mode call of 3 partials and one reverse mode call of 3 seeds would calculate all three rows and columns with $\mathcal{O}(1)$ work, as opposed to $\mathcal{O}(N)$ or $\mathcal{O}(M)$. Exactly how to make use of this insight in an automated manner is an open research question.

Forward-Over-Reverse and Hessian-Free Products

Using this knowledge, we can also develop quick ways for computing the Hessian. Recall from earlier in the discussion that Hessians are the Jacobian of the gradient. So let's say for a scalar function $f$ we want to compute the Hessian. To compute the gradient, we use the reverse-mode AD pullback $\nabla f(x) = \mathcal{B}_f^x(1)$. Recall that the pullback is a function of $x$ since that is the value at which the values from the forward pass are taken. Then since the Jacobian of the gradient vector is $n \times n$ (as many terms in the gradient as there are inputs!), it holds that we want to use forward-mode AD for this Jacobian. Therefore, using the dual number $x = x_0 + e_1 \epsilon_1 + \ldots + e_n \epsilon_n$ the reverse mode gradient function computes the full Hessian in one forward pass. What this amounts to is pushing forward the dual number forward sensitivities when building the pullback, and then when doing the pullback the dual portions, will be holding vectors for the columns of the Hessian.

Similarly, Hessian-vector products without computing the Hessian can be computed using the Jacobian-vector product trick on the function defined by the gradient. Here, $Hv$ is equivalent to the dual part of

\[ \nabla f(x+v\epsilon) = \mathcal{B}_f^{x+v\epsilon}(1) \]

This means that our Newton method for optimization:

\[ p_{i+1} = p_i - H(p_i)^{-1} \frac{dC(p_i)}{dp} \]

can be treated similarly to that for the nonlinear solving problem, where the linear system can be solved using Hessian-free vector products to build a Krylov subspace, giving rise to the Hessian-free Newton Krylov method for optimization.

References

We thank Roger Grosse's lecture notes for the amazing tikz graphs.

\ No newline at end of file + Basic Parameter Estimation, Reverse-Mode AD, and Inverse Problems - MIT Parallel Computing and Scientific Machine Learning (SciML)

Basic Parameter Estimation, Reverse-Mode AD, and Inverse Problems

Chris Rackauckas
October 22th, 2020

Youtube Video Link

Have a model. Have data. Fit model to data.

This is a problem that goes under many different names: parameter estimation, inverse problems, training, etc. In this lecture we will go through the methods for how that's done, starting with the basics and bringing in the recent techniques from machine learning that can be used to improve the basic implementations.

The Shooting Method for Parameter Fitting

Assume that we have some model $u = f(p)$ where $p$ is our parameters, where we put in some parameters and receive our simulated data $u$. How should you choose $p$ such that $u$ best fits that data? The shooting method directly uses this high level definition of the model by putting a cost function on the output $C(p)$. This cost function is dependent on a user-choice and it's model-dependent. However, a common one is the L2-loss. If $y$ is our expected data, then the L2-loss function against the data is simply:

\[ C(p) = \Vert f(p) - y \Vert \]

where $C(p): \mathbb{R}^n \rightarrow \mathbb{R}$ is a function that returns a scalar. The shooting method then directly optimizes this cost function by having the optimizer generate a data given new choices of $p$.

Methods for Optimization

There are many different nonlinear optimization methods which can be used for this purpose, and for a full survey one should look at packages like JuMP, Optim.jl, and NLopt.jl.

There are generally two sets of methods: global and local optimization methods. Local optimization methods attempt to find the best nearby extrema by finding a point where the gradient $\frac{dC}{dp} = 0$. Global optimization methods attempt to explore the whole space and find the best of the extrema. Global methods tend to employ a lot more heuristics and are extremely computationally difficult, and thus many studies focus on local optimization. We will focus strictly on local optimization, but one may want to look into global optimization for many applications of parameter estimation.

Most local optimizers make use of derivative information in order to accelerate the solver. The simplest of which is the method of gradient descent. In this method, given a set of parameters $p_i$, the next step of parameters one will try is:

\[ p_{i+1} = p_i - \alpha \frac{dC}{dP} \]

that is, update $p_i$ by walking in the downward direction of the gradient. Instead of using just first order information, one may want to directly solve the rootfinding problem $\frac{dC}{dp} = 0$ using Newton's method. Newton's method in this case looks like:

\[ p_{i+1} = p_i - (\frac{d}{dp}\frac{dC}{dp})^{-1} \frac{dC}{dp} \]

But notice that the Jacobian of the gradient is the Hessian, and thus we can rewrite this as:

\[ p_{i+1} = p_i - H(p_i)^{-1} \frac{dC(p_i)}{dp} \]

where $H(p)$ is the Hessian matrix $H_{ij} = \frac{dC}{dx_i dx_j}$. However, solving a system of equations which involves the Hessian can be difficult (just like the Jacobian, but now with another layer of differentiation!), and thus many optimization techniques attempt to avoid the Hessian. A commonly used technique that is somewhat in the middle is the BFGS technique, which is a gradient-based optimization method that attempts to approximate the Hessian along the way to modify its stepping behavior. It uses the history of previously calculated points in order to build this quick Hessian approximate. If one keeps only a constant length history, say 5 time points, then one arrives at the l-BFGS technique, which is one of the most common large-scale optimization techniques.

Connection Between Optimization and Differential Equations

There is actually a strong connection between optimization and differential equations. Let's say we wanted to follow the gradient of the solution towards a local minimum. That would mean that the flow that we would wish to follow is given by an ODE, specifically the ODE:

\[ p' = -\frac{dC}{dp} \]

If we apply the Euler method to this ODE, then we receive

\[ p_{n+1} = p_n - \alpha \frac{dC(p_n)}{dp} \]

and we thus recover the gradient descent method. Now assume that you want to use implicit Euler. Then we would have the system

\[ p_{n+1} = p_n - \alpha \frac{dC(p_{n+1})}{dp} \]

which we would then move to one side:

\[ p_{n+1} - p_n + \alpha \frac{dC(p_{n+1})}{dp} = 0 \]

and solve each step via a Newton method. For this Newton method, we need to take the Jacobian of this gradient function, and once again the Hessian arrives as the fundamental quantity.

Neural Network Training as a Shooting Method for Functions

A one layer dense neuron is traditionally written as the function:

\[ layer(x) = \sigma.(Wx + b) \]

where $x \in \mathbb{R}^n$, $W \in \mathbb{R}^{m \times n}$, $b \in \mathbb{R}^{m}$ and $\sigma$ is some choice of $\mathbb{R}\rightarrow\mathbb{R}$ nonlinear function, where the . is the Julia dot to signify element-wise operation.

A traditional neural network, feed-forward network, or multi-layer perceptron is a 3 layer function, i.e.

\[ NN(x) = W_3 \sigma_2.(W_2\sigma_1.(W_1x + b_1) + b_2) + b_3 \]

where the first layer is called the input layer, the second is called the hidden layer, and the final is called the output layer. This specific function was seen as desirable because of the Universal Approximation Theorem, which is formally stated as follows:

Let $\sigma$ be a nonconstant, bounded, and continuous function. Let $I_m = [0,1]^m$. The space of real-valued continuous functions on $I_m$ is denoted by $C(I_m)$. For any $\epsilon >0$ and any $f\in C(I_m)$, there exists an integer $N$, real constants $W_i$ and $b_i$ s.t.

\[ \Vert NN(x) - f(x) \Vert < \epsilon \]

for all $x \in I_m$. Equivalently, $NN$ given parameters is dense in $C(I_m)$.

However, it turns out that using only one hidden layer can require exponential growth in the size of said hidden layer, where the size is given by the number of columns in $W_1$. To counteract this, deep neural networks were developed to be in the form of the recurrence relation:

\[ v_{i+1} = \sigma_i.(W_i v_{i} + b_i) \]

\[ v_1 = x \]

\[ DNN(x) = v_{n} \]

for some $n$ where $n$ is the number of layers. Given a sufficient size of the hidden layers, this kind of function is a universal approximator (2017). Although it's not quite known yet, some results have shown that this kind of function is able to fit high dimensional functions without the curse of dimensionality, i.e. the number of parameters does not grow exponentially with the input size. More mathematical results in this direction are still being investigated.

However, this theory gives a direct way to transform the fitting of an arbitrary function into a parameter shooting problem. Given an unknown function $f$ one wishes to fit, one can place the cost function

\[ C(p) = \Vert DNN(x;p) - f(x) \Vert \]

where $DNN(x;p)$ signifies the deep neural network given by the parameters $p$, where the full set of parameters is the $W_i$ and $b_i$. To make the evaluation of that function be practical, we can instead say we wish to evaluate the difference at finitely many points:

\[ C(p) = \sum_k^N \Vert DNN(x_k;p) - f(x_k) \Vert \]

Training a neural network is machine learning speak for finding the $p$ which minimizes this cost function. Notice that this is then a shooting method problem, where a cost function is defined by direct evaluations of the model with some choice of parameters.

Recurrent Neural Networks

Recurrent neural networks are networks which are given by the recurrence relation:

\[ x_{k+1} = x_k + DNN(x_k,k;p) \]

Given our machinery, we can see this is equivalent to the Euler discretization with $\Delta t = 1$ on the neural ordinary differential equation defined by:

\[ x' = DNN(x,t;p) \]

Thus a recurrent neural network is a sequence of applications of a neural network (or possibly a neural network indexed by integer time).

Computing Gradients

This shows that many different problems, from training neural networks to fitting differential equations, all have the same underlying mathematical structure which requires the ability to compute the gradient of a cost function given model evaluations. However, this simply reduces to computing the gradient of the model's output given the parameters. To see this, let's take for example the L2 loss function, i.e.

\[ C(p) = \sum_i^N \Vert f(x_i;p) - y_i \Vert \]

for some finite data points $y_i$. In the ODE model, $y_i$ are time series points. In the general neural network, $y_i = d(x_i)$ for the function we wish to fit $d$. In data science applications of machine learning, $y_i = d_i$ the discrete data points we wish to fit. In any of these cases, we see that by the chain rule we have

\[ \frac{dC}{dp} = \sum_i^N 2 \left(f(x_i;p) - y_i \right) \frac{df(x_i)}{dp} \]

and therefore, knowing how to efficiently compute $\frac{df(x_i)}{dp}$ is the essential question for shooting-based parameter fitting.

Forward-Mode Automatic Differentiation for Gradients

Let's recall the forward-mode method for computing gradients. For an arbitrary nonlinear function $f$ with scalar output, we can compute derivatives by putting a dual number in. For example, with

\[ d = d_0 + v_1 \epsilon_1 + \ldots + v_m \epsilon_m \]

we have that

\[ f(d) = f(d_0) + f'(d_0)v_1 \epsilon_1 + \ldots + f'(d_0)v_m \epsilon_m \]

where $f'(d_0)v_i$ is the direction derivative in the direction of $v_i$. To compute the gradient with respond to the input, we thus need to make $v_i = e_i$.

However, in this case we now do not want to compute the derivative with respect to the input! Instead, now we have $f(x;p)$ and want to compute the derivatives with respect to $p$. This simply means that we want to take derivatives in the directions of the parameters. To do this, let:

\[ x = x_0 + 0 \epsilon_1 + \ldots + 0 \epsilon_k \]

\[ P = p + e_1 \epsilon_1 + \ldots + e_k \epsilon_k \]

where there are $k$ parameters. We then have that

\[ f(x;P) = f(x;p) + \frac{df}{dp_1} \epsilon_1 + \ldots + \frac{df}{dp_k} \epsilon_k \]

as the output, and thus a $k+1$-dimensional number computes the gradient of the function with respect to $k$ parameters.

Can we do better?

The Adjoint Technique and Reverse Accumulation

The fast method for computing gradients goes under many times. The adjoint technique, backpropagation, and reverse-mode automatic differentiation are in some sense all equivalent phrases given to this method from different disciplines. To understand the adjoint technique, we will look at the multivariate chain rule on a computation graph. Recall that for $f(x(t),y(t))$ that we have:

\[ \frac{df}{dt} = \frac{df}{dx}\frac{dx}{dt} + \frac{df}{dy}\frac{dy}{dt} \]

We can visualize our direct dependences as the computation graph:

i.e. $t$ directly determines $x$ and $y$ which then determines $f$. To calculate Assume you've already evaluated $f(t)$. If this has been done, then you've already had to calculate $x$ and $y$. Thus given the function $f$, we can now calculate $\frac{df}{dx}$ and $\frac{df}{dy}$, and then calculate $\frac{dx}{dt}$ and $\frac{dy}{dt}$.

Now let's put another layer in the computation. Let's make $f(x(v(t),w(t)),y(v(t),w(t))$. We can write out the full expression for the derivative. Notice that even with this additional layer, the statement we wrote above still holds:

\[ \frac{df}{dt} = \frac{df}{dx}\frac{dx}{dt} + \frac{df}{dy}\frac{dy}{dt} \]

So given an evaluation of $f$, we can (still) directly calculate $\frac{df}{dx}$ and $\frac{df}{dy}$. But now, to calculate $\frac{dx}{dt}$ and $\frac{dy}{dt}$, we do the next step of the chain rule:

\[ \frac{dx}{dt} = \frac{dx}{dv}\frac{dv}{dt} + \frac{dx}{dw}\frac{dw}{dt} \]

and similar for $y$. So plug it all in, and you see that our equations will grow wild if we actually try to plug it in! But it's clear that, to calculate $\frac{df}{dt}$, we can first calculate $\frac{df}{dx}$, and then multiply that to $\frac{dx}{dt}$. If we had more layers, we could calculate the sensitivity (the derivative) of the output to the last layer, then and then the sensitivity to the second layer back is the sensitivity of the last layer multiplied to that, and the third layer back has the sensitivity of the second layer multiplied to it!

Logistic Regression Example

To better see this structure, let's write out a simple example. Let our forward pass through our function be:

\[ \begin{align} z &= wx + b\\ y &= \sigma(z)\\ \mathcal{L} &= \frac{1}{2}(y-t)^2\\ \mathcal{R} &= \frac{1}{2}w^2\\ \mathcal{L}_{reg} &= \mathcal{L} + \lambda \mathcal{R}\end{align} \]

The formulation of the program here is called a Wengert list, tape, or graph. In this, $x$ and $t$ are inputs, $b$ and $W$ are parameters, $z$, $y$, $\mathcal{L}$, and $\mathcal{R}$ are intermediates, and $\mathcal{L}_{reg}$ is our output.

This is a simple univariate logistic regression model. To do logistic regression, we wish to find the parameters $w$ and $b$ which minimize the distance of $\mathcal{L}_{reg}$ from a desired output, which is done by computing derivatives.

Let's calculate the derivatives with respect to each quantity in reverse order. If our program is $f(x) = \mathcal{L}_{reg}$, then we have that

\[ \frac{df}{d\mathcal{L}_{reg}} = 1 \]

as the derivatives of the last layer. To computerize our notation, let's write

\[ \overline{\mathcal{L}_{reg}} = \frac{df}{d\mathcal{L}_{reg}} \]

for our computed values. For the derivatives of the second to last layer, we have that:

\[ \begin{align} \overline{\mathcal{R}} &= \frac{df}{d\mathcal{L}_{reg}} \frac{d\mathcal{L}_{reg}}{d\mathcal{R}}\\ &= \overline{\mathcal{L}_{reg}} \lambda \end{align} \]

\[ \begin{align} \overline{\mathcal{L}} &= \frac{df}{d\mathcal{L}_{reg}} \frac{d\mathcal{L}_{reg}}{d\mathcal{L}}\\ &= \overline{\mathcal{L}_{reg}} \end{align} \]

This was our observation from before that the derivative of the second layer is the partial derivative of the current values times the sensitivity of the final layer. And then we keep multiplying, so now for our next layer we have that:

\[ \begin{align} \overline{y} &= \overline{\mathcal{L}} \frac{d\mathcal{L}}{dy}\\ &= \overline{\mathcal{L}} (y-t) \end{align} \]

And notice that the chain rule holds since $\overline{\mathcal{L}}$ implicitly already has the multiplication by $\overline{\mathcal{L}_{reg}}$ inside of it. Then the next layer is:

\[ \begin{align} \frac{df}{z} &= \overline{y} \frac{dy}{dz}\\ &= \overline{y} \sigma^\prime(z) \end{align} \]

Then the next layer. Notice that here, by the chain rule on $w$ we have that:

\[ \begin{align} \overline{w} &= \overline{z} \frac{\partial z}{\partial w} + \overline{\mathcal{R}} \frac{d \mathcal{R}}{dw}\\ &= \overline{z} x + \overline{\mathcal{R}} w\end{align} \]

\[ \begin{align} \overline{b} &= \overline{z} \frac{\partial z}{\partial b}\\ &= \overline{z} \end{align} \]

This completely calculates all derivatives. In conclusion, the rule is:

  • You sum terms from each outward arrow

  • Each arrow has the derivative term of the end times the partial of the current term.

  • Recurse backwards to build simple linear combination expressions.

You can thus think of the relations as a message passing relation in reverse to the forward pass:

Note that the reverse-pass has the values of the forward pass, like $x$ and $t$, embedded within it.

Backpropagation of a Neural Network

Now let's look at backpropagation of a deep neural network. Before getting to it in the linear algebraic sense, let's write everything in terms of scalars. This means we can write a simple neural network as:

\[ \begin{align} z_i &= \sum_j W_{ij}^1 x_j + b_i^1\\ h_i &= \sigma(z_i)\\ y_i &= \sum_j W_{ij}^2 h_j + b_i^2\\ \mathcal{L} &= \frac{1}{2} \sum_k \left(y_k - t_k \right)^2 \end{align} \]

where I have chosen the L2 loss function. This is visualized by the computational graph:

Then we can do the same process as before to get:

\[ \begin{align} \overline{\mathcal{L}} &= 1\\ \overline{y_i} &= \overline{\mathcal{L}} (y_i - t_i)\\ \overline{w_{ij}^2} &= \overline{y_i} h_j\\ \overline{b_i^2} &= \overline{y_i}\\ \overline{h_i} &= \sum_k (\overline{y_k}w_{ki}^2)\\ \overline{z_i} &= \overline{h_i}\sigma^\prime(z_i)\\ \overline{w_{ij}^1} &= \overline{z_i} x_j\\ \overline{b_i^1} &= \overline{z_i}\end{align} \]

just by examining the computation graph. Now let's write this in linear algebraic form.

The forward pass for this simple neural network was:

\[ \begin{align} z &= W_1 x + b_1\\ h &= \sigma(z)\\ y &= W_2 h + b_2\\ \mathcal{L} = \frac{1}{2} \Vert y-t \Vert^2 \end{align} \]

If we carefully decode our scalar expression, we see that we get the following:

\[ \begin{align} \overline{\mathcal{L}} &= 1\\ \overline{y} &= \overline{\mathcal{L}}(y-t)\\ \overline{W_2} &= \overline{y}h^{T}\\ \overline{b_2} &= \overline{y}\\ \overline{h} &= W_2^T \overline{y}\\ \overline{z} &= \overline{h} .* \sigma^\prime(z)\\ \overline{W_1} &= \overline{z} x^T\\ \overline{b_1} &= \overline{z} \end{align} \]

We can thus decode the rules as:

  • Multiplying by the matrix going forwards means multiplying by the transpose going backwards. A term on the left stays on the left, and a term on the right stays on the right.

  • Element-wise operations give element-wise multiplication

Notice that the summation is then easily encoded into this rule by the transpose operation.

We can write it in the general DNN form of:

\[ r_i = W_i v_{i} + b_i \]

\[ v_{i+1} = \sigma_i.(r_i) \]

\[ v_1 = x \]

\[ \mathcal{L} = \frac{1}{2} \Vert v_{n} - t \Vert \]

\[ \begin{align} \overline{\mathcal{L}} &= 1\\ \overline{v_n} &= \overline{\mathcal{L}}(y-t)\\ \overline{r_i} &= \overline{v_i} .* \sigma_i^\prime (r_i)\\ \overline{W_i} &= \overline{v_i}r_{i-1}^{T}\\ \overline{b_i} &= \overline{v_i}\\ \overline{v_{i-1}} &= W_{i}^{T} \overline{v_i} \end{align} \]

Reverse-Mode Automatic Differentiation and vjps

Backpropagation of a neural network is thus a different way of accumulating derivatives. If $f$ is a composition of $L$ functions:

\[ f = f^L \circ f^{L-1} \circ \ldots \circ f^1 \]

Then the Jacobian matrix satisfies:

\[ J = J_L J_{L-1} \ldots J_1 \]

A program is essentially a nice way of writing a function in composition form. Forward-mode automatic differentiation worked by propagating forward the actions of the Jacobians at every step of the program:

\[ Jv = J_L (J_{L-1} (\ldots (J_1 v) \ldots )) \]

effectively calculating the Jacobian of the program by multiplying by the Jacobians from left to right at each step of the way. This means doing primitive $Jv$ calculations on each underlying problem, and pushing that calculation through.

But what about reverse accumulation? This can be isolated to the simple expression graph:

In backpropagation, we just showed that when doing reverse accumulation, the rule is that multiplication forwards is multiplication by the transpose backwards. So if the forward way to compute the Jacobian in reverse is to replace the matrix by its transpose:

We can either look at it as $J^T v$, or by transposing the equation $v^T J$. It's right there that we have a vector-transpose Jacobian product, or a vjp.

We can thus think of this as a different direction for the Jacobian accumulation. Reverse-mode automatic differentiation moves backwards through our composed Jacobian. For a value $v$ at the end, we can push it backwards:

\[ v^T J = (\ldots ((v^T J_L) J_{L-1}) \ldots ) J_1 \]

doing a vjp at every step of the way, which is simply doing reverse-mode AD of that function (and if it's linear, then simply doing the matrix multiplication). Thus reverse-mode AD is just a grouping of vjps into a single larger expression, instead of linearizing every single step.

Primitives of Reverse Mode

For forward-mode AD, we saw that we could define primitives in order to accelerate the calculation. For example, knowing that

\[ exp(x+\epsilon) = exp(x) + exp(x)\epsilon \]

allows the program to skip autodifferentiating through the code for exp. This was simple with forward-mode since we could represent the operation on a Dual number. What's the equivalent for reverse-mode AD? The answer is the pullback function. If $y = [y_1,y_2,\ldots] = f(x_1,x_2, \ldots)$, then $[\overline{x_1},\overline{x_2},\ldots]=\mathcal{B}_f^x(\overline{y})$ is the pullback of $f$ at the point $x$, defined for a scalar loss function $L(y)$ as:

\[ \overline{x_i} = \frac{\partial L}{\partial x_i} = \sum_j \frac{\partial L}{\partial y_j} \frac{\partial y_j}{\partial x_i} \]

Using the notation from earlier, $\overline{y} = \frac{\partial L}{\partial y}$ is the derivative of the some intermediate w.r.t. the cost function, and thus

\[ \overline{x_i} = \sum_j \overline{y_j} \frac{\partial y_j}{\partial x_i} = \mathcal{B}_f^x(\overline{y}) \]

Note that $\mathcal{B}_f^x(\overline{y})$ is a function of $x$ because the reverse pass that is use embeds values from the forward pass, and the values from the forward pass to use are those calculated during the evaluation of $f(x)$.

By the chain rule, if we don't have a primitive defined for $y_i(x)$, we can compute that by $\mathcal{B}_{y_i}(\overline{y})$, and recursively apply this process until we hit rules that we know. The rules to start with are the scalar derivative rules with follow quite simply, and the multivariate rules which we derived above. For example, if $y=f(x)=Ax$, then

\[ \mathcal{B}_{f}^x(\overline{y}) = \overline{y}^T A \]

which is simply saying that the Jacobian of $f$ at $x$ is $A$, and so the vjp is to multiply the vector transpose by $A$.

Likewise, for element-wise operations, the Jacobian is diagonal, and thus the vjp is multiplying once again by a diagonal matrix against the derivative, deriving the same pullback as we had for backpropagation in a neural network. This then is a quicker encoding and derivation of backpropagation.

Multivariate Derivatives from Reverse Mode

Since the primitive of reverse mode is the vjp, we can understand its behavior by looking at a large primitive. In our simplest case, the function $f(x)=Ax$ outputs a vector value, which we apply our loss function $L(y) = \Vert y-t \Vert$ to get a scalar. Thus we seed the scalar output $v=1$, and in the first step backwards we have a vector to scalar function, so the first pullback transforms from $1$ to the vector $v_2 = 2|y-t|$. Then we take that vector and multiply it like $v_2^T A$ to get the derivatives w.r.t. $x$.

Now let $L(y)$ be a vector function, i.e. we output a vector instead of a scalar from our loss function. Then $v$ is the seed to this process. Let's assume that $v = e_i$, one of the basis vectors. Then

\[ v_i^T J = e_i^T J \]

pulls computes a row of the Jacobian. There, if we had a vector function $y=f(x)$, the pullback $\mathcal{B}_f^x(e_i)$ is the row of the Jacobian $f'(x)$. Concatenating these is thus a way to build a full Jacobian. The gradient is thus a special case where $y$ is scalar, and thus the resulting Jacobian is just a single row, and therefore we set the seed equal to $1$ to compute the unscaled gradient.

Multi-Seeding

Similarly to forward-mode having a dual number with multiple simultaneous derivatives through partials $d = x + v_1 \epsilon_1 + \ldots + v_m \epsilon_m$, one can see that multi-seeding is an option in reverse-mode AD by, instead of pulling back a matrix instead of a row vector, where each row is a direction. Thus the matrix $A = [v_1 v_2 \ldots v_n]^T$ evaluated as $\mathcal{B}_f^x(A)$ is the equivalent operation to the forward-mode $f(d)$ for generalized multivariate multiseeded reverse-mode automatic differentiation. One should take care to recognize the Jacobian as a generalized linear operator in this case and ensure that the shapes in the program correctly handle this storage of the reverse seed. When linear, this will automatically make use of BLAS3 operations, making it an efficient form for neural networks.

Sparse Reverse Mode AD

Since the Jacobian is built row-by-row with reverse mode AD, the sparse differentiation discussion from forward-mode AD applies similarly but to the transpose. Therefore, in order to perform sparse reverse mode automatic differentiation, one would build up a connectivity graph of the columns, and perform a coloring algorithm on this graph. The seeds of the reverse call, $v_i$, would then be the color vectors, which would compute compressed rows, that are then decompressed similarly to the forward-mode case.

Forward Mode vs Reverse Mode

Notice that a pullback of a single scalar gives the gradient of a function, while the pushforward using forward-mode of a dual gives a directional derivative. Forward mode computes columns of a Jacobian, while reverse mode computes gradients (rows of a Jacobian). Therefore, the relative efficiency of the two approaches is based on the size of the Jacobian. If $f:\mathbb{R}^n \rightarrow \mathbb{R}^m$, then the Jacobian is of size $m \times n$. If $m$ is much smaller than $n$, then computing by each row will be faster, and thus use reverse mode. In the case of a gradient, $m=1$ while $n$ can be large, leading to this phenomena. Likewise, if $n$ is much smaller than $m$, then computing by each column will be faster. We will see shortly the reverse mode AD has a high overhead with respect to forward mode, and thus if the values are relatively equal (or $n$ and $m$ are small), forward mode is more efficient.

However, since optimization needs gradients, reverse-mode definitely has a place in the standard toolchain which is why backpropagation is so central to machine learning.

Side Note on Mixed Mode

Interestingly, one can find cases where mixing the forward and reverse mode results would give an asymptotically better result. For example, if a Jacobian was non-zero in only the first 3 rows and first 3 columns, then sparse forward mode would still require N partials and reverse mode would require M seeds. However, one forward mode call of 3 partials and one reverse mode call of 3 seeds would calculate all three rows and columns with $\mathcal{O}(1)$ work, as opposed to $\mathcal{O}(N)$ or $\mathcal{O}(M)$. Exactly how to make use of this insight in an automated manner is an open research question.

Forward-Over-Reverse and Hessian-Free Products

Using this knowledge, we can also develop quick ways for computing the Hessian. Recall from earlier in the discussion that Hessians are the Jacobian of the gradient. So let's say for a scalar function $f$ we want to compute the Hessian. To compute the gradient, we use the reverse-mode AD pullback $\nabla f(x) = \mathcal{B}_f^x(1)$. Recall that the pullback is a function of $x$ since that is the value at which the values from the forward pass are taken. Then since the Jacobian of the gradient vector is $n \times n$ (as many terms in the gradient as there are inputs!), it holds that we want to use forward-mode AD for this Jacobian. Therefore, using the dual number $x = x_0 + e_1 \epsilon_1 + \ldots + e_n \epsilon_n$ the reverse mode gradient function computes the full Hessian in one forward pass. What this amounts to is pushing forward the dual number forward sensitivities when building the pullback, and then when doing the pullback the dual portions, will be holding vectors for the columns of the Hessian.

Similarly, Hessian-vector products without computing the Hessian can be computed using the Jacobian-vector product trick on the function defined by the gradient. Here, $Hv$ is equivalent to the dual part of

\[ \nabla f(x+v\epsilon) = \mathcal{B}_f^{x+v\epsilon}(1) \]

This means that our Newton method for optimization:

\[ p_{i+1} = p_i - H(p_i)^{-1} \frac{dC(p_i)}{dp} \]

can be treated similarly to that for the nonlinear solving problem, where the linear system can be solved using Hessian-free vector products to build a Krylov subspace, giving rise to the Hessian-free Newton Krylov method for optimization.

References

We thank Roger Grosse's lecture notes for the amazing tikz graphs.

\ No newline at end of file diff --git a/notes/11-Differentiable_Programming_and_Neural_Differential_Equations/index.html b/notes/11-Differentiable_Programming_and_Neural_Differential_Equations/index.html index 3497c65e..cd1c7f92 100644 --- a/notes/11-Differentiable_Programming_and_Neural_Differential_Equations/index.html +++ b/notes/11-Differentiable_Programming_and_Neural_Differential_Equations/index.html @@ -33,4 +33,4 @@ y = meanpool(data(x), pdims; kw...) y, Δ -> (nobacksies(:meanpool, NNlib.∇meanpool(data.((Δ, y, x))..., pdims; kw...)), nothing) end -

where the derivative makes use of not only x, but also y so that the meanpool does not need to be re-calculated.

Using this style, Tracker.jl moves forward, building up the value and closures for the backpass and then recursively pulls back the input Δ to receive the derivative.

Source-to-Source AD

Given our previous discussions on performance, you should be horrified with how this approach handles scalar values. Each TrackedReal holds as Tracked{T} which holds a Call, not a Call{F,As<:Tuple}, and thus it's not strictly typed. Because it's not strictly typed, this implies that every single operation is going to cause heap allocations. If you measure this in PyTorch, TensorFlow Eager, Tracker, etc. you get around 500ns-2ms of overhead. This means that a 2ns + operation becomes... >500ns! Oh my!

This is not the only issue with tracing. Another issue is that the trace is value-dependent, meaning that every new value can build a new trace. Thus one cannot easily JIT compile a trace because it'll be different for every gradient calculation (you can compile it, but you better make sure the compile times are short!). Lastly, the Wengert list can be much larger than the code itself. For example, if you trace through a loop that is for i in 1:100000, then the trace will be huge, even if the function is relatively simple. This is directly demonstrated in the JAX "how it works" slide:

To avoid these issues, another version of reverse-mode automatic differentiation is source-to-source transformations. In order to do source code transformations, you need to know how to transform all language constructs via the reverse pass. This can be quite difficult (what is the "adjoint" of lock?), but when worked out this has a few benefits. First of all, you do not have to track values, meaning stack-allocated values can stay on the stack. Additionally, you can JIT compile one backpass because you have a single function used for all backpasses. Lastly, you don't need to unroll your loops! Instead, which each branch you'd need to insert some data structure to recall the values used from the forward pass (in order to invert in the right directions). However, that can be much more lightweight than a tracking pass.

This can be a difficult problem to do on a general programming language. In general it needs a strong programmatic representation to use as a compute graph. Google's engineers did an analysis when choosing Swift for TensorFlow and narrowed it down to either Swift or Julia due to their internal graph structures. Thus, it should be no surprise that the modern source-to-source AD systems are Zygote.jl for Julia, and Swift for TensorFlow in Swift. Additionally, older AD systems, like Tampenade, ADIFOR, and TAF, all for Fortran, were source-to-source AD systems.

Derivation of Reverse Mode Rules: Adjoints and Implicit Function Theorem

In order to require the least amount of work from our AD system, we need to be able to derive the adjoint rules at the highest level possible. Here are a few well-known cases to start understanding. These next examples are from Steven Johnson's resource.

Adjoint of Linear Solve

Let's say we have the function $A(p)x=b(p)$, i.e. this is the function that is given by the linear solving process, and we want to calculate the gradients of a cost function $g(x,p)$. To evaluate the gradient directly, we'd calculate:

\[ \frac{dg}{dp} = g_p + g_x x_p \]

where $x_p$ is the derivative of each value of $x$ with respect to each parameter $p$, and thus it's an $M \times P$ matrix (a Jacobian). Since $g$ is a small cost function, $g_p$ and $g_x$ are easy to compute, but $x_p$ is given by:

\[ x_{p_i} = A^{-1}(b_{p_i}-A_{p_i}x) \]

and so this is $P$ $M \times M$ linear solves, which is expensive! However, if we multiply by

\[ \lambda^{T} = g_x A^{-1} \]

then we obtain

\[ \frac{dg}{dp}\vert_{f=0} = g_p - \lambda^T f_p = g_p - \lambda^T (A_p x - b_p) \]

which is an alternative formulation of the derivative at the solution value. However, in this case there is no computational benefit to this reformulation.

Adjoint of Nonlinear Solve

Now let's look at some $f(x,p)=0$ nonlinear solving. Differentiating by $p$ gives us:

\[ f_x x_p + f_p = 0 \]

and thus $x_p = -f_x^{-1}f_p$. Therefore, using our cost function we write:

\[ \frac{dg}{dp} = g_p + g_x x_p = g_p - g_x \left(f_x^{-1} f_p \right) \]

or

\[ \frac{dg}{dp} = g_p - \left(g_x f_x^{-1} \right) f_p \]

Since $g_x$ is $1 \times M$, $f_x^{-1}$ is $M \times M$, and $f_p$ is $M \times P$, this grouping changes the problem gets rid of the size $MP$ term.

As is normal with backpasses, we solve for $x$ through the forward pass however we like, and then for the backpass solve for

\[ f_x^T \lambda = g_x^T \]

to obtain

\[ \frac{dg}{dp}\vert_{f=0} = g_p - \lambda^T f_p \]

which does the calculation without ever building the size $M \times MP$ term.

Adjoint of Ordinary Differential Equations

We with to solve for some cost function $G(u,p)$ evaluated throughout the differential equation, i.e.:

\[ G(u,p) = G(u(p)) = \int_{t_0}^T g(u(t,p))dt \]

To derive this adjoint, introduce the Lagrange multiplier $\lambda$ to form:

\[ I(p) = G(p) - \int_{t_0}^T \lambda^\ast (u^\prime - f(u,p,t))dt \]

Since $u^\prime = f(u,p,t)$, this is the mathematician's trick of adding zero, so then we have that

\[ \frac{dG}{dp} = \frac{dI}{dp} = \int_{t_0}^T (g_p + g_u s)dt - \int_{t_0}^T \lambda^\ast (s^\prime - f_u s - f_p)dt \]

for $s$ being the sensitivity, $s = \frac{du}{dp}$. After applying integration by parts to $\lambda^\ast s^\prime$, we get that:

\[ \int_{t_{0}}^{T}\lambda^{\ast}\left(s^{\prime}-f_{u}s-f_{p}\right)dt =\int_{t_{0}}^{T}\lambda^{\ast}s^{\prime}dt-\int_{t_{0}}^{T}\lambda^{\ast}\left(f_{u}s+f_{p}\right)dt \]

\[ =|\lambda^{\ast}(t)s(t)|_{t_{0}}^{T}-\int_{t_{0}}^{T}\lambda^{\ast\prime}sdt-\int_{t_{0}}^{T}\lambda^{\ast}\left(f_{u}s+f_{p}\right)dt \]

To see where we ended up, let's re-arrange the full expression now:

\[ \frac{dG}{dp} =\int_{t_{0}}^{T}(g_{p}+g_{u}s)dt+|\lambda^{\ast}(t)s(t)|_{t_{0}}^{T}-\int_{t_{0}}^{T}\lambda^{\ast\prime}sdt-\int_{t_{0}}^{T}\lambda^{\ast}\left(f_{u}s+f_{p}\right)dt \]

\[ =\int_{t_{0}}^{T}(g_{p}+\lambda^{\ast}f_{p})dt+|\lambda^{\ast}(t)s(t)|_{t_{0}}^{T}-\int_{t_{0}}^{T}\left(\lambda^{\ast\prime}+\lambda^\ast f_{u}-g_{u}\right)sdt \]

That was just a re-arrangement. Now, let's require that

\[ \lambda^\prime = -\frac{df}{du}^\ast \lambda + \left(\frac{dg}{du} \right)^\ast \]

\[ \lambda(T) = 0 \]

This means that the boundary term of the integration by parts is zero, and also one of those integral terms are perfectly zero. Thus, if $\lambda$ satisfies that equation, then we get:

\[ \frac{dG}{dp} = \lambda^\ast(t_0)\frac{dG}{du}(t_0) + \int_{t_0}^T \left(g_p + \lambda^\ast f_p \right)dt \]

which gives us our adjoint derivative relation.

If $G$ is discrete, then it can be represented via the Dirac delta:

\[ G(u,p) = \int_{t_0}^T \sum_{i=1}^N \Vert d_i - u(t_i,p)\Vert^2 \delta(t_i - t)dt \]

in which case

\[ g_u(t_i) = 2(d_i - u(t_i,p)) \]

at the data points $(t_i,d_i)$. Therefore, the derivative of an ODE solution with respect to a cost function is given by solving for $\lambda^\ast$ using an ODE for $\lambda^T$ in reverse time, and then using that to calculate $\frac{dG}{dp}$. Note that $\frac{dG}{dp}$ can be calculated simultaneously by appending a single value to the reverse ODE, since we can simply define the new ODE term as $g_p + \lambda^\ast f_p$, which would then calculate the integral on the fly (ODE integration is just... integration!).

Complexities of Implementing ODE Adjoints

The image below explains the dilemma:

Essentially, the whole problem is that we need to solve the ODE

\[ \lambda^\prime = -\frac{df}{du}^\ast \lambda - \left(\frac{dg}{du} \right)^\ast \]

\[ \lambda(T) = 0 \]

in reverse, but $\frac{df}{du}$ is defined by $u(t)$ which is a value only computed in the forward pass (the forward pass is embedded within the backpass!). Thus we need to be able to retrieve the value of $u(t)$ to get the Jacobian on-demand. There are three ways which this can be done:

  1. If you solve the reverse ODE $u^\prime = f(u,p,t)$ backwards in time, mathematically it'll give equivalent values. Computation-wise, this means that you can append $u(t)$ to $\lambda(t)$ (to $\frac{dG}{dp}$) to calculate all terms at the same time with a single reverse pass ODE. However, numerically this is unstable and thus not always recommended (ODEs are reversible, but ODE solver methods are not necessarily going to generate the same exact values or trajectories in reverse!)

  2. If you solve the forward ODE and receive a continuous solution $u(t)$, you can interpolate it to retrieve the values at any given the time reverse pass needs the $\frac{df}{du}$ Jacobian. This is fast but memory-intensive.

  3. Every time you need a value $u(t)$ during the backpass, you re-solve the forward ODE to $u(t)$. This is expensive! Thus one can instead use checkpoints, i.e. save at finitely many time points during the forward pass, and use those as starting points for the $u(t)$ calculation.

Alternative strategies can be investigated, such as an interpolation which stores values in a compressed form.

The vjp and Neural Ordinary Differential Equations

It is here that we can note that, if $f$ is a function defined by a neural network, we arrive at the neural ordinary differential equation. This adjoint method is thus the backpropagation method for the neural ODE. However, the backpass

\[ \lambda^\prime = -\frac{df}{du}^\ast \lambda - \left(\frac{dg}{du} \right)^\ast \]

\[ \lambda(T) = 0 \]

can be improved by noticing $\frac{df}{du}^\ast \lambda$ is a vjp, and thus it can be calculated using $\mathcal{B}_f^{u(t)}(\lambda^\ast)$, i.e. reverse-mode AD on the function $f$. If $f$ is a neural network, this means that the reverse ODE is defined through successive backpropagation passes of that neural network. The result is a derivative with respect to the cost function of the parameters defining $f$ (either a model or a neural network), which can then be used to fit the data ("train").

Alternative "Training" Strategies

Those are the "brute force" training methods which simply use $u(t,p)$ evaluations to calculate the cost. However, it is worth noting that there are a few better strategies that one can employ in the case of dynamical models.

Multiple Shooting Techniques

Instead of shooting just from the beginning, one can instead shoot from multiple points in time:

Of course, one won't know what the "initial condition in the future" is, but one can instead make that a parameter. By doing so, each interval can be solved independently, and one can then add to the cost function that the end of one interval must match up with the beginning of the other. This can make the integration more robust, since shooting with incorrect parameters over long time spans can give massive gradients which makes it hard to hone in on the correct values.

Collocation Methods

If the data is dense enough, one can fit a curve through the points, such as a spline:

If that's the case, one can use the fit spline in order to estimate the derivative at each point. Since the ODE is defined as $u^\prime = f(u,p,t)$, one then then use the cost function

\[ C(p) = \sum_{i=1}^N \Vert\tilde{u}^{\prime}(t_i) - f(u(t_i),p,t)\Vert \]

where $\tilde{u}^{\prime}(t_i)$ is the estimated derivative at the time point $t_i$. Then one can fit the parameters to ensure this holds. This method can be extremely fast since the ODE doesn't ever have to be solved! However, note that this is not able to compensate for error accumulation, and thus early errors are not accounted for in the later parts of the data. This means that the integration won't necessarily match the data even if this fit is "good" if the data points are too far apart, a property that is not true with fitting. Thus, this is usually done as part of a two-stage method, where the starting stage uses collocation to get initial parameters which is then completed with a shooting method.

\ No newline at end of file +

where the derivative makes use of not only x, but also y so that the meanpool does not need to be re-calculated.

Using this style, Tracker.jl moves forward, building up the value and closures for the backpass and then recursively pulls back the input Δ to receive the derivative.

Source-to-Source AD

Given our previous discussions on performance, you should be horrified with how this approach handles scalar values. Each TrackedReal holds as Tracked{T} which holds a Call, not a Call{F,As<:Tuple}, and thus it's not strictly typed. Because it's not strictly typed, this implies that every single operation is going to cause heap allocations. If you measure this in PyTorch, TensorFlow Eager, Tracker, etc. you get around 500ns-2ms of overhead. This means that a 2ns + operation becomes... >500ns! Oh my!

This is not the only issue with tracing. Another issue is that the trace is value-dependent, meaning that every new value can build a new trace. Thus one cannot easily JIT compile a trace because it'll be different for every gradient calculation (you can compile it, but you better make sure the compile times are short!). Lastly, the Wengert list can be much larger than the code itself. For example, if you trace through a loop that is for i in 1:100000, then the trace will be huge, even if the function is relatively simple. This is directly demonstrated in the JAX "how it works" slide:

To avoid these issues, another version of reverse-mode automatic differentiation is source-to-source transformations. In order to do source code transformations, you need to know how to transform all language constructs via the reverse pass. This can be quite difficult (what is the "adjoint" of lock?), but when worked out this has a few benefits. First of all, you do not have to track values, meaning stack-allocated values can stay on the stack. Additionally, you can JIT compile one backpass because you have a single function used for all backpasses. Lastly, you don't need to unroll your loops! Instead, which each branch you'd need to insert some data structure to recall the values used from the forward pass (in order to invert in the right directions). However, that can be much more lightweight than a tracking pass.

This can be a difficult problem to do on a general programming language. In general it needs a strong programmatic representation to use as a compute graph. Google's engineers did an analysis when choosing Swift for TensorFlow and narrowed it down to either Swift or Julia due to their internal graph structures. Thus, it should be no surprise that the modern source-to-source AD systems are Zygote.jl for Julia, and Swift for TensorFlow in Swift. Additionally, older AD systems, like Tampenade, ADIFOR, and TAF, all for Fortran, were source-to-source AD systems.

Derivation of Reverse Mode Rules: Adjoints and Implicit Function Theorem

In order to require the least amount of work from our AD system, we need to be able to derive the adjoint rules at the highest level possible. Here are a few well-known cases to start understanding. These next examples are from Steven Johnson's resource.

Adjoint of Linear Solve

Let's say we have the function $A(p)x=b(p)$, i.e. this is the function that is given by the linear solving process, and we want to calculate the gradients of a cost function $g(x,p)$. To evaluate the gradient directly, we'd calculate:

\[ \frac{dg}{dp} = g_p + g_x x_p \]

where $x_p$ is the derivative of each value of $x$ with respect to each parameter $p$, and thus it's an $M \times P$ matrix (a Jacobian). Since $g$ is a small cost function, $g_p$ and $g_x$ are easy to compute, but $x_p$ is given by:

\[ x_{p_i} = A^{-1}(b_{p_i}-A_{p_i}x) \]

and so this is $P$ $M \times M$ linear solves, which is expensive! However, if we multiply by

\[ \lambda^{T} = g_x A^{-1} \]

then we obtain

\[ \frac{dg}{dp}\vert_{f=0} = g_p - \lambda^T f_p = g_p - \lambda^T (A_p x - b_p) \]

which is an alternative formulation of the derivative at the solution value. However, in this case there is no computational benefit to this reformulation.

Adjoint of Nonlinear Solve

Now let's look at some $f(x,p)=0$ nonlinear solving. Differentiating by $p$ gives us:

\[ f_x x_p + f_p = 0 \]

and thus $x_p = -f_x^{-1}f_p$. Therefore, using our cost function we write:

\[ \frac{dg}{dp} = g_p + g_x x_p = g_p - g_x \left(f_x^{-1} f_p \right) \]

or

\[ \frac{dg}{dp} = g_p - \left(g_x f_x^{-1} \right) f_p \]

Since $g_x$ is $1 \times M$, $f_x^{-1}$ is $M \times M$, and $f_p$ is $M \times P$, this grouping changes the problem gets rid of the size $MP$ term.

As is normal with backpasses, we solve for $x$ through the forward pass however we like, and then for the backpass solve for

\[ f_x^T \lambda = g_x^T \]

to obtain

\[ \frac{dg}{dp}\vert_{f=0} = g_p - \lambda^T f_p \]

which does the calculation without ever building the size $M \times MP$ term.

Adjoint of Ordinary Differential Equations

We with to solve for some cost function $G(u,p)$ evaluated throughout the differential equation, i.e.:

\[ G(u,p) = G(u(p)) = \int_{t_0}^T g(u(t,p))dt \]

To derive this adjoint, introduce the Lagrange multiplier $\lambda$ to form:

\[ I(p) = G(p) - \int_{t_0}^T \lambda^\ast (u^\prime - f(u,p,t))dt \]

Since $u^\prime = f(u,p,t)$, this is the mathematician's trick of adding zero, so then we have that

\[ \frac{dG}{dp} = \frac{dI}{dp} = \int_{t_0}^T (g_p + g_u s)dt - \int_{t_0}^T \lambda^\ast (s^\prime - f_u s - f_p)dt \]

for $s$ being the sensitivity, $s = \frac{du}{dp}$. After applying integration by parts to $\lambda^\ast s^\prime$, we get that:

\[ \int_{t_{0}}^{T}\lambda^{\ast}\left(s^{\prime}-f_{u}s-f_{p}\right)dt =\int_{t_{0}}^{T}\lambda^{\ast}s^{\prime}dt-\int_{t_{0}}^{T}\lambda^{\ast}\left(f_{u}s+f_{p}\right)dt \]

\[ =|\lambda^{\ast}(t)s(t)|_{t_{0}}^{T}-\int_{t_{0}}^{T}\lambda^{\ast\prime}sdt-\int_{t_{0}}^{T}\lambda^{\ast}\left(f_{u}s+f_{p}\right)dt \]

To see where we ended up, let's re-arrange the full expression now:

\[ \frac{dG}{dp} =\int_{t_{0}}^{T}(g_{p}+g_{u}s)dt+|\lambda^{\ast}(t)s(t)|_{t_{0}}^{T}-\int_{t_{0}}^{T}\lambda^{\ast\prime}sdt-\int_{t_{0}}^{T}\lambda^{\ast}\left(f_{u}s+f_{p}\right)dt \]

\[ =\int_{t_{0}}^{T}(g_{p}+\lambda^{\ast}f_{p})dt+|\lambda^{\ast}(t)s(t)|_{t_{0}}^{T}-\int_{t_{0}}^{T}\left(\lambda^{\ast\prime}+\lambda^\ast f_{u}-g_{u}\right)sdt \]

That was just a re-arrangement. Now, let's require that

\[ \lambda^\prime = -\frac{df}{du}^\ast \lambda + \left(\frac{dg}{du} \right)^\ast \]

\[ \lambda(T) = 0 \]

This means that the boundary term of the integration by parts is zero, and also one of those integral terms are perfectly zero. Thus, if $\lambda$ satisfies that equation, then we get:

\[ \frac{dG}{dp} = \lambda^\ast(t_0)\frac{dG}{du}(t_0) + \int_{t_0}^T \left(g_p + \lambda^\ast f_p \right)dt \]

which gives us our adjoint derivative relation.

If $G$ is discrete, then it can be represented via the Dirac delta:

\[ G(u,p) = \int_{t_0}^T \sum_{i=1}^N \Vert d_i - u(t_i,p)\Vert^2 \delta(t_i - t)dt \]

in which case

\[ g_u(t_i) = 2(d_i - u(t_i,p)) \]

at the data points $(t_i,d_i)$. Therefore, the derivative of an ODE solution with respect to a cost function is given by solving for $\lambda^\ast$ using an ODE for $\lambda^T$ in reverse time, and then using that to calculate $\frac{dG}{dp}$. Note that $\frac{dG}{dp}$ can be calculated simultaneously by appending a single value to the reverse ODE, since we can simply define the new ODE term as $g_p + \lambda^\ast f_p$, which would then calculate the integral on the fly (ODE integration is just... integration!).

Complexities of Implementing ODE Adjoints

The image below explains the dilemma:

Essentially, the whole problem is that we need to solve the ODE

\[ \lambda^\prime = -\frac{df}{du}^\ast \lambda - \left(\frac{dg}{du} \right)^\ast \]

\[ \lambda(T) = 0 \]

in reverse, but $\frac{df}{du}$ is defined by $u(t)$ which is a value only computed in the forward pass (the forward pass is embedded within the backpass!). Thus we need to be able to retrieve the value of $u(t)$ to get the Jacobian on-demand. There are three ways which this can be done:

  1. If you solve the reverse ODE $u^\prime = f(u,p,t)$ backwards in time, mathematically it'll give equivalent values. Computation-wise, this means that you can append $u(t)$ to $\lambda(t)$ (to $\frac{dG}{dp}$) to calculate all terms at the same time with a single reverse pass ODE. However, numerically this is unstable and thus not always recommended (ODEs are reversible, but ODE solver methods are not necessarily going to generate the same exact values or trajectories in reverse!)

  2. If you solve the forward ODE and receive a continuous solution $u(t)$, you can interpolate it to retrieve the values at any given the time reverse pass needs the $\frac{df}{du}$ Jacobian. This is fast but memory-intensive.

  3. Every time you need a value $u(t)$ during the backpass, you re-solve the forward ODE to $u(t)$. This is expensive! Thus one can instead use checkpoints, i.e. save at finitely many time points during the forward pass, and use those as starting points for the $u(t)$ calculation.

Alternative strategies can be investigated, such as an interpolation which stores values in a compressed form.

The vjp and Neural Ordinary Differential Equations

It is here that we can note that, if $f$ is a function defined by a neural network, we arrive at the neural ordinary differential equation. This adjoint method is thus the backpropagation method for the neural ODE. However, the backpass

\[ \lambda^\prime = -\frac{df}{du}^\ast \lambda - \left(\frac{dg}{du} \right)^\ast \]

\[ \lambda(T) = 0 \]

can be improved by noticing $\frac{df}{du}^\ast \lambda$ is a vjp, and thus it can be calculated using $\mathcal{B}_f^{u(t)}(\lambda^\ast)$, i.e. reverse-mode AD on the function $f$. If $f$ is a neural network, this means that the reverse ODE is defined through successive backpropagation passes of that neural network. The result is a derivative with respect to the cost function of the parameters defining $f$ (either a model or a neural network), which can then be used to fit the data ("train").

Alternative "Training" Strategies

Those are the "brute force" training methods which simply use $u(t,p)$ evaluations to calculate the cost. However, it is worth noting that there are a few better strategies that one can employ in the case of dynamical models.

Multiple Shooting Techniques

Instead of shooting just from the beginning, one can instead shoot from multiple points in time:

Of course, one won't know what the "initial condition in the future" is, but one can instead make that a parameter. By doing so, each interval can be solved independently, and one can then add to the cost function that the end of one interval must match up with the beginning of the other. This can make the integration more robust, since shooting with incorrect parameters over long time spans can give massive gradients which makes it hard to hone in on the correct values.

Collocation Methods

If the data is dense enough, one can fit a curve through the points, such as a spline:

If that's the case, one can use the fit spline in order to estimate the derivative at each point. Since the ODE is defined as $u^\prime = f(u,p,t)$, one then then use the cost function

\[ C(p) = \sum_{i=1}^N \Vert\tilde{u}^{\prime}(t_i) - f(u(t_i),p,t)\Vert \]

where $\tilde{u}^{\prime}(t_i)$ is the estimated derivative at the time point $t_i$. Then one can fit the parameters to ensure this holds. This method can be extremely fast since the ODE doesn't ever have to be solved! However, note that this is not able to compensate for error accumulation, and thus early errors are not accounted for in the later parts of the data. This means that the integration won't necessarily match the data even if this fit is "good" if the data points are too far apart, a property that is not true with fitting. Thus, this is usually done as part of a two-stage method, where the starting stage uses collocation to get initial parameters which is then completed with a shooting method.

\ No newline at end of file diff --git a/notes/13-GPU_programming/index.html b/notes/13-GPU_programming/index.html index 109bc3a7..0a476099 100644 --- a/notes/13-GPU_programming/index.html +++ b/notes/13-GPU_programming/index.html @@ -127,4 +127,4 @@ c = vifelse(mask, b, a) # merge results vstore(c, A, i) end -

GPU (implicit vectorized)

Instead of using explicit vectorization, GPUs change the programming model so that the programmer writes a kernel which operates over each element of the data. In effect the programmer is writing a program that is executed for each vector lane. It is important to remember that the hardware itself still operates on vectors (CUDA calls this warp-size and it is 32 elements).

At this point please refer to the lecture slides

\ No newline at end of file +

GPU (implicit vectorized)

Instead of using explicit vectorization, GPUs change the programming model so that the programmer writes a kernel which operates over each element of the data. In effect the programmer is writing a program that is executed for each vector lane. It is important to remember that the hardware itself still operates on vectors (CUDA calls this warp-size and it is 32 elements).

At this point please refer to the lecture slides

\ No newline at end of file diff --git a/notes/14-PDEs_Convolutions_and_the_Mathematics_of_Locality/index.html b/notes/14-PDEs_Convolutions_and_the_Mathematics_of_Locality/index.html index 1d77a13f..fcee3dd8 100644 --- a/notes/14-PDEs_Convolutions_and_the_Mathematics_of_Locality/index.html +++ b/notes/14-PDEs_Convolutions_and_the_Mathematics_of_Locality/index.html @@ -125,7 +125,7 @@

What SciComp can learn from ML: Moderate Generalizations to Partial Differen diff --git a/notes/15-Mixing_Differential_Equations_and_Neural_Networks_for_Physics-Informed_Learning/index.html b/notes/15-Mixing_Differential_Equations_and_Neural_Networks_for_Physics-Informed_Learning/index.html index 2e43ee0e..e4462867 100644 --- a/notes/15-Mixing_Differential_Equations_and_Neural_Networks_for_Physics-Informed_Learning/index.html +++ b/notes/15-Mixing_Differential_Equations_and_Neural_Networks_for_Physics-Informed_Learning/index.html @@ -44,4 +44,4 @@ cb() Flux.train!(loss_adjoint, ps, data, opt, cb = cb) -

DiffEqFlux.jl supports the wide gambit of possible universal differential equations with combinations of stiffness, delays, stochasticity, etc. It does so by using Julia's language-wide AD tooling, such as ReverseDiff.jl, Tracker.jl, ForwardDiff.jl, and Zygote.jl, along with specializations available whenever adjoint methods are known (and the choice between the two is given to the user).

Many of the methods below can be encapsulated as a choice of a universal differential equation and trained with higher order, adaptive, and more efficient methods with DiffEqFlux.jl.

Deep BSDE Methods for High Dimensional Partial Differential Equations

The key paper on deep BSDE methods is this article from PNAS by Jiequn Han, Arnulf Jentzen, and Weinan E. Follow up papers like this one have identified a larger context in the sense of forward-backwards SDEs for a large class of partial differential equations.

Understanding the Setup for Terminal PDEs

While this setup may seem a bit contrived given the "very specific" partial differential equation form (you know the end value? You have some parabolic form?), it turns out that there is a large class of problems in economics and finance that satisfy this form. The reason is because in these problems you may know the value of something at the end, when you're going to sell it, and you want to evaluate it right now. The classic example is in options pricing. An option is a contract to be able to solve a stock at a given value. The simplest case is a contract that can only be executed at a pre-determined time in the future. Let's say we have an option to sell a stock at 100 no matter what. This means that, if the stock at the strike time (the time the option can be sold) is 70, we will make 30 from this option, and thus the option itself is worth 30. The question is, if I have this option today, the strike time is 3 months in the future, and the stock price is currently 70, how much should I value the option today?

To solve this, we need to put a model on how we think the stock price will evolve. One simple version is a linear stochastic differential equation, i.e. the stock price will evolve with a constant interest rate $r$ with some volatility (randomness) $\sigma$, in which case:

\[ dX_t = r X_t dt + \sigma X_t dW_t. \]

From this model, we can evaluate the probability that the stock is going to be at given values, which then gives us the probability that the option is worth a given value, which then gives us the expected (or average) value of the option. This is the Black-Scholes problem. However, a more direct way of calculating this result is writing down a partial differential equation for the evolution of the value of the option $V$ as a function of time $t$ and the current stock price $x$. At the final time point, if we know the stock price then we know the value of the option, and thus we have a terminal condition $V(T,x) = g(x)$ for some known value function $g(x)$. The question is, given this value at time $T$, what is the value of the option at time $t=0$ given that the stock currently has a value $x = \zeta$. Why is this interesting? This will tell you what you think the option is currently valued at, and thus if it's cheaper than that, you can gain money by buying the option right now! This means that the "solution" to the PDE is the value $V(0,\zeta)$, where we know the final points $V(T,x) = g(x)$. This is precisely the type of problem that is solved by the deep BSDE method.

The Deep BSDE Method

Consider the class of semilinear parabolic PDEs, in finite time $t\in[0, T]$ and $d$-dimensional space $x\in\mathbb R^d$, that have the form

\[ \begin{align} \frac{\partial u}{\partial t}(t,x) &+\frac{1}{2}\text{trace}\left(\sigma\sigma^{T}(t,x)\left(\text{Hess}_{x}u\right)(t,x)\right)\\ &+\nabla u(t,x)\cdot\mu(t,x) \\ &+f\left(t,x,u(t,x),\sigma^{T}(t,x)\nabla u(t,x)\right)=0,\end{align} \]

with a terminal condition $u(T,x)=g(x)$. In this equation, $\text{trace}$ is the trace of a matrix, $\sigma^T$ is the transpose of $\sigma$, $\nabla u$ is the gradient of $u$, and $\text{Hess}_x u$ is the Hessian of $u$ with respect to $x$. Furthermore, $\mu$ is a vector-valued function, $\sigma$ is a $d \times d$ matrix-valued function and $f$ is a nonlinear function. We assume that $\mu$, $\sigma$, and $f$ are known. We wish to find the solution at initial time, $t=0$, at some starting point, $x = \zeta$.

Let $W_{t}$ be a Brownian motion and take $X_t$ to be the solution to the stochastic differential equation

\[ dX_t = \mu(t,X_t) dt + \sigma (t,X_t) dW_t \]

with initial condition $X(0)=\zeta$. Previous work has shown that the solution satisfies the following BSDE:

\[ \begin{align} u(t, &X_t) - u(0,\zeta) = \\ & -\int_0^t f(s,X_s,u(s,X_s),\sigma^T(s,X_s)\nabla u(s,X_s)) ds \\ & + \int_0^t \left[\nabla u(s,X_s) \right]^T \sigma (s,X_s) dW_s,\end{align} \]

with terminating condition $g(X_T) = u(X_T,W_T)$.

At this point, the authors approximate $\left[\nabla u(s,X_s) \right]^T \sigma (s,X_s)$ and $u(0,\zeta)$ as neural networks. Using the Euler-Maruyama discretization of the stochastic differential equation system, one arrives at a recurrent neural network:

Deep BSDE

Julia Implementation

A Julia implementation for the deep BSDE method can be found at NeuralPDE.jl. The examples considered below are part of the standard test suite.

Financial Applications of Deep BSDEs: Nonlinear Black-Scholes

Now let's look at a few applications which have PDEs that are solved by this method. One set of problems that are solved, given our setup, are Black-Scholes types of equations. Unlike a lot of previous literature, this works for a wide class of nonlinear extensions to Black-Scholes with large portfolios. Here, the dimension of the PDE for $V(t,x)$ is the dimension of $x$, where the dimension is the number of stocks in the portfolio that we want to consider. If we want to track 1000 stocks, this means our PDE is 1000 dimensional! Traditional PDE solvers would need around $N^{1000}$ points evolving over time in order to arrive at the solution, which is completely impractical.

One example of a nonlinear Black-Scholes equation in this form is the Black-Scholes equation with default risk. Here we are adding to the standard model the idea that the companies that we are buying stocks for can default, and thus our valuation has to take into account this default probability as the option will thus become value-less. The PDE that is arrived at is:

\[ \frac{\partial u}{\partial t}(t,x) + \bar{\mu}\cdot \nabla u(t, x) + \frac{\bar{\sigma}^{2}}{2} \sum_{i=1}^{d} \left |x_{i} \right |^{2} \frac{\partial^2 u}{\partial {x_{i}}^2}(t,x) \\ - (1 -\delta )Q(u(t,x))u(t,x) - Ru(t,x) = 0 \]

with terminating condition $g(x) = \min_{i} x_i$ for $x = (x_{1}, . . . , x_{100}) \in R^{100}$, where $\delta \in [0, 1)$, $R$ is the interest rate of the risk-free asset, and Q is a piecewise linear function of the current value with three regions $(v^{h} < v ^{l}, \gamma^{h} > \gamma^{l})$,

\[ \begin{align} Q(y) &= \mathbb{1}_{(-\infty,\upsilon^{h})}(y)\gamma ^{h} + \mathbb{1}_{[\upsilon^{l},\infty)}(y)\gamma ^{l} \\ &+ \mathbb{1}_{[\upsilon^{h},\upsilon^{l}]}(y) \left[ \frac{(\gamma ^{h} - \gamma ^{l})}{(\upsilon ^{h}- \upsilon ^{l})} (y - \upsilon ^{h}) + \gamma ^{h} \right ]. \end{align} \]

This PDE can be cast into the form of the deep BSDE method by setting:

\[ \begin{align} \mu &= \overline{\mu} X_{t} \\ \sigma &= \overline{\sigma} \text{diag}(X_{t}) \\ f &= -(1 -\delta )Q(u(t,x))u(t,x) - R u(t,x) \end{align} \]

The Julia code for this exact problem in 100 dimensions can be found here

Stochastic Optimal Control as a Deep BSDE Application

Another type of problem that fits into this terminal PDE form is the stochastic optimal control problem. The problem is a generalized context to what motivated us before. In this case, there are a set of agents which undergo some known stochastic model. What we want to do is apply some control (push them in some direction) at every single timepoint towards some goal. For example, we have the physics for the dynamics of drone flight, but there's randomness in the wind condition, and so we want to control the engine speeds to move in a certain direction. However, there is a cost associated with controlling, and thus the question is how to best balance the use of controls with the natural stochastic evolution.

It turns out this is in the same form as the Black-Scholes problem. There is a model evolving forwards, and when we get to the end we know how much everything "cost" because we know if the drone got to the right location and how much energy it took. So in the same sense as Black-Scholes, we can know the value at the end and try and propagate it backwards given the current state of the system $x$, to find out $u(0,\zeta)$, i.e. how should we control right now given the current system is in the state $x = \zeta$. It turns out that the solution of $u(t,x)$ where $u(T,x)=g(x)$ and we want to find $u(0,\zeta)$ is given by a partial differential equation which is known as the Hamilton-Jacobi-Bellman equation, which is one of these terminal PDEs that is representable by the deep BSDE method.

Take the classical linear-quadratic Gaussian (LQG) control problem in 100 dimensions

\[ dX_t = 2\sqrt{\lambda} c_t dt + \sqrt{2} dW_t \]

with $t\in [0,T]$, $X_0 = x$, and with a cost function

\[ C(c_t) = \mathbb{E}\left[\int_0^T \Vert c_t \Vert^2 dt + g(X_t) \right] \]

where $X_t$ is the state we wish to control, $\lambda$ is the strength of the control, and $c_t$ is the control process. To minimize the control, the Hamilton–Jacobi–Bellman equation:

\[ \frac{\partial u}{\partial t}(t,x) + \Delta u(t,x) - \lambda \Vert \nabla u(t,x) \Vert^2 = 0 \]

has a solution $u(t,x)$ which at $t=0$ represents the optimal cost of starting from $x$.

This PDE can be rewritten into the canonical form of the deep BSDE method by setting:

\[ \begin{align} \mu &= 0, \\ \sigma &= \overline{\sigma} I, \\ f &= -\alpha \left \| \sigma^T(s,X_s)\nabla u(s,X_s)) \right \|^{2}, \end{align} \]

where $\overline{\sigma} = \sqrt{2}$, T = 1 and $X_0 = (0,. . . , 0) \in R^{100}$.

The Julia code for solving this exact problem in 100 dimensions can be found here

Connections of Reservoir Computing to Scientific Machine Learning

Reservoir computing techniques are an alternative to the "full" neural network techniques we have previously discussed. However, the process of training neural networks has a few caveats which can cause difficulties in real systems:

  1. The tangent space diverges exponentially fast when the system is chaotic, meaning that results of both forward and reverse automatic differentiation techniques (and the related adjoints) are divergent on these kinds of systems.

  2. It is hard for neural networks to represent stiff systems. There are many reasons for this, one being that neural networks tend to drop high frequency behavior.

There are ways being investigated to alleviate these issues. For example, shadow adjoints can give a non-divergent average sense of a derivative on ergodic chaotic systems, but is significantly more expensive than the traditional adjoint.

To get around these caveats, some research teams have investigated alternatives which do not require gradient-based optimization. The clear frontrunner in this field is a type of architecture called echo state networks. A simplified formulation of an echo state network essentially fixes a neural network that defines a reservoir, i.e.

\[ x_{n+1} = \sigma(W x_n + W_{fb} y_n) \]

\[ y_n = g(W_{out} x_n) \]

where $W$ and $W_{fb}$ are fixed random matrices that are chosen before the training process, $x_n$ is called the reservoir state, and $y_n$ is the output state for the observables. The idea is to find a projection $W_{out}$ from the high dimensional random reservoir $x$ to model the timeseries by $y$. If the reservoir is a big enough and nonlinear enough random system, there should in theory exist a projection from that random system that matches any potential timeseries. Indeed, one can prove that echo state networks are universal adaptive filters under certain conditions.

If $g$ is invertible (and in many cases $g$ is taken to be the identity), then one can directly apply the inversion of $g$ to the data. This turns the training of $W_{out}$, the only non-fixed portion, into a standard least squares regression between the reservoir and the observation series. This is then solved by classical means like SVD factorizations which can be stable in ill-conditioned cases.

Echo state networks have been shown to accurately reproduce chaotic attractors which are shown to be hard to train RNNs against. A demonstration via ReservoirComputing.jl clearly highlights this prediction ability:

However, this methodology still is not tailored to the continuous nature of dynamical systems found in scientific computing. Recent work has extended this methodolgy to allow for a continuous reservoir, i.e. a continuous-time echo state network. It is shown that using the adaptive points of a stiff ODE integrator gives a non-uniform sampling in time that makes it easier to learn stiff equations from less training points, and demonstrates the ability to learn equations where standard physics-informed neural network (PINN) training techniques fail.

This area of research is still far less developed than PINNs and neural differential equations but shows promise to more easily learn highly stiff and chaotic systems which are seemingly out of reach for these other methods.

Automated Equation Discovery: Outputting LaTeX for Dynamical Systems from Data

The SINDy algorithm enables data-driven discovery of governing equations from data. It leverages the fact that most physical systems have only a few relevant terms that define the dynamics, making the governing equations sparse in a high-dimensional nonlinear function space. Given a set of observations

\[ \begin{array}{c} \mathbf{X}=\left[\begin{array}{c} \mathbf{x}^{T}\left(t_{1}\right) \\ \mathbf{x}^{T}\left(t_{2}\right) \\ \vdots \\ \mathbf{x}^{T}\left(t_{m}\right) \end{array}\right]=\left[\begin{array}{cccc} x_{1}\left(t_{1}\right) & x_{2}\left(t_{1}\right) & \cdots & x_{n}\left(t_{1}\right) \\ x_{1}\left(t_{2}\right) & x_{2}\left(t_{2}\right) & \cdots & x_{n}\left(t_{2}\right) \\ \vdots & \vdots & \ddots & \vdots \\ x_{1}\left(t_{m}\right) & x_{2}\left(t_{m}\right) & \cdots & x_{n}\left(t_{m}\right) \end{array}\right] \\ \end{array} \]

and a set of derivative observations

\[ \begin{array}{c} \dot{\mathbf{X}}=\left[\begin{array}{c} \mathbf{x}^{T}\left(t_{1}\right) \\ \dot{\mathbf{x}}^{T}\left(t_{2}\right) \\ \vdots \\ \mathbf{x}^{T}\left(t_{m}\right) \end{array}\right]=\left[\begin{array}{cccc} \dot{x}_{1}\left(t_{1}\right) & \dot{x}_{2}\left(t_{1}\right) & \cdots & \dot{x}_{n}\left(t_{1}\right) \\ \dot{x}_{1}\left(t_{2}\right) & \dot{x}_{2}\left(t_{2}\right) & \cdots & \dot{x}_{n}\left(t_{2}\right) \\ \vdots & \vdots & \ddots & \vdots \\ \dot{x}_{1}\left(t_{m}\right) & \dot{x}_{2}\left(t_{m}\right) & \cdots & \dot{x}_{n}\left(t_{m}\right) \end{array}\right] \end{array} \]

we can evaluate the observations in a basis $\Theta(X)$:

\[ \Theta(\mathbf{X})=\left[\begin{array}{llllllll} 1 & \mathbf{X} & \mathbf{X}^{P_{2}} & \mathbf{X}^{P_{3}} & \cdots & \sin (\mathbf{X}) & \cos (\mathbf{X}) & \cdots \end{array}\right] \]

where $X^{P_i}$ stands for all $P_i$th order polynomial terms. For example,

\[ \mathbf{X}^{P_{2}}=\left[\begin{array}{cccccc} x_{1}^{2}\left(t_{1}\right) & x_{1}\left(t_{1}\right) x_{2}\left(t_{1}\right) & \cdots & x_{2}^{2}\left(t_{1}\right) & \cdots & x_{n}^{2}\left(t_{1}\right) \\ x_{1}^{2}\left(t_{2}\right) & x_{1}\left(t_{2}\right) x_{2}\left(t_{2}\right) & \cdots & x_{2}^{2}\left(t_{2}\right) & \cdots & x_{n}^{2}\left(t_{2}\right) \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ x_{1}^{2}\left(t_{m}\right) & x_{1}\left(t_{m}\right) x_{2}\left(t_{m}\right) & \cdots & x_{2}^{2}\left(t_{m}\right) & \cdots & x_{n}^{2}\left(t_{m}\right) \end{array}\right] \]

Using these matrices, SINDy finds this sparse basis $\mathbf{\Xi}$ over a given candidate library $\mathbf{\Theta}$ by solving the sparse regression problem $\dot{X} =\mathbf{\Theta}\mathbf{\Xi}$ with $L_1$ regularization, i.e. minimizing the objective function $\left\Vert \mathbf{\dot{X}} - \mathbf{\Theta}\mathbf{\Xi} \right\Vert_2 + \lambda \left\Vert \mathbf{\Xi}\right\Vert_1$. This method and other variants of SInDy, along with specialized optimizers for the LASSO $L_1$ optimization problem, have been implemented in packages like DataDrivenDiffEq.jl and pysindy. The result of these methods is LaTeX for the missing dynamical system.

Notice that to use this method, derivative data $\dot{X}$ is required. While in most publications on the subject this information is assumed. To find this, $\dot{X}$ is calculated directly from the time series $X$ by fitting a cubic spline and taking the approximated derivatives at the observation points. However, for this estimation to be stable one needs a fairly dense timeseries for the interpolation. To alleviate this issue, the universal differential equations work estimates terms of partially described models and then uses the neural network as an oracle for the derivative values to learn from subsets of the dynamical system. This allows for the neural network's training to smooth out the derivative estimate between points while incorporating extra scientific information.

Other ways are being investigated for incorporating deep learning into the model discovery process. For example, extensions have been investigated where elements are defined by neural networks representing a basis of the Koopman operator. Additionally, much work is going on in improving the efficiency of the symbolic regression methods themselves, and making the methods implicit and parallel.

Surrogate Acceleration Methods

Another approach for mixing neural networks with differential equations is as a surrogate method. These methods are more mathematically trivial than the previous ideas, but can still achieve interesting results. A full example is explained in this video.

Say we have some function $g(p)$ which depends on a solution to a differential equation $u(t;p)$ and choices of parameters $p$. Computationally how we evaluate this function is we do the following:

  • Solve the differential equation with parameters $p$

  • Evaluate $g$ on the numerical solution for $u$

However, this process is computationally expensive since it requires the numerical solution of $u$ for every evaluation. Thus, one can look at this setup and see $g(p)$ itself is a nonlinear function. The idea is to train a neural network to be the function $g(p)$, i.e. directly put in $p$ and return the appropriate value without ever solving the differential equation.

The video highlights an important fact about this method: it can be computationally expensive to train this kind of surrogate since many data points $(p,g(p))$ are required. In fact, many more data points than you might use. However, after training, the surrogate network for $g(p)$ can be a lot faster than the original simulation-based approach. This means that this is a method for accelerating real-time solutions by doing upfront computations. The total compute time will always be more, but in some sense the cost is amortized or shifted to be done before hand, so that the model does not need to be simulated on the fly. This can allow for things like computationally expensive models of drone flight to be used in a real-time controller.

This technique goes a long way back, but some recent examples of this have been shown. For example, there's this paper which "accelerated" the solution of the 3-body problem using a neural network surrogate trained over a few days to get a 1 million times acceleration (after generating many points beforehand of course! In the paper, notice that it took 10 days to generate the training dataset). Additionally, there is this deep learning trebuchet example which showcased that inverse problems, i.e. control or finding parameters, can be completely encapsulated as a $g(p)$ and learned with sufficient data.

\ No newline at end of file +

DiffEqFlux.jl supports the wide gambit of possible universal differential equations with combinations of stiffness, delays, stochasticity, etc. It does so by using Julia's language-wide AD tooling, such as ReverseDiff.jl, Tracker.jl, ForwardDiff.jl, and Zygote.jl, along with specializations available whenever adjoint methods are known (and the choice between the two is given to the user).

Many of the methods below can be encapsulated as a choice of a universal differential equation and trained with higher order, adaptive, and more efficient methods with DiffEqFlux.jl.

Deep BSDE Methods for High Dimensional Partial Differential Equations

The key paper on deep BSDE methods is this article from PNAS by Jiequn Han, Arnulf Jentzen, and Weinan E. Follow up papers like this one have identified a larger context in the sense of forward-backwards SDEs for a large class of partial differential equations.

Understanding the Setup for Terminal PDEs

While this setup may seem a bit contrived given the "very specific" partial differential equation form (you know the end value? You have some parabolic form?), it turns out that there is a large class of problems in economics and finance that satisfy this form. The reason is because in these problems you may know the value of something at the end, when you're going to sell it, and you want to evaluate it right now. The classic example is in options pricing. An option is a contract to be able to solve a stock at a given value. The simplest case is a contract that can only be executed at a pre-determined time in the future. Let's say we have an option to sell a stock at 100 no matter what. This means that, if the stock at the strike time (the time the option can be sold) is 70, we will make 30 from this option, and thus the option itself is worth 30. The question is, if I have this option today, the strike time is 3 months in the future, and the stock price is currently 70, how much should I value the option today?

To solve this, we need to put a model on how we think the stock price will evolve. One simple version is a linear stochastic differential equation, i.e. the stock price will evolve with a constant interest rate $r$ with some volatility (randomness) $\sigma$, in which case:

\[ dX_t = r X_t dt + \sigma X_t dW_t. \]

From this model, we can evaluate the probability that the stock is going to be at given values, which then gives us the probability that the option is worth a given value, which then gives us the expected (or average) value of the option. This is the Black-Scholes problem. However, a more direct way of calculating this result is writing down a partial differential equation for the evolution of the value of the option $V$ as a function of time $t$ and the current stock price $x$. At the final time point, if we know the stock price then we know the value of the option, and thus we have a terminal condition $V(T,x) = g(x)$ for some known value function $g(x)$. The question is, given this value at time $T$, what is the value of the option at time $t=0$ given that the stock currently has a value $x = \zeta$. Why is this interesting? This will tell you what you think the option is currently valued at, and thus if it's cheaper than that, you can gain money by buying the option right now! This means that the "solution" to the PDE is the value $V(0,\zeta)$, where we know the final points $V(T,x) = g(x)$. This is precisely the type of problem that is solved by the deep BSDE method.

The Deep BSDE Method

Consider the class of semilinear parabolic PDEs, in finite time $t\in[0, T]$ and $d$-dimensional space $x\in\mathbb R^d$, that have the form

\[ \begin{align} \frac{\partial u}{\partial t}(t,x) &+\frac{1}{2}\text{trace}\left(\sigma\sigma^{T}(t,x)\left(\text{Hess}_{x}u\right)(t,x)\right)\\ &+\nabla u(t,x)\cdot\mu(t,x) \\ &+f\left(t,x,u(t,x),\sigma^{T}(t,x)\nabla u(t,x)\right)=0,\end{align} \]

with a terminal condition $u(T,x)=g(x)$. In this equation, $\text{trace}$ is the trace of a matrix, $\sigma^T$ is the transpose of $\sigma$, $\nabla u$ is the gradient of $u$, and $\text{Hess}_x u$ is the Hessian of $u$ with respect to $x$. Furthermore, $\mu$ is a vector-valued function, $\sigma$ is a $d \times d$ matrix-valued function and $f$ is a nonlinear function. We assume that $\mu$, $\sigma$, and $f$ are known. We wish to find the solution at initial time, $t=0$, at some starting point, $x = \zeta$.

Let $W_{t}$ be a Brownian motion and take $X_t$ to be the solution to the stochastic differential equation

\[ dX_t = \mu(t,X_t) dt + \sigma (t,X_t) dW_t \]

with initial condition $X(0)=\zeta$. Previous work has shown that the solution satisfies the following BSDE:

\[ \begin{align} u(t, &X_t) - u(0,\zeta) = \\ & -\int_0^t f(s,X_s,u(s,X_s),\sigma^T(s,X_s)\nabla u(s,X_s)) ds \\ & + \int_0^t \left[\nabla u(s,X_s) \right]^T \sigma (s,X_s) dW_s,\end{align} \]

with terminating condition $g(X_T) = u(X_T,W_T)$.

At this point, the authors approximate $\left[\nabla u(s,X_s) \right]^T \sigma (s,X_s)$ and $u(0,\zeta)$ as neural networks. Using the Euler-Maruyama discretization of the stochastic differential equation system, one arrives at a recurrent neural network:

Deep BSDE

Julia Implementation

A Julia implementation for the deep BSDE method can be found at NeuralPDE.jl. The examples considered below are part of the standard test suite.

Financial Applications of Deep BSDEs: Nonlinear Black-Scholes

Now let's look at a few applications which have PDEs that are solved by this method. One set of problems that are solved, given our setup, are Black-Scholes types of equations. Unlike a lot of previous literature, this works for a wide class of nonlinear extensions to Black-Scholes with large portfolios. Here, the dimension of the PDE for $V(t,x)$ is the dimension of $x$, where the dimension is the number of stocks in the portfolio that we want to consider. If we want to track 1000 stocks, this means our PDE is 1000 dimensional! Traditional PDE solvers would need around $N^{1000}$ points evolving over time in order to arrive at the solution, which is completely impractical.

One example of a nonlinear Black-Scholes equation in this form is the Black-Scholes equation with default risk. Here we are adding to the standard model the idea that the companies that we are buying stocks for can default, and thus our valuation has to take into account this default probability as the option will thus become value-less. The PDE that is arrived at is:

\[ \frac{\partial u}{\partial t}(t,x) + \bar{\mu}\cdot \nabla u(t, x) + \frac{\bar{\sigma}^{2}}{2} \sum_{i=1}^{d} \left |x_{i} \right |^{2} \frac{\partial^2 u}{\partial {x_{i}}^2}(t,x) \\ - (1 -\delta )Q(u(t,x))u(t,x) - Ru(t,x) = 0 \]

with terminating condition $g(x) = \min_{i} x_i$ for $x = (x_{1}, . . . , x_{100}) \in R^{100}$, where $\delta \in [0, 1)$, $R$ is the interest rate of the risk-free asset, and Q is a piecewise linear function of the current value with three regions $(v^{h} < v ^{l}, \gamma^{h} > \gamma^{l})$,

\[ \begin{align} Q(y) &= \mathbb{1}_{(-\infty,\upsilon^{h})}(y)\gamma ^{h} + \mathbb{1}_{[\upsilon^{l},\infty)}(y)\gamma ^{l} \\ &+ \mathbb{1}_{[\upsilon^{h},\upsilon^{l}]}(y) \left[ \frac{(\gamma ^{h} - \gamma ^{l})}{(\upsilon ^{h}- \upsilon ^{l})} (y - \upsilon ^{h}) + \gamma ^{h} \right ]. \end{align} \]

This PDE can be cast into the form of the deep BSDE method by setting:

\[ \begin{align} \mu &= \overline{\mu} X_{t} \\ \sigma &= \overline{\sigma} \text{diag}(X_{t}) \\ f &= -(1 -\delta )Q(u(t,x))u(t,x) - R u(t,x) \end{align} \]

The Julia code for this exact problem in 100 dimensions can be found here

Stochastic Optimal Control as a Deep BSDE Application

Another type of problem that fits into this terminal PDE form is the stochastic optimal control problem. The problem is a generalized context to what motivated us before. In this case, there are a set of agents which undergo some known stochastic model. What we want to do is apply some control (push them in some direction) at every single timepoint towards some goal. For example, we have the physics for the dynamics of drone flight, but there's randomness in the wind condition, and so we want to control the engine speeds to move in a certain direction. However, there is a cost associated with controlling, and thus the question is how to best balance the use of controls with the natural stochastic evolution.

It turns out this is in the same form as the Black-Scholes problem. There is a model evolving forwards, and when we get to the end we know how much everything "cost" because we know if the drone got to the right location and how much energy it took. So in the same sense as Black-Scholes, we can know the value at the end and try and propagate it backwards given the current state of the system $x$, to find out $u(0,\zeta)$, i.e. how should we control right now given the current system is in the state $x = \zeta$. It turns out that the solution of $u(t,x)$ where $u(T,x)=g(x)$ and we want to find $u(0,\zeta)$ is given by a partial differential equation which is known as the Hamilton-Jacobi-Bellman equation, which is one of these terminal PDEs that is representable by the deep BSDE method.

Take the classical linear-quadratic Gaussian (LQG) control problem in 100 dimensions

\[ dX_t = 2\sqrt{\lambda} c_t dt + \sqrt{2} dW_t \]

with $t\in [0,T]$, $X_0 = x$, and with a cost function

\[ C(c_t) = \mathbb{E}\left[\int_0^T \Vert c_t \Vert^2 dt + g(X_t) \right] \]

where $X_t$ is the state we wish to control, $\lambda$ is the strength of the control, and $c_t$ is the control process. To minimize the control, the Hamilton–Jacobi–Bellman equation:

\[ \frac{\partial u}{\partial t}(t,x) + \Delta u(t,x) - \lambda \Vert \nabla u(t,x) \Vert^2 = 0 \]

has a solution $u(t,x)$ which at $t=0$ represents the optimal cost of starting from $x$.

This PDE can be rewritten into the canonical form of the deep BSDE method by setting:

\[ \begin{align} \mu &= 0, \\ \sigma &= \overline{\sigma} I, \\ f &= -\alpha \left \| \sigma^T(s,X_s)\nabla u(s,X_s)) \right \|^{2}, \end{align} \]

where $\overline{\sigma} = \sqrt{2}$, T = 1 and $X_0 = (0,. . . , 0) \in R^{100}$.

The Julia code for solving this exact problem in 100 dimensions can be found here

Connections of Reservoir Computing to Scientific Machine Learning

Reservoir computing techniques are an alternative to the "full" neural network techniques we have previously discussed. However, the process of training neural networks has a few caveats which can cause difficulties in real systems:

  1. The tangent space diverges exponentially fast when the system is chaotic, meaning that results of both forward and reverse automatic differentiation techniques (and the related adjoints) are divergent on these kinds of systems.

  2. It is hard for neural networks to represent stiff systems. There are many reasons for this, one being that neural networks tend to drop high frequency behavior.

There are ways being investigated to alleviate these issues. For example, shadow adjoints can give a non-divergent average sense of a derivative on ergodic chaotic systems, but is significantly more expensive than the traditional adjoint.

To get around these caveats, some research teams have investigated alternatives which do not require gradient-based optimization. The clear frontrunner in this field is a type of architecture called echo state networks. A simplified formulation of an echo state network essentially fixes a neural network that defines a reservoir, i.e.

\[ x_{n+1} = \sigma(W x_n + W_{fb} y_n) \]

\[ y_n = g(W_{out} x_n) \]

where $W$ and $W_{fb}$ are fixed random matrices that are chosen before the training process, $x_n$ is called the reservoir state, and $y_n$ is the output state for the observables. The idea is to find a projection $W_{out}$ from the high dimensional random reservoir $x$ to model the timeseries by $y$. If the reservoir is a big enough and nonlinear enough random system, there should in theory exist a projection from that random system that matches any potential timeseries. Indeed, one can prove that echo state networks are universal adaptive filters under certain conditions.

If $g$ is invertible (and in many cases $g$ is taken to be the identity), then one can directly apply the inversion of $g$ to the data. This turns the training of $W_{out}$, the only non-fixed portion, into a standard least squares regression between the reservoir and the observation series. This is then solved by classical means like SVD factorizations which can be stable in ill-conditioned cases.

Echo state networks have been shown to accurately reproduce chaotic attractors which are shown to be hard to train RNNs against. A demonstration via ReservoirComputing.jl clearly highlights this prediction ability:

However, this methodology still is not tailored to the continuous nature of dynamical systems found in scientific computing. Recent work has extended this methodolgy to allow for a continuous reservoir, i.e. a continuous-time echo state network. It is shown that using the adaptive points of a stiff ODE integrator gives a non-uniform sampling in time that makes it easier to learn stiff equations from less training points, and demonstrates the ability to learn equations where standard physics-informed neural network (PINN) training techniques fail.

This area of research is still far less developed than PINNs and neural differential equations but shows promise to more easily learn highly stiff and chaotic systems which are seemingly out of reach for these other methods.

Automated Equation Discovery: Outputting LaTeX for Dynamical Systems from Data

The SINDy algorithm enables data-driven discovery of governing equations from data. It leverages the fact that most physical systems have only a few relevant terms that define the dynamics, making the governing equations sparse in a high-dimensional nonlinear function space. Given a set of observations

\[ \begin{array}{c} \mathbf{X}=\left[\begin{array}{c} \mathbf{x}^{T}\left(t_{1}\right) \\ \mathbf{x}^{T}\left(t_{2}\right) \\ \vdots \\ \mathbf{x}^{T}\left(t_{m}\right) \end{array}\right]=\left[\begin{array}{cccc} x_{1}\left(t_{1}\right) & x_{2}\left(t_{1}\right) & \cdots & x_{n}\left(t_{1}\right) \\ x_{1}\left(t_{2}\right) & x_{2}\left(t_{2}\right) & \cdots & x_{n}\left(t_{2}\right) \\ \vdots & \vdots & \ddots & \vdots \\ x_{1}\left(t_{m}\right) & x_{2}\left(t_{m}\right) & \cdots & x_{n}\left(t_{m}\right) \end{array}\right] \\ \end{array} \]

and a set of derivative observations

\[ \begin{array}{c} \dot{\mathbf{X}}=\left[\begin{array}{c} \mathbf{x}^{T}\left(t_{1}\right) \\ \dot{\mathbf{x}}^{T}\left(t_{2}\right) \\ \vdots \\ \mathbf{x}^{T}\left(t_{m}\right) \end{array}\right]=\left[\begin{array}{cccc} \dot{x}_{1}\left(t_{1}\right) & \dot{x}_{2}\left(t_{1}\right) & \cdots & \dot{x}_{n}\left(t_{1}\right) \\ \dot{x}_{1}\left(t_{2}\right) & \dot{x}_{2}\left(t_{2}\right) & \cdots & \dot{x}_{n}\left(t_{2}\right) \\ \vdots & \vdots & \ddots & \vdots \\ \dot{x}_{1}\left(t_{m}\right) & \dot{x}_{2}\left(t_{m}\right) & \cdots & \dot{x}_{n}\left(t_{m}\right) \end{array}\right] \end{array} \]

we can evaluate the observations in a basis $\Theta(X)$:

\[ \Theta(\mathbf{X})=\left[\begin{array}{llllllll} 1 & \mathbf{X} & \mathbf{X}^{P_{2}} & \mathbf{X}^{P_{3}} & \cdots & \sin (\mathbf{X}) & \cos (\mathbf{X}) & \cdots \end{array}\right] \]

where $X^{P_i}$ stands for all $P_i$th order polynomial terms. For example,

\[ \mathbf{X}^{P_{2}}=\left[\begin{array}{cccccc} x_{1}^{2}\left(t_{1}\right) & x_{1}\left(t_{1}\right) x_{2}\left(t_{1}\right) & \cdots & x_{2}^{2}\left(t_{1}\right) & \cdots & x_{n}^{2}\left(t_{1}\right) \\ x_{1}^{2}\left(t_{2}\right) & x_{1}\left(t_{2}\right) x_{2}\left(t_{2}\right) & \cdots & x_{2}^{2}\left(t_{2}\right) & \cdots & x_{n}^{2}\left(t_{2}\right) \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ x_{1}^{2}\left(t_{m}\right) & x_{1}\left(t_{m}\right) x_{2}\left(t_{m}\right) & \cdots & x_{2}^{2}\left(t_{m}\right) & \cdots & x_{n}^{2}\left(t_{m}\right) \end{array}\right] \]

Using these matrices, SINDy finds this sparse basis $\mathbf{\Xi}$ over a given candidate library $\mathbf{\Theta}$ by solving the sparse regression problem $\dot{X} =\mathbf{\Theta}\mathbf{\Xi}$ with $L_1$ regularization, i.e. minimizing the objective function $\left\Vert \mathbf{\dot{X}} - \mathbf{\Theta}\mathbf{\Xi} \right\Vert_2 + \lambda \left\Vert \mathbf{\Xi}\right\Vert_1$. This method and other variants of SInDy, along with specialized optimizers for the LASSO $L_1$ optimization problem, have been implemented in packages like DataDrivenDiffEq.jl and pysindy. The result of these methods is LaTeX for the missing dynamical system.

Notice that to use this method, derivative data $\dot{X}$ is required. While in most publications on the subject this information is assumed. To find this, $\dot{X}$ is calculated directly from the time series $X$ by fitting a cubic spline and taking the approximated derivatives at the observation points. However, for this estimation to be stable one needs a fairly dense timeseries for the interpolation. To alleviate this issue, the universal differential equations work estimates terms of partially described models and then uses the neural network as an oracle for the derivative values to learn from subsets of the dynamical system. This allows for the neural network's training to smooth out the derivative estimate between points while incorporating extra scientific information.

Other ways are being investigated for incorporating deep learning into the model discovery process. For example, extensions have been investigated where elements are defined by neural networks representing a basis of the Koopman operator. Additionally, much work is going on in improving the efficiency of the symbolic regression methods themselves, and making the methods implicit and parallel.

Surrogate Acceleration Methods

Another approach for mixing neural networks with differential equations is as a surrogate method. These methods are more mathematically trivial than the previous ideas, but can still achieve interesting results. A full example is explained in this video.

Say we have some function $g(p)$ which depends on a solution to a differential equation $u(t;p)$ and choices of parameters $p$. Computationally how we evaluate this function is we do the following:

  • Solve the differential equation with parameters $p$

  • Evaluate $g$ on the numerical solution for $u$

However, this process is computationally expensive since it requires the numerical solution of $u$ for every evaluation. Thus, one can look at this setup and see $g(p)$ itself is a nonlinear function. The idea is to train a neural network to be the function $g(p)$, i.e. directly put in $p$ and return the appropriate value without ever solving the differential equation.

The video highlights an important fact about this method: it can be computationally expensive to train this kind of surrogate since many data points $(p,g(p))$ are required. In fact, many more data points than you might use. However, after training, the surrogate network for $g(p)$ can be a lot faster than the original simulation-based approach. This means that this is a method for accelerating real-time solutions by doing upfront computations. The total compute time will always be more, but in some sense the cost is amortized or shifted to be done before hand, so that the model does not need to be simulated on the fly. This can allow for things like computationally expensive models of drone flight to be used in a real-time controller.

This technique goes a long way back, but some recent examples of this have been shown. For example, there's this paper which "accelerated" the solution of the 3-body problem using a neural network surrogate trained over a few days to get a 1 million times acceleration (after generating many points beforehand of course! In the paper, notice that it took 10 days to generate the training dataset). Additionally, there is this deep learning trebuchet example which showcased that inverse problems, i.e. control or finding parameters, can be completely encapsulated as a $g(p)$ and learned with sufficient data.

\ No newline at end of file diff --git a/notes/16-From_Optimization_to_Probabilistic_Programming/index.html b/notes/16-From_Optimization_to_Probabilistic_Programming/index.html index a0b10f6d..ba963516 100644 --- a/notes/16-From_Optimization_to_Probabilistic_Programming/index.html +++ b/notes/16-From_Optimization_to_Probabilistic_Programming/index.html @@ -24,7 +24,7 @@ prob1 = ODEProblem(lotka_volterra,u0,tspan,) sol = solve(prob1,Tsit5()) plot(sol) -

and from which we can get an ensemble of solutions:

+

and from which we can get an ensemble of solutions:

 prob_func = function (prob,i,repeat)
   remake(prob,p=rand.(θ))
 end
@@ -34,21 +34,21 @@
 
 using DiffEqBase.EnsembleAnalysis
 plot(EnsembleSummary(sol))
-

From just a few variables having probabilities, every variable has an induced probability: there is a probability distribution on the integrator states, the output at time t_i, etc.

Bayesian Estimation with Point Estimates: Bayes' Rule, Maximum Likelihood, and MAP

Recall from our previous studies that the difficult part of modeling is not necessarily the forward modeling approach, rather it's the incorporation of data or the estimation problem that is difficult. When your variables are now random distributions, how do you "fit" them?

The answer comes from Bayes' rule, which is the following. Assume you had a prior distribution $p(\theta)$ for the probability that $X$ is a given value $\theta$. Then the posterior probability distribution, $p(\theta|D)$, or the distribution which is updated to include data, is given by:

\[ p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int_\Omega p(D|\theta)p(\theta)d\theta} \]

The scaling factor on the denominator is simply a constant to make the distribution integrate 1 (so that the resulting function is a probability distribution!). The numerator is simply the prior distribution multiplied by the likelihood of seeing the data given the value of the random variable. The prior distribution must be given but notice that the likelihood has another name: the likelihood is the model.

The reason why it's the same thing is because the model is what tells you the expected outcomes given a value of the random variable, and your data is on an expected outcome! However, the likelihood encodes a little bit more information in that it again is a distribution and not a point estimate. We need to make a choice for our measurement distribution on our model's results.

Quick Question: Why is this referred to as measurement noise? Why is it not process noise?

A common choice for the measurement distribution is the Normal distribution. This comes from the Central Limit Theorem (CLT) which essentially states that, given enough interacting mechanisms, the average values of things "tend to become normally distributed". The true statement of the CLT is much more complex, but that is a decent working definition for practical use. The normal distribution is defined by two parameters, $\mu$ and $\sigma$, and is given by the following function:

\[ f(x;\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}\exp(\frac{-(x-\mu)^2}{2\sigma^2}) \]

This is a bell curve centered at $\mu$ with a variance of $\sigma$. Our best guess for the output, i.e. the model's prediction, should be the average measurement, meaning that $\mu$ is the result from the simulator. $\sigma$ is a parameter for how much measurement error we expect (some intuition on $\sigma$ will come soon).

Let's return to thinking about the ODE example. In this case we have $\theta$ as a vector of random variables. This means that $u(t;\theta)$ is a random variable for the ODE $u'= ...$'s solution at a given point in time $t$. If we have a measurement at a time $t_i$ and assume our measurement noise is normally distributed with some constant measurement noise $\sigma$, then the likelihood of our data would be $f(x_i;u(t_i;\theta),\sigma)$ at each data point $(t_i,x_i)$. From probability we know that seeing the composition of events is given by the multiplication of probabilities, so the probability of seeing the full dataset given observations $D = (t_i,x_i)$ along the timeseries is:

\[ p(D|\theta) = \prod_i f(x_i;u(t_i;\theta),\sigma) \]

This can be read as: solve the model with the given parameters, and the probability of having seen the measurement is thus given by a product of normal distribution calculations. Note that in many cases the product is not numerically stable (and grows exponentially), and so the likelihood is transformed to the log-likelihood. To get this expression, we take the log of both sides and notice that the product becomes a summation, and thus:

\[ \begin{align} \log p(D|\theta) &= \sum_i \log f(x_i;u(t_i;\theta),\sigma)\\ &= \frac{N}{\sqrt{2\pi}\sigma} + \frac{1}{2\sigma^2} \sum_i -(x-\mu)^2 \end{align} \]

Notice that maximizing this log-likelihood is equivalent to minimizing the L2 norm of the solution against the data!. Thus we can see a few things:

  1. Previous parameter estimation by minimizing a norm against data can be seen as maximum likelihood with some measurement distribution. L2 norm corresponds to assuming measurement noise is normally distributed and all of the measurements have the same error variance.

  2. By the same derivation, having different error variances with normally distributed errors is equivalent to doing weighted L2 estimation.

This reformulation (generalization?) to likelihoods of probability distributions is known as maximum likelihood estimation (MLE), but is equivalent to our previous forms of parameter estimation using point estimates against data. However, this calculation is ignoring Bayes' rule, and is thus not finding the parameters which have the highest probability. To do that, we need to go back to Bayes' rule which states that:

\[ \log p(\theta|D) = \log p(D|\theta) + \log p(\theta) - C \]

Thus, maximizing the log-likelihood is "almost" the same as finding the most probable parameters, except that we need to add weights given $\log p(\theta)$ from our prior distribution! If we assume our prior distribution is flat, like a uniform distribution, then we have a non-informative prior and the maximum posterior point matches that of the maximum likelihood estimation. However, this formulation allows us to get point estimates in a way that takes into account prior knowledge, and is call maximum a posteriori estimation (MAP).

Bayesian Estimation of Posterior Distributions with Monte Carlo

The previous discussion still solely focused on getting point estimates for the most probable parameters. However, what if we wanted to find the distributions of the parameters, i.e. the full $p(D|\theta)$? Outside of very few small models, this cannot be done analytically and is thus the basic problem of probabilistic programming. There are two general approaches:

  1. Sampling-based approaches. Sample parameters $\theta_i$ in such a manner that the array $[\theta_i]$ converges to an array sampled from the true distribution, and thus with enough samples one can capture the distribution numerically.

  2. Variational inference. Find some way to represent the probability distribution and push forward the distributions at every step of the program.

Recovering Distributions from Sampled Points

It's clear from above that if you have a distribution, like Normal(5,1), that you can sample from the distribution to get an array of values which follow the distribution. However, in order for the following sampling approaches to make sense, we need to see how to recover a distribution from discrete samples. So let's say you had a bunch of normally distributed points:

+

From just a few variables having probabilities, every variable has an induced probability: there is a probability distribution on the integrator states, the output at time t_i, etc.

Bayesian Estimation with Point Estimates: Bayes' Rule, Maximum Likelihood, and MAP

Recall from our previous studies that the difficult part of modeling is not necessarily the forward modeling approach, rather it's the incorporation of data or the estimation problem that is difficult. When your variables are now random distributions, how do you "fit" them?

The answer comes from Bayes' rule, which is the following. Assume you had a prior distribution $p(\theta)$ for the probability that $X$ is a given value $\theta$. Then the posterior probability distribution, $p(\theta|D)$, or the distribution which is updated to include data, is given by:

\[ p(\theta|D) = \frac{p(D|\theta)p(\theta)}{\int_\Omega p(D|\theta)p(\theta)d\theta} \]

The scaling factor on the denominator is simply a constant to make the distribution integrate 1 (so that the resulting function is a probability distribution!). The numerator is simply the prior distribution multiplied by the likelihood of seeing the data given the value of the random variable. The prior distribution must be given but notice that the likelihood has another name: the likelihood is the model.

The reason why it's the same thing is because the model is what tells you the expected outcomes given a value of the random variable, and your data is on an expected outcome! However, the likelihood encodes a little bit more information in that it again is a distribution and not a point estimate. We need to make a choice for our measurement distribution on our model's results.

Quick Question: Why is this referred to as measurement noise? Why is it not process noise?

A common choice for the measurement distribution is the Normal distribution. This comes from the Central Limit Theorem (CLT) which essentially states that, given enough interacting mechanisms, the average values of things "tend to become normally distributed". The true statement of the CLT is much more complex, but that is a decent working definition for practical use. The normal distribution is defined by two parameters, $\mu$ and $\sigma$, and is given by the following function:

\[ f(x;\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}\exp(\frac{-(x-\mu)^2}{2\sigma^2}) \]

This is a bell curve centered at $\mu$ with a variance of $\sigma$. Our best guess for the output, i.e. the model's prediction, should be the average measurement, meaning that $\mu$ is the result from the simulator. $\sigma$ is a parameter for how much measurement error we expect (some intuition on $\sigma$ will come soon).

Let's return to thinking about the ODE example. In this case we have $\theta$ as a vector of random variables. This means that $u(t;\theta)$ is a random variable for the ODE $u'= ...$'s solution at a given point in time $t$. If we have a measurement at a time $t_i$ and assume our measurement noise is normally distributed with some constant measurement noise $\sigma$, then the likelihood of our data would be $f(x_i;u(t_i;\theta),\sigma)$ at each data point $(t_i,x_i)$. From probability we know that seeing the composition of events is given by the multiplication of probabilities, so the probability of seeing the full dataset given observations $D = (t_i,x_i)$ along the timeseries is:

\[ p(D|\theta) = \prod_i f(x_i;u(t_i;\theta),\sigma) \]

This can be read as: solve the model with the given parameters, and the probability of having seen the measurement is thus given by a product of normal distribution calculations. Note that in many cases the product is not numerically stable (and grows exponentially), and so the likelihood is transformed to the log-likelihood. To get this expression, we take the log of both sides and notice that the product becomes a summation, and thus:

\[ \begin{align} \log p(D|\theta) &= \sum_i \log f(x_i;u(t_i;\theta),\sigma)\\ &= \frac{N}{\sqrt{2\pi}\sigma} + \frac{1}{2\sigma^2} \sum_i -(x-\mu)^2 \end{align} \]

Notice that maximizing this log-likelihood is equivalent to minimizing the L2 norm of the solution against the data!. Thus we can see a few things:

  1. Previous parameter estimation by minimizing a norm against data can be seen as maximum likelihood with some measurement distribution. L2 norm corresponds to assuming measurement noise is normally distributed and all of the measurements have the same error variance.

  2. By the same derivation, having different error variances with normally distributed errors is equivalent to doing weighted L2 estimation.

This reformulation (generalization?) to likelihoods of probability distributions is known as maximum likelihood estimation (MLE), but is equivalent to our previous forms of parameter estimation using point estimates against data. However, this calculation is ignoring Bayes' rule, and is thus not finding the parameters which have the highest probability. To do that, we need to go back to Bayes' rule which states that:

\[ \log p(\theta|D) = \log p(D|\theta) + \log p(\theta) - C \]

Thus, maximizing the log-likelihood is "almost" the same as finding the most probable parameters, except that we need to add weights given $\log p(\theta)$ from our prior distribution! If we assume our prior distribution is flat, like a uniform distribution, then we have a non-informative prior and the maximum posterior point matches that of the maximum likelihood estimation. However, this formulation allows us to get point estimates in a way that takes into account prior knowledge, and is call maximum a posteriori estimation (MAP).

Bayesian Estimation of Posterior Distributions with Monte Carlo

The previous discussion still solely focused on getting point estimates for the most probable parameters. However, what if we wanted to find the distributions of the parameters, i.e. the full $p(D|\theta)$? Outside of very few small models, this cannot be done analytically and is thus the basic problem of probabilistic programming. There are two general approaches:

  1. Sampling-based approaches. Sample parameters $\theta_i$ in such a manner that the array $[\theta_i]$ converges to an array sampled from the true distribution, and thus with enough samples one can capture the distribution numerically.

  2. Variational inference. Find some way to represent the probability distribution and push forward the distributions at every step of the program.

Recovering Distributions from Sampled Points

It's clear from above that if you have a distribution, like Normal(5,1), that you can sample from the distribution to get an array of values which follow the distribution. However, in order for the following sampling approaches to make sense, we need to see how to recover a distribution from discrete samples. So let's say you had a bunch of normally distributed points:

 X = Normal(5,1)
 x = [rand(X) for i in 1:100]
 scatter(x,[1 for i in 1:100])
-

Notice that there are more points in the areas of higher probability. Thus the density of sampled points gives us an estimate for the probability of having points in a given area. We can then count the number of points in a bin and divide by the total number of points in order to get the probability of being in a specific region. This is depicted by a histogram:

+

Notice that there are more points in the areas of higher probability. Thus the density of sampled points gives us an estimate for the probability of having points in a given area. We can then count the number of points in a bin and divide by the total number of points in order to get the probability of being in a specific region. This is depicted by a histogram:

 histogram(x)
-

and we see this converges when we get more points:

+

and we see this converges when we get more points:

 histogram([rand(X) for i in 1:10000],normed=true)
 using StatsPlots
 plot!(X,lw=5)
-

A continuous form of this is the kernel density estimate, which is essentially a smoothed binning approach.

+

A continuous form of this is the kernel density estimate, which is essentially a smoothed binning approach.

 using KernelDensity
 plot(kde([rand(X) for i in 1:10000]),lw=5)
 plot!(X,lw=5)
-

Thus, for the sampling-based approaches, we simply need to arrive at an array which is sampled according to the distribution that we want to estimate, and from that array we can recover the distribution.

Sampling Distributions with the Metropolis Hastings Algorithm

The Metropolis-Hastings algorithm is the simplest form of Markov Chain Monte Carlo (MCMC) which gives a way of sampling the $\theta$ distribution. To see how this algorithm works, let's understand the ratio between two points in the posterior probability. If we have $x_i$ and $x_j$, the ratio of the two probabilities would be given by:

\[ \frac{p(x_i|D)}{p(x_j|D)} = \frac{p(D|x_i)p(x_i)}{p(D|x_j)p(x_j)} \]

(notice that the integration constant cancels). This motivates the idea that all we have to do is ensure we only go to a point $x_j$ from $x_i$ with probability difference that matches that ratio, and over time if we do this between "all points" we will have the right number of "each point" in the distribution (quotes because it's continuous). With a bit more rigour we arrive at the following algorithm:

  1. Starting at $x_i$, take $x_{i+1}$ from a sampling algorithm $g(x_{i+1}|x_i)$.

  2. Calculate $A = \min\left(1,\frac{p(D|x_{i+1})p(x_{i+1})g(x_i|x_{i+1})}{p(D|x_i)p(x_i)g(x_{i+1}|x_i)}\right)$. Notice that if we require $g$ to be symmetric, then this simplifies to the probability ratio $A = \min\left(1,\frac{p(D|x_{i+1})p(x_{i+1})}{p(D|x_i)p(x_i)}\right)$

  3. Use a random number to accept the step with a probability $A$. Go back to step 1, incrementing $i$ if accepted, otherwise just repeat.

I.e, we just walk around the space biasing the acceptance of a step by the factor $\frac{p(x_i|D)}{p(x_j|D)}$ and sooner or later we will have spent the right amount of time in each area, giving the correct distribution.

(This can be rigorously proven, and those details are left out.)

The Cost of Bayesian Estimation

Let's take a quick moment to understand the high cost of Bayesian posterior estimations. While before we were getting point estimates, now we are trying to recover a full probability distribution, and each accept/reject probability calculation requires evaluating the likelihood at some point. Remember, the likelihood is generated by our simulator, and thus every evaluation here is an ODE solver call or a neural network forward pass! This means that to get good distributions, we are solving the ODE hundreds of thousands of times, i.e. even more than when doing parameter estimation! This is something to keep in mind.

However, notice that this process is trivially parallelizable. We can just have parallel chains on going, i.e. start 16 processes all doing Metropolis-Hastings, and in the end they are all sampling from the same distribution, so the final array can simply be the pooled results of each chain.

Hamiltonian Monte Carlo

Metropolis-Hastings is easy to motivate and implement. However, it does not do well in high dimensional spaces because it searches in all directions. For example, it's common for the sampling distribution $g$ to be a multivariable distribution (i.e. normal in all directions). However, high dimensional objects commonly sit on low dimensional manifolds (known as the manifold hypothesis). If that's the case, the most probable set of parameters is something that is low dimensional. For example, parameters may compensate for one another, and so $\theta_1^2 + \theta_2^2 + \theta_3^2 = 1$ might be the manifold on which all of the most probable choices for $\theta$ lie, in which case we need sample on the sphere instead of all of $\mathbb{R}^3$.

However, it's quick to see that this will give Metropolis-Hastings some trouble, since it will use a normal distribution around the current point, and thus even if we start on the sphere, it will have a high chance of trying a point not on the sphere in the next round! This can be depicted as:

Recall that every single rejection is still evaluating the likelihood (since it's calculating an acceptance probability, finding it near zero, rejecting and starting again), and every likelihood call is calling our simulator, and so this is sllllllooooooooooooooowwwwwwwww in high dimensions!

What we need to do instead is ensure that we walk along the path of high probability. What we want to do is thus build a vector field that matches our high probability regions

and follow said vector field (following a vector field is solving what kind of equation?). The first idea one might have is to use the gradient. However, while this idea has the right intentions, the issue is that the gradient of the probability will average out all of the possible probabilities, and will thus flow towards the mode of the distribution:

To overcome this issue, we look to physical systems and see that a satellite orbiting a planet always nicely stays on some manifold instead of following the gradient:

The reason why it does is because it has momentum. Recall from basic physics that one way to describe a physical system is through Hamiltonian mechanics, where $H(x,p)$ is the energy associated with the state $(x,p)$ (normally $x$ is location and $p$ is momentum). Due to conservation of energy, the solution of the dynamical equations leads to $H(x,p)$ being constant, and thus the dynamics follow the level sets of $H$. From the Hamiltonian the dynamics of the system are:

\[ \begin{align} \frac{dx}{dt} &= \frac{dH}{dp}\\ &= -\frac{dH}{dx} \end{align} \]

Here we want our Hamiltonian to be our posterior probability, so that way we stay on the manifold of high probability. This means:

\[ H(x,p) = - \log \pi(x,p) \]

where $\pi(x,p) = \pi(p|x)\pi(x)$ (where I am now using $pi$ for probability since $p$ is momentum!). So to lift from a probability over parameters to one that includes momentum, we simply need to choose a conditional distribution $\pi(p|x)$. This would mean that

\[ \begin{align} H(x,p) &= -log \pi(p|x) - \log \pi(x)\\ &= K(p,x) + V(x) \end{align} \]

where $K$ is the kinetic energy and $V$ is the potential. Thus the potential energy is directly given by the posterior calculation, and the kinetic energy is thus a choice that is used to build the correct Hamiltonian. Hamiltonian Monte Carlo methods then dig into good ways to choose the kinetic energy function. This is done at the start (along with the choice of ODE solver time step) in such a way that it maximizes acceptance probabilities.

Connections to Differentiable Programming

\[ -\frac{dH}{dx} \]

requires calculating the gradient of the likelihood function with respect to the parameters, so we are once again using the gradient of our simulator! This means that all of our previous discussion on automatic differentiation and differentiable programming applies to the Hamiltonian Monte Carlo context.

There's another thread to follow that transformations of probability distributions are pushforwards of the Jacobian transformations (given the transformation of an integral formula), and this is used when doing variational inference.

Symplectic and Geometric Integration

One way to integrate the system of ODEs which result from the Hamiltonian system is to convert it to a system of first order ODEs and solve it directly. However, this loses information and can result in drift. This is demonstrated by looking at the long time solution of the pendulum:

+

Thus, for the sampling-based approaches, we simply need to arrive at an array which is sampled according to the distribution that we want to estimate, and from that array we can recover the distribution.

Sampling Distributions with the Metropolis Hastings Algorithm

The Metropolis-Hastings algorithm is the simplest form of Markov Chain Monte Carlo (MCMC) which gives a way of sampling the $\theta$ distribution. To see how this algorithm works, let's understand the ratio between two points in the posterior probability. If we have $x_i$ and $x_j$, the ratio of the two probabilities would be given by:

\[ \frac{p(x_i|D)}{p(x_j|D)} = \frac{p(D|x_i)p(x_i)}{p(D|x_j)p(x_j)} \]

(notice that the integration constant cancels). This motivates the idea that all we have to do is ensure we only go to a point $x_j$ from $x_i$ with probability difference that matches that ratio, and over time if we do this between "all points" we will have the right number of "each point" in the distribution (quotes because it's continuous). With a bit more rigour we arrive at the following algorithm:

  1. Starting at $x_i$, take $x_{i+1}$ from a sampling algorithm $g(x_{i+1}|x_i)$.

  2. Calculate $A = \min\left(1,\frac{p(D|x_{i+1})p(x_{i+1})g(x_i|x_{i+1})}{p(D|x_i)p(x_i)g(x_{i+1}|x_i)}\right)$. Notice that if we require $g$ to be symmetric, then this simplifies to the probability ratio $A = \min\left(1,\frac{p(D|x_{i+1})p(x_{i+1})}{p(D|x_i)p(x_i)}\right)$

  3. Use a random number to accept the step with a probability $A$. Go back to step 1, incrementing $i$ if accepted, otherwise just repeat.

I.e, we just walk around the space biasing the acceptance of a step by the factor $\frac{p(x_i|D)}{p(x_j|D)}$ and sooner or later we will have spent the right amount of time in each area, giving the correct distribution.

(This can be rigorously proven, and those details are left out.)

The Cost of Bayesian Estimation

Let's take a quick moment to understand the high cost of Bayesian posterior estimations. While before we were getting point estimates, now we are trying to recover a full probability distribution, and each accept/reject probability calculation requires evaluating the likelihood at some point. Remember, the likelihood is generated by our simulator, and thus every evaluation here is an ODE solver call or a neural network forward pass! This means that to get good distributions, we are solving the ODE hundreds of thousands of times, i.e. even more than when doing parameter estimation! This is something to keep in mind.

However, notice that this process is trivially parallelizable. We can just have parallel chains on going, i.e. start 16 processes all doing Metropolis-Hastings, and in the end they are all sampling from the same distribution, so the final array can simply be the pooled results of each chain.

Hamiltonian Monte Carlo

Metropolis-Hastings is easy to motivate and implement. However, it does not do well in high dimensional spaces because it searches in all directions. For example, it's common for the sampling distribution $g$ to be a multivariable distribution (i.e. normal in all directions). However, high dimensional objects commonly sit on low dimensional manifolds (known as the manifold hypothesis). If that's the case, the most probable set of parameters is something that is low dimensional. For example, parameters may compensate for one another, and so $\theta_1^2 + \theta_2^2 + \theta_3^2 = 1$ might be the manifold on which all of the most probable choices for $\theta$ lie, in which case we need sample on the sphere instead of all of $\mathbb{R}^3$.

However, it's quick to see that this will give Metropolis-Hastings some trouble, since it will use a normal distribution around the current point, and thus even if we start on the sphere, it will have a high chance of trying a point not on the sphere in the next round! This can be depicted as:

Recall that every single rejection is still evaluating the likelihood (since it's calculating an acceptance probability, finding it near zero, rejecting and starting again), and every likelihood call is calling our simulator, and so this is sllllllooooooooooooooowwwwwwwww in high dimensions!

What we need to do instead is ensure that we walk along the path of high probability. What we want to do is thus build a vector field that matches our high probability regions

and follow said vector field (following a vector field is solving what kind of equation?). The first idea one might have is to use the gradient. However, while this idea has the right intentions, the issue is that the gradient of the probability will average out all of the possible probabilities, and will thus flow towards the mode of the distribution:

To overcome this issue, we look to physical systems and see that a satellite orbiting a planet always nicely stays on some manifold instead of following the gradient:

The reason why it does is because it has momentum. Recall from basic physics that one way to describe a physical system is through Hamiltonian mechanics, where $H(x,p)$ is the energy associated with the state $(x,p)$ (normally $x$ is location and $p$ is momentum). Due to conservation of energy, the solution of the dynamical equations leads to $H(x,p)$ being constant, and thus the dynamics follow the level sets of $H$. From the Hamiltonian the dynamics of the system are:

\[ \begin{align} \frac{dx}{dt} &= \frac{dH}{dp}\\ &= -\frac{dH}{dx} \end{align} \]

Here we want our Hamiltonian to be our posterior probability, so that way we stay on the manifold of high probability. This means:

\[ H(x,p) = - \log \pi(x,p) \]

where $\pi(x,p) = \pi(p|x)\pi(x)$ (where I am now using $pi$ for probability since $p$ is momentum!). So to lift from a probability over parameters to one that includes momentum, we simply need to choose a conditional distribution $\pi(p|x)$. This would mean that

\[ \begin{align} H(x,p) &= -log \pi(p|x) - \log \pi(x)\\ &= K(p,x) + V(x) \end{align} \]

where $K$ is the kinetic energy and $V$ is the potential. Thus the potential energy is directly given by the posterior calculation, and the kinetic energy is thus a choice that is used to build the correct Hamiltonian. Hamiltonian Monte Carlo methods then dig into good ways to choose the kinetic energy function. This is done at the start (along with the choice of ODE solver time step) in such a way that it maximizes acceptance probabilities.

Connections to Differentiable Programming

\[ -\frac{dH}{dx} \]

requires calculating the gradient of the likelihood function with respect to the parameters, so we are once again using the gradient of our simulator! This means that all of our previous discussion on automatic differentiation and differentiable programming applies to the Hamiltonian Monte Carlo context.

There's another thread to follow that transformations of probability distributions are pushforwards of the Jacobian transformations (given the transformation of an integral formula), and this is used when doing variational inference.

Symplectic and Geometric Integration

One way to integrate the system of ODEs which result from the Hamiltonian system is to convert it to a system of first order ODEs and solve it directly. However, this loses information and can result in drift. This is demonstrated by looking at the long time solution of the pendulum:

 using ParameterizedFunctions
 u0 = [1.,0.]
 harmonic! = @ode_def HarmonicOscillator begin
@@ -63,4 +63,4 @@
 plot(sol,vars=(1,2))
 
 plot(sol)
-

What is an oscillatory system slowly loses energy and falls inward towards the center. To avoid this issue, we can do a few things:

  1. Project back to the manifold after steps. That can be costly (but almost might only need to happen every once in awhile!)

  2. Use a symplectic integrator.

A symplectic integrator is an integrator who's solution lives on a symplectic manifold, i.e. it preserves area in in the $(x,p)$ ellipses as it numerically approximates the flow. This means that:

  • Long-time integrations are truly cyclic with only floating point drift.

  • Steps preserve area. In the sense of Hamiltonian Monte Carlo, this means preserve probability and thus increase the acceptance rate.

These properties are demonstrated in the Kepler problem demo. However, note that while the solution lives on a symplectic manifold, it isn't necessarily the correct symplectic manifold. The shift in the manifold is $\mathcal{O}(\Delta t^k)$ where $k$ is the order of the method. For more information on symplectic integration, consult this StackOverflow response which goes into depth.

Application: Bayesian Estimation of Differential Equation Parameters

For a full demo of probabilistic programming on a differential equation system, see this tutorial on Bayesian inference of pendulum parameteres utilizing DifferentialEquations.jl and DiffEqBayes.jl.

Bayesian Estimation of Posterior Distributions with Variational Inference

Instead of using sampling, one can use variational inference to push through probability distributions. There are many ways to do variational inference, but a lot of the methods can be very model-specific. However, a recent change to probabilistic programming has been the development of Automatic Differentiation Variational Inference (ADVI): a general variational inference method which is not model-specific and instead uses AD. This has allowed for large expensive models to get effective distributional estimation, something that wasn't previously possible with HMC. In this section we will build up this methodology and understand its performance characteristics.

ADVI as Optimization

In this form of variational inference, we wish to directly estimate the posterior distribution. To do so, we pick a functional form to represent the solution $q(\theta; \phi)$ where $\phi$ are latent variables. We want our resulting distribution to fit the posterior, and tus we enforce that:

\[ \phi^\ast = \text{argmin}_{\phi} \text{KL} \left(q(\theta; \phi) \Vert p(\theta | D)\right) \]

where KL is the KL-divergence. KL-divergence is a distance function over probability distributions, and so this is simply a cost function over the distance between a chosen distribution and a desired distribution, where when $\phi$ are good we will have $q$ as a good approximation to the posterior.

However, the KL divergence lacks an analytical form because it requires knowing the posterior, the quantity we are trying to numerically estimate. However, it turns out that we can instead maximize the Evidence Lower Bound (ELBO):

\[ \mathcal{L}(\phi) = \mathbb{E}_{q}[\log p(x,\theta)] - \mathbb{E}_q [\log q(\theta; \phi)] \]

The ELBO is equivalent to the negative KL divergence up to a constant $\log p(x)$, which means that maximizing this is equivalent to minimizing the KL divergence.

One last detail is necessary in order for this problem to be tractable. To know the set of possible values to optimize over, we assume that the support of $q$ is a subset of the support of the prior. This means that our prior has to cover the probability distribution, which makes sense and matches Cromwell's rule for MCMC.

At this point we now assume that $q$ is Gaussian. When we rewrite the ELBO in terms of the standard Gaussian, we receive an expectation that is automatically differentiable. Calculating gradients is thus done with AD. Using only one or a few solves gives a noisy gradient to sample and optimize the latent variables to hone in on latent variables.

A Note on Implementation of Optimization for Probabilistic Programming

Variable domains can be constrained. For example, you may require a positive value. This can be handled by a transformation. For example, if $y$ must be positive, then one can optimize implicitly using $\exp(y)$ at every point, this allowing $y$ to be any real value with then $\exp(y)$ is positive. This turns the problem into an unconstrained optimization over the real numbers, and similar transformations can be done with any of the standard probability distribution's support function.

Citation

For Hamiltonian Monte Carlo, the images were taken from A Conceptual Introduction to Hamiltonian Monte Carlo by Michael Betancourt.

\ No newline at end of file +

What is an oscillatory system slowly loses energy and falls inward towards the center. To avoid this issue, we can do a few things:

  1. Project back to the manifold after steps. That can be costly (but almost might only need to happen every once in awhile!)

  2. Use a symplectic integrator.

A symplectic integrator is an integrator who's solution lives on a symplectic manifold, i.e. it preserves area in in the $(x,p)$ ellipses as it numerically approximates the flow. This means that:

  • Long-time integrations are truly cyclic with only floating point drift.

  • Steps preserve area. In the sense of Hamiltonian Monte Carlo, this means preserve probability and thus increase the acceptance rate.

These properties are demonstrated in the Kepler problem demo. However, note that while the solution lives on a symplectic manifold, it isn't necessarily the correct symplectic manifold. The shift in the manifold is $\mathcal{O}(\Delta t^k)$ where $k$ is the order of the method. For more information on symplectic integration, consult this StackOverflow response which goes into depth.

Application: Bayesian Estimation of Differential Equation Parameters

For a full demo of probabilistic programming on a differential equation system, see this tutorial on Bayesian inference of pendulum parameteres utilizing DifferentialEquations.jl and DiffEqBayes.jl.

Bayesian Estimation of Posterior Distributions with Variational Inference

Instead of using sampling, one can use variational inference to push through probability distributions. There are many ways to do variational inference, but a lot of the methods can be very model-specific. However, a recent change to probabilistic programming has been the development of Automatic Differentiation Variational Inference (ADVI): a general variational inference method which is not model-specific and instead uses AD. This has allowed for large expensive models to get effective distributional estimation, something that wasn't previously possible with HMC. In this section we will build up this methodology and understand its performance characteristics.

ADVI as Optimization

In this form of variational inference, we wish to directly estimate the posterior distribution. To do so, we pick a functional form to represent the solution $q(\theta; \phi)$ where $\phi$ are latent variables. We want our resulting distribution to fit the posterior, and tus we enforce that:

\[ \phi^\ast = \text{argmin}_{\phi} \text{KL} \left(q(\theta; \phi) \Vert p(\theta | D)\right) \]

where KL is the KL-divergence. KL-divergence is a distance function over probability distributions, and so this is simply a cost function over the distance between a chosen distribution and a desired distribution, where when $\phi$ are good we will have $q$ as a good approximation to the posterior.

However, the KL divergence lacks an analytical form because it requires knowing the posterior, the quantity we are trying to numerically estimate. However, it turns out that we can instead maximize the Evidence Lower Bound (ELBO):

\[ \mathcal{L}(\phi) = \mathbb{E}_{q}[\log p(x,\theta)] - \mathbb{E}_q [\log q(\theta; \phi)] \]

The ELBO is equivalent to the negative KL divergence up to a constant $\log p(x)$, which means that maximizing this is equivalent to minimizing the KL divergence.

One last detail is necessary in order for this problem to be tractable. To know the set of possible values to optimize over, we assume that the support of $q$ is a subset of the support of the prior. This means that our prior has to cover the probability distribution, which makes sense and matches Cromwell's rule for MCMC.

At this point we now assume that $q$ is Gaussian. When we rewrite the ELBO in terms of the standard Gaussian, we receive an expectation that is automatically differentiable. Calculating gradients is thus done with AD. Using only one or a few solves gives a noisy gradient to sample and optimize the latent variables to hone in on latent variables.

A Note on Implementation of Optimization for Probabilistic Programming

Variable domains can be constrained. For example, you may require a positive value. This can be handled by a transformation. For example, if $y$ must be positive, then one can optimize implicitly using $\exp(y)$ at every point, this allowing $y$ to be any real value with then $\exp(y)$ is positive. This turns the problem into an unconstrained optimization over the real numbers, and similar transformations can be done with any of the standard probability distribution's support function.

Citation

For Hamiltonian Monte Carlo, the images were taken from A Conceptual Introduction to Hamiltonian Monte Carlo by Michael Betancourt.

\ No newline at end of file diff --git a/notes/17-Global_Sensitivity_Analysis/index.html b/notes/17-Global_Sensitivity_Analysis/index.html index f63973b1..b003581f 100644 --- a/notes/17-Global_Sensitivity_Analysis/index.html +++ b/notes/17-Global_Sensitivity_Analysis/index.html @@ -9,4 +9,4 @@ using LatinHypercubeSampling p = LHCoptim(120,2,1000) scatter(p[1][:,1],p[1][:,2]) -

For a reference library with many different quasi-Monte Carlo samplers, check out QuasiMonteCarlo.jl.

Fourier Amplitude Sensitivity Sampling (FAST) and eFAST

The FAST method is a change to the Sobol method to allow for faster convergence. First transform the variables $x_i$ onto the space $[0,1]$. Then, instead of the linear decomposition, one decomposes into a Fourier basis:

\[ f(x_i,x_2,\ldots,x_n) = \sum_{m_1 = -\infty}^{\infty} \ldots \sum_{m_n = -\infty}^{\infty} C_{m_1m_2\ldots m_n}\exp\left(2\pi i (m_1 x_1 + \ldots + m_n x_n)\right) \]

where

\[ C_{m_1m_2\ldots m_n} = \int_0^1 \ldots \int_0^1 f(x_i,x_2,\ldots,x_n) \exp\left(-2\pi i (m_1 x_1 + \ldots + m_n x_n)\right) \]

The ANOVA like decomposition is thus

\[ f_0 = C_{0\ldots 0} \]

\[ f_j = \sum_{m_j \neq 0} C_{0\ldots 0 m_j 0 \ldots 0} \exp (2\pi i m_j x_j) \]

\[ f_{jk} = \sum_{m_j \neq 0} \sum_{m_k \neq 0} C_{0\ldots 0 m_j 0 \ldots m_k 0 \ldots 0} \exp \left(2\pi i (m_j x_j + m_k x_k)\right) \]

The first order conditional variance is thus:

\[ V_j = \int_0^1 f_j^2 (x_j) dx_j = \sum_{m_j \neq 0} |C_{0\ldots 0 m_j 0 \ldots 0}|^2 \]

or

\[ V_j = 2\sum_{m_j = 1}^\infty \left(A_{m_j}^2 + B_{m_j}^2 \right) \]

where $C_{0\ldots 0 m_j 0 \ldots 0} = A_{m_j} + i B_{m_j}$. By Fourier series we know this to be:

\[ A_{m_j} = \int_0^1 \ldots \int_0^1 f(x)\cos(2\pi m_j x_j)dx \]

\[ B_{m_j} = \int_0^1 \ldots \int_0^1 f(x)\sin(2\pi m_j x_j)dx \]

Implementation via the Ergodic Theorem

Define

\[ X_j(s) = \frac{1}{2\pi} (\omega_j s \mod 2\pi) \]

By the ergodic theorem, if $\omega_j$ are irrational numbers, then the dynamical system will never repeat values and thus it will create a solution that is dense in the plane (Let's prove a bit later). As an animation:

(here, $\omega_1 = \pi$ and $\omega_2 = 7$)

This means that:

\[ A_{m_j} = \lim_{T\rightarrow \infty} \frac{1}{2T} \int_{-T}^T f(x)\cos(m_j \omega_j s)ds \]

\[ B_{m_j} = \lim_{T\rightarrow \infty} \frac{1}{2T} \int_{-T}^T f(x)\sin(m_j \omega_j s)ds \]

i.e. the multidimensional integral can be approximated by the integral over a single line.

One can satisfy this approximately to get a simpler form for the integral. Using $\omega_i$ as integers, the integral is periodic and so only integrating over $2\pi$ is required. This would mean that:

\[ A_{m_j} \approx \frac{1}{2\pi} \int_{-\pi}^\pi f(x)\cos(m_j \omega_j s)ds \]

\[ B_{m_j} \approx \frac{1}{2\pi} \int_{-\pi}^\pi f(x)\sin(m_j \omega_j s)ds \]

It's only approximate since the sequence cannot be dense. For example, with $\omega_1 = 11$ and $\omega_2 = 7$:

A higher period thus gives a better fill of the space and thus a better approximation, but may require a more points. However, this transformation makes the true integrals simple one dimensional quadratures which can be efficiently computed.

To get the total index from this method, one can calculate the total contribution of the complementary set, i.e. $V_{c_i} = \sum_{j \neq i} V_j$ and then

\[ S_{T_i} = 1 - S_{c_i} \]

Note that this then is a fast measure for the total contribution of variable $i$, including all higher-order nonlinear interactions, all from one-dimensional integrals! (This extension is called extended FAST or eFAST)

Proof of the Ergodic Theorem

Look at the map $x_{n+1} = x_n + \alpha (\text{mod} 1)$, where $\alpha$ is irrational. This is the irrational rotation map that corresponds to our problem. We wish to prove that in any interval $I$, there is a point of our orbit in this interval.

First let's prove a useful result: our points get arbitrarily close. Assume that for some finite $\epsilon$ that no two points are $\epsilon$ apart. This means that we at most have spacings of $\epsilon$ between the points, and thus we have at most $\frac{2\pi}{\epsilon}$ points (rounded up). This means our orbit is periodic. This means that there is a $p$ such that

\[ x_{n+p} = x_n \]

which means that $p \alpha = 1$ or $p = \frac{1}{\alpha}$ which is a contradiction since $\alpha$ is irrational.

Thus for every $\epsilon$ there are two points which are $\epsilon$ apart. Now take any arbitrary $I$. Let $\epsilon < d/2$ where $d$ is the length of the interval. We have just shown that there are two points $\epsilon$ apart, so there is a point that is $x_{n+m}$ and $x_{n+k}$ which are $<\epsilon$ apart. Assuming WLOG $m>k$, this means that $m-k$ rotations takes one from $x_{n+k}$ to $x_{n+m}$, and so $m-k$ rotations is a rotation by $\epsilon$. If we do $\frac{1}{\epsilon}$ rounded up rotations, we will then cover the space with intervals of length epsilon, each with one point of the orbit in it. Since $\epsilon < d/2$, one of those intervals is completely encapsulated in $I$, which means there is at least one point in our orbit that is in $I$.

Thus for every interval we have at least one point in our orbit that lies in it, proving that the rotation map with irrational $\alpha$ is dense. Note that during the proof we essentially showed as well that if $\alpha$ is rational, then the map is periodic based on the denominator of the map in its reduced form.

A Quick Note on Parallelism

Very quick note: all of these are hyper parallel since it does the same calculation per parameter or trajectory, and each calculation is long. For quasi-Monte Carlo, after generating "good enough" trajectories, one can evaluate the model at all points in parallel, and then simply do the GSA index measurement. For FAST, one can do each quadrature in parallel.

\ No newline at end of file +

For a reference library with many different quasi-Monte Carlo samplers, check out QuasiMonteCarlo.jl.

Fourier Amplitude Sensitivity Sampling (FAST) and eFAST

The FAST method is a change to the Sobol method to allow for faster convergence. First transform the variables $x_i$ onto the space $[0,1]$. Then, instead of the linear decomposition, one decomposes into a Fourier basis:

\[ f(x_i,x_2,\ldots,x_n) = \sum_{m_1 = -\infty}^{\infty} \ldots \sum_{m_n = -\infty}^{\infty} C_{m_1m_2\ldots m_n}\exp\left(2\pi i (m_1 x_1 + \ldots + m_n x_n)\right) \]

where

\[ C_{m_1m_2\ldots m_n} = \int_0^1 \ldots \int_0^1 f(x_i,x_2,\ldots,x_n) \exp\left(-2\pi i (m_1 x_1 + \ldots + m_n x_n)\right) \]

The ANOVA like decomposition is thus

\[ f_0 = C_{0\ldots 0} \]

\[ f_j = \sum_{m_j \neq 0} C_{0\ldots 0 m_j 0 \ldots 0} \exp (2\pi i m_j x_j) \]

\[ f_{jk} = \sum_{m_j \neq 0} \sum_{m_k \neq 0} C_{0\ldots 0 m_j 0 \ldots m_k 0 \ldots 0} \exp \left(2\pi i (m_j x_j + m_k x_k)\right) \]

The first order conditional variance is thus:

\[ V_j = \int_0^1 f_j^2 (x_j) dx_j = \sum_{m_j \neq 0} |C_{0\ldots 0 m_j 0 \ldots 0}|^2 \]

or

\[ V_j = 2\sum_{m_j = 1}^\infty \left(A_{m_j}^2 + B_{m_j}^2 \right) \]

where $C_{0\ldots 0 m_j 0 \ldots 0} = A_{m_j} + i B_{m_j}$. By Fourier series we know this to be:

\[ A_{m_j} = \int_0^1 \ldots \int_0^1 f(x)\cos(2\pi m_j x_j)dx \]

\[ B_{m_j} = \int_0^1 \ldots \int_0^1 f(x)\sin(2\pi m_j x_j)dx \]

Implementation via the Ergodic Theorem

Define

\[ X_j(s) = \frac{1}{2\pi} (\omega_j s \mod 2\pi) \]

By the ergodic theorem, if $\omega_j$ are irrational numbers, then the dynamical system will never repeat values and thus it will create a solution that is dense in the plane (Let's prove a bit later). As an animation:

(here, $\omega_1 = \pi$ and $\omega_2 = 7$)

This means that:

\[ A_{m_j} = \lim_{T\rightarrow \infty} \frac{1}{2T} \int_{-T}^T f(x)\cos(m_j \omega_j s)ds \]

\[ B_{m_j} = \lim_{T\rightarrow \infty} \frac{1}{2T} \int_{-T}^T f(x)\sin(m_j \omega_j s)ds \]

i.e. the multidimensional integral can be approximated by the integral over a single line.

One can satisfy this approximately to get a simpler form for the integral. Using $\omega_i$ as integers, the integral is periodic and so only integrating over $2\pi$ is required. This would mean that:

\[ A_{m_j} \approx \frac{1}{2\pi} \int_{-\pi}^\pi f(x)\cos(m_j \omega_j s)ds \]

\[ B_{m_j} \approx \frac{1}{2\pi} \int_{-\pi}^\pi f(x)\sin(m_j \omega_j s)ds \]

It's only approximate since the sequence cannot be dense. For example, with $\omega_1 = 11$ and $\omega_2 = 7$:

A higher period thus gives a better fill of the space and thus a better approximation, but may require a more points. However, this transformation makes the true integrals simple one dimensional quadratures which can be efficiently computed.

To get the total index from this method, one can calculate the total contribution of the complementary set, i.e. $V_{c_i} = \sum_{j \neq i} V_j$ and then

\[ S_{T_i} = 1 - S_{c_i} \]

Note that this then is a fast measure for the total contribution of variable $i$, including all higher-order nonlinear interactions, all from one-dimensional integrals! (This extension is called extended FAST or eFAST)

Proof of the Ergodic Theorem

Look at the map $x_{n+1} = x_n + \alpha (\text{mod} 1)$, where $\alpha$ is irrational. This is the irrational rotation map that corresponds to our problem. We wish to prove that in any interval $I$, there is a point of our orbit in this interval.

First let's prove a useful result: our points get arbitrarily close. Assume that for some finite $\epsilon$ that no two points are $\epsilon$ apart. This means that we at most have spacings of $\epsilon$ between the points, and thus we have at most $\frac{2\pi}{\epsilon}$ points (rounded up). This means our orbit is periodic. This means that there is a $p$ such that

\[ x_{n+p} = x_n \]

which means that $p \alpha = 1$ or $p = \frac{1}{\alpha}$ which is a contradiction since $\alpha$ is irrational.

Thus for every $\epsilon$ there are two points which are $\epsilon$ apart. Now take any arbitrary $I$. Let $\epsilon < d/2$ where $d$ is the length of the interval. We have just shown that there are two points $\epsilon$ apart, so there is a point that is $x_{n+m}$ and $x_{n+k}$ which are $<\epsilon$ apart. Assuming WLOG $m>k$, this means that $m-k$ rotations takes one from $x_{n+k}$ to $x_{n+m}$, and so $m-k$ rotations is a rotation by $\epsilon$. If we do $\frac{1}{\epsilon}$ rounded up rotations, we will then cover the space with intervals of length epsilon, each with one point of the orbit in it. Since $\epsilon < d/2$, one of those intervals is completely encapsulated in $I$, which means there is at least one point in our orbit that is in $I$.

Thus for every interval we have at least one point in our orbit that lies in it, proving that the rotation map with irrational $\alpha$ is dense. Note that during the proof we essentially showed as well that if $\alpha$ is rational, then the map is periodic based on the denominator of the map in its reduced form.

A Quick Note on Parallelism

Very quick note: all of these are hyper parallel since it does the same calculation per parameter or trajectory, and each calculation is long. For quasi-Monte Carlo, after generating "good enough" trajectories, one can evaluate the model at all points in parallel, and then simply do the GSA index measurement. For FAST, one can do each quadrature in parallel.

\ No newline at end of file diff --git a/notes/18-Code_Profiling_and_Optimization/index.html b/notes/18-Code_Profiling_and_Optimization/index.html index 012f5251..22ba292f 100644 --- a/notes/18-Code_Profiling_and_Optimization/index.html +++ b/notes/18-Code_Profiling_and_Optimization/index.html @@ -100,4 +100,4 @@ @profile for i in 1:10000 sol = solve(prob,Tsit5(),save_everystep=false) end
 Juno.profiler()
-

Now that this looks like a fairly good profile, we can use this to dig in and find out what lines need to be optimized!

\ No newline at end of file +

Now that this looks like a fairly good profile, we can use this to dig in and find out what lines need to be optimized!

\ No newline at end of file diff --git a/notes/19-Uncertainty_Programming-Generalized_Uncertainty_Quantification/index.html b/notes/19-Uncertainty_Programming-Generalized_Uncertainty_Quantification/index.html index 84c20b9b..1226fb10 100644 --- a/notes/19-Uncertainty_Programming-Generalized_Uncertainty_Quantification/index.html +++ b/notes/19-Uncertainty_Programming-Generalized_Uncertainty_Quantification/index.html @@ -198,4 +198,4 @@ plot(sim,vars=(0,1),linealpha=0.4)
 ERROR: UndefVarError: `AdaptiveProbIntsUncertainty` not defined
-

Notice that while an interval estimate would have grown to allow all extremes together, this form keeps the trajectories alive, allowing them to fall back to the mode, which decreases the true uncertainty. This is thus a good explanation as to why general methods will overestimate uncertainty.

Adjoints of Uncertainty and the Koopman Operator

Everything that we've demonstrated here so far can be thought of as "forward mode uncertainty quantification". For every example we have constructed a method such that, for a known probability distribution in x, we build the probability distribution of the output of the program, and then compute quantities from that. On a dynamical system this pushforward of a measure is denoted by the Frobenius-Perron operator. With a pushforward operator $P$ and an initial uncertainty density $f$, we can represent calculating the expected value of some cost function on the solution via:

\[ \mathbb{E}[g(x)|X \sim Pf] = \int_{S(A)} P f(x) g(x) dx \]

where $S$ is the program, i.e. $S(A)$ is the total set of points by pushing every value of $A$ through our program, and $P f(x)$ is the pushforward operator applied to the probability distribution. What this means is that, to calculate the expectation on the output of our program, like to calculate the mean value of the ODE's solution given uncertainty in the parameters, we can pushforward the probability distribution to construct $Pf$ and on this probability distribution calculate the expected value of some $g$ cost function on the solution.

The problem, as seen earlier, is that pushing forward entire probability distributions is a fairly expensive process. We can instead think about doing the adjoint to this cost function, i.e. pulling back the cost function and computing it on the initial density. In terms of inner product notation, this would be doing:

\[ \langle Pf,g \rangle = \langle f, Ug \rangle \]

meaning $U$ is the adjoint operator to the pushforward $P$. This operator is known as the Koopman operator. There are many properties one can use about the Koopman operator, one special property being it's a linear operator on the space of observables, but it also gives a nice expression for computing uncertainty expectations. Using the Koopman operator, we can rewrite the expectation as:

\[ \mathbb{E}[g(x)|X \sim Pf] = \mathbb{E}[Ug(x)|X \sim f] \]

or perform the integral on the pullback of the cost function, i.e.

\[ \mathbb{E}[g(x)|X \sim f] = \int_A Ug(x) f(x) dx \]

In images it looks like:

Koopman vs FP

This expression gives us a fast way to compute expectations on the program output without having to compute the full uncertainty distribution on the output. This can thus be used for optimization under uncertainty, i.e. the optimization of loss functions with respect to expectations of the program's output under the assumption of given input uncertainty distributions. For more information, see The Koopman Expectation: An Operator Theoretic Method for Efficient Analysis and Optimization of Uncertain Hybrid Dynamical Systems.

\ No newline at end of file +

Notice that while an interval estimate would have grown to allow all extremes together, this form keeps the trajectories alive, allowing them to fall back to the mode, which decreases the true uncertainty. This is thus a good explanation as to why general methods will overestimate uncertainty.

Adjoints of Uncertainty and the Koopman Operator

Everything that we've demonstrated here so far can be thought of as "forward mode uncertainty quantification". For every example we have constructed a method such that, for a known probability distribution in x, we build the probability distribution of the output of the program, and then compute quantities from that. On a dynamical system this pushforward of a measure is denoted by the Frobenius-Perron operator. With a pushforward operator $P$ and an initial uncertainty density $f$, we can represent calculating the expected value of some cost function on the solution via:

\[ \mathbb{E}[g(x)|X \sim Pf] = \int_{S(A)} P f(x) g(x) dx \]

where $S$ is the program, i.e. $S(A)$ is the total set of points by pushing every value of $A$ through our program, and $P f(x)$ is the pushforward operator applied to the probability distribution. What this means is that, to calculate the expectation on the output of our program, like to calculate the mean value of the ODE's solution given uncertainty in the parameters, we can pushforward the probability distribution to construct $Pf$ and on this probability distribution calculate the expected value of some $g$ cost function on the solution.

The problem, as seen earlier, is that pushing forward entire probability distributions is a fairly expensive process. We can instead think about doing the adjoint to this cost function, i.e. pulling back the cost function and computing it on the initial density. In terms of inner product notation, this would be doing:

\[ \langle Pf,g \rangle = \langle f, Ug \rangle \]

meaning $U$ is the adjoint operator to the pushforward $P$. This operator is known as the Koopman operator. There are many properties one can use about the Koopman operator, one special property being it's a linear operator on the space of observables, but it also gives a nice expression for computing uncertainty expectations. Using the Koopman operator, we can rewrite the expectation as:

\[ \mathbb{E}[g(x)|X \sim Pf] = \mathbb{E}[Ug(x)|X \sim f] \]

or perform the integral on the pullback of the cost function, i.e.

\[ \mathbb{E}[g(x)|X \sim f] = \int_A Ug(x) f(x) dx \]

In images it looks like:

Koopman vs FP

This expression gives us a fast way to compute expectations on the program output without having to compute the full uncertainty distribution on the output. This can thus be used for optimization under uncertainty, i.e. the optimization of loss functions with respect to expectations of the program's output under the assumption of given input uncertainty distributions. For more information, see The Koopman Expectation: An Operator Theoretic Method for Efficient Analysis and Optimization of Uncertain Hybrid Dynamical Systems.

\ No newline at end of file diff --git a/notes/index.html b/notes/index.html index 08f89df1..a7c8883d 100644 --- a/notes/index.html +++ b/notes/index.html @@ -1 +1 @@ - Notes Overview - MIT Parallel Computing and Scientific Machine Learning (SciML) \ No newline at end of file + Notes Overview - MIT Parallel Computing and Scientific Machine Learning (SciML) \ No newline at end of file diff --git a/package-lock.json b/package-lock.json index ac341b25..f6c04c1d 100644 --- a/package-lock.json +++ b/package-lock.json @@ -3,9 +3,9 @@ "lockfileVersion": 1, "dependencies": { "highlight.js": { - "version": "11.8.0", - "resolved": "https://registry.npmjs.org/highlight.js/-/highlight.js-11.8.0.tgz", - "integrity": "sha512-MedQhoqVdr0U6SSnWPzfiadUcDHfN/Wzq25AkXiQv9oiOO/sG0S7XkvpFIqWBl9Yq1UYyYOOVORs5UW2XlPyzg==" + "version": "11.9.0", + "resolved": "https://registry.npmjs.org/highlight.js/-/highlight.js-11.9.0.tgz", + "integrity": "sha512-fJ7cW7fQGCYAkgv4CPfwFHrfd/cLS4Hau96JuJ+ZTOWhjnhoeN1ub1tFmALm/+lW5z4WCAuAV9bm05AP0mS6Gw==" } } } diff --git a/sitemap.xml b/sitemap.xml index a76d1bbd..f4392131 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -3,283 +3,283 @@ https://book.sciml.ai/homework/01/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/11-Differentiable_Programming_and_Neural_Differential_Equations/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/14-PDEs_Convolutions_and_the_Mathematics_of_Locality/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/homework/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/06-The_Different_Flavors_of_Parallelism/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/course/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/13-GPU_programming/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/19-Uncertainty_Programming-Generalized_Uncertainty_Quantification/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/lectures/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/07-Ordinary_Differential_Equations-Applications_and_Discretizations/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/12-Description_of_MPI_and_MPI/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/09-Solving_Stiff_Ordinary_Differential_Equations/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/17-Global_Sensitivity_Analysis/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/homework/03/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/18-Code_Profiling_and_Optimization/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/homework/02/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/404/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/03-Introduction_to_Scientific_Machine_Learning_through_Physics-Informed_Neural_Networks/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/02-Optimizing_Serial_Code/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/10-Basic_Parameter_Estimation-Reverse-Mode_AD-and_Inverse_Problems/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/05-The_Basics_of_Single_Node_Parallel_Computing/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/08-Forward-Mode_Automatic_Differentiation_%28AD%29_via_High_Dimensional_Algebras/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/15-Mixing_Differential_Equations_and_Neural_Networks_for_Physics-Informed_Learning/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/16-From_Optimization_to_Probabilistic_Programming/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/notes/04-How_Loops_Work-An_Introduction_to_Discrete_Dynamics/index.html - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/homework02/hw2/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/lecture04/dynamical_systems/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/lecture07/discretizing_odes/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/lecture19/uncertainty_programming/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/lecture11/adjoints/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/lecture03/sciml/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/lecture17/global_sensitivity/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/lecture06/styles_of_parallelism/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/lecture18/code_profiling/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/lecture10/estimation_identification/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/lecture09/stiff_odes/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/lecture15/diffeq_machine_learning/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/lecture02/optimizing/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/lecture05/parallelism_overview/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/lecture16/probabilistic_programming/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/homework03/hw3/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/lecture13/gpus/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/lecture08/automatic_differentiation/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/lecture14/pdes_and_convolutions/ - 2023-07-15 + 2023-10-29 monthly 0.5 https://book.sciml.ai/_weave/homework01/hw1/ - 2023-07-15 + 2023-10-29 monthly 0.5