Skip to content

Commit

Permalink
Merge pull request #144 from patelvyom/patch-2
Browse files Browse the repository at this point in the history
minor typos
  • Loading branch information
ChrisRackauckas authored Jul 26, 2024
2 parents 37e762a + 9cfbe0a commit 66bdde3
Showing 1 changed file with 9 additions and 9 deletions.
18 changes: 9 additions & 9 deletions _weave/lecture16/probabilistic_programming.jmd
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ the repeat process of:
1. Sample variables
2. Compute output

Doing this repeatedly then produces samples of $f(x)$ from which a numerical
Repeatedly doing this produces samples of $f(x)$ from which a numerical
representation of the distribution can be had. From there, going to a multivariable
linear model like $f(x) = Ax$ is the same idea. Going to $f(x)$ where $f$ is an
arbitrary program is still the same idea: sample every variable in the program,
Expand Down Expand Up @@ -154,14 +154,14 @@ that is a decent working definition for practical use. The normal distribution
is defined by two parameters, $\mu$ and $\sigma$, and is given by the following
function:

$$f(x;\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}\exp(\frac{-(x-\mu)^2}{2\sigma^2})$$
$$f(x;\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}}\exp\left(\frac{-(x-\mu)^2}{2\sigma^2}\right)$$

This is a bell curve centered at $\mu$ with a variance of $\sigma$. Our best guess
for the output, i.e. the model's prediction, should be the average measurement,
meaning that $\mu$ is the result from the simulator. $\sigma$ is a parameter for
how much measurement error we expect (some intuition on $\sigma$ will come soon).

Let's return to thinking about the ODE example. In this case we have $\theta$ as a
Let's return to thinking about the ODE example. In this case, we have $\theta$ as a
vector of random variables. This means that $u(t;\theta)$ is a random variable for the
ODE $u'= ...$'s solution at a given point in time $t$. If we have a
measurement at a time $t_i$ and assume our measurement noise is normally distributed
Expand All @@ -182,7 +182,7 @@ product becomes a summation, and thus:

$$\begin{align}
\log p(D|\theta) &= \sum_i \log f(x_i;u(t_i;\theta),\sigma)\\
&= \frac{N}{\sqrt{2\pi}\sigma} + \frac{1}{2\sigma^2} \sum_i -(x-\mu)^2
&= \frac{N}{\log(\sqrt{2\pi}\sigma)} + \frac{1}{2\sigma^2} \sum_i -(x_i - u(t_i; \theta))^2
\end{align}$$

Notice that **maximizing this log-likelihood is equivalent to minimizing the L2
Expand Down Expand Up @@ -271,7 +271,7 @@ Thus, for the sampling-based approaches, we simply need to arrive at an array
which is sampled according to the distribution that we want to estimate, and
from that array we can recover the distribution.

### Sampling Distributions with the Metropolis Hastings Algorithm
### Sampling Distributions with the Metropolis-Hastings Algorithm

The Metropolis-Hastings algorithm is the simplest form of *Markov Chain Monte Carlo*
(MCMC) which gives a way of sampling the $\theta$ distribution. To see how this
Expand Down Expand Up @@ -460,7 +460,7 @@ information on symplectic integration, consult

For a full demo of probabilistic programming on a differential equation system,
see
[this tutorial on Bayesian inference of pendulum parameteres](https://tutorials.juliadiffeq.org/html/models/06-pendulum_bayesian_inference.html)
[this tutorial on Bayesian inference of pendulum parameters](https://tutorials.juliadiffeq.org/html/models/06-pendulum_bayesian_inference.html)
utilizing DifferentialEquations.jl and DiffEqBayes.jl.

## Bayesian Estimation of Posterior Distributions with Variational Inference
Expand All @@ -472,15 +472,15 @@ probabilistic programming has been the development of *Automatic Differentiation
Variational Inference (ADVI)*: a general variational inference method which
is not model-specific and instead uses AD. This has allowed for large expensive
models to get effective distributional estimation, something that wasn't
previously possible with HMC. In this section we will build up this methodology
previously possible with HMC. In this section, we will build up this methodology
and understand its performance characteristics.

### ADVI as Optimization

In this form of variational inference, we wish to directly estimate the posterior
distribution. To do so, we pick a functional form to represent the solution
$q(\theta; \phi)$ where $\phi$ are latent variables. We want our resulting
distribution to fit the posterior, and tus we enforce that:
distribution to fit the posterior, and thus we enforce that:

$$\phi^\ast = \text{argmin}_{\phi} \text{KL} \left(q(\theta; \phi) \Vert p(\theta | D)\right)$$

Expand All @@ -504,7 +504,7 @@ is a subset of the support of the prior. This means that our prior has to cover
the probability distribution, which makes sense and matches Cromwell's rule for
MCMC.

At this point we now assume that $q$ is Gaussian. When we rewrite the ELBO in
At this point, we now assume that $q$ is Gaussian. When we rewrite the ELBO in
terms of the standard Gaussian, we receive an expectation that is automatically
differentiable. Calculating gradients is thus done with AD. Using only one
or a few solves gives a noisy gradient to sample and optimize the latent variables
Expand Down

0 comments on commit 66bdde3

Please sign in to comment.