Skip to content

Normalizing Flows

Song Huang edited this page Jan 8, 2021 · 1 revision

Normalizing Flows


Articles

Research Publications

Tutorials

Short Blog Articles

Software Implementation

Video Lectures

Course Materials

Others

Applications in Astronomy

Gravitational Wave Related

Cosmological Inference

Tools

PyTorch Based

JAX based

Research code

Notes

Basics

Eric Jang's Tutorials (Basics)

  • NF is a type of Generative Models, similar to GAN and VAE.

  • Machine learning model learned from data can be used to

    1. Generate new data; Sampling
    2. Evaluate the likelihood of data
    3. Find the conditional relationship between variables.
    4. Score the algorithm.
    • NF focuses on 2, 3, & 4.
  • Distribution: Gaussian:

    • Normal Distribution’s ease-of-use makes it a very popular choice for many generative modeling and reinforcement learning algorithms
    • In Reinforcement Learning - especially continuous control tasks such as robotics - policies are often modeled as multivariate Gaussians with diagonal covariance matrices
    • Gaussian分布的不足: In addition to bad symmetry assumptions, Gaussians have most of their density concentrated at the edges in high dimensions and are not robust to rare events.
    • NF是克服以上不足的一种手段:可以构建足够复杂的高维分布;但同时保留了Gaussian分布容易sampling,容易估计density,with re-parameterizable samples等优点。
  • Change of variables, Change of volume:

    • Mathematically, this locally-linear change in volume is $\left|\operatorname{det}\left(J\left(f^{-1}(x)\right)\right)\right|$, the absolute value of the determinants of the Jacobian of the function inverse.
    • We can see determinants as the local, linearized rate of volume change of a transformation. "The absolute value of the Jacobian determinant at $p$ gives us the factor by which the function $f$ expands or shrinks volumes near $p$"
    • The determinant | Essence of linear algebra by 3Blue1Brown

$$ \log p(y)=\log p\left(f^{-1}(y)\right)+\log \left|\operatorname{det}\left(J\left(f^{-1}(y)\right)\right)\right| $$

  • A TransformedDistribution in TensorFlow is specified by a base distribution object that we will transform, and a Bijector object that implements

    • A forward transformation
    • Its inverse transformation
    • The inverse log determinant of the Jacobian
  • If bijector.forward is a differentiable function, then Y = bijector.forward(x) is a re-parameterizable distribution with respect to samples x = base_distribution.sample().

  • This means that normalizing flows can be used as a drop-in replacement for variational posteriors in a VAE.

  • TensorFlow distributions makes normalizing flows easy to implement, and automatically accumulate all the Jacobians determinants in a chain for us in a way that is clean and highly readable

  • Normalizing Flows and Learning Flexible Bijectors

    • We can chain any number of bijectors together, much like we chain layers together in a neural network. 这就是NF的做法
    • If a bijector has tunable parameters with respect to bijector.log_prob, then the bijector can actually be learned to transform our base distribution to suit arbitrary densities.
    • Each bijector functions as a learnable “layer”, and you can use an optimizer to learn the parameters of the transformation to suit our data distribution we are trying to model.
    • We compute and optimize over log probabilities rather than probabilities for numerical stability reasons
    • See Shakir Mohamed and Danilo Rezende’s UAW talk (slides)
    • There’s a connection between Normalizing Flows and GANs via encoder-decoder GAN architectures that learn the inverse of the generator
  • Downside: computing the determinant of an arbitrary $N \times N$ Jacobian matrix has runtime complexity $O(N^3)$, which is very expensive to put in a neural network.

    • TensorFlow provides a structured affine transformation whose determinant can be computed more efficiently: parameterized as a lower triangular matrix $M$ plus a low rank update: $M+V \cdot D \cdot V^{T}$
    • Then use the matrix determinant lemma to compute its determinant: $\operatorname{det}\left(\mathbf{A}+\mathbf{u v}^{\top}\right)=\operatorname{det}(\mathbf{A})+\mathbf{v}^{\top} \operatorname{adj}(\mathbf{A}) \mathbf{u}$
  • How to achieve non-linear transformation: need an invertible nonlinearity in order to express non-linear functions (otherwise the chain of affine bijectors remains affine).

    • Sigmoid / tanh may seem like good choices, but they are incredibly unstable to invert
    • ReLU is stable, but not invertible for x<0
    • PReLU, parameterized ReLU, Leaky ReLU but with a learnable slope in the negative regime
  • Train the model:

    • Note: For low dimensional distributions, this MLP is a very poor choice of a normalizing flow

Lilian Weng's Tutorials (Basics)

  • GAN and VAE do not explicitly learn the probability density function of real data. Because the latent part is "intractable".

  • Goal of NF: A good estimation of $p(x)$ makes it possible to efficiently complete many downstream tasks: sample unobserved but realistic new data points (data generation), predict the rareness of future events (density estimation), infer latent variables, fill in incomplete data samples

  • A flow-based generative model is constructed by a sequence of invertible transformations. Unlike other two, the model explicitly learns the data distribution $p(x)$ and therefore the loss function is simply the negative log-likelihood.

    • Definition: A normalizing flow transforms a simple distribution into a complex one by applying a sequence of invertible transformation functions.
  • Backward propagation in deep learning models requires the embedded probability distribution is expected to be simple enough to calculate the derivative easily and efficiently. (e.g. Gaussian)

  • About determinant:

    • The absolute value of the determinant can be thought of as a measure of “how much multiplication by the matrix expands or contracts space”.
    • The determinant of a square matrix MM detects whether it is invertible: If $\operatorname{det}\left(M\right) =0$ then MM is not invertible (a singular matrix with linearly dependent rows or columns; or any row or column is all 0)
    • $\operatorname{det}(A B)=\operatorname{det}(A) \operatorname{det}(B)$
    • $\operatorname{det}\left(M^{-1}\right)=(\operatorname{det}(M))^{-1}$
  • A transformation function should satisfy two properties

    1. It is easily invertible.
    2. Its Jacobian determinant is easy to compute. (Not always the case.)
  • Normalizing flow:

    • Starting with an initial distribution $\mathbf{z}_{0}$
    • After $K$ steps of "flow", it becomes $\mathbf{x}$.

$$ \begin{aligned} \mathbf{x}=\mathbf{z}{K} &=f{K} \circ f_{K-1} \circ \cdots \circ f_{1}\left(\mathbf{z}{0}\right) \ \log p(\mathbf{x})=\log \pi{K}\left(\mathbf{z}{K}\right) &=\log \pi{K-1}\left(\mathbf{z}{K-1}\right)-\log \left|\operatorname{det} \frac{d f{K}}{d \mathbf{z}{K-1}}\right| \ &=\log \pi{K-2}\left(\mathbf{z}{K-2}\right)-\log \left|\operatorname{det} \frac{d f{K-1}}{d \mathbf{z}{K-2}}\right|-\log \left|\operatorname{det} \frac{d f{K}}{d \mathbf{z}{K-1}}\right| \ &=\ldots \ &=\log \pi{0}\left(\mathbf{z}{0}\right)-\sum{i=1}^{K} \log \left|\operatorname{det} \frac{d f_{i}}{d \mathbf{z}_{i-1}}\right| \end{aligned} $$

  • Each step it is based on the change of variable theorem:

$$ \begin{aligned} \mathbf{z} & \sim \pi(\mathbf{z}), \mathbf{x}=f(\mathbf{z}), \mathbf{z}=f^{-1}(\mathbf{x}) \\ p(\mathbf{x}) &=\pi(\mathbf{z})\left|\operatorname{det} \frac{d \mathbf{z}}{d \mathbf{x}}\right|=\pi\left(f^{-1}(\mathbf{x})\right)\left|\operatorname{det} \frac{d f^{-1}}{d \mathbf{x}}\right| \end{aligned} $$

  • Apply the inverse function theorem and the property of the Jacobians of an invertible function: The determinant of the inverse of an invertible matrix is the inverse of the determinant. And also use the $\log$ probability.

$$ \log p_{i}\left(\mathbf{z}{i}\right)=\log p{i-1}\left(\mathbf{z}{i-1}\right)-\log \left|\operatorname{det} \frac{d f{i}}{d \mathbf{z}_{i-1}}\right| $$

  • Gaussians are also used, and often prove too simple, as the PDF for latent variables in Variational Autoencoders (VAEs)

Different Flow Model

  • $\mathbf{z} \in \mathbb{R}^{d}$: random variable
  • $f: \mathbb{R}^{d} \mapsto \mathbb{R}^{d}$: an invertible smooth mapping. Use $f$ to transform $\mathbf{z} \sim q(\mathbf{z})$.
  • Resulting in random variable: $\mathbf{y}=f(\mathbf{z})$
  • $y$ has the probability distribution of

$$ q_{y}(\mathbf{y})=q(\mathbf{z})\left|\operatorname{det} \frac{\partial f^{-1}}{\partial \mathbf{z}}\right|=q(\mathbf{z})\left|\operatorname{det} \frac{\partial f}{\partial \mathbf{z}}\right|^{-1} $$

  • The requirement for the flow is mainly an easy calculation of determinant of the Jacobian.

Simple Flows

Planar FLow

$$ f(\mathbf{z})=\mathbf{z}+\mathbf{u} h\left(\mathbf{w}^{T} \mathbf{z}+b\right) $$

  • Where $h$ an element-wise non-linearity

  • Determinant:

$$ \left|\operatorname{det} \frac{\partial f}{\partial \mathbf{z}}\right|=\left|1+\mathbf{u}^{T} \psi(\mathbf{z})\right| $$

  • We can think of it as slicing the $z$-space with straight lines (or hyperplanes), where each line contracts or expands the space around it
Radial Flow

$$ f(\mathbf{z})=\mathbf{z}+\beta h(\alpha, r)\left(\mathbf{z}-\mathbf{z}_{0}\right) $$

  • Where $r=\left|\mathbf{z}-\mathbf{z}{0}\right|{2}, h(\alpha, r)=\frac{1}{\alpha+r}$

  • Similarly to planar flows, radial flows introduce spheres in the $z$-space.

Discussion
  • These simple flows are useful only for low dimensional spaces, since each transformation affects only a small volume in the original space.

  • Each mapping behaves as a hidden layer of a neural network with one hidden unit and a skip connection. Since a single hidden unit is not very expressive, we need a lot of transformations.

  • Not easy to enhance expressivity: NF requires Jacobian that can be easily computed.

Autoregressive Flows

  • We can introduce dependencies between different dimensions of the latent variable, and still end up with a tractable Jacobian.

    • Namely, if after a transformation, the dimension $i$ of the resulting variable depends only on dimensions 1:$i$ of the input variable, then the Jacobian of this transformation is triangular.
    • A determinant of a triangular matrix is equal to the product of the terms on the diagonal.
  • Autoregressive model

    • An autoregressive (AR) model is a representation of a type of random process; as such, it is used to describe certain time-varying processes
    • The autoregressive model specifies that the output variable depends linearly on its own previous values and on a stochastic term (an imperfectly predictable term); thus the model is in the form of a stochastic difference equation. $$ X_{t}=c+\sum_{i=1}^{p} \varphi_{i} X_{t-i}+\varepsilon_{t} $$
    • $\varphi_{i}$ are the parameters of the model.
    • $\varepsilon_{t}$ is white noise.
[Autoregressive Transformation]

Publication in Astronomy

  • Using a very flexible model for a color-magnitude population density would reduce potential model-dependent bias from entering distance estimates.

  • Deep learning can also be applied to many additional optimization problems than just regression, such as density estimation, which we will use it for in this paper

    • Density estimation is the problem of fitting a function that models an unknown probability distribution and can be approached with deep learning using a model called a normalizing flow.
  • Deep learning can be thought of as a recursive generalized linear regression - you repeatedly compute linear regression (fitting a hyperplane) on an input, following each regression with an element-wise nonlinearity (such as converting negative values to zero).

    • In the case of normalizing flows, as we will see, we also apply a mask over the linear regression weights to make the function have a triangular Jacobian matrix
  • Why using NF here: The path of stellar evolution has many sharp turns and discontinuities, so, while the spread of stars about a single isochrone may be Gaussian-like due to a Gaussian spread in ages and metallicities, the overall CMD is better modelled with a highly flexible model that can express sharp turns, which is where deep learning can be extremely helpful, as GMMs perform poorly at modeling the contours of the CMD

  • Normalizing flows are not yet popular for density estimation in astrophysics or the natural sciences in general, although they are being used for some likelihood-free inference applications derived from Papamakarios et al. (2018) in Cosmology and Particle Physics, for example in the MadMiner package for LHC data Brehmer et al. (2019), and PyDELFI for Cosmology Alsing et al. (2019).

Model design
  • Assumptions:

    • For a given star, there will be other stars with similar photometry, as is assumed for any regression-type optimization problem
  • This architecture models the joint posterior density for Gaia magnitudes: $P(g, b p-r p, b p-g)$

  • This posterior models the density of stars in Gaia DR2 and weights stars observed in Gaia DR2 by $\sigma^{-2}{P}$, where $\sigma{P}^{2}=\sigma_{g}^{2}+\sigma_{b p-r p}^{2}+\sigma_{b p-g}^{2}$.

    • Which is a metric for the signal-to-noise ratio of data points, much like when calculating a mean from uncertain measurements, one weights each measurement by the inverse variance
  • During training, we sample 32 parallaxes for each Gaia DR2 data point from the truncated normal distribution:

$$ P(\varpi) \propto\left{\begin{array}{cl} \exp \left(-\frac{\left(\varpi-\varpi_{\mathrm{obs}}\right)^{2}}{2 \sigma^{2}}\right), & \varpi>0 \\ 0, & \varpi \leq 0, \end{array}\right. $$

  • Training and evaluation take different approaches due to the computational expense. During training, we calculate a distance by dividing each of the 32 parallax samples from above eq. We calculate the mean of these distance samples to get the current best-estimate for a distance

    • For every best-estimate distance, we query Bayestar to get reddening correction.
  • Next, using the current model for $P(g, b p-r p, b p-g)$, we calculate the probability of each $(g, b p-r p, b p-g)$.

  • These likelihood values are treated as weights, and the weights are used to calculate a new best-estimate for distance via a weighted sum:

$$ d_{\text {best }}=\frac{\sum_{i} d_{i} P\left(g_{i}, b p-r p, b p-g\right)}{\sum_{i} P\left(g_{i}, b p-r p, b p-g\right)} $$

  • This $d_{\rm best}$ is then fed back into the loop and a new reddening is found using Bayestar. This iteration is repeated 5 times during training, due to the computational expense, but 10 times during evaluation
  • Use the best distance to get the best dust correction. Then calculate the new color and magnitude.
    • We do not use $d_{\rm best}$ to calculate the final $g$ since this could create a feedback loop for the probability density and create unphysical artifacts, which we experimentally observed.
  • The final standard deviations:

$$ \begin{aligned} \sigma_{g}^{2} =\operatorname{Var}\left(g_{1}, \ldots, g_{32}\right)+\sigma_{G, \text { Bayestar }}^{2} \\ \sigma_{b p-r p}^{2} =\sigma_{b p-r p, \text { Bayestar }}^{2} \\ \sigma_{b p-g}^{2} =\sigma_{b p-g, \text { Bayestar }}^{2} \end{aligned} $$

  • The final loss function is:

$$ -\frac{1}{\sigma_{P}^{2}} \log \left{P(g, b p-r p, b p-g)_{\text {best }}\right} $$

  • We sum this over all stars in the DR2 catalog, after calculating the best-estimate color-magnitude points, and minimize.
Model architecture
  • A normalizing flow can roughly be thought of as a Gaussian that has been parametrically warped by an invertible neural network.
  • Uses the Masked Autoregressive Flow (MAF):

$$ \begin{aligned} y_{1} &=\mu_{1}+\sigma_{1} z_{1} \\ \forall i>1: y_{i} &=\mu\left(y_{1: i-1}\right)+\sigma\left(y_{1: i-1}\right) z_{i}, \end{aligned} $$

  • This equation is similar to the common matrix multiply followed by vector addition that is found in neural networks, but with a particular mask applied

$$ \forall i>1: z_{i}=\frac{y_{i}-\mu\left(y_{1: i-1}\right)}{\sigma\left(y_{1: i-1}\right)} $$

which transforms from our data variables (y) to a latent space (z) where we set the Gaussian.

  • This transform can be repeated to form a complex flow.
  • Autoregressive models can also be used to exactly marginalize over inputs in the case of missing data (such as missing photometry)
  • Model uses a sequence of blocks of MADE --> BatchNorm --> REVERSE
    • MADE: Masked Autoencoder for Distribution Estimation model defined in Germain et al. (2015)
      • The MADE model essentially applies three densely-connected neural network layers with a mask (hence, a masked autoencoder) applied to the weights at each layer to satisfy properties of the neural flow.
      • One can think of this transform as parametrizing an arbitrary bijective vector field, where the vectors show the flow of points from distribution to distribution.
    • BatchNorm is a batch norm-like layer in Dinh et al. (2016)
      • The BatchNorm is the equivalent of a batch normalization for normalizing flows
    • Reverse is the reversing layer found in Dinh et al. (2016)
      • Reverse permutes the order of the probability variables since the MADE’s mask treats each slightly differently
    • The BatchNorm and Reverse layers help regularize training.
Model optimization
  • We conduct a hyperparameter search for the normalizing flow over 80 different models, finding the best model using Bayesian optimization with summed-log-likelihood as an optimization metric.
  • The Gaia data are dealt in mini-batch size of 2048
    • Smaller batch size leads to artifacts in the density map; much larger batch sizes results in early convergence to less accurate models.