Normalizing Flows

2021-01-04
#inference #machine_learning #statistics
Has connection to ML - Simulation-based Inference

Starting Point:
- A list of awesome resources on normalizing flows: awesome-normalizing-flows

Articles

Research Publications

Variational Inference with Normalizing Flows
- Papers with code introduction
Convex Potential Flows: Universal Probability Distributions with Optimal Transport and Convex Optimization
- Flow-based models are powerful tools for designing probabilistic models with tractable density. This paper introduces Convex Potential Flows (CP-Flow), a natural and efficient parameterization of invertible models inspired by the optimal transport (OT) theory.
Flows for simultaneous manifold learning and density estimation
- We introduce manifold-learning flows (M-flows), a new class of generative models that simultaneously learn the data manifold as well as a tractable probability density on that manifold
Self Normalizing Flows
- Efficient gradient computation of the Jacobian determinant term is a core problem of the normalizing flow framework.
- We propose Self Normalizing Flows, a flexible framework for training normalizing flows by replacing expensive terms in the gradient by learned approximate inverses at each layer

Tutorials

Short Blog Articles

By Eric Jang:
By Adam Kosiorek
- Normalizing Flows 中文小节
By Lilian Weng
- Flow-based Deep Generative Models
中文内容:

Software Implementation

Normalizing Flows - Introduction by pyro.ai
- This tutorial introduces Pyro's normalizing flow library
- Prerequisite: Tensor shapes in Pyro
Normalizing Flows Overview for PyMC3
- This notebook reveals some tips and tricks for using normalizing flows effectively in PyMC3.

Video Lectures

By Laurent Dinh
- Primer on Normalizing Flows - Youtube Video
  - Slides can be found on Google doc
By Ari Seff
- What are Normalizing Flows? - Youtube Video
By Marcus Brubaker
- Introduction to Normalizing Flows (ECCV2020 Tutorial) - Youtube Video

Course Materials

[Deep Generative Models - CS236 - Fall 2019 at Stanford]
- Normalizing flow models. Course slides can be found here
Flow Models - CS294-158-SP20 Deep Unsupervised Learning - UC Berkeley - Spring 2020 by Pieter Abbeel - Youtube Video
- More course material can be found on this Google Site

Others

A simple visual explanation by Miles Cranmer on Twitter

Applications in Astronomy

Cranmer et al. (2019) - Modeling the Gaia Color-Magnitude Diagram with Bayesian Neural Flows to Constrain Distance Estimates
- We demonstrate an algorithm for learning a flexible color-magnitude diagram from noisy parallax and photometry measurements using a normalizing flow, a deep neural network capable of learning an arbitrary multi-dimensional probability distribution.
- Demo code can be found here: xd_vs_flow
- Based on pytorch-flows
Kodi Ramanah et al. (2020) - Dynamical mass inference of galaxy clusters with neural flows
- A key aspect of our novel algorithm is that it yields the probability density function of the mass of a particular cluster, thereby providing a principled way of quantifying uncertainties
Hortua et al. (2020) - Constraining the Reionization History using Bayesian Normalizing Flows
- We present the use of Bayesian Neural Networks (BNNs) to predict the posterior distribution for four astrophysical and cosmological parameters.
- We demonstrate the advantages of Normalizing Flows (NF) combined with BNNs, being able to model more complex output distributions and thus capture key information as non-Gaussianities in the parameter conditional density distribution for astrophysical and cosmological dataset.
Reiman et al. (2020) - Fully probabilistic quasar continua predictions near Lyman-$\alpha$ with conditional neural spline flows

Gravitational Wave Related

Wong et al. (2020) - Gravitational wave population inference with deep flow-based generative network
- We combine hierarchical Bayesian modeling with a flow-based deep generative network, in order to demonstrate that one can efficiently constraint numerical gravitational wave (GW) population models at a previously intractable complexity.
Green & Gair (2020) - Complete parameter inference for GW150914 using deep learning

Cosmological Inference

Alsing et al. (2019) - Fast likelihood-free cosmology with neural density estimators and active learning
Alsing & Wandelt (2019) - Nuisance hardened data compression for fast likelihood-free inference
Diaz Rivero & Dvorkin (2020) - Flow-based likelihoods for non-Gaussian inference

Tools

`PyTorch` Based

FrEIA - Framework for Easily Invertible Architectures
nflows - Normalizing flows in PyTorch
pytorch-flows - PyTorch implementations of algorithms for density estimation
- Note This is research code.
- A PyTorch implementations of Masked Autoregressive Flow and some other invertible transformations from Glow: Generative Flow with Invertible 1x1 Convolutions and Density estimation using Real NVP.

`JAX` based

Research code

manifold-flow - Manifold-learning flows (ℳ-flows)
- Note This is research code
- In the paper Flows for simultaneous manifold learning and density estimation we introduce manifold-learning flows or ℳ-flows, a new class of generative models that simultaneously learn the data manifold as well as a tractable probability density on that manifold.
ffjord
- Note This is research code
- For the paper FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models
- Need torchdiffeq - Differentiable ODE solvers with full GPU support

Notes

Basics

Eric Jang's Tutorials (Basics)

NF is a type of Generative Models, similar to GAN and VAE.
Machine learning model learned from data can be used to
1. Generate new data; Sampling
2. Evaluate the likelihood of data
3. Find the conditional relationship between variables.
4. Score the algorithm.
- NF focuses on 2, 3, & 4.
Distribution: Gaussian:
- Normal Distribution’s ease-of-use makes it a very popular choice for many generative modeling and reinforcement learning algorithms
- In Reinforcement Learning - especially continuous control tasks such as robotics - policies are often modeled as multivariate Gaussians with diagonal covariance matrices
- Gaussian分布的不足: In addition to bad symmetry assumptions, Gaussians have most of their density concentrated at the edges in high dimensions and are not robust to rare events.
- NF是克服以上不足的一种手段：可以构建足够复杂的高维分布；但同时保留了Gaussian分布容易sampling，容易估计density，with re-parameterizable samples等优点。
Change of variables, Change of volume:
- Mathematically, this locally-linear change in volume is $\left|\operatorname{det}\left(J\left(f^{-1}(x)\right)\right)\right|$, the absolute value of the determinants of the Jacobian of the function inverse.
- We can see determinants as the local, linearized rate of volume change of a transformation. "The absolute value of the Jacobian determinant at $p$ gives us the factor by which the function $f$ expands or shrinks volumes near $p$"
- The determinant | Essence of linear algebra by 3Blue1Brown

$$ \log p(y)=\log p\left(f^{-1}(y)\right)+\log \left|\operatorname{det}\left(J\left(f^{-1}(y)\right)\right)\right| $$

A TransformedDistribution in TensorFlow is specified by a base distribution object that we will transform, and a Bijector object that implements
- A forward transformation
- Its inverse transformation
- The inverse log determinant of the Jacobian
If bijector.forward is a differentiable function, then Y = bijector.forward(x) is a re-parameterizable distribution with respect to samples x = base_distribution.sample().
This means that normalizing flows can be used as a drop-in replacement for variational posteriors in a VAE.
TensorFlow distributions makes normalizing flows easy to implement, and automatically accumulate all the Jacobians determinants in a chain for us in a way that is clean and highly readable
Normalizing Flows and Learning Flexible Bijectors
- We can chain any number of bijectors together, much like we chain layers together in a neural network. 这就是NF的做法
- If a bijector has tunable parameters with respect to bijector.log_prob, then the bijector can actually be learned to transform our base distribution to suit arbitrary densities.
- Each bijector functions as a learnable “layer”, and you can use an optimizer to learn the parameters of the transformation to suit our data distribution we are trying to model.
- We compute and optimize over log probabilities rather than probabilities for numerical stability reasons
- See Shakir Mohamed and Danilo Rezende’s UAW talk (slides)
- There’s a connection between Normalizing Flows and GANs via encoder-decoder GAN architectures that learn the inverse of the generator
Downside: computing the determinant of an arbitrary $N \times N$ Jacobian matrix has runtime complexity $O(N^3)$, which is very expensive to put in a neural network.
- TensorFlow provides a structured affine transformation whose determinant can be computed more efficiently: parameterized as a lower triangular matrix $M$ plus a low rank update: $M+V \cdot D \cdot V^{T}$
- Then use the matrix determinant lemma to compute its determinant: $\operatorname{det}\left(\mathbf{A}+\mathbf{u v}^{\top}\right)=\operatorname{det}(\mathbf{A})+\mathbf{v}^{\top} \operatorname{adj}(\mathbf{A}) \mathbf{u}$
How to achieve non-linear transformation: need an invertible nonlinearity in order to express non-linear functions (otherwise the chain of affine bijectors remains affine).
- Sigmoid / tanh may seem like good choices, but they are incredibly unstable to invert
- ReLU is stable, but not invertible for x<0
- PReLU, parameterized ReLU, Leaky ReLU but with a learnable slope in the negative regime
Train the model:
- Note: For low dimensional distributions, this MLP is a very poor choice of a normalizing flow

Lilian Weng's Tutorials (Basics)

GAN and VAE do not explicitly learn the probability density function of real data. Because the latent part is "intractable".
Goal of NF: A good estimation of $p(x)$ makes it possible to efficiently complete many downstream tasks: sample unobserved but realistic new data points (data generation), predict the rareness of future events (density estimation), infer latent variables, fill in incomplete data samples
A flow-based generative model is constructed by a sequence of invertible transformations. Unlike other two, the model explicitly learns the data distribution $p(x)$ and therefore the loss function is simply the negative log-likelihood.
- Definition: A normalizing flow transforms a simple distribution into a complex one by applying a sequence of invertible transformation functions.
Backward propagation in deep learning models requires the embedded probability distribution is expected to be simple enough to calculate the derivative easily and efficiently. (e.g. Gaussian)
About determinant:
- The absolute value of the determinant can be thought of as a measure of “how much multiplication by the matrix expands or contracts space”.
- The determinant of a square matrix MM detects whether it is invertible: If $\operatorname{det}\left(M\right) =0$ then MM is not invertible (a singular matrix with linearly dependent rows or columns; or any row or column is all 0)
- $\operatorname{det}(A B)=\operatorname{det}(A) \operatorname{det}(B)$
- $\operatorname{det}\left(M^{-1}\right)=(\operatorname{det}(M))^{-1}$
A transformation function should satisfy two properties
1. It is easily invertible.
2. Its Jacobian determinant is easy to compute. (Not always the case.)
Normalizing flow:
- Starting with an initial distribution $\mathbf{z}_{0}$
- After $K$ steps of "flow", it becomes $\mathbf{x}$.

$$ \begin{aligned} \mathbf{x}=\mathbf{z}{K} &=f{K} \circ f_{K-1} \circ \cdots \circ f_{1}\left(\mathbf{z}{0}\right) \ \log p(\mathbf{x})=\log \pi{K}\left(\mathbf{z}{K}\right) &=\log \pi{K-1}\left(\mathbf{z}{K-1}\right)-\log \left|\operatorname{det} \frac{d f{K}}{d \mathbf{z}{K-1}}\right| \ &=\log \pi{K-2}\left(\mathbf{z}{K-2}\right)-\log \left|\operatorname{det} \frac{d f{K-1}}{d \mathbf{z}{K-2}}\right|-\log \left|\operatorname{det} \frac{d f{K}}{d \mathbf{z}{K-1}}\right| \ &=\ldots \ &=\log \pi{0}\left(\mathbf{z}{0}\right)-\sum{i=1}^{K} \log \left|\operatorname{det} \frac{d f_{i}}{d \mathbf{z}_{i-1}}\right| \end{aligned} $$

Each step it is based on the change of variable theorem:

$$ \begin{aligned} \mathbf{z} & \sim \pi(\mathbf{z}), \mathbf{x}=f(\mathbf{z}), \mathbf{z}=f^{-1}(\mathbf{x}) \\ p(\mathbf{x}) &=\pi(\mathbf{z})\left|\operatorname{det} \frac{d \mathbf{z}}{d \mathbf{x}}\right|=\pi\left(f^{-1}(\mathbf{x})\right)\left|\operatorname{det} \frac{d f^{-1}}{d \mathbf{x}}\right| \end{aligned} $$

Apply the inverse function theorem and the property of the Jacobians of an invertible function: The determinant of the inverse of an invertible matrix is the inverse of the determinant. And also use the $\log$ probability.

$$ \log p_{i}\left(\mathbf{z}{i}\right)=\log p{i-1}\left(\mathbf{z}{i-1}\right)-\log \left|\operatorname{det} \frac{d f{i}}{d \mathbf{z}_{i-1}}\right| $$

Tutorial by Adam Kosiorek

Gaussians are also used, and often prove too simple, as the PDF for latent variables in Variational Autoencoders (VAEs)
- What is wrong with VAEs?

Different Flow Model

$\mathbf{z} \in \mathbb{R}^{d}$: random variable
$f: \mathbb{R}^{d} \mapsto \mathbb{R}^{d}$: an invertible smooth mapping. Use $f$ to transform $\mathbf{z} \sim q(\mathbf{z})$.
Resulting in random variable: $\mathbf{y}=f(\mathbf{z})$
$y$ has the probability distribution of

$$ q_{y}(\mathbf{y})=q(\mathbf{z})\left|\operatorname{det} \frac{\partial f^{-1}}{\partial \mathbf{z}}\right|=q(\mathbf{z})\left|\operatorname{det} \frac{\partial f}{\partial \mathbf{z}}\right|^{-1} $$

The requirement for the flow is mainly an easy calculation of determinant of the Jacobian.

Simple Flows

Planar FLow

$$ f(\mathbf{z})=\mathbf{z}+\mathbf{u} h\left(\mathbf{w}^{T} \mathbf{z}+b\right) $$

Where $h$ an element-wise non-linearity
Determinant:

$$ \left|\operatorname{det} \frac{\partial f}{\partial \mathbf{z}}\right|=\left|1+\mathbf{u}^{T} \psi(\mathbf{z})\right| $$

We can think of it as slicing the $z$-space with straight lines (or hyperplanes), where each line contracts or expands the space around it

Radial Flow

$$ f(\mathbf{z})=\mathbf{z}+\beta h(\alpha, r)\left(\mathbf{z}-\mathbf{z}_{0}\right) $$

Where $r=\left|\mathbf{z}-\mathbf{z}{0}\right|{2}, h(\alpha, r)=\frac{1}{\alpha+r}$
Similarly to planar flows, radial flows introduce spheres in the $z$-space.

Discussion

These simple flows are useful only for low dimensional spaces, since each transformation affects only a small volume in the original space.
Each mapping behaves as a hidden layer of a neural network with one hidden unit and a skip connection. Since a single hidden unit is not very expressive, we need a lot of transformations.
- But see: Sylvester Normalizing Flows for Variational Inference; Code: sylvester-flows
Not easy to enhance expressivity: NF requires Jacobian that can be easily computed.

Autoregressive Flows

We can introduce dependencies between different dimensions of the latent variable, and still end up with a tractable Jacobian.
- Namely, if after a transformation, the dimension $i$ of the resulting variable depends only on dimensions 1:$i$ of the input variable, then the Jacobian of this transformation is triangular.
- A determinant of a triangular matrix is equal to the product of the terms on the diagonal.
Autoregressive model
- An autoregressive (AR) model is a representation of a type of random process; as such, it is used to describe certain time-varying processes
- The autoregressive model specifies that the output variable depends linearly on its own previous values and on a stochastic term (an imperfectly predictable term); thus the model is in the form of a stochastic difference equation. $$ X_{t}=c+\sum_{i=1}^{p} \varphi_{i} X_{t-i}+\varepsilon_{t} $$
- $\varphi_{i}$ are the parameters of the model.
- $\varepsilon_{t}$ is white noise.

Real Non-Volume Preserving Flows (R-NVP)

[Autoregressive Transformation]

Masked Autoregressive Flow (MAF)

Inverse Autoregressive Flow (IAF)

Publication in Astronomy

Modeling the Gaia Color-Magnitude Diagram with Bayesian Neural Flows to Constrain Distance Estimates

Using a very flexible model for a color-magnitude population density would reduce potential model-dependent bias from entering distance estimates.
Deep learning can also be applied to many additional optimization problems than just regression, such as density estimation, which we will use it for in this paper
- Density estimation is the problem of fitting a function that models an unknown probability distribution and can be approached with deep learning using a model called a normalizing flow.
Deep learning can be thought of as a recursive generalized linear regression - you repeatedly compute linear regression (fitting a hyperplane) on an input, following each regression with an element-wise nonlinearity (such as converting negative values to zero).
- In the case of normalizing flows, as we will see, we also apply a mask over the linear regression weights to make the function have a triangular Jacobian matrix
Why using NF here: The path of stellar evolution has many sharp turns and discontinuities, so, while the spread of stars about a single isochrone may be Gaussian-like due to a Gaussian spread in ages and metallicities, the overall CMD is better modelled with a highly flexible model that can express sharp turns, which is where deep learning can be extremely helpful, as GMMs perform poorly at modeling the contours of the CMD
Normalizing flows are not yet popular for density estimation in astrophysics or the natural sciences in general, although they are being used for some likelihood-free inference applications derived from Papamakarios et al. (2018) in Cosmology and Particle Physics, for example in the MadMiner package for LHC data Brehmer et al. (2019), and PyDELFI for Cosmology Alsing et al. (2019).

Model design

Assumptions:
- For a given star, there will be other stars with similar photometry, as is assumed for any regression-type optimization problem
This architecture models the joint posterior density for Gaia magnitudes: $P(g, b p-r p, b p-g)$
This posterior models the density of stars in Gaia DR2 and weights stars observed in Gaia DR2 by $\sigma^{-2}{P}$, where $\sigma{P}^{2}=\sigma_{g}^{2}+\sigma_{b p-r p}^{2}+\sigma_{b p-g}^{2}$.
- Which is a metric for the signal-to-noise ratio of data points, much like when calculating a mean from uncertain measurements, one weights each measurement by the inverse variance
During training, we sample 32 parallaxes for each Gaia DR2 data point from the truncated normal distribution:

$$ P(\varpi) \propto\left{\begin{array}{cl} \exp \left(-\frac{\left(\varpi-\varpi_{\mathrm{obs}}\right)^{2}}{2 \sigma^{2}}\right), & \varpi>0 \\ 0, & \varpi \leq 0, \end{array}\right. $$

Training and evaluation take different approaches due to the computational expense. During training, we calculate a distance by dividing each of the 32 parallax samples from above eq. We calculate the mean of these distance samples to get the current best-estimate for a distance
- For every best-estimate distance, we query Bayestar to get reddening correction.
Next, using the current model for $P(g, b p-r p, b p-g)$, we calculate the probability of each $(g, b p-r p, b p-g)$.
These likelihood values are treated as weights, and the weights are used to calculate a new best-estimate for distance via a weighted sum:

$$ d_{\text {best }}=\frac{\sum_{i} d_{i} P\left(g_{i}, b p-r p, b p-g\right)}{\sum_{i} P\left(g_{i}, b p-r p, b p-g\right)} $$

This $d_{\rm best}$ is then fed back into the loop and a new reddening is found using Bayestar. This iteration is repeated 5 times during training, due to the computational expense, but 10 times during evaluation
Use the best distance to get the best dust correction. Then calculate the new color and magnitude.
- We do not use $d_{\rm best}$ to calculate the final $g$ since this could create a feedback loop for the probability density and create unphysical artifacts, which we experimentally observed.
The final standard deviations:

$$ \begin{aligned} \sigma_{g}^{2} =\operatorname{Var}\left(g_{1}, \ldots, g_{32}\right)+\sigma_{G, \text { Bayestar }}^{2} \\ \sigma_{b p-r p}^{2} =\sigma_{b p-r p, \text { Bayestar }}^{2} \\ \sigma_{b p-g}^{2} =\sigma_{b p-g, \text { Bayestar }}^{2} \end{aligned} $$

The final loss function is:

$$ -\frac{1}{\sigma_{P}^{2}} \log \left{P(g, b p-r p, b p-g)_{\text {best }}\right} $$

We sum this over all stars in the DR2 catalog, after calculating the best-estimate color-magnitude points, and minimize.

Model architecture

A normalizing flow can roughly be thought of as a Gaussian that has been parametrically warped by an invertible neural network.
Uses the Masked Autoregressive Flow (MAF):

$$ \begin{aligned} y_{1} &=\mu_{1}+\sigma_{1} z_{1} \\ \forall i>1: y_{i} &=\mu\left(y_{1: i-1}\right)+\sigma\left(y_{1: i-1}\right) z_{i}, \end{aligned} $$

This equation is similar to the common matrix multiply followed by vector addition that is found in neural networks, but with a particular mask applied

$$ \forall i>1: z_{i}=\frac{y_{i}-\mu\left(y_{1: i-1}\right)}{\sigma\left(y_{1: i-1}\right)} $$

which transforms from our data variables (y) to a latent space (z) where we set the Gaussian.

This transform can be repeated to form a complex flow.
Autoregressive models can also be used to exactly marginalize over inputs in the case of missing data (such as missing photometry)
Model uses a sequence of blocks of MADE --> BatchNorm --> REVERSE
- MADE: Masked Autoencoder for Distribution Estimation model defined in Germain et al. (2015)
  - The MADE model essentially applies three densely-connected neural network layers with a mask (hence, a masked autoencoder) applied to the weights at each layer to satisfy properties of the neural flow.
  - One can think of this transform as parametrizing an arbitrary bijective vector field, where the vectors show the flow of points from distribution to distribution.
- BatchNorm is a batch norm-like layer in Dinh et al. (2016)
  - The BatchNorm is the equivalent of a batch normalization for normalizing flows
- Reverse is the reversing layer found in Dinh et al. (2016)
  - Reverse permutes the order of the probability variables since the MADE’s mask treats each slightly differently
- The BatchNorm and Reverse layers help regularize training.

Model optimization

We conduct a hyperparameter search for the normalizing flow over 80 different models, finding the best model using Bayesian optimization with summed-log-likelihood as an optimization metric.
The Gaia data are dealt in mini-batch size of 2048
- Smaller batch size leads to artifacts in the density map; much larger batch sizes results in early convergence to less accurate models.

Normalizing Flows