-
Notifications
You must be signed in to change notification settings - Fork 57
Normalizing Flows
- 2021-01-04
- #inference #machine_learning #statistics
- Has connection to ML - Simulation-based Inference
-
Starting Point:
- A list of awesome resources on normalizing flows: awesome-normalizing-flows
-
Normalizing Flows: An Introduction and Review of Current Methods
-
Short Notes on Divergence Measures by Danilo Jimenez Rezende
-
- Flow-based models are powerful tools for designing probabilistic models with tractable density. This paper introduces Convex Potential Flows (CP-Flow), a natural and efficient parameterization of invertible models inspired by the optimal transport (OT) theory.
-
Flows for simultaneous manifold learning and density estimation
- We introduce manifold-learning flows (M-flows), a new class of generative models that simultaneously learn the data manifold as well as a tractable probability density on that manifold
-
- Efficient gradient computation of the Jacobian determinant term is a core problem of the normalizing flow framework.
- We propose Self Normalizing Flows, a flexible framework for training normalizing flows by replacing expensive terms in the gradient by learned approximate inverses at each layer
-
By Eric Jang:
-
By Adam Kosiorek
-
By Lilian Weng
-
中文内容:
-
Normalizing Flows - Introduction by
pyro.ai
- This tutorial introduces
Pyro
's normalizing flow library - Prerequisite: Tensor shapes in Pyro
- This tutorial introduces
-
Normalizing Flows Overview for
PyMC3
- This notebook reveals some tips and tricks for using normalizing flows effectively in PyMC3.
-
By Laurent Dinh
-
By Ari Seff
-
By Marcus Brubaker
-
[Deep Generative Models - CS236 - Fall 2019 at Stanford]
- A simple visual explanation by Miles Cranmer on Twitter
-
Cranmer et al. (2019) - Modeling the Gaia Color-Magnitude Diagram with Bayesian Neural Flows to Constrain Distance Estimates
- We demonstrate an algorithm for learning a flexible color-magnitude diagram from noisy parallax and photometry measurements using a normalizing flow, a deep neural network capable of learning an arbitrary multi-dimensional probability distribution.
- Demo code can be found here:
xd_vs_flow
- Based on
pytorch-flows
-
Kodi Ramanah et al. (2020) - Dynamical mass inference of galaxy clusters with neural flows
- A key aspect of our novel algorithm is that it yields the probability density function of the mass of a particular cluster, thereby providing a principled way of quantifying uncertainties
-
Hortua et al. (2020) - Constraining the Reionization History using Bayesian Normalizing Flows
- We present the use of Bayesian Neural Networks (BNNs) to predict the posterior distribution for four astrophysical and cosmological parameters.
- We demonstrate the advantages of Normalizing Flows (NF) combined with BNNs, being able to model more complex output distributions and thus capture key information as non-Gaussianities in the parameter conditional density distribution for astrophysical and cosmological dataset.
-
Reiman et al. (2020) - Fully probabilistic quasar continua predictions near Lyman-$\alpha$ with conditional neural spline flows
-
Wong et al. (2020) - Gravitational wave population inference with deep flow-based generative network
- We combine hierarchical Bayesian modeling with a flow-based deep generative network, in order to demonstrate that one can efficiently constraint numerical gravitational wave (GW) population models at a previously intractable complexity.
-
Green & Gair (2020) - Complete parameter inference for GW150914 using deep learning
-
Alsing et al. (2019) - Fast likelihood-free cosmology with neural density estimators and active learning
-
Alsing & Wandelt (2019) - Nuisance hardened data compression for fast likelihood-free inference
-
Diaz Rivero & Dvorkin (2020) - Flow-based likelihoods for non-Gaussian inference
-
pytorch-flows
- PyTorch implementations of algorithms for density estimation- Note This is research code.
- A PyTorch implementations of Masked Autoregressive Flow and some other invertible transformations from Glow: Generative Flow with Invertible 1x1 Convolutions and Density estimation using Real NVP.
-
manifold-flow
- Manifold-learning flows (ℳ-flows)- Note This is research code
- In the paper Flows for simultaneous manifold learning and density estimation we introduce manifold-learning flows or ℳ-flows, a new class of generative models that simultaneously learn the data manifold as well as a tractable probability density on that manifold.
-
- Note This is research code
- For the paper FFJORD: Free-form Continuous Dynamics for Scalable Reversible Generative Models
- Need
torchdiffeq
- Differentiable ODE solvers with full GPU support
-
NF is a type of Generative Models, similar to GAN and VAE.
-
Machine learning model learned from data can be used to
- Generate new data; Sampling
- Evaluate the likelihood of data
- Find the conditional relationship between variables.
- Score the algorithm.
- NF focuses on 2, 3, & 4.
-
Distribution: Gaussian:
- Normal Distribution’s ease-of-use makes it a very popular choice for many generative modeling and reinforcement learning algorithms
- In Reinforcement Learning - especially continuous control tasks such as robotics - policies are often modeled as multivariate Gaussians with diagonal covariance matrices
- Gaussian分布的不足: In addition to bad symmetry assumptions, Gaussians have most of their density concentrated at the edges in high dimensions and are not robust to rare events.
- NF是克服以上不足的一种手段:可以构建足够复杂的高维分布;但同时保留了Gaussian分布容易sampling,容易估计density,with re-parameterizable samples等优点。
-
Change of variables, Change of volume:
- Mathematically, this locally-linear change in volume is
$\left|\operatorname{det}\left(J\left(f^{-1}(x)\right)\right)\right|$ , the absolute value of the determinants of the Jacobian of the function inverse. - We can see determinants as the local, linearized rate of volume change of a transformation. "The absolute value of the Jacobian determinant at
$p$ gives us the factor by which the function$f$ expands or shrinks volumes near$p$ " - The determinant | Essence of linear algebra by 3Blue1Brown
- Mathematically, this locally-linear change in volume is
-
A
TransformedDistribution
inTensorFlow
is specified by a base distribution object that we will transform, and a Bijector object that implements- A forward transformation
- Its inverse transformation
- The inverse log determinant of the Jacobian
-
If
bijector.forward
is a differentiable function, thenY = bijector.forward(x)
is a re-parameterizable distribution with respect tosamples x = base_distribution.sample()
. -
This means that normalizing flows can be used as a drop-in replacement for variational posteriors in a VAE.
-
TensorFlow
distributions makes normalizing flows easy to implement, and automatically accumulate all the Jacobians determinants in a chain for us in a way that is clean and highly readable -
Normalizing Flows and Learning Flexible Bijectors
- We can chain any number of bijectors together, much like we chain layers together in a neural network. 这就是NF的做法
- If a bijector has tunable parameters with respect to
bijector.log_prob
, then the bijector can actually be learned to transform our base distribution to suit arbitrary densities. - Each bijector functions as a learnable “layer”, and you can use an optimizer to learn the parameters of the transformation to suit our data distribution we are trying to model.
- We compute and optimize over log probabilities rather than probabilities for numerical stability reasons
- See Shakir Mohamed and Danilo Rezende’s UAW talk (slides)
- There’s a connection between Normalizing Flows and GANs via encoder-decoder GAN architectures that learn the inverse of the generator
-
Downside: computing the determinant of an arbitrary
$N \times N$ Jacobian matrix has runtime complexity$O(N^3)$ , which is very expensive to put in a neural network.-
TensorFlow
provides a structured affine transformation whose determinant can be computed more efficiently: parameterized as a lower triangular matrix$M$ plus a low rank update:$M+V \cdot D \cdot V^{T}$ - Then use the matrix determinant lemma to compute its determinant:
$\operatorname{det}\left(\mathbf{A}+\mathbf{u v}^{\top}\right)=\operatorname{det}(\mathbf{A})+\mathbf{v}^{\top} \operatorname{adj}(\mathbf{A}) \mathbf{u}$
-
-
How to achieve non-linear transformation: need an invertible nonlinearity in order to express non-linear functions (otherwise the chain of affine bijectors remains affine).
- Sigmoid / tanh may seem like good choices, but they are incredibly unstable to invert
- ReLU is stable, but not invertible for x<0
- PReLU, parameterized ReLU, Leaky ReLU but with a learnable slope in the negative regime
-
Train the model:
- Note: For low dimensional distributions, this MLP is a very poor choice of a normalizing flow
-
GAN and VAE do not explicitly learn the probability density function of real data. Because the latent part is "intractable".
-
Goal of NF: A good estimation of
$p(x)$ makes it possible to efficiently complete many downstream tasks: sample unobserved but realistic new data points (data generation), predict the rareness of future events (density estimation), infer latent variables, fill in incomplete data samples -
A flow-based generative model is constructed by a sequence of invertible transformations. Unlike other two, the model explicitly learns the data distribution
$p(x)$ and therefore the loss function is simply the negative log-likelihood.- Definition: A normalizing flow transforms a simple distribution into a complex one by applying a sequence of invertible transformation functions.
-
Backward propagation in deep learning models requires the embedded probability distribution is expected to be simple enough to calculate the derivative easily and efficiently. (e.g. Gaussian)
-
About determinant:
- The absolute value of the determinant can be thought of as a measure of “how much multiplication by the matrix expands or contracts space”.
- The determinant of a square matrix MM detects whether it is invertible: If
$\operatorname{det}\left(M\right) =0$ then MM is not invertible (a singular matrix with linearly dependent rows or columns; or any row or column is all 0) $\operatorname{det}(A B)=\operatorname{det}(A) \operatorname{det}(B)$ $\operatorname{det}\left(M^{-1}\right)=(\operatorname{det}(M))^{-1}$
-
A transformation function should satisfy two properties
- It is easily invertible.
- Its Jacobian determinant is easy to compute. (Not always the case.)
-
Normalizing flow:
- Starting with an initial distribution
$\mathbf{z}_{0}$ - After
$K$ steps of "flow", it becomes$\mathbf{x}$ .
- Starting with an initial distribution
$$ \begin{aligned} \mathbf{x}=\mathbf{z}{K} &=f{K} \circ f_{K-1} \circ \cdots \circ f_{1}\left(\mathbf{z}{0}\right) \ \log p(\mathbf{x})=\log \pi{K}\left(\mathbf{z}{K}\right) &=\log \pi{K-1}\left(\mathbf{z}{K-1}\right)-\log \left|\operatorname{det} \frac{d f{K}}{d \mathbf{z}{K-1}}\right| \ &=\log \pi{K-2}\left(\mathbf{z}{K-2}\right)-\log \left|\operatorname{det} \frac{d f{K-1}}{d \mathbf{z}{K-2}}\right|-\log \left|\operatorname{det} \frac{d f{K}}{d \mathbf{z}{K-1}}\right| \ &=\ldots \ &=\log \pi{0}\left(\mathbf{z}{0}\right)-\sum{i=1}^{K} \log \left|\operatorname{det} \frac{d f_{i}}{d \mathbf{z}_{i-1}}\right| \end{aligned} $$
- Each step it is based on the change of variable theorem:
- Apply the inverse function theorem and the property of the Jacobians of an invertible function: The determinant of the inverse of an invertible matrix is the inverse of the determinant. And also use the
$\log$ probability.
$$ \log p_{i}\left(\mathbf{z}{i}\right)=\log p{i-1}\left(\mathbf{z}{i-1}\right)-\log \left|\operatorname{det} \frac{d f{i}}{d \mathbf{z}_{i-1}}\right| $$
- Gaussians are also used, and often prove too simple, as the PDF for latent variables in Variational Autoencoders (VAEs)
-
$\mathbf{z} \in \mathbb{R}^{d}$ : random variable -
$f: \mathbb{R}^{d} \mapsto \mathbb{R}^{d}$ : an invertible smooth mapping. Use$f$ to transform$\mathbf{z} \sim q(\mathbf{z})$ . - Resulting in random variable:
$\mathbf{y}=f(\mathbf{z})$ -
$y$ has the probability distribution of
- The requirement for the flow is mainly an easy calculation of determinant of the Jacobian.
-
Where
$h$ an element-wise non-linearity -
Determinant:
- We can think of it as slicing the
$z$ -space with straight lines (or hyperplanes), where each line contracts or expands the space around it
-
Where $r=\left|\mathbf{z}-\mathbf{z}{0}\right|{2}, h(\alpha, r)=\frac{1}{\alpha+r}$
-
Similarly to planar flows, radial flows introduce spheres in the
$z$ -space.
-
These simple flows are useful only for low dimensional spaces, since each transformation affects only a small volume in the original space.
-
Each mapping behaves as a hidden layer of a neural network with one hidden unit and a skip connection. Since a single hidden unit is not very expressive, we need a lot of transformations.
- But see: Sylvester Normalizing Flows for Variational Inference; Code:
sylvester-flows
- But see: Sylvester Normalizing Flows for Variational Inference; Code:
-
Not easy to enhance expressivity: NF requires Jacobian that can be easily computed.
-
We can introduce dependencies between different dimensions of the latent variable, and still end up with a tractable Jacobian.
- Namely, if after a transformation, the dimension
$i$ of the resulting variable depends only on dimensions 1:$i$ of the input variable, then the Jacobian of this transformation is triangular. - A determinant of a triangular matrix is equal to the product of the terms on the diagonal.
- Namely, if after a transformation, the dimension
-
- An autoregressive (AR) model is a representation of a type of random process; as such, it is used to describe certain time-varying processes
- The autoregressive model specifies that the output variable depends linearly on its own previous values and on a stochastic term (an imperfectly predictable term); thus the model is in the form of a stochastic difference equation. $$ X_{t}=c+\sum_{i=1}^{p} \varphi_{i} X_{t-i}+\varepsilon_{t} $$
-
$\varphi_{i}$ are the parameters of the model. -
$\varepsilon_{t}$ is white noise.
Modeling the Gaia Color-Magnitude Diagram with Bayesian Neural Flows to Constrain Distance Estimates
-
Using a very flexible model for a color-magnitude population density would reduce potential model-dependent bias from entering distance estimates.
-
Deep learning can also be applied to many additional optimization problems than just regression, such as density estimation, which we will use it for in this paper
- Density estimation is the problem of fitting a function that models an unknown probability distribution and can be approached with deep learning using a model called a normalizing flow.
-
Deep learning can be thought of as a recursive generalized linear regression - you repeatedly compute linear regression (fitting a hyperplane) on an input, following each regression with an element-wise nonlinearity (such as converting negative values to zero).
- In the case of normalizing flows, as we will see, we also apply a mask over the linear regression weights to make the function have a triangular Jacobian matrix
-
Why using NF here: The path of stellar evolution has many sharp turns and discontinuities, so, while the spread of stars about a single isochrone may be Gaussian-like due to a Gaussian spread in ages and metallicities, the overall CMD is better modelled with a highly flexible model that can express sharp turns, which is where deep learning can be extremely helpful, as GMMs perform poorly at modeling the contours of the CMD
-
Normalizing flows are not yet popular for density estimation in astrophysics or the natural sciences in general, although they are being used for some likelihood-free inference applications derived from Papamakarios et al. (2018) in Cosmology and Particle Physics, for example in the
MadMiner
package for LHC data Brehmer et al. (2019), andPyDELFI
for Cosmology Alsing et al. (2019).
-
Assumptions:
- For a given star, there will be other stars with similar photometry, as is assumed for any regression-type optimization problem
-
This architecture models the joint posterior density for Gaia magnitudes:
$P(g, b p-r p, b p-g)$ -
This posterior models the density of stars in Gaia DR2 and weights stars observed in Gaia DR2 by $\sigma^{-2}{P}$, where $\sigma{P}^{2}=\sigma_{g}^{2}+\sigma_{b p-r p}^{2}+\sigma_{b p-g}^{2}$.
- Which is a metric for the signal-to-noise ratio of data points, much like when calculating a mean from uncertain measurements, one weights each measurement by the inverse variance
-
During training, we sample 32 parallaxes for each Gaia DR2 data point from the truncated normal distribution:
-
Training and evaluation take different approaches due to the computational expense. During training, we calculate a distance by dividing each of the 32 parallax samples from above eq. We calculate the mean of these distance samples to get the current best-estimate for a distance
- For every best-estimate distance, we query
Bayestar
to get reddening correction.
- For every best-estimate distance, we query
-
Next, using the current model for
$P(g, b p-r p, b p-g)$ , we calculate the probability of each$(g, b p-r p, b p-g)$ . -
These likelihood values are treated as weights, and the weights are used to calculate a new best-estimate for distance via a weighted sum:
- This
$d_{\rm best}$ is then fed back into the loop and a new reddening is found usingBayestar
. This iteration is repeated 5 times during training, due to the computational expense, but 10 times during evaluation - Use the best distance to get the best dust correction. Then calculate the new color and magnitude.
- We do not use
$d_{\rm best}$ to calculate the final$g$ since this could create a feedback loop for the probability density and create unphysical artifacts, which we experimentally observed.
- We do not use
- The final standard deviations:
- The final loss function is:
- We sum this over all stars in the DR2 catalog, after calculating the best-estimate color-magnitude points, and minimize.
- A normalizing flow can roughly be thought of as a Gaussian that has been parametrically warped by an invertible neural network.
- Uses the Masked Autoregressive Flow (MAF):
- This equation is similar to the common matrix multiply followed by vector addition that is found in neural networks, but with a particular mask applied
which transforms from our data variables (y) to a latent space (z) where we set the Gaussian.
- This transform can be repeated to form a complex flow.
- Autoregressive models can also be used to exactly marginalize over inputs in the case of missing data (such as missing photometry)
- Model uses a sequence of blocks of
MADE
-->BatchNorm
-->REVERSE
-
MADE
: Masked Autoencoder for Distribution Estimation model defined in Germain et al. (2015)- The
MADE
model essentially applies three densely-connected neural network layers with a mask (hence, a masked autoencoder) applied to the weights at each layer to satisfy properties of the neural flow. - One can think of this transform as parametrizing an arbitrary bijective vector field, where the vectors show the flow of points from distribution to distribution.
- The
-
BatchNorm
is a batch norm-like layer in Dinh et al. (2016)- The
BatchNorm
is the equivalent of a batch normalization for normalizing flows
- The
-
Reverse
is the reversing layer found in Dinh et al. (2016)-
Reverse
permutes the order of the probability variables since the MADE’s mask treats each slightly differently
-
- The
BatchNorm
andReverse
layers help regularize training.
-
- We conduct a hyperparameter search for the normalizing flow over 80 different models, finding the best model using Bayesian optimization with summed-log-likelihood as an optimization metric.
- The Gaia data are dealt in mini-batch size of 2048
- Smaller batch size leads to artifacts in the density map; much larger batch sizes results in early convergence to less accurate models.