diff --git a/README.md b/README.md index f9fbf66..5e7cac8 100644 --- a/README.md +++ b/README.md @@ -2,3 +2,89 @@ [![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://turinglang.github.io/NormalizingFlows.jl/dev/) [![Build Status](https://github.com/TuringLang/NormalizingFlows.jl/actions/workflows/CI.yml/badge.svg?branch=main)](https://github.com/TuringLang/NormalizingFlows.jl/actions/workflows/CI.yml?query=branch%3Amain) + + +A normalizing flow library for Julia. + +The purpose of this package is to provide a simple and flexible interface for variational inference (VI) and normalizing flows (NF) for Bayesian computation or generative modeling. +The key focus is to ensure modularity and extensibility, so that users can easily +construct (e.g., define customized flow layers) and combine various components +(e.g., choose different VI objectives or gradient estimates) +for variational approximation of general target distributions, +without being tied to specific probabilistic programming frameworks or applications. + +See the [documentation](https://turinglang.org/NormalizingFlows.jl/dev/) for more. + +## Installation +To install the package, run the following command in the Julia REPL: +```julia +] # enter Pkg mode +(@v1.9) pkg> add git@github.com:TuringLang/NormalizingFlows.jl.git +``` +Then simply run the following command to use the package: +```julia +using NormalizingFlows +``` + +## Quick recap of normalizing flows +Normalizing flows transform a simple reference distribution $q_0$ (sometimes known as base distribution) to +a complex distribution $q$ using invertible functions. + +In more details, given the base distribution, usually a standard Gaussian distribution, i.e., $q_0 = \mathcal{N}(0, I)$, +we apply a series of parameterized invertible transformations (called flow layers), $T_{1, \theta_1}, \cdots, T_{N, \theta_k}$, yielding that +```math +Z_N = T_{N, \theta_N} \circ \cdots \circ T_{1, \theta_1} (Z_0) , \quad Z_0 \sim q_0,\quad Z_N \sim q_{\theta}, +``` +where $\theta = (\theta_1, \dots, \theta_N)$ is the parameter to be learned, and $q_{\theta}$ is the variational distribution (flow distribution). This describes **sampling procedure** of normalizing flows, which requires sending draws through a forward pass of these flow layers. + +Since all the transformations are invertible (techinically [diffeomorphic](https://en.wikipedia.org/wiki/Diffeomorphism)), we can evaluate the density of a normalizing flow distribution $q_{\theta}$ by the change of variable formula: +```math +q_\theta(x)=\frac{q_0\left(T_1^{-1} \circ \cdots \circ +T_N^{-1}(x)\right)}{\prod_{n=1}^N J_n\left(T_n^{-1} \circ \cdots \circ +T_N^{-1}(x)\right)} \quad J_n(x)=\left|\operatorname{det} \nabla_x +T_n(x)\right|. +``` +Here we drop the subscript $\theta_n, n = 1, \dots, N$ for simplicity. +Density evaluation of normalizing flow requires computing the **inverse** and the +**Jacobian determinant** of each flow layer. + +Given the feasibility of i.i.d. sampling and density evaluation, normalizing flows can be trained by minimizing some statistical distances to the target distribution $p$. The typical choice of the statistical distance is the forward and backward Kullback-Leibler (KL) divergence, which leads to the following optimization problems: +```math +\begin{aligned} +\text{Reverse KL:}\quad +&\argmin _{\theta} \mathbb{E}_{q_{\theta}}\left[\log q_{\theta}(Z)-\log p(Z)\right] \\ +&= \argmin _{\theta} \mathbb{E}_{q_0}\left[\log \frac{q_\theta(T_N\circ \cdots \circ T_1(Z_0))}{p(T_N\circ \cdots \circ T_1(Z_0))}\right] \\ +&= \argmax _{\theta} \mathbb{E}_{q_0}\left[ \log p\left(T_N \circ \cdots \circ T_1(Z_0)\right)-\log q_0(X)+\sum_{n=1}^N \log J_n\left(F_n \circ \cdots \circ F_1(X)\right)\right] +\end{aligned} +``` +and +```math +\begin{aligned} +\text{Forward KL:}\quad +&\argmin _{\theta} \mathbb{E}_{p}\left[\log q_{\theta}(Z)-\log p(Z)\right] \\ +&= \argmin _{\theta} \mathbb{E}_{p}\left[\log q_\theta(Z)\right] +\end{aligned} +``` +Both problems can be solved via standard stochastic optimization algorithms, +such as stochastic gradient descent (SGD) and its variants. + +Reverse KL minimization is typically used for **Bayesian computation**, where one +wants to approximate a posterior distribution $p$ that is only known up to a +normalizing constant. +In contrast, forward KL minimization is typically used for **generative modeling**, where one wants to approximate a complex distribution $p$ that is known up to a normalizing constant. + +## Current status and TODOs + +- [x] general interface development +- [x] documentation +- [ ] including more flow examples +- [ ] GPU compatibility +- [ ] benchmarking + +## Related packages +- [Bijectors.jl](https://github.com/TuringLang/Bijectors.jl): a package for defining bijective transformations, which can be used for defining customized flow layers. +- [Flux.jl](https://fluxml.ai/Flux.jl/stable/) +- [Optimisers.jl](https://github.com/FluxML/Optimisers.jl) +- [AdvancedVI.jl](https://github.com/TuringLang/AdvancedVI.jl) + + diff --git a/docs/.gitignore b/docs/.gitignore new file mode 100644 index 0000000..da3d337 --- /dev/null +++ b/docs/.gitignore @@ -0,0 +1,2 @@ +build/ +site/ \ No newline at end of file diff --git a/docs/make.jl b/docs/make.jl index 7202aa7..e0c8764 100644 --- a/docs/make.jl +++ b/docs/make.jl @@ -10,7 +10,12 @@ makedocs(; repo="https://github.com/TuringLang/NormalizingFlows.jl/blob/{commit}{path}#{line}", sitename="NormalizingFlows.jl", format=Documenter.HTML(), - pages=["Home" => "index.md"], + pages=[ + "Home" => "index.md", + "API" => "api.md", + "Example" => "example.md", + "Customize your own flow layer" => "customized_layer.md", + ], ) deploydocs(; repo="github.com/TuringLang/NormalizingFlows.jl", devbranch="main") diff --git a/docs/src/api.md b/docs/src/api.md new file mode 100644 index 0000000..f8028b9 --- /dev/null +++ b/docs/src/api.md @@ -0,0 +1,93 @@ +## API + +```@index +``` + + +## Main Function + +```@docs +NormalizingFlows.train_flow +``` + +The flow object can be constructed by `transformed` function in `Bijectors.jl` package. +For example of Gaussian VI, we can construct the flow as follows: +```@julia +using Distributions, Bijectors +T= Float32 +q₀ = MvNormal(zeros(T, 2), ones(T, 2)) +flow = Bijectors.transformed(q₀, Bijectors.Shift(zeros(T,2)) ∘ Bijectors.Scale(ones(T, 2))) +``` +To train the Gaussian VI targeting at distirbution $p$ via ELBO maiximization, we can run +```@julia +using NormalizingFlows + +sample_per_iter = 10 +flow_trained, stats, _ = train_flow( + elbo, + flow, + logp, + sample_per_iter; + max_iters=2_000, + optimiser=Optimisers.ADAM(0.01 * one(T)), +) +``` +## Variational Objectives +We have implemented two variational objectives, namely, ELBO and the log-likelihood objective. +Users can also define their own objective functions, and pass it to the [`train_flow`](@ref) function. +`train_flow` will optimize the flow parameters by maximizing `vo`. +The objective function should take the following general form: +```julia +vo(rng, flow, args...) +``` +where `rng` is the random number generator, `flow` is the flow object, and `args...` are the +additional arguments that users can pass to the objective function. + +#### Evidence Lower Bound (ELBO) +By maximizing the ELBO, it is equivalent to minimizing the reverse KL divergence between $q_\theta$ and $p$, i.e., +```math +\begin{aligned} +&\min _{\theta} \mathbb{E}_{q_{\theta}}\left[\log q_{\theta}(Z)-\log p(Z)\right] \quad \text{(Reverse KL)}\\ +& = \max _{\theta} \mathbb{E}_{q_0}\left[ \log p\left(T_N \circ \cdots \circ +T_1(Z_0)\right)-\log q_0(X)+\sum_{n=1}^N \log J_n\left(F_n \circ \cdots \circ +F_1(X)\right)\right] \quad \text{(ELBO)} +\end{aligned} +``` +Reverse KL minimization is typically used for **Bayesian computation**, +where one only has access to the log-(unnormalized)density of the target distribution $p$ (e.g., a Bayesian posterior distribution), +and hope to generate approximate samples from it. + +```@docs +NormalizingFlows.elbo +``` +#### Log-likelihood + +By maximizing the log-likelihood, it is equivalent to minimizing the forward KL divergence between $q_\theta$ and $p$, i.e., +```math +\begin{aligned} +& \min_{\theta} \mathbb{E}_{p}\left[\log q_{\theta}(Z)-\log p(Z)\right] \quad \text{(Forward KL)} \\ +& = \max_{\theta} \mathbb{E}_{p}\left[\log q_{\theta}(Z)\right] \quad \text{(Expected log-likelihood)} +\end{aligned} +``` +Forward KL minimization is typically used for **generative modeling**, +where one is given a set of samples from the target distribution $p$ (e.g., images) +and aims to learn the density or a generative process that outputs high quality samples. + +```@docs +NormalizingFlows.loglikelihood +``` + + +## Training Loop + +```@docs +NormalizingFlows.optimize +``` + + +## Utility Functions for Taking Gradient +```@docs +NormalizingFlows.grad! +NormalizingFlows.value_and_gradient! +``` + diff --git a/docs/src/banana.png b/docs/src/banana.png new file mode 100644 index 0000000..785e53b Binary files /dev/null and b/docs/src/banana.png differ diff --git a/docs/src/comparison.png b/docs/src/comparison.png new file mode 100644 index 0000000..777f2fd Binary files /dev/null and b/docs/src/comparison.png differ diff --git a/docs/src/customized_layer.md b/docs/src/customized_layer.md new file mode 100644 index 0000000..af9062c --- /dev/null +++ b/docs/src/customized_layer.md @@ -0,0 +1,180 @@ +# Defining Your Own Flow Layer + +In practice, user might want to define their own normalizing flow. +As briefly noted in [What are normalizing flows?](@ref), the key is to define a +customized normalizing flow layer, including its transformation and inverse, +as well as the log-determinant of the Jacobian of the transformation. +`Bijectors.jl` offers a convenient interface to define a customized bijection. +We refer users to [the documentation of +`Bijectors.jl`](https://turinglang.org/Bijectors.jl/dev/transforms/#Implementing-a-transformation) +for more details. +`Flux.jl` is also a useful package, offering a convenient interface to define neural networks. + + +In this tutorial, we demonstrate how to define a customized normalizing flow +layer -- an `Affine Coupling Layer` (Dinh *et al.*, 2016) -- using `Bijectors.jl` and `Flux.jl`. + +## Affine Coupling Flow + +Given an input vector $\boldsymbol{x}$, the general *coupling transformation* splits it into two +parts: $\boldsymbol{x}_{I_1}$ and $\boldsymbol{x}_{I\setminus I_1}$. Only one +part (e.g., $\boldsymbol{x}_{I_1}$) undergoes a bijective transformation $f$, noted as the *coupling law*, +based on the values of the other part (e.g., $\boldsymbol{x}_{I\setminus I_1}$), which remains unchanged. +```math +\begin{array}{llll} +c_{I_1}(\cdot ; f, \theta): & \mathbb{R}^d \rightarrow \mathbb{R}^d & c_{I_1}^{-1}(\cdot ; f, \theta): & \mathbb{R}^d \rightarrow \mathbb{R}^d \\ +& \boldsymbol{x}_{I \backslash I_1} \mapsto \boldsymbol{x}_{I \backslash I_1} & & \boldsymbol{y}_{I \backslash I_1} \mapsto \boldsymbol{y}_{I \backslash I_1} \\ +& \boldsymbol{x}_{I_1} \mapsto f\left(\boldsymbol{x}_{I_1} ; \theta\left(\boldsymbol{x}_{I\setminus I_1}\right)\right) & & \boldsymbol{y}_{I_1} \mapsto f^{-1}\left(\boldsymbol{y}_{I_1} ; \theta\left(\boldsymbol{y}_{I\setminus I_1}\right)\right) +\end{array} +``` +Here $\theta$ can be an arbitrary function, e.g., a neural network. +As long as $f(\cdot; \theta(\boldsymbol{x}_{I\setminus I_1}))$ is invertible, $c_{I_1}$ is invertible, and the +Jacobian determinant of $c_{I_1}$ is easy to compute: +```math +\left|\text{det} \nabla_x c_{I_1}(x)\right| = \left|\text{det} \nabla_{x_{I_1}} f(x_{I_1}; \theta(x_{I\setminus I_1}))\right| +``` + +The affine coupling layer is a special case of the coupling transformation, where the coupling law $f$ is an affine function: +```math +\begin{aligned} +\boldsymbol{x}_{I_1} &\mapsto \boldsymbol{x}_{I_1} \odot s\left(\boldsymbol{x}_{I\setminus I_1}\right) + t\left(\boldsymbol{x}_{I \setminus I_1}\right) \\ +\boldsymbol{x}_{I \backslash I_1} &\mapsto \boldsymbol{x}_{I \backslash I_1} +\end{aligned} +``` +Here, $s$ and $t$ are arbitrary functions (often neural networks) called the "scaling" and "translation" functions, respectively. +They produce vectors of the +same dimension as $\boldsymbol{x}_{I_1}$. + + +## Implementing Affine Coupling Layer + +We start by defining a simple 3-layer multi-layer perceptron (MLP) using `Flux.jl`, +which will be used to define the scaling $s$ and translation functions $t$ in the affine coupling layer. +```@example afc +using Flux + +function MLP_3layer(input_dim::Int, hdims::Int, output_dim::Int; activation=Flux.leakyrelu) + return Chain( + Flux.Dense(input_dim, hdims, activation), + Flux.Dense(hdims, hdims, activation), + Flux.Dense(hdims, output_dim), + ) +end +``` + +#### Construct the Object + +Following the user interface of `Bijectors.jl`, we define a struct `AffineCoupling` as a subtype of `Bijectors.Bijector`. +The functions `parition` , `combine` are used to partition and recombine a vector into 3 disjoint subvectors. +And `PartitionMask` is used to store this partition rule. +These three functions are +all defined in `Bijectors.jl`; see the [documentaion](https://github.com/TuringLang/Bijectors.jl/blob/49c138fddd3561c893592a75b211ff6ad949e859/src/bijectors/coupling.jl#L3) for more details. + +```@example afc +using Functors +using Bijectors +using Bijectors: partition, combine, PartitionMask + +struct AffineCoupling <: Bijectors.Bijector + dim::Int + mask::Bijectors.PartitionMask + s::Flux.Chain + t::Flux.Chain +end + +# to apply functions to the parameters that are contained in AffineCoupling.s and AffineCoupling.t, +# and to re-build the struct from the parameters, we use the functor interface of `Functors.jl` +# see https://fluxml.ai/Flux.jl/stable/models/functors/#Functors.functor +@functor AffineCoupling (s, t) + +function AffineCoupling( + dim::Int, # dimension of input + hdims::Int, # dimension of hidden units for s and t + mask_idx::AbstractVector, # index of dimension that one wants to apply transformations on +) + cdims = length(mask_idx) # dimension of parts used to construct coupling law + s = MLP_3layer(cdims, hdims, cdims) + t = MLP_3layer(cdims, hdims, cdims) + mask = PartitionMask(dim, mask_idx) + return AffineCoupling(dim, mask, s, t) +end +``` +By default, we define $s$ and $t$ using the `MLP_3layer` function, which is a +3-layer MLP with leaky ReLU activation function. + +#### Implement the Forward and Inverse Transformations + + +```@example afc +function Bijectors.transform(af::AffineCoupling, x::AbstractVector) + # partition vector using 'af.mask::PartitionMask` + x₁, x₂, x₃ = partition(af.mask, x) + y₁ = x₁ .* af.s(x₂) .+ af.t(x₂) + return combine(af.mask, y₁, x₂, x₃) +end + +function Bijectors.transform(iaf::Inverse{<:AffineCoupling}, y::AbstractVector) + af = iaf.orig + # partition vector using `af.mask::PartitionMask` + y_1, y_2, y_3 = partition(af.mask, y) + # inverse transformation + x_1 = (y_1 .- af.t(y_2)) ./ af.s(y_2) + return combine(af.mask, x_1, y_2, y_3) +end +``` + +#### Implement the Log-determinant of the Jacobian +Notice that here we wrap the transformation and the log-determinant of the Jacobian into a single function, `with_logabsdet_jacobian`. + +```@example afc +function Bijectors.with_logabsdet_jacobian(af::AffineCoupling, x::AbstractVector) + x_1, x_2, x_3 = Bijectors.partition(af.mask, x) + y_1 = af.s(x_2) .* x_1 .+ af.t(x_2) + logjac = sum(log ∘ abs, af.s(x_2)) + return combine(af.mask, y_1, x_2, x_3), logjac +end + +function Bijectors.with_logabsdet_jacobian( + iaf::Inverse{<:AffineCoupling}, y::AbstractVector +) + af = iaf.orig + # partition vector using `af.mask::PartitionMask` + y_1, y_2, y_3 = partition(af.mask, y) + # inverse transformation + x_1 = (y_1 .- af.t(y_2)) ./ af.s(y_2) + logjac = -sum(log ∘ abs, af.s(y_2)) + return combine(af.mask, x_1, y_2, y_3), logjac +end +``` +#### Construct Normalizing Flow + +Now with all the above implementations, we are ready to use the `AffineCoupling` layer for normalizing flow +by applying it to a base distribution $q_0$. + +```@example afc +using Random, Distributions, LinearAlgebra +dim = 4 +hdims = 10 +Ls = [ + AffineCoupling(dim, hdims, 1:2), + AffineCoupling(dim, hdims, 3:4), + AffineCoupling(dim, hdims, 1:2), + AffineCoupling(dim, hdims, 3:4), + ] +ts = reduce(∘, Ls) +q₀ = MvNormal(zeros(Float32, dim), I) +flow = Bijectors.transformed(q₀, ts) +``` +We can now sample from the flow: +```@example afc +x = rand(flow, 10) +``` +And evaluate the density of the flow: +```@example afc +logpdf(flow, x[:,1]) +``` + + +## Reference +Dinh, L., Sohl-Dickstein, J. and Bengio, S., 2016. *Density estimation using real nvp.* +arXiv:1605.08803. \ No newline at end of file diff --git a/docs/src/elbo.png b/docs/src/elbo.png new file mode 100644 index 0000000..2de495b Binary files /dev/null and b/docs/src/elbo.png differ diff --git a/docs/src/example.md b/docs/src/example.md new file mode 100644 index 0000000..346c15a --- /dev/null +++ b/docs/src/example.md @@ -0,0 +1,119 @@ +## Example: Using Planar Flow + +Here we provide a minimal demonstration of learning a synthetic 2d banana distribution +using *planar flows* (Renzende *et al.* 2015) by maximizing the [Evidence Lower Bound (ELBO)](@ref). +To complete this task, the two key inputs are: +- the log-density function of the target distribution, +- the planar flow. + +#### The Target Distribution + +The `Banana` object is defined in `example/targets/banana.jl`, see the [source code](https://github.com/zuhengxu/NormalizingFlows.jl/blob/main/example/targets/banana.jl) for details. +```julia +p = Banana(2, 1.0f-1, 100.0f0) +logp = Base.Fix1(logpdf, p) +``` +Visualize the contour of the log-density and the sample scatters of the target distribution: +![Banana](banana.png) + + + + +#### The Planar Flow + +The planar flow is defined by repeated applying a sequence of invertible +transformations to a base distribution $q_0$. The building blocks for a planar flow +of length $N$ are the following invertible transformations, called *planar layers*: +```math +\text{planar layers}: +T_{n, \theta_n}(x)=x+u_n \cdot \tanh \left(w_n^T x+b_n\right), \quad n=1, \ldots, N, +``` +where $\theta_n = (u_n, w_n, b_n), n=1, \dots, N$ are the parameters to be learned. +Thankfully, [`Bijectors.jl`](https://github.com/TuringLang/Bijectors.jl) +provides a nice framework to define a normalizing flow. +Here we used the `PlanarLayer()` from `Bijectors.jl` to construct a +20-layer planar flow, of which the base distribution is a 2d standard Gaussian distribution. + +```julia +using Bijectors, FunctionChains + +function create_planar_flow(n_layers::Int, q₀) + d = length(q₀) + Ls = [f32(PlanarLayer(d)) for _ in 1:n_layers] + ts = fchain(Ls) + return transformed(q₀, ts) +end + +# create a 20-layer planar flow +flow = create_planar_flow(20, MvNormal(zeros(Float32, 2), I)) +flow_untrained = deepcopy(flow) # keep a copy of the untrained flow for comparison +``` +*Notice that here the flow layers are chained together using `fchain` function from [`FunctionChains.jl`](https://github.com/oschulz/FunctionChains.jl). +Alternatively, one can do* +```julia +ts = reduce(∘, [f32(PlanarLayer(d)) for i in 1:20]) +``` +*However, we recommend using `fchain` to reduce the compilation time when the number of layers is large. +See [this comment](https://github.com/TuringLang/NormalizingFlows.jl/blob/8f4371d48228adf368d851e221af076ff929f1cf/src/NormalizingFlows.jl#L52) +for how the compilation time might be a concern.* + + +#### Flow Training +Then we can train the flow by maximizing the ELBO using the [`train_flow`](@ref) function as follows: +```julia +using NormalizingFlows +using ADTypes +using Optimisers + +sample_per_iter = 10 +# callback function to track the number of samples used per iteration +cb(iter, opt_stats, re, θ) = (sample_per_iter=sample_per_iter,) +# defined stopping criteria when the gradient norm is less than 1e-3 +checkconv(iter, stat, re, θ, st) = stat.gradient_norm < 1e-3 +flow_trained, stats, _ = train_flow( + elbo, + flow, + logp, + sample_per_iter; + max_iters=200_00, + optimiser=Optimisers.ADAM(), + callback=cb, + hasconverged=checkconv, + ADbackend=AutoZygote(), # using Zygote as the AD backend +) +``` + +Examine the loss values during training: +```julia +using Plots + +losses = map(x -> x.loss, stats) +plot(losses; xlabel = "#iteration", ylabel= "negative ELBO", label="", linewidth=2) +``` +![elbo](elbo.png) + +## Evaluating Trained Flow +Finally, we can evaluate the trained flow by sampling from it and compare it with the target distribution. +Since the flow is defined as a `Bijectors.TransformedDistribution`, one can +easily sample from it using `rand` function, or examine the density using `logpdf` function. +See [documentation of `Bijectors.jl`](https://turinglang.org/Bijectors.jl/dev/distributions/) for details. +```julia +using Random, Distributions + +nsample = 1000 +samples_trained = rand(flow_trained, n_samples) # 1000 iid samples from the trained flow +samples_untrained = rand(flow_untrained, n_samples) # 1000 iid samples from the untrained flow +samples_true = rand(p, n_samples) # 1000 iid samples from the target + +# plot +scatter(samples_true[1, :], samples_true[2, :]; label="True Distribution", color=:blue, markersize=2, alpha=0.5) +scatter!(samples_untrained[1, :], samples_untrained[2, :]; label="Untrained Flow", color=:red, markersize=2, alpha=0.5) +scatter!(samples_trained[1, :], samples_trained[2, :]; label="Trained Flow", color=:green, markersize=2, alpha=0.5) +plot!(title = "Comparison of Trained and Untrained Flow", xlabel = "X", ylabel= "Y", legend=:topleft) +``` +![compare](comparison.png) + + +## Reference + +- Rezende, D. and Mohamed, S., 2015. *Variational inference with normalizing flows*. International Conference on Machine Learning \ No newline at end of file diff --git a/docs/src/index.md b/docs/src/index.md index dfb7aee..dc84076 100644 --- a/docs/src/index.md +++ b/docs/src/index.md @@ -2,13 +2,84 @@ CurrentModule = NormalizingFlows ``` -# NormalizingFlows +# NormalizingFlows.jl Documentation for [NormalizingFlows](https://github.com/TuringLang/NormalizingFlows.jl). -```@index + +The purpose of this package is to provide a simple and flexible interface for +variational inference (VI) and normalizing flows (NF) for Bayesian computation and generative modeling. +The key focus is to ensure modularity and extensibility, so that users can easily +construct (e.g., define customized flow layers) and combine various components +(e.g., choose different VI objectives or gradient estimates) +for variational approximation of general target distributions, +*without being tied to specific probabilistic programming frameworks or applications*. + +See the [documentation](https://turinglang.org/NormalizingFlows.jl/dev/) for more. + +## Installation +To install the package, run the following command in the Julia REPL: +``` +] # enter Pkg mode +(@v1.9) pkg> add git@github.com:TuringLang/NormalizingFlows.jl.git ``` +Then simply run the following command to use the package: +```julia +using NormalizingFlows +``` + +## What are normalizing flows? -```@autodocs -Modules = [NormalizingFlows] +Normalizing flows transform a simple reference distribution $q_0$ (sometimes known as base distribution) to +a complex distribution $q_\theta$ using invertible functions with trainable parameter $\theta$, aiming to approximate a target distribution $p$. +The approximation is achieved by minimizing some statistical distances between $q$ and $p$. + +In more details, given the base distribution, usually a standard Gaussian distribution, i.e., $q_0 = \mathcal{N}(0, I)$, +we apply a series of parameterized invertible transformations (called flow layers), $T_{1, \theta_1}, \cdots, T_{N, \theta_k}$, yielding that +```math +Z_N = T_{N, \theta_N} \circ \cdots \circ T_{1, \theta_1} (Z_0) , \quad Z_0 \sim q_0,\quad Z_N \sim q_{\theta}, ``` +where $\theta = (\theta_1, \dots, \theta_N)$ are the parameters to be learned, +and $q_{\theta}$ is the transformed distribution (typically called the +variational distribution or the flow distribution). +This describes **sampling procedure** of normalizing flows, which requires +sending draws from the base distribution through a forward pass of these flow layers. + +Since all the transformations are invertible (technically [diffeomorphic](https://en.wikipedia.org/wiki/Diffeomorphism)), +we can evaluate the density of a normalizing flow distribution $q_{\theta}$ by the change of variable formula: +```math +q_\theta(x)=\frac{q_0\left(T_1^{-1} \circ \cdots \circ +T_N^{-1}(x)\right)}{\prod_{n=1}^N J_n\left(T_n^{-1} \circ \cdots \circ +T_N^{-1}(x)\right)} \quad J_n(x)=\left|\operatorname{det} \nabla_x +T_n(x)\right|. +``` +Here we drop the subscript $\theta_n, n = 1, \dots, N$ for simplicity. +Density evaluation of normalizing flow requires computing the **inverse** and the +**Jacobian determinant** of each flow layer. + +Given the feasibility of i.i.d. sampling and density evaluation, normalizing +flows can be trained by minimizing some statistical distances to the target +distribution $p$. The typical choice of the statistical distance is the forward +and reverse Kullback-Leibler (KL) divergence, which leads to the following +optimization problems: +```math +\begin{aligned} +\text{Reverse KL:}\quad +&\argmin _{\theta} \mathbb{E}_{q_{\theta}}\left[\log q_{\theta}(Z)-\log p(Z)\right] \\ +&= \argmin _{\theta} \mathbb{E}_{q_0}\left[\log \frac{q_\theta(T_N\circ \cdots \circ T_1(Z_0))}{p(T_N\circ \cdots \circ T_1(Z_0))}\right] \\ +&= \argmax _{\theta} \mathbb{E}_{q_0}\left[ \log p\left(T_N \circ \cdots \circ T_1(Z_0)\right)-\log q_0(X)+\sum_{n=1}^N \log J_n\left(F_n \circ \cdots \circ F_1(X)\right)\right] +\end{aligned} +``` +and +```math +\begin{aligned} +\text{Forward KL:}\quad +&\argmin _{\theta} \mathbb{E}_{p}\left[\log q_{\theta}(Z)-\log p(Z)\right] \\ +&= \argmin _{\theta} \mathbb{E}_{p}\left[\log q_\theta(Z)\right] +\end{aligned} +``` +Both problems can be solved via standard stochastic optimization algorithms, +such as stochastic gradient descent (SGD) and its variants. + + + diff --git a/src/NormalizingFlows.jl b/src/NormalizingFlows.jl index 2d70892..16efb89 100644 --- a/src/NormalizingFlows.jl +++ b/src/NormalizingFlows.jl @@ -21,19 +21,22 @@ Train the given normalizing flow `flow` by calling `optimize`. # Arguments - `rng::AbstractRNG`: random number generator - `vo`: variational objective -- `flow`: normalizing flow to be trained +- `flow`: normalizing flow to be trained, we recommend to define flow as `<:Bijectors.TransformedDistribution` - `args...`: additional arguments for `vo` # Keyword Arguments - `max_iters::Int=1000`: maximum number of iterations - `optimiser::Optimisers.AbstractRule=Optimisers.ADAM()`: optimiser to compute the steps -- `ADbackend::ADTypes.AbstractADType=ADTypes.AutoZygote()`: automatic differentiation backend -- `kwargs...`: additional keyword arguments for `optimize` (See `optimize`) +- `ADbackend::ADTypes.AbstractADType=ADTypes.AutoZygote()`: + automatic differentiation backend, currently supports + `ADTypes.AutoZygote()`, `ADTypes.ForwardDiff()`, and `ADTypes.ReverseDiff()`. +- `kwargs...`: additional keyword arguments for `optimize` (See [`optimize`](@ref) for details) # Returns - `flow_trained`: trained normalizing flow -- `opt_stats`: statistics of the optimiser during the training process (See `optimize`) +- `opt_stats`: statistics of the optimiser during the training process + (See [`optimize`](@ref) for details) - `st`: optimiser state for potential continuation of training """ function train_flow(vo, flow, args...; kwargs...) diff --git a/src/objectives/elbo.jl b/src/objectives/elbo.jl index 30a491e..68545b5 100644 --- a/src/objectives/elbo.jl +++ b/src/objectives/elbo.jl @@ -15,13 +15,13 @@ Compute the ELBO for a batch of samples `xs` from the reference distribution `fl # Arguments - `rng`: random number generator - `flow`: variational distribution to be trained. In particular - "flow = transformed(q₀, T::Bijectors.Bijector)", + `flow = transformed(q₀, T::Bijectors.Bijector)`, q₀ is a reference distribution that one can easily sample and compute logpdf - `logp`: log-pdf of the target distribution (not necessarily normalized) - `xs`: samples from reference dist q₀ - `n_samples`: number of samples from reference dist q₀ + """ -# ELBO based on multiple iid samples function elbo(flow::Bijectors.UnivariateTransformed, logp, xs::AbstractVector) elbo_values = map(x -> elbo_single_sample(flow, logp, x), xs) return mean(elbo_values) diff --git a/src/objectives/loglikelihood.jl b/src/objectives/loglikelihood.jl index 8564793..4097ae1 100644 --- a/src/objectives/loglikelihood.jl +++ b/src/objectives/loglikelihood.jl @@ -5,7 +5,13 @@ loglikelihood(flow::Bijectors.TransformedDistribution, xs::AbstractVecOrMat) Compute the log-likelihood for variational distribution flow at a batch of samples xs from -the target distribution. +the target distribution p. + +# Arguments +- `flow`: variational distribution to be trained. In particular + "flow = transformed(q₀, T::Bijectors.Bijector)", + q₀ is a reference distribution that one can easily sample and compute logpdf +- `xs`: samples from the target distribution p. """ function loglikelihood( diff --git a/src/train.jl b/src/train.jl index 5edd9a0..3a28635 100644 --- a/src/train.jl +++ b/src/train.jl @@ -33,7 +33,8 @@ The result is stored in `out`. # Arguments - `rng::AbstractRNG`: random number generator -- `ad::ADTypes.AbstractADType`: automatic differentiation backend +- `ad::ADTypes.AbstractADType`: automatic differentiation backend, currently supports + `ADTypes.AutoZygote()`, `ADTypes.ForwardDiff()`, and `ADTypes.ReverseDiff()`. - `vo`: variational objective - `θ_flat::AbstractVector{<:Real}`: flattened parameters of the normalizing flow - `reconstruct`: function that reconstructs the normalizing flow from the flattened parameters