diff --git a/README.md b/README.md
index f9fbf663..5e7cac89 100644
--- a/README.md
+++ b/README.md
@@ -2,3 +2,89 @@
 
 [![Dev](https://img.shields.io/badge/docs-dev-blue.svg)](https://turinglang.github.io/NormalizingFlows.jl/dev/)
 [![Build Status](https://github.com/TuringLang/NormalizingFlows.jl/actions/workflows/CI.yml/badge.svg?branch=main)](https://github.com/TuringLang/NormalizingFlows.jl/actions/workflows/CI.yml?query=branch%3Amain)
+
+
+A normalizing flow library for Julia.
+
+The purpose of this package is to provide a simple and flexible interface for variational inference (VI) and normalizing flows (NF) for Bayesian computation or generative modeling.
+The key focus is to ensure modularity and extensibility, so that users can easily 
+construct (e.g., define customized flow layers) and combine various components 
+(e.g., choose different VI objectives or gradient estimates) 
+for variational approximation of general target distributions, 
+without being tied to specific probabilistic programming frameworks or applications. 
+
+See the [documentation](https://turinglang.org/NormalizingFlows.jl/dev/) for more.  
+
+## Installation
+To install the package, run the following command in the Julia REPL:
+```julia
+]  # enter Pkg mode
+(@v1.9) pkg> add git@github.com:TuringLang/NormalizingFlows.jl.git
+```
+Then simply run the following command to use the package:
+```julia
+using NormalizingFlows
+```
+
+## Quick recap of normalizing flows
+Normalizing flows transform a simple reference distribution $q_0$ (sometimes known as base distribution) to 
+a complex distribution $q$ using invertible functions.
+
+In more details, given the base distribution, usually a standard Gaussian distribution, i.e., $q_0 = \mathcal{N}(0, I)$,
+we apply a series of parameterized invertible transformations (called flow layers), $T_{1, \theta_1}, \cdots, T_{N, \theta_k}$, yielding that
+```math
+Z_N = T_{N, \theta_N} \circ \cdots \circ T_{1, \theta_1} (Z_0) , \quad Z_0 \sim q_0,\quad  Z_N \sim q_{\theta}, 
+```
+where $\theta = (\theta_1, \dots, \theta_N)$ is the parameter to be learned, and $q_{\theta}$ is the variational distribution (flow distribution). This describes **sampling procedure** of normalizing flows, which requires sending draws through a forward pass of these flow layers.
+
+Since all the transformations are invertible (techinically [diffeomorphic](https://en.wikipedia.org/wiki/Diffeomorphism)), we can evaluate the density of a normalizing flow distribution $q_{\theta}$ by the change of variable formula:
+```math
+q_\theta(x)=\frac{q_0\left(T_1^{-1} \circ \cdots \circ
+T_N^{-1}(x)\right)}{\prod_{n=1}^N J_n\left(T_n^{-1} \circ \cdots \circ
+T_N^{-1}(x)\right)} \quad J_n(x)=\left|\operatorname{det} \nabla_x
+T_n(x)\right|.
+```
+Here we drop the subscript $\theta_n, n = 1, \dots, N$ for simplicity. 
+Density evaluation of normalizing flow requires computing the **inverse** and the
+**Jacobian determinant** of each flow layer.
+
+Given the feasibility of i.i.d. sampling and density evaluation, normalizing flows can be trained by minimizing some statistical distances to the target distribution $p$. The typical choice of the statistical distance is the forward and backward Kullback-Leibler (KL) divergence, which leads to the following optimization problems:
+```math
+\begin{aligned}
+\text{Reverse KL:}\quad
+&\argmin _{\theta} \mathbb{E}_{q_{\theta}}\left[\log q_{\theta}(Z)-\log p(Z)\right] \\
+&= \argmin _{\theta} \mathbb{E}_{q_0}\left[\log \frac{q_\theta(T_N\circ \cdots \circ T_1(Z_0))}{p(T_N\circ \cdots \circ T_1(Z_0))}\right] \\
+&= \argmax _{\theta} \mathbb{E}_{q_0}\left[ \log p\left(T_N \circ \cdots \circ T_1(Z_0)\right)-\log q_0(X)+\sum_{n=1}^N \log J_n\left(F_n \circ \cdots \circ F_1(X)\right)\right]
+\end{aligned}
+```
+and 
+```math
+\begin{aligned}
+\text{Forward KL:}\quad
+&\argmin _{\theta} \mathbb{E}_{p}\left[\log q_{\theta}(Z)-\log p(Z)\right] \\
+&= \argmin _{\theta} \mathbb{E}_{p}\left[\log q_\theta(Z)\right] 
+\end{aligned}
+```
+Both problems can be solved via standard stochastic optimization algorithms,
+such as stochastic gradient descent (SGD) and its variants.
+
+Reverse KL minimization is typically used for **Bayesian computation**, where one
+wants to approximate a posterior distribution $p$ that is only known up to a
+normalizing constant. 
+In contrast, forward KL minimization is typically used for **generative modeling**, where one wants to approximate a complex distribution $p$ that is known up to a normalizing constant.
+
+## Current status and TODOs
+
+- [x] general interface development
+- [x] documentation
+- [ ] including more flow examples
+- [ ] GPU compatibility
+- [ ] benchmarking
+
+## Related packages
+- [Bijectors.jl](https://github.com/TuringLang/Bijectors.jl): a package for defining bijective transformations, which can be used for defining customized flow layers.
+- [Flux.jl](https://fluxml.ai/Flux.jl/stable/)
+- [Optimisers.jl](https://github.com/FluxML/Optimisers.jl)
+- [AdvancedVI.jl](https://github.com/TuringLang/AdvancedVI.jl)
+
+
diff --git a/docs/.gitignore b/docs/.gitignore
new file mode 100644
index 00000000..da3d3379
--- /dev/null
+++ b/docs/.gitignore
@@ -0,0 +1,2 @@
+build/
+site/
\ No newline at end of file
diff --git a/docs/make.jl b/docs/make.jl
index 7202aa7d..e0c87646 100644
--- a/docs/make.jl
+++ b/docs/make.jl
@@ -10,7 +10,12 @@ makedocs(;
     repo="https://github.com/TuringLang/NormalizingFlows.jl/blob/{commit}{path}#{line}",
     sitename="NormalizingFlows.jl",
     format=Documenter.HTML(),
-    pages=["Home" => "index.md"],
+    pages=[
+        "Home" => "index.md",
+        "API" => "api.md",
+        "Example" => "example.md",
+        "Customize your own flow layer" => "customized_layer.md",
+    ],
 )
 
 deploydocs(; repo="github.com/TuringLang/NormalizingFlows.jl", devbranch="main")
diff --git a/docs/src/api.md b/docs/src/api.md
new file mode 100644
index 00000000..f8028b91
--- /dev/null
+++ b/docs/src/api.md
@@ -0,0 +1,93 @@
+## API
+
+```@index
+```
+
+
+## Main Function
+
+```@docs
+NormalizingFlows.train_flow
+```
+
+The flow object can be constructed by `transformed` function in `Bijectors.jl` package.
+For example of Gaussian VI, we can construct the flow as follows:
+```@julia
+using Distributions, Bijectors
+T= Float32
+q₀ = MvNormal(zeros(T, 2), ones(T, 2))
+flow = Bijectors.transformed(q₀, Bijectors.Shift(zeros(T,2)) ∘ Bijectors.Scale(ones(T, 2)))
+```
+To train the Gaussian VI targeting at distirbution $p$ via ELBO maiximization, we can run
+```@julia
+using NormalizingFlows
+
+sample_per_iter = 10
+flow_trained, stats, _ = train_flow(
+    elbo,
+    flow,
+    logp,
+    sample_per_iter;
+    max_iters=2_000,
+    optimiser=Optimisers.ADAM(0.01 * one(T)),
+)
+```
+## Variational Objectives
+We have implemented two variational objectives, namely, ELBO and the log-likelihood objective. 
+Users can also define their own objective functions, and pass it to the [`train_flow`](@ref) function.
+`train_flow` will optimize the flow parameters by maximizing `vo`.
+The objective function should take the following general form:
+```julia
+vo(rng, flow, args...) 
+```
+where `rng` is the random number generator, `flow` is the flow object, and `args...` are the
+additional arguments that users can pass to the objective function.
+
+#### Evidence Lower Bound (ELBO)
+By maximizing the ELBO, it is equivalent to minimizing the reverse KL divergence between $q_\theta$ and $p$, i.e., 
+```math 
+\begin{aligned}
+&\min _{\theta} \mathbb{E}_{q_{\theta}}\left[\log q_{\theta}(Z)-\log p(Z)\right]  \quad \text{(Reverse KL)}\\
+& = \max _{\theta} \mathbb{E}_{q_0}\left[ \log p\left(T_N \circ \cdots \circ
+T_1(Z_0)\right)-\log q_0(X)+\sum_{n=1}^N \log J_n\left(F_n \circ \cdots \circ
+F_1(X)\right)\right] \quad \text{(ELBO)} 
+\end{aligned}
+```
+Reverse KL minimization is typically used for **Bayesian computation**, 
+where one only has access to the log-(unnormalized)density of the target distribution $p$ (e.g., a Bayesian posterior distribution), 
+and hope to generate approximate samples from it.
+
+```@docs
+NormalizingFlows.elbo
+```
+#### Log-likelihood
+
+By maximizing the log-likelihood, it is equivalent to minimizing the forward KL divergence between $q_\theta$ and $p$, i.e., 
+```math 
+\begin{aligned}
+& \min_{\theta} \mathbb{E}_{p}\left[\log q_{\theta}(Z)-\log p(Z)\right] \quad \text{(Forward KL)} \\
+& = \max_{\theta} \mathbb{E}_{p}\left[\log q_{\theta}(Z)\right] \quad \text{(Expected log-likelihood)}
+\end{aligned}
+```
+Forward KL minimization is typically used for **generative modeling**, 
+where one is given a set of samples from the target distribution $p$ (e.g., images)
+and aims to learn the density or a generative process that outputs high quality samples.
+
+```@docs
+NormalizingFlows.loglikelihood
+```
+
+
+## Training Loop
+
+```@docs
+NormalizingFlows.optimize
+```
+
+
+## Utility Functions for Taking Gradient
+```@docs
+NormalizingFlows.grad!
+NormalizingFlows.value_and_gradient!
+```
+
diff --git a/docs/src/banana.png b/docs/src/banana.png
new file mode 100644
index 00000000..785e53b4
Binary files /dev/null and b/docs/src/banana.png differ
diff --git a/docs/src/comparison.png b/docs/src/comparison.png
new file mode 100644
index 00000000..777f2fdc
Binary files /dev/null and b/docs/src/comparison.png differ
diff --git a/docs/src/customized_layer.md b/docs/src/customized_layer.md
new file mode 100644
index 00000000..af9062ca
--- /dev/null
+++ b/docs/src/customized_layer.md
@@ -0,0 +1,180 @@
+# Defining Your Own Flow Layer
+
+In practice, user might want to define their own normalizing flow. 
+As briefly noted in [What are normalizing flows?](@ref), the key is to define a
+customized normalizing flow layer, including its transformation and inverse,
+as well as the log-determinant of the Jacobian of the transformation.
+`Bijectors.jl` offers a convenient interface to define a customized bijection.
+We refer users to [the documentation of
+`Bijectors.jl`](https://turinglang.org/Bijectors.jl/dev/transforms/#Implementing-a-transformation)
+for more details.
+`Flux.jl` is also a useful package, offering a convenient interface to define neural networks.
+
+
+In this tutorial, we demonstrate how to define a customized normalizing flow
+layer -- an `Affine Coupling Layer` (Dinh *et al.*, 2016) -- using `Bijectors.jl` and `Flux.jl`.
+
+## Affine Coupling Flow
+
+Given an input vector $\boldsymbol{x}$, the general *coupling transformation* splits it into two
+parts: $\boldsymbol{x}_{I_1}$ and $\boldsymbol{x}_{I\setminus I_1}$. Only one
+part (e.g., $\boldsymbol{x}_{I_1}$) undergoes a bijective transformation $f$, noted as the *coupling law*, 
+based on the values of the other part (e.g., $\boldsymbol{x}_{I\setminus I_1}$), which remains unchanged. 
+```math
+\begin{array}{llll}
+c_{I_1}(\cdot ; f, \theta): & \mathbb{R}^d \rightarrow \mathbb{R}^d & c_{I_1}^{-1}(\cdot ; f, \theta): & \mathbb{R}^d \rightarrow \mathbb{R}^d \\
+& \boldsymbol{x}_{I \backslash I_1} \mapsto \boldsymbol{x}_{I \backslash I_1} & & \boldsymbol{y}_{I \backslash I_1} \mapsto \boldsymbol{y}_{I \backslash I_1} \\
+& \boldsymbol{x}_{I_1} \mapsto f\left(\boldsymbol{x}_{I_1} ; \theta\left(\boldsymbol{x}_{I\setminus I_1}\right)\right) & & \boldsymbol{y}_{I_1} \mapsto f^{-1}\left(\boldsymbol{y}_{I_1} ; \theta\left(\boldsymbol{y}_{I\setminus I_1}\right)\right)
+\end{array}
+```
+Here $\theta$ can be an arbitrary function, e.g., a neural network.
+As long as $f(\cdot; \theta(\boldsymbol{x}_{I\setminus I_1}))$ is invertible, $c_{I_1}$ is invertible, and the 
+Jacobian determinant of $c_{I_1}$ is easy to compute:
+```math
+\left|\text{det} \nabla_x c_{I_1}(x)\right| = \left|\text{det} \nabla_{x_{I_1}} f(x_{I_1}; \theta(x_{I\setminus I_1}))\right|
+```
+
+The affine coupling layer is a special case of the coupling transformation, where the coupling law $f$ is an affine function:
+```math
+\begin{aligned}
+\boldsymbol{x}_{I_1} &\mapsto \boldsymbol{x}_{I_1} \odot s\left(\boldsymbol{x}_{I\setminus I_1}\right) + t\left(\boldsymbol{x}_{I \setminus I_1}\right) \\
+\boldsymbol{x}_{I \backslash I_1} &\mapsto \boldsymbol{x}_{I \backslash I_1}
+\end{aligned}
+```
+Here, $s$ and $t$ are arbitrary functions (often neural networks) called the "scaling" and "translation" functions, respectively. 
+They produce vectors of the
+same dimension as $\boldsymbol{x}_{I_1}$.
+
+
+## Implementing Affine Coupling Layer
+
+We start by defining a simple 3-layer multi-layer perceptron (MLP) using `Flux.jl`, 
+which will be used to define the scaling $s$ and translation functions $t$ in the affine coupling layer.
+```@example afc
+using Flux
+
+function MLP_3layer(input_dim::Int, hdims::Int, output_dim::Int; activation=Flux.leakyrelu)
+    return Chain(
+        Flux.Dense(input_dim, hdims, activation),
+        Flux.Dense(hdims, hdims, activation),
+        Flux.Dense(hdims, output_dim),
+    )
+end
+```
+
+#### Construct the Object
+
+Following the user interface of `Bijectors.jl`, we define a struct `AffineCoupling` as a subtype of `Bijectors.Bijector`.
+The functions `parition` , `combine` are used to partition and recombine a vector into 3 disjoint subvectors. 
+And `PartitionMask` is used to store this partition rule. 
+These three functions are
+all defined in `Bijectors.jl`; see the [documentaion](https://github.com/TuringLang/Bijectors.jl/blob/49c138fddd3561c893592a75b211ff6ad949e859/src/bijectors/coupling.jl#L3) for more details.
+
+```@example afc
+using Functors
+using Bijectors
+using Bijectors: partition, combine, PartitionMask
+
+struct AffineCoupling <: Bijectors.Bijector
+    dim::Int
+    mask::Bijectors.PartitionMask
+    s::Flux.Chain
+    t::Flux.Chain
+end
+
+# to apply functions to the parameters that are contained in AffineCoupling.s and AffineCoupling.t, 
+# and to re-build the struct from the parameters, we use the functor interface of `Functors.jl` 
+# see https://fluxml.ai/Flux.jl/stable/models/functors/#Functors.functor
+@functor AffineCoupling (s, t)
+
+function AffineCoupling(
+    dim::Int,  # dimension of input
+    hdims::Int, # dimension of hidden units for s and t
+    mask_idx::AbstractVector, # index of dimension that one wants to apply transformations on
+)
+    cdims = length(mask_idx) # dimension of parts used to construct coupling law
+    s = MLP_3layer(cdims, hdims, cdims)
+    t = MLP_3layer(cdims, hdims, cdims)
+    mask = PartitionMask(dim, mask_idx)
+    return AffineCoupling(dim, mask, s, t)
+end
+```
+By default, we define $s$ and $t$ using the `MLP_3layer` function, which is a
+3-layer MLP with leaky ReLU activation function.
+
+#### Implement the Forward and Inverse Transformations
+
+
+```@example afc
+function Bijectors.transform(af::AffineCoupling, x::AbstractVector)
+    # partition vector using 'af.mask::PartitionMask`
+    x₁, x₂, x₃ = partition(af.mask, x)
+    y₁ = x₁ .* af.s(x₂) .+ af.t(x₂)
+    return combine(af.mask, y₁, x₂, x₃)
+end
+
+function Bijectors.transform(iaf::Inverse{<:AffineCoupling}, y::AbstractVector)
+    af = iaf.orig
+    # partition vector using `af.mask::PartitionMask`
+    y_1, y_2, y_3 = partition(af.mask, y)
+    # inverse transformation
+    x_1 = (y_1 .- af.t(y_2)) ./ af.s(y_2)
+    return combine(af.mask, x_1, y_2, y_3)
+end
+```
+
+#### Implement the Log-determinant of the Jacobian
+Notice that here we wrap the transformation and the log-determinant of the Jacobian into a single function, `with_logabsdet_jacobian`.
+
+```@example afc
+function Bijectors.with_logabsdet_jacobian(af::AffineCoupling, x::AbstractVector)
+    x_1, x_2, x_3 = Bijectors.partition(af.mask, x)
+    y_1 = af.s(x_2) .* x_1 .+ af.t(x_2)
+    logjac = sum(log ∘ abs, af.s(x_2))
+    return combine(af.mask, y_1, x_2, x_3), logjac
+end
+
+function Bijectors.with_logabsdet_jacobian(
+    iaf::Inverse{<:AffineCoupling}, y::AbstractVector
+)
+    af = iaf.orig
+    # partition vector using `af.mask::PartitionMask`
+    y_1, y_2, y_3 = partition(af.mask, y)
+    # inverse transformation
+    x_1 = (y_1 .- af.t(y_2)) ./ af.s(y_2)
+    logjac = -sum(log ∘ abs, af.s(y_2))
+    return combine(af.mask, x_1, y_2, y_3), logjac
+end
+```
+#### Construct Normalizing Flow
+
+Now with all the above implementations, we are ready to use the `AffineCoupling` layer for normalizing flow 
+by applying it to a base distribution $q_0$.
+
+```@example afc
+using Random, Distributions, LinearAlgebra
+dim = 4
+hdims = 10
+Ls = [
+    AffineCoupling(dim, hdims, 1:2), 
+    AffineCoupling(dim, hdims, 3:4), 
+    AffineCoupling(dim, hdims, 1:2), 
+    AffineCoupling(dim, hdims, 3:4), 
+    ]
+ts = reduce(∘, Ls)
+q₀ = MvNormal(zeros(Float32, dim), I)
+flow = Bijectors.transformed(q₀, ts)
+```
+We can now sample from the flow:
+```@example afc
+x = rand(flow, 10)
+```
+And evaluate the density of the flow:
+```@example afc
+logpdf(flow, x[:,1])
+```
+
+
+## Reference
+Dinh, L., Sohl-Dickstein, J. and Bengio, S., 2016. *Density estimation using real nvp.* 
+arXiv:1605.08803.
\ No newline at end of file
diff --git a/docs/src/elbo.png b/docs/src/elbo.png
new file mode 100644
index 00000000..2de495b2
Binary files /dev/null and b/docs/src/elbo.png differ
diff --git a/docs/src/example.md b/docs/src/example.md
new file mode 100644
index 00000000..346c15a0
--- /dev/null
+++ b/docs/src/example.md
@@ -0,0 +1,119 @@
+## Example: Using Planar Flow
+
+Here we provide a minimal demonstration of learning a synthetic 2d banana distribution
+using *planar flows* (Renzende *et al.* 2015) by maximizing the [Evidence Lower Bound (ELBO)](@ref).
+To complete this task, the two key inputs are:
+- the log-density function of the target distribution, 
+- the planar flow. 
+
+#### The Target Distribution
+
+The `Banana` object is defined in `example/targets/banana.jl`, see the [source code](https://github.com/zuhengxu/NormalizingFlows.jl/blob/main/example/targets/banana.jl) for details.
+```julia
+p = Banana(2, 1.0f-1, 100.0f0)
+logp = Base.Fix1(logpdf, p)
+```
+Visualize the contour of the log-density and the sample scatters of the target distribution: 
+![Banana](banana.png)
+
+
+
+
+#### The Planar Flow 
+
+The planar flow is defined by repeated applying a sequence of invertible
+transformations to a base distribution $q_0$.  The building blocks for a planar flow
+of length $N$ are the following invertible transformations, called *planar layers*:
+```math
+\text{planar layers}: 
+T_{n, \theta_n}(x)=x+u_n \cdot \tanh \left(w_n^T x+b_n\right), \quad n=1, \ldots, N, 
+```
+where $\theta_n = (u_n, w_n, b_n), n=1, \dots, N$ are the parameters to be learned. 
+Thankfully, [`Bijectors.jl`](https://github.com/TuringLang/Bijectors.jl)
+provides a nice framework to define a normalizing flow.
+Here we used the `PlanarLayer()` from `Bijectors.jl` to construct a 
+20-layer planar flow, of which the base distribution is a 2d standard Gaussian distribution.
+
+```julia
+using Bijectors, FunctionChains
+
+function create_planar_flow(n_layers::Int, q₀)
+    d = length(q₀)
+    Ls = [f32(PlanarLayer(d)) for _ in 1:n_layers]
+    ts = fchain(Ls)
+    return transformed(q₀, ts)
+end
+
+# create a 20-layer planar flow
+flow = create_planar_flow(20, MvNormal(zeros(Float32, 2), I))
+flow_untrained = deepcopy(flow) # keep a copy of the untrained flow for comparison
+```
+*Notice that here the flow layers are chained together using `fchain` function from [`FunctionChains.jl`](https://github.com/oschulz/FunctionChains.jl). 
+Alternatively, one can do*
+```julia
+ts = reduce(∘, [f32(PlanarLayer(d)) for i in 1:20]) 
+```
+*However, we recommend using `fchain` to reduce the compilation time when the number of layers is large.
+See [this comment](https://github.com/TuringLang/NormalizingFlows.jl/blob/8f4371d48228adf368d851e221af076ff929f1cf/src/NormalizingFlows.jl#L52)
+for how the compilation time might be a concern.*
+
+
+#### Flow Training
+Then we can train the flow by maximizing the ELBO using the [`train_flow`](@ref) function as follows: 
+```julia
+using NormalizingFlows
+using ADTypes
+using Optimisers
+
+sample_per_iter = 10
+# callback function to track the number of samples used per iteration
+cb(iter, opt_stats, re, θ) = (sample_per_iter=sample_per_iter,)
+# defined stopping criteria when the gradient norm is less than 1e-3
+checkconv(iter, stat, re, θ, st) = stat.gradient_norm < 1e-3
+flow_trained, stats, _ = train_flow(
+    elbo,
+    flow,
+    logp,
+    sample_per_iter;
+    max_iters=200_00,
+    optimiser=Optimisers.ADAM(),
+    callback=cb,
+    hasconverged=checkconv,
+    ADbackend=AutoZygote(), # using Zygote as the AD backend
+)
+```
+
+Examine the loss values during training:
+```julia
+using Plots
+
+losses = map(x -> x.loss, stats)
+plot(losses; xlabel = "#iteration", ylabel= "negative ELBO", label="", linewidth=2) 
+```
+![elbo](elbo.png)
+
+## Evaluating Trained Flow 
+Finally, we can evaluate the trained flow by sampling from it and compare it with the target distribution.
+Since the flow is defined as a `Bijectors.TransformedDistribution`, one can
+easily sample from it using `rand` function, or examine the density using `logpdf` function.
+See [documentation of `Bijectors.jl`](https://turinglang.org/Bijectors.jl/dev/distributions/) for details.
+```julia
+using Random, Distributions
+
+nsample = 1000
+samples_trained = rand(flow_trained, n_samples) # 1000 iid samples from the trained flow 
+samples_untrained = rand(flow_untrained, n_samples) # 1000 iid samples from the untrained flow
+samples_true = rand(p, n_samples) # 1000 iid samples from the target
+
+# plot 
+scatter(samples_true[1, :], samples_true[2, :]; label="True Distribution", color=:blue, markersize=2, alpha=0.5)
+scatter!(samples_untrained[1, :], samples_untrained[2, :]; label="Untrained Flow", color=:red, markersize=2, alpha=0.5)
+scatter!(samples_trained[1, :], samples_trained[2, :]; label="Trained Flow", color=:green, markersize=2, alpha=0.5)
+plot!(title = "Comparison of Trained and Untrained Flow", xlabel = "X", ylabel= "Y", legend=:topleft) 
+```
+![compare](comparison.png)
+
+
+## Reference 
+
+- Rezende, D. and Mohamed, S., 2015. *Variational inference with normalizing flows*. International Conference on Machine Learning  
\ No newline at end of file
diff --git a/docs/src/index.md b/docs/src/index.md
index dfb7aeec..dc840761 100644
--- a/docs/src/index.md
+++ b/docs/src/index.md
@@ -2,13 +2,84 @@
 CurrentModule = NormalizingFlows
 ```
 
-# NormalizingFlows
+# NormalizingFlows.jl
 
 Documentation for [NormalizingFlows](https://github.com/TuringLang/NormalizingFlows.jl).
 
-```@index
+
+The purpose of this package is to provide a simple and flexible interface for 
+variational inference (VI) and normalizing flows (NF) for Bayesian computation and generative modeling.
+The key focus is to ensure modularity and extensibility, so that users can easily 
+construct (e.g., define customized flow layers) and combine various components 
+(e.g., choose different VI objectives or gradient estimates) 
+for variational approximation of general target distributions, 
+*without being tied to specific probabilistic programming frameworks or applications*. 
+
+See the [documentation](https://turinglang.org/NormalizingFlows.jl/dev/) for more.  
+
+## Installation
+To install the package, run the following command in the Julia REPL:
+```
+]  # enter Pkg mode
+(@v1.9) pkg> add git@github.com:TuringLang/NormalizingFlows.jl.git
 ```
+Then simply run the following command to use the package:
+```julia
+using NormalizingFlows
+```
+
+## What are normalizing flows?
 
-```@autodocs
-Modules = [NormalizingFlows]
+Normalizing flows transform a simple reference distribution $q_0$ (sometimes known as base distribution) to 
+a complex distribution $q_\theta$ using invertible functions with trainable parameter $\theta$, aiming to approximate a target distribution $p$.
+The approximation is achieved by minimizing some statistical distances between $q$ and $p$.
+
+In more details, given the base distribution, usually a standard Gaussian distribution, i.e., $q_0 = \mathcal{N}(0, I)$,
+we apply a series of parameterized invertible transformations (called flow layers), $T_{1, \theta_1}, \cdots, T_{N, \theta_k}$, yielding that
+```math
+Z_N = T_{N, \theta_N} \circ \cdots \circ T_{1, \theta_1} (Z_0) , \quad Z_0 \sim q_0,\quad  Z_N \sim q_{\theta}, 
 ```
+where $\theta = (\theta_1, \dots, \theta_N)$ are the parameters to be learned,
+and $q_{\theta}$ is the transformed distribution (typically called the
+variational distribution or the flow distribution). 
+This describes **sampling procedure** of normalizing flows, which requires
+sending draws from the base distribution through a forward pass of these flow layers.
+
+Since all the transformations are invertible (technically [diffeomorphic](https://en.wikipedia.org/wiki/Diffeomorphism)), 
+we can evaluate the density of a normalizing flow distribution $q_{\theta}$ by the change of variable formula: 
+```math
+q_\theta(x)=\frac{q_0\left(T_1^{-1} \circ \cdots \circ
+T_N^{-1}(x)\right)}{\prod_{n=1}^N J_n\left(T_n^{-1} \circ \cdots \circ
+T_N^{-1}(x)\right)} \quad J_n(x)=\left|\operatorname{det} \nabla_x
+T_n(x)\right|.
+```
+Here we drop the subscript $\theta_n, n = 1, \dots, N$ for simplicity. 
+Density evaluation of normalizing flow requires computing the **inverse** and the
+**Jacobian determinant** of each flow layer.
+
+Given the feasibility of i.i.d. sampling and density evaluation, normalizing
+flows can be trained by minimizing some statistical distances to the target
+distribution $p$. The typical choice of the statistical distance is the forward
+and reverse Kullback-Leibler (KL) divergence, which leads to the following
+optimization problems:
+```math
+\begin{aligned}
+\text{Reverse KL:}\quad
+&\argmin _{\theta} \mathbb{E}_{q_{\theta}}\left[\log q_{\theta}(Z)-\log p(Z)\right] \\
+&= \argmin _{\theta} \mathbb{E}_{q_0}\left[\log \frac{q_\theta(T_N\circ \cdots \circ T_1(Z_0))}{p(T_N\circ \cdots \circ T_1(Z_0))}\right] \\
+&= \argmax _{\theta} \mathbb{E}_{q_0}\left[ \log p\left(T_N \circ \cdots \circ T_1(Z_0)\right)-\log q_0(X)+\sum_{n=1}^N \log J_n\left(F_n \circ \cdots \circ F_1(X)\right)\right]
+\end{aligned}
+```
+and 
+```math
+\begin{aligned}
+\text{Forward KL:}\quad
+&\argmin _{\theta} \mathbb{E}_{p}\left[\log q_{\theta}(Z)-\log p(Z)\right] \\
+&= \argmin _{\theta} \mathbb{E}_{p}\left[\log q_\theta(Z)\right] 
+\end{aligned}
+```
+Both problems can be solved via standard stochastic optimization algorithms,
+such as stochastic gradient descent (SGD) and its variants. 
+
+
+
diff --git a/src/NormalizingFlows.jl b/src/NormalizingFlows.jl
index 2d708929..16efb899 100644
--- a/src/NormalizingFlows.jl
+++ b/src/NormalizingFlows.jl
@@ -21,19 +21,22 @@ Train the given normalizing flow `flow` by calling `optimize`.
 # Arguments
 - `rng::AbstractRNG`: random number generator
 - `vo`: variational objective
-- `flow`: normalizing flow to be trained
+- `flow`: normalizing flow to be trained, we recommend to define flow as `<:Bijectors.TransformedDistribution` 
 - `args...`: additional arguments for `vo`
 
 
 # Keyword Arguments
 - `max_iters::Int=1000`: maximum number of iterations
 - `optimiser::Optimisers.AbstractRule=Optimisers.ADAM()`: optimiser to compute the steps
-- `ADbackend::ADTypes.AbstractADType=ADTypes.AutoZygote()`: automatic differentiation backend
-- `kwargs...`: additional keyword arguments for `optimize` (See `optimize`)
+- `ADbackend::ADTypes.AbstractADType=ADTypes.AutoZygote()`: 
+    automatic differentiation backend, currently supports
+    `ADTypes.AutoZygote()`, `ADTypes.ForwardDiff()`, and `ADTypes.ReverseDiff()`. 
+- `kwargs...`: additional keyword arguments for `optimize` (See [`optimize`](@ref) for details)
 
 # Returns
 - `flow_trained`: trained normalizing flow
-- `opt_stats`: statistics of the optimiser during the training process (See `optimize`)
+- `opt_stats`: statistics of the optimiser during the training process 
+    (See [`optimize`](@ref) for details)
 - `st`: optimiser state for potential continuation of training
 """
 function train_flow(vo, flow, args...; kwargs...)
diff --git a/src/objectives/elbo.jl b/src/objectives/elbo.jl
index 30a491eb..68545b54 100644
--- a/src/objectives/elbo.jl
+++ b/src/objectives/elbo.jl
@@ -15,13 +15,13 @@ Compute the ELBO for a batch of samples `xs` from the reference distribution `fl
 # Arguments
 - `rng`: random number generator
 - `flow`: variational distribution to be trained. In particular 
-  "flow = transformed(q₀, T::Bijectors.Bijector)", 
+  `flow = transformed(q₀, T::Bijectors.Bijector)`, 
   q₀ is a reference distribution that one can easily sample and compute logpdf
 - `logp`: log-pdf of the target distribution (not necessarily normalized)
 - `xs`: samples from reference dist q₀
 - `n_samples`: number of samples from reference dist q₀
+
 """
-# ELBO based on multiple iid samples
 function elbo(flow::Bijectors.UnivariateTransformed, logp, xs::AbstractVector)
     elbo_values = map(x -> elbo_single_sample(flow, logp, x), xs)
     return mean(elbo_values)
diff --git a/src/objectives/loglikelihood.jl b/src/objectives/loglikelihood.jl
index 8564793d..4097ae15 100644
--- a/src/objectives/loglikelihood.jl
+++ b/src/objectives/loglikelihood.jl
@@ -5,7 +5,13 @@
     loglikelihood(flow::Bijectors.TransformedDistribution, xs::AbstractVecOrMat)
 
 Compute the log-likelihood for variational distribution flow at a batch of samples xs from 
-the target distribution.
+the target distribution p. 
+
+# Arguments
+- `flow`: variational distribution to be trained. In particular 
+  "flow = transformed(q₀, T::Bijectors.Bijector)", 
+  q₀ is a reference distribution that one can easily sample and compute logpdf
+- `xs`: samples from the target distribution p.
 
 """
 function loglikelihood(
diff --git a/src/train.jl b/src/train.jl
index 5edd9a02..3a286350 100644
--- a/src/train.jl
+++ b/src/train.jl
@@ -33,7 +33,8 @@ The result is stored in `out`.
 
 # Arguments
 - `rng::AbstractRNG`: random number generator
-- `ad::ADTypes.AbstractADType`: automatic differentiation backend
+- `ad::ADTypes.AbstractADType`: automatic differentiation backend, currently supports
+    `ADTypes.AutoZygote()`, `ADTypes.ForwardDiff()`, and `ADTypes.ReverseDiff()`. 
 - `vo`: variational objective
 - `θ_flat::AbstractVector{<:Real}`: flattened parameters of the normalizing flow
 - `reconstruct`: function that reconstructs the normalizing flow from the flattened parameters