layout | title | author | tags | |
---|---|---|---|---|
post |
Variational Lossy Autoencoder |
Jiaming Song |
|
In autoregressive structures, it is easy for the model to ignore the latent code by just using the prior distribution, and put the representation burden on the model
VAE can be seen as a way to encode data in a two-part code:
This gives us the naive encoding length:
$$
\mathcal{C}{naive}(x) = \mathbb{E}{x\sim p(x), z\sim q(z\lvert x)}[-\log p(z) - \log p(x\lvert z)]
$$
The Bits-Back Coding improves by noticing that the encoder distribution
$$
\mathcal{C}{BB}(x) = \mathbb{E}{x\sim p(x), z\sim q(z\lvert x)}[\log q(z\lvert x) -\log p(z) - \log p(x\lvert z)] = \mathbb{E}_{x\sim p(x)}[-\mathcal{L}(x)]
$$
This view of VAE as a efficient coding scheme allows us to reason when the latent code
$$
\begin{array}\ \mathcal{C}{BB}(x) & = \mathbb{E}{x\sim p(x), z\sim q(z\lvert x)}[\log q(z\lvert x) -\log p(z) - \log p(x\lvert z)] \ & = \mathbb{E}{x\sim p(x)} [-\log p(x) + \mathcal{D}{KL}(q(z\lvert x) \rVert p(z\lvert x))] \ & \geq \mathbb{E}{x\sim p(x)} [-\log p{data}(x) + \mathcal{D}{KL}(q(z\lvert x) \rVert p(z\lvert x))] \ & = \mathcal{H}(p(x)) + \mathbb{E}{x\sim p(x)} [\mathcal{D}_{KL}(q(z\lvert x) \rVert p(z\lvert x))] \end{array}
$$
Since the second term is always larger than zero, we know that using the KL term derived from VAE suffers at least an extra code length. If the
There are two ways to utilize this information:
- Use explicit information placement to restrict the reception of the autoregressive model, thereby forcing the model to use the latent code
$$z$$ which is globally provided. - Parametrize the prior distribution with a autoregressive model showing that a type of autoregressive latent code can reduce inefficiency in Bits-Back coding.