Skip to content

Latest commit

 

History

History
53 lines (33 loc) · 2.53 KB

2016-11-08-lossy-vae.md

File metadata and controls

53 lines (33 loc) · 2.53 KB
layout title author tags
post
Variational Lossy Autoencoder
Jiaming Song
reading

Variational Lossy Autoencoder

In autoregressive structures, it is easy for the model to ignore the latent code by just using the prior distribution, and put the representation burden on the model $$p(x\lvert z)$$, while $$q(z\lvert x)$$ carries few information.

Understanding VAE from Bits-Back Coding Perspective

VAE can be seen as a way to encode data in a two-part code: $$p(z)$$ and $$p(x\lvert z)$$.

This gives us the naive encoding length:

$$

\mathcal{C}{naive}(x) = \mathbb{E}{x\sim p(x), z\sim q(z\lvert x)}[-\log p(z) - \log p(x\lvert z)]

$$

The Bits-Back Coding improves by noticing that the encoder distribution $$q(z\lvert x)$$ can be used to transmit additional information, up to $$H(q(z\lvert x))$$ expected nats.

$$

\mathcal{C}{BB}(x) = \mathbb{E}{x\sim p(x), z\sim q(z\lvert x)}[\log q(z\lvert x) -\log p(z) - \log p(x\lvert z)] = \mathbb{E}_{x\sim p(x)}[-\mathcal{L}(x)]

$$

This view of VAE as a efficient coding scheme allows us to reason when the latent code $$z$$ will be used: only when the two-part code is an efficient code. Therefore

$$

\begin{array}\ \mathcal{C}{BB}(x) & = \mathbb{E}{x\sim p(x), z\sim q(z\lvert x)}[\log q(z\lvert x) -\log p(z) - \log p(x\lvert z)] \ & = \mathbb{E}{x\sim p(x)} [-\log p(x) + \mathcal{D}{KL}(q(z\lvert x) \rVert p(z\lvert x))] \ & \geq \mathbb{E}{x\sim p(x)} [-\log p{data}(x) + \mathcal{D}{KL}(q(z\lvert x) \rVert p(z\lvert x))] \ & = \mathcal{H}(p(x)) + \mathbb{E}{x\sim p(x)} [\mathcal{D}_{KL}(q(z\lvert x) \rVert p(z\lvert x))] \end{array}

$$

Since the second term is always larger than zero, we know that using the KL term derived from VAE suffers at least an extra code length. If the $$p(x\lvert z)$$ can model $$p(x)$$ without using the information from $$z$$, then the latent code $$z$$ is not used, which causes the true posterior to be simply the prior $$p(z)$$. Hence, information that can be modeled locally by decoding distribution $$p(x\lvert z)$$ without access to $$z$$ will be encoded locally and only the remainder will be encoded in $$z$$.

Variational Lossy Autoencoder

There are two ways to utilize this information:

  1. Use explicit information placement to restrict the reception of the autoregressive model, thereby forcing the model to use the latent code $$z$$ which is globally provided.
  2. Parametrize the prior distribution with a autoregressive model showing that a type of autoregressive latent code can reduce inefficiency in Bits-Back coding.