-
Notifications
You must be signed in to change notification settings - Fork 0
/
cd.tex
222 lines (162 loc) · 13.4 KB
/
cd.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
\documentclass[11pt]{article}
\usepackage{geometry}
\geometry{letterpaper}
\usepackage{graphicx}
\usepackage{amssymb}
\usepackage{epstopdf}
\usepackage{amsmath}
\usepackage{filecontents}
\begin{filecontents}{\jobname.bib}
@article{Hinton:2002,
author = {Hinton, Geoffrey E.},
title = {Training Products of Experts by Minimizing Contrastive Divergence},
journal = {Neural Comput.},
issue_date = {August 2002},
volume = {14},
number = {8},
month = aug,
year = {2002},
issn = {0899-7667},
pages = {1771--1800},
numpages = {30},
url = {http://dx.doi.org/10.1162/089976602760128018},
doi = {10.1162/089976602760128018},
acmid = {639730},
publisher = {MIT Press},
address = {Cambridge, MA, USA},
}
@conference{carreiraperpinan:2005,
added-at = {2014-04-08T11:15:45.000+0200},
author = {Carreira-Perpinan, Miguel A. and Hinton, Geoffrey E.},
biburl = {https://www.bibsonomy.org/bibtex/287686aa19cf8339d926bafc7860d78b4/prlz77},
editor = {Intelligence, Artificial and {Statistics, 2005}, Barbados},
interhash = {6f46aaf7a3dd6e0a85e6ad0c95e55951},
intrahash = {87686aa19cf8339d926bafc7860d78b4},
keywords = {Contrastive Divergence Learning On},
timestamp = {2014-04-08T11:15:45.000+0200},
title = {On Contrastive Divergence Learning},
year = 2005
}
\end{filecontents}
\begin{document}
\begin{center}
{\LARGE On Contrastive Divergence}
\end{center}
Contrastive Divergence (CD) is an approximative Maximum Likelihood (ML) learning algorithm proposed by Geoffrey Hinton \cite{Hinton:2002, carreiraperpinan:2005}.
% todo:
% - cleanup notation, e.g. average / expectation.
% - rewrite with gibbs dist and modified energy function p_\theta(x) = \frac{1}{Z(\theta)}\exp\[\mathcal{E}_\theta(x)\]
% - more precise relation to KL
% - add http://www.hongliangjie.com/2012/07/12/maximum-likelihood-as-minimize-kl-divergence/
% - add https://www.quora.com/Why-does-contrastive-divergence-minimize-the-difference-of-two-Kullback-Leibler-divergences
% - http://www.gatsby.ucl.ac.uk/~turner/Notes/ContrastiveDivergence/CDv3.pdf
\section{Why CD?}
We would like to model the probability of a data point $x$ using a function of the form $f_\theta(x)$ where $\theta$ denotes a vector of model parameters. As we know, the probability of $x$, $p_\theta(x)$ must integrate to $1$ over all $x$, therefore:
\begin{equation}
p_\theta(x) = \frac{1}{Z(\theta)}f_\theta(x)
\end{equation}
where $Z(\theta)$, known as the normalizing constant or partition function, is defined as
\begin{equation}
Z(\theta) = \int f_\theta(x) dx.
\label{eq:partition}
\end{equation}
Image that we are given a set of training data-points $X=\{x_1,\ldots,x_N\}$. We can learn our model parameters $\theta$ by maximizing the likelihood of a the training set $X$:
\begin{equation}
\mathcal{L} (\theta) = \prod_{i=1}^N \frac{1}{Z(\theta)} f_\theta(x_i).
\end{equation}
or, equivalently, by minimizing the negative $\log$ of $\mathcal{L}(\theta)$, which we shall call energy:
\begin{equation}
\mathcal{E}_\theta(X) = \log Z(\theta) - \frac{1}{N} \sum_{i=1}^N \log f_\theta(x_i).
\label{eq:energy}
\end{equation}
First, let us choose our probability model function $f_\theta(x)$ to be the pdf of a normal distribution $x\sim\mathcal{N}(\mu, \sigma)$ where $\theta = \{\mu,\sigma\}$:
\begin{equation}
f_\theta(x) = \mathcal{N}(x; \mu, \sigma).
\end{equation}
By Kolmogorov's second axiom of probability, $\int f_\theta(x) dx = 1$, so that $\log Z(\theta) = 0$. Differentiation of (\ref{eq:energy}) with respect to $\mu$ shows that the optimal $\mu$ is the mean of the training set $X$:
\begin{gather*}
\frac{\partial}{\partial\mu} \mathcal{E}_\theta(X) = \frac{\partial}{\partial\mu} \log \int f_\theta(x) dx - \frac{1}{n} \sum_{i=1}^N \frac{\partial}{\partial\mu} \log f_\theta(x_i)
\propto \frac{1}{2} \sum_{i=1}^N \frac{\partial}{\partial\mu} (x_i-\mu)^2,\\
\frac{\partial}{\partial\mu} \mathcal{E}_\theta(X)\big\rvert_{\hat{\mu}} = 0 \implies \hat{\mu} = \frac{1}{N} \sum_{i=0}^N x_i.
\end{gather*}
Similarly, differentiation of (\ref{eq:energy}) with respect to $\sigma$ shows that the optimal $\sigma$ is the square root of the variant of the training set $X$. Sometimes, as in this case, an optimization method exists that can exactly minimize our particular energy function. If we imagine our energy function to be an undulating landscape, whose lowest point we wish to find, then we could draw the metaphor of being in this field on a clear, sunny day, seeing the lowest point and walking straight to it.
Now, let us choose our probability model function $f_\theta(x)$ to be the sum of $K$ normal distributions where $\theta = \{(\mu_k, \sigma_k) : k=1,\ldots,K \}$:
\begin{equation}
f_\theta(x) = \sum_{k=1}^K \mathcal{N}(x; \mu_k, \sigma_k).
\label{eq:mixture}
\end{equation}
We usually refer to (\ref{eq:mixture}) as mixture or sum-of-experts model. Again, using the fact that a normal distribution must integrate to 1, we can inspect (\ref{eq:partition}) to see that $\log Z(\theta) = \log N$ holds. However, differentiation of (\ref{eq:energy}) with respect to our model parameters results in equations dependent on other model parameters. In this case, we cannot calculate the optimal parameters directly, i.e. in closed-form. Instead, we may use the partial differential equations and a gradient descent method with line search to find a local minimum of the energy in the parameter space.
Returning the metaphor of an undulating landscape, we could say that gradient descent with line search is equivalent to being in the field at night with a torch. We can either feel the gradient of the field at the point where we are standing, or else estimate it by using the torch to see the relative height of the field a short distance in each direction from us (i.e. numerical differentiation using finite differences). Then, by shining the beam of the torch in our chosen direction of travel, it allows us to see the lowest point in the field in that direction. We can walk to that point, and iteratively choose a new direction and distance to walk.
Finally, let us choose our probability model function $f_\theta(x)$ to be the product of $K$ normal distributions where $\theta = \{(\mu_k, \sigma_k) : k=1,\ldots,K \}$:
\begin{equation}
f_\theta(x) = \prod_{k=1}^K \mathcal{N}(x; \mu_k, \sigma_k).
\label{eq:product}
\end{equation}
We usually refer to (\ref{eq:product}) as product-of-experts model. Note, the partition function $Z(\theta)$ is no longer a constant. Consider $K=2$ normal distribution with fixed $\sigma=1$. If $\mu_1=-\infty$ and $\mu_2=\infty$, then $Z(\theta) = 0$. If $\mu_1=0$ and $\mu_2=0$, then $Z(\theta) = \frac{1}{2}\sqrt{\pi}$. While, in this case, it is still possible to compute that partition function exactly given $\theta$, let us imagine that the integration in (\ref{eq:partition}) is not algebraically tractable. Then we would need numerical integration to evaluate (\ref{eq:energy}), use finite difference to compute the gradient in parameter space, and use gradient descent to find a local energy minimum. For high-dimensional data-space computing the integral numerically is expensive, further, a high-dimensional parameter-space compounds this problem. This leads to a situation where we are trying to minimize an energy function which we cannot evaluate.
This is where CD comes into play as a remedy. Even though we are not able to evaluate the energy function itself, CD provides a way to estimate the gradient of the energy function. Returning to our field metaphor one last time, we find ourselves in the field without any light whatsoever (i.e. we cannot calculate energy), so we cannot establish the height of any point in the field relative to our own. CD gives us a sense of balance which allows us to feel the gradient of the field under our feet. By taking very small steps in the direction of steepest gradient, we will eventually find our way to a local minimum.
\section{How does CD work?}
Consider a probability distribution over a vector $x$ with model parameters $\theta$
\begin{equation}
p_\theta(x) = \frac{1}{Z(\theta)} f_\theta(x)
\end{equation}
where $Z(\theta)$, known as the normalizing constant or partition function is defined as
\begin{equation}
Z(\theta) = \int f_\theta(x) dx.
\end{equation}
Maximum Likelihood (ML) learning of parameters $\theta$ given i.i.d. samples $X=\{x_1,\ldots,x_N\}$ may be done by gradient ascent:
\begin{equation}
\theta^{(\tau+1)} = \theta^{(\tau)} + \eta \frac{\partial\mathcal{L}(\theta)}{\partial \theta} \bigg\vert_{\theta^{(\tau)}}
\end{equation}
where $\eta$ denotes the learning rate and $\mathcal{L}(\theta)$ is known as the average log-likelihood
\begin{equation}
\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^{N} \log p_\theta(x_i).
\end{equation}
We want to maximize the likelihood, or equivalently, minimizing the negative of $\mathcal{L}(\theta)$, which we shall call energy:
\begin{equation}
\mathcal{E}_\theta(X) = \log Z(\theta) - \frac{1}{N} \sum_{i=1}^N \log f_\theta(x_i).
\label{eq:energy}
\end{equation}
Since we want to maximize the likelihood, we will derive the gradient equation by firstly writing down the partial derivative of (\ref{eq:energy}):
\begin{align}
\frac{\partial\mathcal{E}_\theta(X)}{\partial\theta}
&= \frac{\partial\log Z(\theta)}{\partial\theta} - \frac{1}{N}\sum_{i=1}^N\frac{\partial\log f_\theta(x_i)}{\partial\theta}\\
&= \frac{\partial\log Z (\theta)}{\partial\theta} - \left\langle\frac{\partial\log f_\theta (x)}{\partial\theta}\right\rangle_X
\label{eq:gradient}
\end{align}
where $\langle\cdot\rangle_X$ is the expectation with respect to the data distribution of $X$. The first term on the right-hand side comes from the partition function $Z(\theta)$, which, as equation (\ref{eq:partition}), involves the integration over $x$. By substitution, we get
\begin{align}
\frac{\partial\log Z(\theta)}{\partial\theta}
&= \frac{1}{Z(\theta)}\frac{\partial Z(\theta)}{\partial\theta}\\
&= \frac{1}{Z(\theta)}\frac{\partial}{\partial\theta}\int f_\theta(x)dx\\
&= \frac{1}{Z(\theta)}\int \frac{\partial f_\theta(x)}{\partial\theta} dx\\
&= \frac{1}{Z(\theta)}\int f_\theta(x)\frac{\partial\log f_\theta(x)}{\partial\theta}dx\\
&= \int p_\theta(x) \frac{\partial\log f_\theta(x)}{\partial\theta}dx\\
&= \left\langle \frac{\partial\log f_\theta (x)}{\partial\theta} \right\rangle_{p_\theta(x)}. \label{eq:logZ}
\end{align}
This integration is algebraically intractable. However, in the form of (\ref{eq:logZ}), we can see that the expectation can be numerically approximated by drawing samples from the proposed distribution $p_\theta (x)$. However, samples cannot be drawn directly from $p_\theta(x)$ as we can not evaluate the partition function, but we may use many cycles of Markov Chain Monte Carlo (MCMC) sampling to transform our training data (drawn from the target distribution) into data drawn from the proposed distribution. This is possible as the transformation only involves calculating the ratio of two probabilities, namely the quotient $p_\theta(x') / p_\theta(x)$, so the partition function cancels out.
Let $X^n$ denote the transformed training data using $n$ cycles of MCMC, such that $X^0 = X$. Putting this back into (\ref{eq:logZ}), we get:
\begin{equation}
\frac{\partial\mathcal{E}_\theta (X)}{\partial\theta} = \left\langle \frac{\partial\log f_\theta(x)}{\partial\theta} \right\rangle_{X^\infty} - \left\langle \frac{\partial\log f_\theta (x)}{\partial\theta} \right\rangle_{X^0}
\end{equation}
where $\langle\cdot\rangle_{X^\infty}$ denotes the expectation with respect to the model distribution $p_\infty(x) = p_\theta(x)$ and $\langle\cdot\rangle_{X^0}$ denotes the expectation with respect to the data distribution $p_0(x) = \frac{1}{N}\sum_{i=1}^N \delta ( x-x_i )$.
Lastly, we still have the computational hurdle of too many MCMC cycles which are required to compute an accurate estimate of the gradient. Hinton assumed that only very few MCMC cycles would be needed to calculate an approximate gradient.
Hinton's assertion was that only a few MCMC cycles would be needed to calculate an approximate gradient. The intuition behind this is that after
a few iterations the data will have moved from the target distribution (i.e. that of the training data) towards the proposed distribution, and so give an idea in which direction the proposed distribution should move to better model the training data. Empirically, Hinton has found that even 1 cycle of MCMC is sufficient for the algorithm to converge to the ML answer.
As such, bearing in mind that we wish to go downhill in order to minimize our energy function, our parameter update equation can be written as:
\begin{equation}
\theta^{(\tau+1)} = \theta^{(\tau)} - \eta \left( \left\langle \frac{\partial\log f_\theta(x)}{\partial\theta} \right\rangle_{X^1} - \left\langle \frac{\partial\log f_\theta (x)}{\partial\theta} \right\rangle_{X^0} \right)
\end{equation}
where $\eta$ is the step size.
\section{Why the name?}
Maximum likelihood learning minimizes the Kullback-Leibler divergence:
\begin{equation}
KL\left(p_0 \Vert p_\infty \right) = \int p_0(x) \log\frac{p_0(x)}{p_\theta(x)}.
\end{equation}
Contrastive divergence learning approximately follows the gradient of the difference of two divergences:
\begin{equation}
CD_n = KL\left(p_0 \Vert p_\infty \right) - KL\left(p_n \Vert p_\infty \right).
\end{equation}
In CD learning, we start the Markov chain at the data distribution $p_0$ and run the chain for a small number of $n$ steps (e.g. $n=1$). This greatly reduces both the computation per gradient step and the variance of the estimated gradient.
\bibliographystyle{plain}
\bibliography{\jobname}
\end{document}