-
Notifications
You must be signed in to change notification settings - Fork 0
/
bayesianstuff.tex
567 lines (429 loc) · 73.2 KB
/
bayesianstuff.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
\chapter{Chapter 2\newline Introduction to Bayesian $n$-gram %and Domain-dependent
Language Modelling}\label{chap:introblm}%\marginnote{Strictly speaking the G-CRP is a stochastic process that generates exchangeable random partitions
%whereas the Pitman-Yor process (PYP) is usually taken to define a stochastic process that generates prob-
%ability distributions. However, the two are closely connected: every distribution sampled from the PYP is
%associated with a partition that assigns each observation to a cluster, and the partitions associated with
%a PYP-generated distribution follow a G-CRP. Given the slightly inconsistent way in which the PYP is
%defined, we have chosen to adopt the terminology from Ishwaran and James (2003) in which G-CRP is used
%to describe the process that generates partitions, and the PYP is defined more narrowly to refer to the
%closely related process that generates probability distributions.}
%\footnote{Table 1 van \url{http://www.psy.cmu.edu/~ckemp/papers/navarrok_noneoftheaboveabayesianaccountofthedetectionofnovelcategories.pdf} is ook handig}
\newthought{In the previous chapter we introduced language models} in which the word probabilities are based on the underlying fixed, but unknown, (joint) distributions from which we assume that we sample our training examples in a frequentist manner. Those distributions are observed by simply counting $n$-grams. If the corpus is too small compared to the model complexity, this yields unreliable counts; smoothing is a technique to rectify these counts. The downside of most smoothing techniques is that they do not rely on a statistical model, but merely rely on heuristics such as discounting, interpolation, and back-off schemes.
In this chapter we introduce the Bayesian counterpart to language modelling, where we use the posterior distribution to estimate the underlying distribution. We use an approach that allows the model complexity to grow with the corpus size, as to avoid unreliable counts often found by MLE, and to avoid statistical artefacts caused due to overfitting.
The Bayesian language models have a basis which is firmly nested in probability theory. The literature contains many metaphors to simplify mathematical concepts. The goal of this chapter is twofold. First we want to demystify the metaphors by first introducing them in their simplest form, and from thereon extend them to the variants often found in literature. Second, we show how you can build language models with these concepts, and why these concepts are a good basis for language models. Although this chapter requires some basic knowledge of probability theory, we provide explanations and references where this might be useful.
After reading this chapter, we assume you know about using the Chinese restaurant process, the Hoppe-P\'olya urn, and stick-breaking processes, and how they can be used as a prior in mixture models such as the Dirichlet process mixture model and the nested Pitman-Yor process mixture model, which is the Bayesian $n$-gram language model that we use throughout this thesis. We introduce a nested Pitman-Yor process language model in the next chapter, for which this introduction chapter serves as ground work.
%
\marginnote{Central to this chapter is the theorem of Bayes, which is a powerful way to relate current belief to prior belief, e.g.~by using new evidence to update prior beliefs into current beliefs. In its simplest form, Bayes' theorem can be stated as:
\begin{align}
p(A|B) = \frac{p(B|A)p(A)}{P(B)}.
\end{align}
Here $A$ is the proposition, and $B$ the evidence or belief. $p(A)$ is the the prior probability, the initial degree of belief in $A$, and $p(A|B)$ the posterior probability, the degree of belief in $A$, given the evidence $B$.}
In this chapter we look at the priors and processes underlying a Bayesian language model. In most studies on the topic it is either considered from a bottom-up approach, or from a top-down approach. The first approach is really useful given a strong probabilistic background, whereas the latter is more convenient if the reader already has (vaguely) heard about the priors and processes involved. The approach of this chapter is to make the priors and processes understandable for the reader who has neither.
We set out to devise a language model within a non-parametric Bayesian (BNP) framework. The BNP approach is an alternative to parametric modelling and model selection.\footnote[][1em]{Model selection is the task of selecting a model from a set of candidate models. Given a data set, the goal is to select the model with the best explanatory power, and favourably also the simplest model. Famous criteria are cross-validation\autocite{Kohavi1995A} and the Akaike information criterion\autocite{Bozdogan1987Model}. For a more general introduction see\autocite{Burnham2002Model}.}
In many statistical inference problems, we expect that structures or patterns continue to emerge as data accrue, perhaps even without limit. We want a modelling framework that supplies a growing, unbounded number of degrees of freedom to the data analysis. However, if we allow the degrees of freedom to accrue too quickly, we risk finding structures that are statistical artifacts; a typical example of such artifact is overfitting. Overfitting is a serious problem,\footnote{Overfitting can also be used as a tool, for example in text generation where you may want to force certain language.} and it motivates the Bayesian aspect of non-parametrics. In many parametric paradigms, a number of parameters needs to be set which optimises a certain value. However, more of different data might already require another number of parameters.
Although Bayesian methods are by no means immune to overfitting, they provide a natural resilience to it. By using a model with an unbounded complexity, underfitting is migitated, while computing, or approximating, the full posterior over parameters migitates overfitting.
A common misconception about non-parametrics is that it means that there are no parameters. Instead, it means that the model is not parametric, which in turn means that the number of parameters is not fixed once and for all. If the amount of training data is fixed,\footnote{For example in the classic scenario where you only have one batch of training data.} there is no difference between parametric and non-parametric models. The difference emerges when new training data is added. The non-parametric model can adapt to the influx of new data points, while the parametric model does not have the flexibility to add parameters. Thus, the non-parametric framework is not opposed to parameters, to the contrary: the framework can be viewed as allowing an infinite\footnote{With infinite we henceforth mean countably infinite, or more precise, a finite but unbounded number; unless stated otherwise.} number of parameters, or rather, as an finite but unbounded number of parameters.
The chapter is set up as a continuous explanation of how to build a language model within a Bayesian non-parametric framework along several subtopics. Many of the concepts are introduced at the side, but we tried to separate them into sections as well. If a concept is explained in more detail in another section, we have listed a reference. Although it is a custom in Bayesian literature to first specify a generative model, we start from an example, only to arrive at the generative model at the end.
\section{A Bayesian Approach to Unigram Language Modelling}
Different from the standard frequentist approach to language modelling, which in essence is based on estimating the maximum likelihood estimate for a pattern, in a Bayesian approach we assume that the pattern comes from a certain distribution and that the texts are generated according to some underlying stochastic process. This process is latent,\footnote{If data is generated by some process, but we cannot (directly) observe this process, we call the process a latent process.} and the same process can generate different texts. The frequentist approach is to model the observed texts, and estimate the probabilities of unseen text based on these observations. The Bayesian way is to infer the underlying process and its pattern distributions. Of course any process could have been used, and if we choose a process, there is no guarantee that it is the best process. One of the most widely used processes to describe such underlying models for textual data is the Chinese Restaurant Process, which we will introduce in the next section. We show that some of its properties are different from real-life observations, and how we can adapt the process to model this real-life behaviour.
Just as we need to have a unigram language model to build an $n$-gram language model, for our Bayesian approach we have to start with a unigram language model as well.
Assume we have a data set $\mathcal{D}$ of $M$ data points $x_1, x_2, \ldots, x_M$, with $W$ unique words comprising a vocabulary $V$, and each word type can be denoted as $v_i$ for $0 \leq i < W$. The count of word type $v_i$ in $\mathcal{D}$ is denoted as $c_\mathcal{D}(v_i)$.\footnote{We will leave out the $\mathcal{D}$ if it is obvious from the context.} Our data set $\mathcal{D}$ consists of documents which in turn consist of words $x$. We can aggregrate over these words, by storing for each word type $
v_i$ its frequency $c(v_i)$.\footnote{$c_\mathcal{D}(v_i) = \sum_{x\in\mathcal{D}} [x = v_i]$, with $[\cdot]$ being the indicator function, which returns $1$ if the expression evaluates to true, and returns $0$ otherwise.} In this abstraction we lose the sense of origin, the documents that contained the word, and we lose any positional information. Yet, for our unigram language model this is of no concern. We will store the words and their frequencies by means of the Chinese restaurant process.
\section{The Chinese Restaurant Process}\marginnote{The Chinese restaurant process describes a family of probability distributions over partitions.}
The Chinese restaurant process (CRP)\footnote{The restaurant metaphor was attributed to Jim Pitman by \cite{Aldous1985Exchangeability}} is one of the many metaphors to simplify difficult concepts and their underlying mathematical principles.
Imagine a restaurant with an infinite number of round tables, with enough space to host an infinite number of guests. Each table in the restaurant serves one dish, but multiple tables may serve the same dish.
Now imagine that each dish $\theta$ of the menu $\Theta$ represents one of the word types $v_i \in V$, and each customer is a data point $x_j\in\mathcal{D}$ that enters the restaurant, and sits at the table that corresponds to its word type.\footnote{Note that since we have an infinite number of tables, but only a limited number of dishes $|\Theta|$, the same dish can be served at multiple tables.} In the case of our unigram language model we have $W$ dishes distributed over $K$ tables,\footnote{With $W\leq K$.} and customer $x_j$ represents the occurrence of $v_i$, so he sits at the table where $\theta =
v_i$. Now if all customers $x\in\mathcal{D}$ do this, we end up with at each table as many customers as there are occurrences of that word: $\sum_{k=1}^K[z_k=v_i]\cdot|k| = c(v_i)$, with $z_k$ denoting the dish served at table $k$, and $|k|$ is the number of customers sitting at table $k$.
Here we can already see the advantage of having a non-parametric framework. If we would have used a parametric model, we could not have dealt with additional data containing new word types: in a parametric model $W$ is fixed. Since we are using a non-parametric model, we now have the means to facilitate additional data without having to resort to another (parametrised) model.
What we have now is a partition\footnote{For a more rigorous and mathematical foundation of partitions and many other topics presented in this section, the interested reader is referred to \cite{Pitman2006Combinatorial}} of the $M$ data points (word tokens) assigned to $K$ clusters.\footnote{Remember that a dish may be served at multiple tables, hence $K\geq W$.} Henceforth we will (interchangeably) call such a partition a seating arrangement $\mathcal{S}$, as the way people sit around the tables forms the partition, with each table being a cluster. More formally, the probability of a word occurrence being assigned to a cluster $k$ is $w_k$, for $k = 1, 2, \ldots, K$ and $\sum_{k=1}^K = 1$, and additionally, each data point can only be in one cluster. The order of clusters in the partition is arbitrary, as is the order of customers within a cluster.\footnote{We come back to this property later when we introduce the notion of exchangeability.} Given that $M$ denotes the number of customers, $|x\in\mathcal{D}|$, we denote a partition of size $m$ as $\pi_{[m]}$. % * <a.vandenbosch@let.ru.nl> 2014-10-20T17:16:20.907Z:
%
% grote M, kleine m, is dat OK?
%
For example, a restaurant that represents 4 word types and 12 word occurrences, might look like this:
\begin{equation}\label{eq:partitionexample}
%\pi_{[12]} = \{\{1,2,5,9\},\{3,4,8,10\},\{6\},\{7,11,12\}\}.
\pi_{[12]} = \{\tikz[baseline,remember picture]{\node[fill=blue!20,anchor=base] (t11) {$\{1,2,5,9\}$};},\tikz[baseline,remember picture]{\node[fill=red!20,anchor=base] (t12) {$\{3,4,8,10\}$};},\tikz[baseline,remember picture]{\node[fill=green!20,anchor=base] (t13) {$\{6\}$};},\tikz[baseline,remember picture]{\node[fill=magenta!20,anchor=base] (t14) {$\{7,11,12\}$};}\}.
\end{equation}
This restaurant might for instance represent the data set
\noindent \begin{center}%
\tikz[baseline,remember picture]{\node[box,fill=blue!20] (ta1) {a};}
\tikz[baseline,remember picture]{\node[box,fill=blue!20] (ta2) {a};}
\tikz[baseline,remember picture]{\node[box,fill=red!20] (tb1) {b};}
\tikz[baseline,remember picture]{\node[box,fill=red!20] (tb2) {b};}
\tikz[baseline,remember picture]{\node[box,fill=blue!20] (ta3) {a};}
\tikz[baseline,remember picture]{\node[box,fill=green!20] (tc1) {c};}
\tikz[baseline,remember picture]{\node[box,fill=magenta!20] (td1) {d};}
\tikz[baseline,remember picture]{\node[box,fill=red!20] (tb3) {b};}
\tikz[baseline,remember picture]{\node[box,fill=blue!20] (ta4) {a};}
\tikz[baseline,remember picture]{\node[box,fill=red!20] (tb4) {b};}
\tikz[baseline,remember picture]{\node[box,fill=magenta!20] (td2) {d};}
\tikz[baseline,remember picture]{\node[box,fill=magenta!20] (td3) {d};}
.\end{center}
It is easy to see that for each $m$ we have a different seating arrangement $\pi_{[m]}$. Each $m$ describes a different restaurant, so we call the CRP a family of distributions.
The \emph{process}\/ in CRP comes from the generative model. There the CRP is a discrete-time stochastic process of which the value at any positive-integer time $m$ is a partition $\pi_{[m]}$ of the set $\{1,2,\ldots,m\}$. This means that the CRP is a sequence of distributions, for $1\leq m \leq\infty$, and if we talk about a specific $m$, the CRP reduces to a Chinese restaurant distribution. The probability distribution is then determined as follows. At time $m=1$ the trivial partition $\pi_{[1]}=\{\{1\}\}$ is obtained with probability $1$.
Now at time $m+1$ customer $x_{m+1}$ sits at table $t$ with probability $\frac{|t|}{m+1}$, or sits at an empty table with probability $\frac{1}{m+1}$:
\begin{equation}\label{eq:crp-table}
p(x_{m+1}\text{ seats at }t|\pi_{[m]}) =
\begin{cases}
\frac{|t|}{m+1}, & \text{if }t\in\pi_{[m]},\\
\frac{1}{m+1}, & \text{otherwise}.
\end{cases}
\end{equation}
In \cref{eq:crp-table} the probability of joining a table $t$ is expressed. We can also compute the probability of $x_{m+1}$ seating at $x_i$'s table. At time $m+1$, customer $x_{m+1}$ sits at the same table as customer $x_i$ with probability $\frac{1}{m+1}$ for each $i<(n+1)$,\footnote{This can be verified easily: for each of the $|t|$ customers sitting at table $t$, the chance of joining that customer is $\frac{1}{m+1}$, so joining the same table means having to sum for each customer at table $t$: $\sum_{i=1}^{|t|} \frac{1}{m+1} = \frac{|t|}{m+1}$.} or sits at an empty table with probability $\frac{1}{m+1}$.
An interesting property of the CRP is exchangeability. Although the CRP is specified using an ordering of the customers, exchangeability states that the distribution on partitions defined by the CRP is invariant to the ordering, in the sense that only the size of the clusters determines the probability of the partition. For example, consider the probability that customers 1 and 2 will be found sitting together at the same table after $M$ customers have entered the restaurant. This probability is $\frac{1}{2}$ because customer 1 sits at an arbitrary table with probability $1$, and customer 2 joins the table with probability $1\cdot\frac{1}{2}$. Now, by exchangeability, this probability does not change if the customers were to enter the restaurant in a different order. Put differently, the probability of any two customers $i$ and $j$ sitting at the same table is $\frac{1}{2}$.
% * <a.vandenbosch@let.ru.nl> 2014-10-20T17:42:50.188Z:
%
% klinkt vreemd - is dit niet alleen wanneer i en j 1 en 2 zijn, of 2 en 1? Of ook als i=20484 en j=478234?
%
For example for the partition $\pi_{[12]}$ from \cref{eq:partitionexample}, the probability of this assignment is:
%\begin{equation}\label{eq:partitionexampleprobability}
% p(\pi_{[12]}) = \frac{1}{1}\frac{1}{2}\frac{1}{3}\frac{1}{4}\frac{2}{5}\frac{1}{6}\frac{1}{7}\frac{2}{8}\frac{3}{9}\frac{3}{10}\frac{1}{11}\frac{2}{12}
%\end{equation}
\begin{equation}\label{eq:partitionexampleprobability}
p(\pi_{[12]}) =
\tikz[baseline,remember picture]{\node[box,fill=blue!20] (a1) {$\frac{1}{1}$};}
\tikz[baseline,remember picture]{\node[box,fill=blue!20] (a2) {$\frac{1}{2}$};}
\tikz[baseline,remember picture]{\node[box,fill=red!20] (b3) {$\frac{1}{3}$};}
\tikz[baseline,remember picture]{\node[box,fill=red!20] (b4) {$\frac{1}{4}$};}
\tikz[baseline,remember picture]{\node[box,fill=blue!20] (a5) {$\frac{2}{5}$};}
\tikz[baseline,remember picture]{\node[box,fill=green!20] (c6) {$\frac{1}{6}$};}
\tikz[baseline,remember picture]{\node[box,fill=magenta!20] (d7) {$\frac{1}{7}$};}
\tikz[baseline,remember picture]{\node[box,fill=red!20] (b8) {$\frac{2}{8}$};}
\tikz[baseline,remember picture]{\node[box,fill=blue!20] (a9) {$\frac{3}{9}$};}
\tikz[baseline,remember picture]{\node[box,fill=red!20] (b10) {$\frac{3}{10}$};}
\tikz[baseline,remember picture]{\node[box,fill=magenta!20] (d11) {$\frac{1}{11}$};}
\tikz[baseline,remember picture]{\node[box,fill=magenta!20] (d12) {$\frac{2}{12}$};}
\end{equation}
From these products, it also becomes clear that the order in which the customers enter does not matter, as long as they join the same tables. This also follows from the definition of exchangeability, which states that $X_1, X_2, \ldots$ is infinitely exchangeable if for any $n$, $p(X_1,\ldots,X_m)$ is invariant under permutation. More precise: a finite sequence $(X_1, \ldots, X_m)$ of random variables is called exchangeable if $p(X_1,\ldots,X_m) = p(X_{\sigma(1)},\ldots,X_{\sigma(m)})$, for each permutation $\sigma$ of $\{1,\ldots,m\}$. An infinite sequence $(X_1,X_2,\ldots)$ is called exchangeable if $p(X_1,X_2,\ldots)=p(X_{\sigma(1)},X_{\sigma(2)},\ldots)$ for each finite permutation $\sigma$ of $\{1,2,\ldots\}$, that is, each permutation for which $\#\{i: \sigma(i)\neq i\} < \infty$.
The part on infinite exchangeability is relevant because the CRP defines a distribution over ``infinite partitions'', even if when only observing a finite number of data points. The CRP commits to the idea that there is an infinitely large partition. When observing only a finite number of $x$ data points, we essentially ask how much of the latent infinite partition do we expect to observe if we only have $x$ data points?
We can generalise \cref{eq:partitionexampleprobability} as:
\begin{equation}\label{eq:partitionprobability}
p(\pi_{[m]}) = \frac{\prod_{t\in\pi{[m]}} (|t|-1)!}{m!}
\end{equation}
So now that we know about exchangeability, we can turn the argument around and recover the CRP update formula in \cref{eq:crp-table} with \cref{eq:partitionprobability}. We add customer $m+1$ to $\pi_{[m]}$ at table $T$, and we call the new partition $\pi'_{[m+1]}$. If $T$ is an existing table, we find that:
\begin{equation}
\begin{split}
p(x_{m+1}\text{ seats at }T|\pi_{[m]}) &= \frac{p(\pi'_{[m+1]})}{p(\pi_{[m]}} \\
&= \frac{m!}{(m+1)!}\frac{\prod_{t'\in\pi'_{[m+1]}}(|t'|-1)!}{\prod_{t\in\pi_{[m]}}(|t|-1)!} \\
&= \frac{1}{m+1}\frac{(|t_0|-1)! \cdot \ldots \cdot (|T|-1)! \cdot\ldots\cdot (|t_K|-1)!}{(|t_0|-1)! \cdot \ldots \cdot (|T-1|-1)! \cdot\ldots\cdot (|t_K|-1)!} \\
&= \frac{1}{m+1}|T|
\end{split}
\end{equation}
%Here we use the $\infty$-sign colloquially.
The product $\prod_{t\in\pi_{[m]}}$ is a product over an unbounded number of tables. For the probability of $T$ being empty and $x_{m+1}$ starting a new table, we can do a similar exercise:
\begin{equation}
\begin{split}
p(x_{m+1}\text{ seats a new table }T|\pi_{[m]}) &= \frac{p(\pi'_{[m+1]})}{p(\pi_{[m]}} \\
&= \frac{m!}{(m+1)!}\frac{\prod_{t'\in\pi'_{[m+1]}}(|t'|-1)!}{\prod_{t\in\pi_{[m]}}(|t|-1)!} \\
&= \frac{1}{m+1}\frac{(|t_0|-1)! \cdot\ldots\cdot (|t_K|-1)! \cdot (1-1)!}{(|t_0|-1)! \cdot\ldots\cdot (|t_K|-1)!} \\
&= \frac{1}{m+1}
\end{split}
\end{equation}
Some readers may notice that although we filled restaurants with customers, and let them take place at tables, we did not mention anything about their food yet. Strictly, serving the dishes is not part of the CRP, but in many studies serving customers by providing tables and providing dishes is considered a combined process, also called the CRP mixture model. In this example, and in the remainder of this thesis, we assume that the dishes come from a menu. This menu can be seen as a distribution over dishes, and selecting a dish from the menu as a draw from the distribution.
The idea is that we assign a parameter vector $\phi_t$ to each table $t\in\pi_{[M]}$, and table $t$ hosts $x_i$ if $i\in t$. Additionally, we assume that the data points at table $t$ are generated independently from a common probability distribution indexed by $\phi_t$. So now for each $i\in t$, we let $f(x_i|\phi_t)$ denote the probability density for generating data point $x_i$, and we take the product over $i\in t$ to obtain the total probability of generating the data associated with table $t$. The product over clusters and over data points within clusters is the overall conditional probability of the data:
\begin{equation}
p(x|\phi,\pi_{[M]}) = \prod_{t\in\pi_{[M]}}\prod_{i\in t} f(x_i|\phi_t),
\end{equation}
with $\phi=(\phi_1,\ldots,\phi_K)$ and $K$ the number of occupied tables. If we fix $x$, then this probability density is known as the likelihood function.
This is almost a complete probability model, however we first need to specify a distribution for the parameters $\phi$. We assume that these parameters are drawn independently from a distribution $G_0$:
\begin{align}
\pi_{[M]} &\sim \operatorname{CRP}(M)\label{eq:CRPMMpartition} \\
\phi_t | \pi_{[M]} &\overset{\text{iid}}{\sim} G_0 && \text{ for }t\in\pi_{[M]},\label{eq:CRPMMlatent} \\
x_i|\phi,\pi_{[M]} &\overset{\text{ind}}{\sim} F(\phi_t) && \text{ for }t\in\pi_{[M]}\text{ and }i\in t,\label{eq:CRPMMdatapoints}
\end{align}
with $F(\phi_t)$ being the distribution with density $f(\cdot|\phi_t)$. The linked conditional probabilities yield a joint probability distribution on the collection of variables $(x,\phi,\pi_{[M]})$. Naturally, we are mostly interested in the posterior probability of $\pi_{[M]}$ given $x$.
This model is called the CRP mixture model because in a mixture model each data point is generated from one of a set of mixture components, and the choice of mixture component is made randomly for each data point. In this case, $\phi$ defines the mixture components, and $\pi_{[M]}$ selects randomly among these mixture components by assigning a table to each data point, and each data point is generated from the selected mixture component with a draw from $F(\phi_t)$.
The expected number of tables as $M \rightarrow \infty$ is... This can be seen as follows: let $\tau_i$ be the event that the $i$th customer starts a new table. The probability of this happening is $p(\tau_i = 1) = \frac{1}{(i-1)+1}$. We can denote the number of tables $K$ after $n$ customers as $k_m = \sum_i \tau_i$, which is equal to $\sum_i \frac{1}{(i-1)+1}$ which is in turn upper bounded by $O(H_m)$ where $H_m$ is the $m$th harmonic sum\footnote{The harmonic series is defined as: $\sum_{m=1}^\infty \frac{1}{m} = 1 + \frac{1}{2} + \frac{1}{3} + \cdots$. The $m$th partial sum is then $H_m=\sum_{k=1}^m \frac{1}{k}$.} which grows logarithmically with $O(\ln m+1)$. % * <a.vandenbosch@let.ru.nl> 2014-10-20T17:50:27.316Z:
%
% sentence is unfinished. What is the expected number of tables as M-> infinity?
%
\subsection{Stick-breaking}\marginnote{The stick-breaking process describes a probability distribution over partition sizes.}
In the previous section we defined a process to generate a distribution over partitions by means of a seating arrangement. In this section we introduce a new metaphor: the stick-breaking distribution, which defines the distribution of customers sitting at the tables, without generating an explicit seating arrangement.
The stick-breaking process\autocite{Ishwaran2001Gibbs} is a distribution on the infinite sequence $\xi = (\xi_1,\xi_2,\ldots)$; it is also known as the GEM distribution.\footnote{The GEM distribution is named after Griffiths, Engen, and McCloskey, cf.\ \cite{Pitman1997The}}
The metaphor is as follows: imagine you have a stick of length $1$. You generate $\xi_1$ by snapping the stick into two pieces. The length of one of the two pieces becomes the value $\xi_1$. To generate $\xi_2$ you then snap the other piece in two; again the length of one of these pieces becomes $\xi_2$. As you repeat this stick-breaking, the length of the stick that you have left approaches zero.
$\xi_k$ can then be interpretated as the length of the piece of unit-length stick assigned to the $k$th value.\footnote{$k$ can be considered to be the $k$th table in the CRP, with $\xi_k$ being the relative number of customers seated at $t_k$.} After the first $(k-1)$ values have their portions assigned, the length of the remainder of the stick,
\begin{align}\label{eq:stickbreaking}
\xi_k &= \prod_{i=1}^{k-1}(1-\xi'_i),
\end{align}
is broken according to a sample $\xi'_k$ from a beta distribution, where $\xi'_k$ indicates the portion of the remainder to be assigned to the $k$th value. The beta distribution is defined as follows:
\begin{equation}
\operatorname{Beta}(x; \alpha, \beta) = \frac{1}{\operatorname{B}(\alpha,\beta)}x^{\alpha-1}(1-x)^{\beta-1}\label{eq:betad}
\end{equation}
with $\alpha,\beta >0$ and $\operatorname{B}(\alpha,\beta)$ being the beta function
\begin{equation}
\operatorname{B}(\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)}\label{eq:betaf}
\end{equation}
and for $\Gamma(n)$ denoting the Gamma function for $n>0$:
\begin{equation}
\Gamma(n) = (n-1)!\label{eq:gammaf}
\end{equation}
Just as in the CRP, the probability space is uniformly distributed over the number of customers. This uniform distribution can be represented as a beta distribution with parameters $(1,1)$, such that
\begin{equation}
\operatorname{Beta}(1,1)=\frac{1}{\operatorname{B}(1,1)} = \frac{1}{(1-1)!} = 1
\end{equation}
Now let $(\phi_1, \phi_2, \ldots)$ be a set of samples from $G_0$.\footnote{Compare this to \cref{eq:CRPMMlatent}, but now $t\rightarrow\infty$, so we do not have to use $\pi_{[M]}$ anymore, and its conditional independence can be dropped.} The distribution given by density
\begin{equation}
f(\phi) = \sum_{k=1}^\infty \xi_k \cdot \delta_{\phi}(\phi)
\end{equation}
with $\delta(\cdot)$ being the Dirac delta measure, used as an indicator function. It is clear from the construction of $f$ that a non-parametric sample is generated, and that the samples are discrete (from the indicator function).
\textcite{Sethuraman1994A} shows that the sequence $\{\xi_i\}_{j=1}^\infty$ of independent draws from $\operatorname{Beta}(\alpha,\beta)$ according to \cref{eq:stickbreaking} satisfies $\sum_{i} \xi_i = 1$.
\subsection{Urn Representation}
Another way to fill the Chinese restaurant is by means of drawing balls from an urn. Different from the CRP where we assigned each customer to its table, with the urn representation we have a process where we select a table, and consequently assign a customer to that table.
The urn works as follows. An urn contains objects of various types, where the objects are customarily called the balls, and each ball type can be distinguished by its colour. In general the content of an urn can change over time, subject to rules of placing balls into the urn, or drawing balls under predesignated replacement schemes. A simple example is when there is a particular distribution $P$ of balls in the urn, and you immediately replace the drawn ball after noting its colour; drawing a ball from the urn is then the same as drawing from distribution $P$. In this case $P$ is a fixed distribution. By adding or removing balls after each draw, the distribution can also be dynamic.
In the section on CRP mixture models we saw that the underlying distributions were far from static, hence we need a replacement scheme that matches this distribution: the Hoppe-P\'olya urn.\footnote{The Hoppe-P\'olya urn is a distribution on sequences of random variables.} In the literature this scheme is often simply called the P\'olya urn.\footnote{Compare this simplified naming to the interchangeability of CRP and the Dirichlet Process CRP in the literature. We will undo this unclarity in a later section.} In this section we will introduce the P\'olya urn, the Hoppe-P\'olya urn, and the missing link between the two urn schemes, the multivariate P\'olya urn scheme.
The P\'olya urn scheme\autocite{Johnson1977Urn} is the simplest of the three schemes. The urn contains only white and black balls. For each ball that is drawn (randomly) from the urn, we observe its colour, re-place\footnote{Read: put it back into the urn.} it, and add an additional ball of the same colour to the urn.
A modest generalisation of the P\'olya urn scheme is the multivariate P\'olya urn scheme. Here the generalisation entails that the balls can have other colours as well, although each ball still has only one colour. Image a Chinese restaurant where the number of tables is limited to $\kappa$ and $\kappa$ unique dishes are served; each table has a unique dish. The restaurant hosts $\nu$ customers. We can translate this scenario to the multivariate P\'olya urn, where we have $\kappa$ different colours, and $\nu$ balls. In the CRP\footnote{And in our reduced example.} we modelled word types as tables, and word tokens as the customers sitting around the same table. In the urn metaphor, the colours are the types, and the number of balls with a colour are the tokens. Since we limited $K=\kappa$ in this example, customers cannot choose to join a new table. The chance of joining another word is now $\frac{1}{\nu}$, and the chance of being of the same colour $h$ as ball $b$ is $\frac{c(h)}{\nu}$ with $c(x)$ denoting the count of balls with colour $x$.
More interesting is a generalisation of the P\'olya urn scheme that can behave the same way as a CRP. With the multivariate P\'olya urn scheme, we came close, but lacked the means to generate new tables.\footnote{And thus also the means to generate new dishes.} With the Hoppe-P\'olya urn scheme\autocite{Hoppe1984Polya}, we can overcome this problem as follows. We start with an urn with one black\footnote{Read black as uncoloured, as we also do not take it into account when we count the number of different colours in the urn.} ball. When drawing from the urn, if we draw a black ball, we replace the ball, and alongside add a new ball of a new non-black colour randomly generated from a probability distribution. Otherwise, we put the ball back along with another ball of the same colour, as for the standard multivariate P\'olya urn scheme. The colours of an infinite sequence of draws from the Hoppe-P\'olya urn scheme follow a CRP. If instead of generating new colours (tables in the CRP), we draw a random value from a given base distribution such as $G_0$, we can model a CRPMM.\footnote{Compare this to \cref{eq:CRPMMlatent}. Also note that the P\'olya urn scheme models a beta-binomial distribution, with the parameters of the beta distribution depending on the initial layout of the urn. The multivariate P\'olya urn models a Dirichlet-multinomial distribution, where we set the Dirichlet parameter vector $\alpha=1$. The Hoppe-P\'olya urn scheme models the Dirichlet process, with $\alpha=1$. We will come back to this in a later section on the Dirichlet Process.} Similar to the CRP, the number of colours grows logarithmically to infinity, as the expected number of colours at time $n$ is $\sum_{i=1}^n i^{-1}$.
\textcite{Blackwell1973Ferguson} define the P\'olya urn scheme as follows. Let $\mu$ be any finite positive measure on a complete separable metric space $X$, and $(B_1,\dots,B_r)$ a finite partition of $X$. Now every sequence $\{X_n,n\geq 1\}$ of random variables with values in $X$ is a P\'olya sequence with parameter $\mu$ if for every $B\subset X$:
\begin{equation}
p(X_1\in B) = \frac{\mu(B)}{\mu(X)}
\end{equation}
and
\begin{equation}
p\{X_{n+1}\in B | X_1,\ldots,X_n\} = \frac{\mu_n(B)}{\mu_n(X)}.
\end{equation}
with $\mu_n = \mu + \sum_{i=1}^n \delta(X_i)$ and $\delta(x)$ denoting the unit measure concentrating at $x$. For every finite $X$, the sequence $\{X_n\}$ represents the results of successive draws from an urn where initially the urn has $\mu(x)$ balls of colour $x$, and after each draw, the ball drawn is replaced and another ball of its same colour is added to the urn.
The Hoppe-P\'olya stochastic process can be formalised as follows. Let $C = \{1,2,\ldots\}$ be the set of colours and $X_n$ the random variable which indicates the colour of the new ball added to the urn after the $n$th draw. Let $K$ be the random number of distinct colours that are in the urn at time $n$, then the urn can be described as a configuration of $n$ tokens in $K$ cells $\{n_1, n_2, \ldots, n_K\}$ where $\sum_{i=1}^K n_i = n$ are the occupancy numbers of the $K$ cells, which form a partition\footnote{A partition of an integer $n$, is one way of writing $n$ as the sum of positive integers where the
order of the addends (terms being added) does not matter.} of the integer $n$. The sequence $\{X_j\}_{j=1}^n$ determines the random partition $\pi_{n}$.
Consider a partition $a = (a_1,a_2,\ldots,a_n)$ with $K$ distinct colours and a sequence of draws $\pi_{n} = \{X_j\}_{j=1}^n$ generating it. The probability of obtaining this sequence is then:
\begin{equation}
p(\pi_n) = \frac{1^K}{n!}\prod_{j=1}^K (n_j-1)!
\end{equation}
The $\frac{1}{n!}$ comes from the fact that in order to get $K$ colours, the black ball must be drawn $K$ times. Then in order to get exactly the required partition, each colour must be drawn at exactly $n_j-1$ times. Not surprisingly, but a confirmation nonetheless, this models the same probability as \cref{eq:partitionprobability}: the probability of a partition in the CRP.
%http://ofce-skema.org/wp-content/uploads/2013/06/marengo.pdf
%https://stat.duke.edu/courses/Fall09/sta205/lec/exchange.pdf
%http://www.stat.berkeley.edu/~pitman/blmq.pdf
%
%http://www.artofproblemsolving.com/Resources/Papers/LaurendiPartitions.pdf
%http://mlss11.bordeaux.inria.fr/docs/mlss11-Teh.pdf
\subsection{The Dirichlet Process: a Complete Prior}
In the Chinese restaurant process we observed a unigram language model from the perspective of the customers, representing word occurrences. There we saw that according to the CRP, they all have an equal probability of joining another customer, and the probability of joining an existing table was proportional to the number of customers already sitting at that table. In the Hoppe-P\'olya urn scheme, we watched the same process from the table point-of-view; here we saw customers joining tables or starting new tables. By means of the stick-breaking process, we observed a way to describe the distribution over number of customers sitting at tables.
Now, saying that customers have an equal likelihood of starting a new table is a strong limitation. Therefore we introduce a parametrised generalisation of the previous priors, called the Dirichlet Process, that encompasses all of the above. Historically the Dirichlet process was invented first, and the other processes were all invented to provide tractable methods for using the Dirichlet process.
To come to the Dirichlet process from the line of processes that we followed in this thesis, we introduce a concentration parameter $\alpha $. If we set $\alpha = 1$, we reduce the Dirichlet process to a process coherent with the CRP, stick-breaking process, and Hoppe-P\'olya urn scheme that we saw in the previous sections.
The Dirichlet Process is a way of assigning a probability distribution over probability distributions. It is similar to the CRP, the stick-breaking process, and the urn schemes, in a sense that it is often used as a prior distribution. In this section we will briefly introduce the Dirichlet Process, and show the relations and differences with the priors we introduced in the previous sections.
We will start with a somewhat informal introduction. The Dirichlet Process is a stochastic process, and is often coined the cornerstone of Bayesian nonparametric models. It represents a distribution over distributions, such that each draw from a Dirichlet process is a distribution itself. It is called a Dirichlet process because it has Dirichlet-distributed finite dimensional marginal distributions. The distributions drawn from a Dirichlet process are discrete, but cannot be described using a finite number of parameters, hence the Dirichlet process is classified as a nonparametric model.
\subsection{Introduction to the Dirichlet Distribution}
Assume we have a corpus $\mathcal{C}$ with $W$ words and vocabulary $V$, and we want to compare two documents $\mathcal{D}_1$ and $\mathcal{D}_2$ from this corpus. In a Bayesian setting we assume some underlying probability mass function $q$ generating the documents, sampled from the random probability mass function $Q$, such that $q_1,q_2\in Q$. With the Dirichlet distribution we can model the randomness, or variability, between the two documents. If we assume that all words have an equal probability, then we can imagine a fair $W$-sided die. A document is then generated by rolling the die, and each outcome is a word token $v\in V$. The die can be represented by a probability mass function of length $W$, with a uniform distribution. Now we have for both $\mathcal{C}_1$ and $\mathcal{C}_2$ a probability mass function, and we can use the Dirichlet distribution to capture the variability.
In practice the distribution of words is not uniform, and the probability mass function is then of length $W$ but normalised over the empirical word frequencies.
More formally, let $Q = [Q_1, Q_2, \ldots, Q_K]$ be a random probability mass function (pmf) with $K$ components, with $Q_i \geq 0$ for $i = 1,2,\ldots,K$ and $\sum_{i=1}^K Q_i = 1$. In addition, suppose that parameter vector $\alpha = [\alpha_1,\alpha_2,\ldots,\alpha_K]$,\footnote{Although perhaps confusing, this is not necessarily the same $\alpha$ as the Dirichlet process concentration parameter, although in practice this is often the case. Here we just follow the convention of the literature, where $\alpha$ is also used to denote the concentration parameter of the Dirichlet distribution.} with $\alpha_i > 0$ for each $i$ and let $\hat{\alpha} = \sum_{i=1}^K \alpha_i$.\footnote{If $\alpha_1=\alpha_2=\ldots=\alpha_K$, then we call $\alpha$ symmetric, and the distribution a symmetric Dirichlet distribution.} Then $Q$ is said to have a Dirichlet distribution with parameter $\alpha$, which we can denote by $Q \sim \operatorname{Dir}(\alpha)$:\footnote{See Section 2 of \cite{Ferguson1973A} for a more elaborate introduction, with many technicalities that we assume to be true; where not, we will indicate this.\cite{Ferguson1973A}}% * <a.vandenbosch@let.ru.nl> 2014-10-20T20:45:36.363Z:
%
% leg uit wat 'pmf' is
%
\begin{equation}
\operatorname{Dir}(q; \alpha) = \frac{\Gamma(\hat{\alpha})}{\prod_{i=1}^K \Gamma(\alpha_i)} \prod_{j=1}^K q_i^{\alpha_j-1}
\end{equation}
with $\Gamma(\cdot)$ denoting the gamma function (see~\cref{eq:gammaf}).
The Dirichlet distribution can be thought of as a probability distribution over the $(K-1)$-dimensional probability simplex\footnote{The $k$-simplex is a geometrical $(k-1)$-dimensional object ($k$ follows from $k=1-\sum_{i=1}^{k-1} q_i$), which is a generalisation of the notion of a triangle to $k$ dimensions. It is defined as the point set consisting of the convex hull of a set of linear independent points in $\mathbb{R}^k$, with its $k$ components being non-negative and summing to $1$: $\Delta_k = \{q\in\mathbb{R}^k|\sum_{i=1}^k q_i=1,q_i\geq0\text{ for }i=1,2,\ldots,k\}$.} $\Delta_K$; put otherwise, as a distribution over pmfs of length $K$. With $K=2$, the Dirichlet distribution reduces to a Beta distribution (see~\cref{eq:betad}), thus if $X\sim\operatorname{Beta}(a,b)$ then $Q=(X,1-X)\sim\operatorname{Dir}(\{a,b\})$, and vice versa.
With $\alpha = \{1\}_{i=1}^K$, the Dirichlet distribution reduces to the uniform distribution over the simplex. When the components of $\alpha$ are all greater then $1$, the density has one mode being somewhere in the interior of the simplex. With the components of $\alpha$ being all less than $1$, the density has sharp peaks around the edges of the simplex. However it should be noted that the support of the Dirichlet is open, and does not include the edges of the simplex, and therefore a component drawn from a Dirichlet will never be $0$. Note the difference between $\alpha$ being only $1$s, and all $\alpha_i$ going to infinity. In the first case all sets of distributions are equally likely\footnote{The distribution over distributions is uniform.}, whereas in the latter case only near-uniform distributions\footnote{The distribution over distributions is peaked around the uniform distribution.} are likely.
One of the properties of the Dirichlet distribution is that it is the conjugate prior for the multinomial distribution. The multinomial distribution is parametrised by an integer $n$ and a pmf $q = [q_1, q_2, \ldots, q_3]$. $n$ denotes the number of independent events, and for each event, the probability of outcome $i$ is $q_i$. The multinomial distribution then specifies the probability that outcome $i$ occurs $x_i$ times, for $i=1,2,\ldots,K$.\footnote{The multinomial distribution can model the probability of an $n$-sample empirical histogram, if each sample is drawn iid from $q$.} If $X\sim\operatorname{Mult}(n,q)$, then its probability mass function is given by
\begin{equation}
\operatorname{Mult}(x_1,x_2,\ldots,x_K|n, q_1,q_2,\ldots,q_K) = \frac{n!}{x_1!x_2!\ldots x_K!}\prod_{i=1}^K q_i^{x_i}
\end{equation}
when $\sum_{i=1}^K x_i = n$, $0$ otherwise. We can also express this with gamma functions to shows it correspondence to the Dirichlet distribution:
\begin{equation}
\operatorname{Mult}(x_1, x_2,\ldots,x_K|n, q_1,q_2,\ldots,q_K) = \frac{\Gamma(\sum_{i=1}^K x_i+1)}{\prod_{i=1}^K \Gamma(x_i+1)}\prod_{i=1}^K q_i^{x_1}.
\end{equation}
\begin{equation}
\left(\frac{\Gamma(\sum_{i=1}^K x_i+1)}{\prod_{i=1}^K \Gamma(x_i+1)}\prod_{i=1}^K q_i^{x_1}\right)\left(\frac{\Gamma(\alpha_0)}{\prod_{i=1}^K \Gamma(\alpha_i)} \prod_{j=1}^K q_i^{\alpha_j-1}\right)
\end{equation}
Now consider an example in which we have $L$ documents, each assigned to a pmf $q^{(1)}, \ldots, q^{(L)}$, each drawn from a distribution $Q$ such that $q^{(i)} \overset{\text{iid}}{\leftarrow} Q$, and $Q\sim\operatorname{Dir}(\alpha)$ being a distribution drawn from a distribution over distributions. If we draw iid $n_i$ samples from each pmf $q^{(i)}$
\begin{equation}
\operatorname{Dir}(\alpha) \overset{\text{iid}}{\rightarrow}
\begin{cases}
q^{(1)} \overset{\text{iid}}{\rightarrow} x_{1,1}, x_{1,2},\ldots,x_{1,n_1} \triangleq x_1\\
q^{(2)} \overset{\text{iid}}{\rightarrow} x_{2,1}, x_{2,2},\ldots,x_{2,n_2} \triangleq x_2 \\
\vdots \\
q^{(L)} \overset{\text{iid}}{\rightarrow} x_{L,1}, x_{L,2},\ldots,x_{L,n_L} \triangleq x_L \\
\end{cases}
\end{equation}
then $\{x_i\}$ are the realisation of a compound Dirichlet distribution, also known as the multivariate P\'olya distribution that we saw earlier with $L$ different balls and $n_i$ colours. Now the distributions over distributions is not uniform, but concentrated around certain distributions by means of the concentration parameter $\alpha$.
\subsection{Dirichlet Multinomial Compound Mixture Model}
By means of an example, we show that the multivariate P\'olya urn scheme indeed converges to a Dirichlet distribution in a parametric setting. We can do the same for the stick-breaking process, where we limit the number of breaks to $(k-1)$, which leaves us with $k$ pieces. For the CRP there seems to be no sensible parametric metaphor, so the next step is then to introduce these priors into a nonparametric setting. We will do this in the next section.
If we start with the urn scheme, then we want to generate a realisation $Q\sim\operatorname{Dir}(\alpha)$. That is, we put $\alpha_i$ balls of colour $i$ for $i=1,2,\ldots,K$ in an urn. Note that $\alpha_i>0$ and is not necessarily an integer. The process of drawing balls is similar to the multivariate P\'olya urn discussed earlier, but now we can have a different number of coloured balls.\footnote{The distribution is not uniform anymore, but now models a Dirichlet-multinomial distribution with concentration parameter $\alpha$.} At each iteration, we draw one ball uniformly at random from the urn, and then place it back into the urn along with an additional ball of the same colour. As we iterate this process, the proportions of balls of each colour will converge to a pmf that is a sample from the distribution $\operatorname{Dir}(\alpha)$. Let $Q_n = (Q_{n1}, Q_{n2},\ldots,Q_{nk})$, where $Q_{ni}$ is the proportion of balls of colour $i$ after $n$ balls are in the urn. Then $Q_n \rightarrow_\text{d} Q \sim \operatorname{Dir}(\alpha)$ as $n\rightarrow\infty$, with $\rightarrow_\text{d}$ denoting convergence in distribution.\footnote{$p(Q_{n1}\leq z_1, Q_{n2}\leq z_2, \ldots, Q_{nk}\leq z_k) \rightarrow p(Q_1\leq z_1, Q_2\leq z_2,\ldots,Q_K\leq z_K)$ as $n\rightarrow\infty$ for all $(z_1,z_2,\ldots,z_3)$.}
The model can be described by:
\begin{align}
Q|\alpha &\sim \operatorname{Dir}\left(\frac{\alpha}{K},\ldots,\frac{\alpha}{K}\right) \\
q^{(i)}|Q&\sim\operatorname{Mult}(\pi) \\
\phi_k | G_0 &\overset{\text{iid}}{\sim} G_0 \\
x_i|q^{(i)},\{\phi_K\}&\overset{\text{ind}}{\sim} F\left(\phi_{z_i}\right).
\end{align}
\subsection{Formalising The Dirichlet Process}
Earlier we already made a remark that vocabularies are not necessarily of finite length. With the Dirichlet distribution we have to set $k$ upfront, because it is a parametrised model. In order to facilitate this, we use the Dirichlet process.
For a random distribution $G$ to be distributed according to a Dirichlet process, its marginal distributions have to be Dirichlet distributed. Let the Dirichlet process be parametrised by a base distribution $G_0$ and a positive scaling parameter $\alpha$.\footnote{\cite{Ferguson1973A} parametrises the Dirichlet process with a base measure, which is $\alpha G_0$ in our notation.} Then for any finite partition $A_1,\ldots,A_k$ of $\{\phi\}$ the vector $(G(A_1),\ldots,G(A_k))$ is random since $G$ is random. We then say $G$ is Dirichlet process distributed with base distribution $G_0$ and concentration parameter $\alpha$, which we denote as $G\sim\operatorname{DP}(\alpha, G_0)$, if
\begin{align}
(G(A_1),\ldots,G(A_k)) \sim \operatorname{Dir}(\alpha G_0(A_1), \ldots,\alpha G_0(A_k)).\label{eq:dirtodp}
\end{align}
The base distribution $G_0$ is basically the mean of the Dirichlet process, since for any measurable set $A\subset\{\phi\}$ we have $\mathbb{E}[G(A)]=G_0(A)$. The concentration parameter can be understood as the inverse variance: $\operatorname{Var}(G_0(A))= \frac{1-G_0(A)}{\alpha +1 }G_0(A)$. The larger $\alpha$ is, the smaller the variance, and the Dirichlet process will concentrate more of its mass around the mean. Other names for $\alpha$ are the mass parameter, as this prior strength can be measured in units of mass of observations. Sometimes it is also called the strength parameter, when they refer to the strength of the prior when using the Dirichlet process as a nonparametric prior.
For the Dirichlet distribution and the multivariate P\'olya urn scheme, we saw that if we kept on drawing and placing balls according to its replacement scheme, we approached the true underlying distribution. For the Dirichlet process we can do a similar exercise. Let $G$ be a random distribution $G\sim\operatorname{Dir}(\alpha, G_0)$. Let $\theta_1,\ldots,\theta_n$ be a sequence of independent draws from $G$.
Now $G$ is not a fixed distribution, but a distribution of distributions. Each draw $\theta$ for colour $k$ is no longer proportional to $\frac{\alpha}{K}$, with $K$ the number of colours, but to the number of balls already picked with that colour: $A_k$. That means that instead of the Dirichlet with $\operatorname{Dir}\left(\frac{\alpha}{K}, \ldots,\frac{\alpha}{K}\right)$, it now follows $\operatorname{Dir}\left(\alpha G_0(A_1), \ldots, \alpha G_0(A_k)\right)$, which is also what we stated in \cref{eq:dirtodp}.
Recall the Chinese restaurant process. Now the process for a Dirichlet process CRP (DPCRP) is as follows: the first customer enters the restaurant and sits at the first table. Each $(m+1)$th customer either joins an already occupied table $t$ with a probability $\frac{|t|}{m+\alpha}$, or sits at a new table with a probability $\frac{\alpha}{m+\alpha}$:\footnote{Compare this to the probabilities for the CRP: \cref{eq:crp-table}.}
\begin{equation}\label{eq:dpcrp-table}
p(x_{m+1}\text{ seats at }t|\pi_{[m]}) =
\begin{cases}
\frac{|t|}{m+\alpha}, & \text{if }t\in\pi_{[m]},\\
\frac{\alpha}{m+\alpha}, & \text{otherwise}.
\end{cases}
\end{equation}
The properties on exchangeability still hold. In \cref{eq:partitionexampleprobability} we computed the probability of $\pi_{[12]}$ (\cref{eq:partitionexample}) as generated by a CRP:
\begin{equation}
\pi_{[12]} = \{\{1,2,5,9\},\{3,4,8,10\},\{6\},\{7,11,12\}\}.
\end{equation}
We can do the same for a DPCRP:
\begin{equation}\label{eq:dppartitionexampleprobability}
p(\pi_{[12]}) = \frac{\alpha}{\alpha}\frac{1}{1+\alpha}\frac{\alpha}{2+\alpha}\frac{1}{3+\alpha}\frac{2}{4+\alpha}\frac{\alpha}{5+\alpha}\frac{\alpha}{6+\alpha}\frac{2}{7+\alpha}\frac{3}{8+\alpha}\frac{3}{9+\alpha}\frac{1}{10+\alpha}\frac{2}{11+\alpha},
\end{equation}
which we can generalise as:
\begin{align}
p(\pi_{[m]}) = \frac{\prod_{t\in\pi_{[m]}}(|t|-1)!}{m!} \frac{a^K}{a^{(K)}}
\end{align}
with $a^{(x)} = a(a+1)(a+2)\ldots(a+x-1)$ denoting the rising factorial.\footnote{Also known as the Pochhammer factorial.} It is interesting to look at the expected number of tables after $(m+1)$ customers. Note that for each customer the probability of starting a new table, $\frac{\alpha}{i+\alpha}$, is independent of the number of tables among the previous customers. We use the same notation where we denote the number of tables $K$ after $n$ customers as $k_m = \sum_i \tau_i$, which is equal to
\begin{align}
\sum_i \frac{\alpha}{(i-1)+1} = \alpha(\psi(\alpha+(m+1))-\psi(\alpha))
\end{align}
with $\psi(\cdot)$ denoting the digamma function. It grows logarithmically with $O(\alpha \log(1+\frac{m}{\alpha}))$. Note that $\alpha$ controls the number of clusters in a direct manner, with larger $\alpha$ implying a larger number of clusters a priori.
The stick-breaking process is also very similar to the unparametrised prior we saw earlier. Now we parametrise $\operatorname{Beta}$ as follows: $\operatorname{Beta}(1,\alpha)$. The process is then defined as:
\begin{align}
\xi'_k &\sim\operatorname{Beta}(1,\alpha) \\
\xi_k &= \xi'_k \prod_{i=1}^{k-1}(1-\xi'_i) \\
\phi_k&\sim G_0 \\
F &= \sum_{k=1}^\infty \xi_k\cdot\delta_\phi(\phi)
\end{align}
Then $G\sim\operatorname{DP}(\alpha,G_0)$. The distribution over $\xi$ is also written as $\xi\sim\operatorname{GEM}(\alpha)$.
For the urn scheme the extension with a concentration parameter $\alpha$ is more straight-forward. Recall the Hoppe-P\'olya urn. In the original model there was only one black ball. If you drew the black ball, you had to put the black ball back, alongside a ball of a new colour. For any other coloured ball that was drawn, two balls of the same colour were placed in the urn. Now in the parametrised version, there are $\alpha$ black balls in the urn. Similar to the DRCRP where the chance of starting a new table was proportional to $\alpha$, independent of what other customers did, the chance in the DP Hoppe-P\'olya urn of adding a new colour is $\alpha$.
\subsection{The Dirichlet Process Mixture Model}
The most common application of the Dirichlet process is in clustering data using mixture models\autocite{Antoniak1974Mixtures,Escobar1995Bayesian}. Here the nonparametric nature of the Dirichlet process translates to mixture models with a countably infinite number of components. We model a set of observations $\{x_1, x_2, \ldots, x_n\}$ using a set of latent variables $\{\theta_1,\theta_2,\ldots,\theta_n\}$. Each $\theta_i$ is drawn idd from $G$, while each $x_i$ has distribution $F(\theta_i)$ parametrised by $\theta_i$:
\begin{align}
x_i|\theta_i&\sim F(\theta_i) \\
\theta_i|G&\sim G \\
G|\alpha,G_0&\sim\operatorname{DP}(\alpha,G_0).
\end{align}
Because $G$ is discrete, multiple $\theta_i$s can take on the same value, and the model can be seen as a mixture model, where $x_i$'s with the same value of $\theta_i$ belong to the same cluster. Using the DP stick-breaking prior, we can make the model more in the usual mixture model perspective:
\begin{align}
\pi|\alpha&\sim\operatorname{GEM}(\alpha) \\
z_i|\pi&\sim\operatorname{Mult}(\pi) \\
\theta_k|G_0&\sim G_0 \\
x_i|z_i, \{\theta_k\}&\sim F(\theta_{z_i})
\end{align}
with $G = \sum_{k=1}^\infty \pi_k\delta_{\theta_k}$ and $\theta_i = \theta_{z_i}$; $\pi$ is then the mixing proportion, $\theta_k$ the cluster parameters, $F(\theta_k)$ the distribution over data in cluster $k$, and $G_0$ the prior over cluster parameters.
The Dirichlet process mixture model can take on a countably infinite number of clusters, it is therefore called an infinite mixture model. However, the $\pi_k$'s decrease exponentially quick, hence only a small (logarithmic) number of clusters will be used to model the data a priori.
\subsection{Pitman-Yor Process: The Process}
With the Dirichlet process we defined a prior distribution over distributions, where we used the mass parameter $\alpha$ to assign a relative mass to choosing new tables in the DPCRP and new colours in the Hoppe-P\'olya urn. Our goal is to create a language model using Bayesian nonparametric priors. With the priors we have seen until now, the growth of the number of tables or colours, or in the case of a language model, the occurrence of word types, is still logarithmic. However, in language, many patterns seem to follow a power law\footnote{A power law is a relationship between two quantities, where one quantity varies as a power of another, which is typically of the form $f(x)=ax^k$} growth.\cite{Goldwater2006Interpolating}
To achieve a power law growth with the exponential growth Dirichlet process, we have to introduce yet another parameter: the discount parameter $0 \leq \gamma < 1$. The Pitman-Yor process is a generalisation of the Dirichlet process, because if we set $\gamma = 0$, it reduces to a Dirichlet process.
The Pitman-Yor process is thus a distribution over distributions over a probability space $X$. It has three parameters: a discount parameter $0\leq\gamma<1$, a strength parameter $\alpha > -\gamma$, and a base distribution $G_0$ over $X$.
The Pitman-Yor process can also be described by the Chinese restaurant process (PYCRP). It describes the properties of the Pitman-Yor process in terms of distributions over draws from $F$, which correspond to words whose distributions we wish to model. Let $x_1, x_2,\ldots$ be iid draws from $F$. The first customer enters the restaurant and sits at the first table. Each $(m+1)$th customer sits at the $t$th occupied table with a probability proportional to $|t|-\gamma$, and sits at an unoccupied table with a probability proportional to $\alpha+k\gamma$:
\begin{align}
p(x_{m+1}\text{ seats at }t|\pi_{[m]}) =
\begin{cases}
\frac{|t|-\gamma}{\alpha+m}, & \text{if }t\in\pi_{[m]},\\
\frac{\alpha+k\gamma}{\alpha+m}, & \text{otherwise}\\
\end{cases}.
\end{align}
Two characteristic properties can be observed easily from the description.
\begin{enumerate}
\item As more customers join a table, that table becomes a more likely choice for future customers too;
\item Irrespective of how many customers are in the restaurant, there is always a positive probability of joining an unoccupied table. This probability depends on the number of tables, as $k$ grows, the probability of starting table $(k+1)$ grows as well with rate $\gamma$.
\end{enumerate}
Its conditional distribution is:
\begin{align}\label{eq:pycrpconddist}
x_{m+1}|x_1,\ldots,x_m, \pi_{[m]} \sim \sum_{i=1}^k \frac{|i|-\gamma}{\alpha+m}\delta_\phi(\phi) + \frac{\alpha+k\gamma}{\alpha+m}G_0
\end{align}
The PYCRP can also be shown to be exchangeable, similar in argument to the CRP and DPCRP.
An explicit construction of draws from a Pitman-Yor process is given by the stick-breaking construction. It shows that the posterior distribution is a weighted sum of an infinite sequence of points masses with probability one. In the original stick-breaking process we used $\operatorname{Beta}(1,1)$, in the stick-breaking process for the Dirichlet process $\operatorname{Beta}(1,\alpha)$, and now, for the Pitman-Yor process, we use both parameters: $\operatorname{Beta}(1-\gamma, \alpha+k\gamma)$, where $k$ is the number of pieces already broken of stick. Let $\xi_1,\xi_2,\ldots$ and $\phi_1, \phi_2,\ldots$ be two sequences of independent random variables. The distribution is then given by the density:
\begin{align}
\xi'_k &\sim \operatorname{Beta}(1-\gamma, \alpha+k\gamma)\\
\xi_k &= \xi'_k\prod_{i=1}^{k-1}(1-\xi'_i) \\
\phi_k &\sim G_0 \\
F &= \sum_{k=1}^\infty \xi_k \delta_\phi(\phi)
\end{align}
then $F\sim\operatorname{PYP}(\alpha, \gamma, G_0)$.
\subsection{Pitman-Yor Process: The Bayesian Unigram Language Model}
Throughout this section we introduced two distributions over distributions: the Dirichlet process and the Pitman-Yor process. For both processes we saw that there are three specific scenarios that we can use to describe the processes, albeit from three different perspectives: the Chinese restaurant process, the stick-breaking process, and the urn scheme.
This all lead to the Pitman-Yor process that shows power law behaviour, similar to patterns we find in language. In the remainder of this section we fulfil our statement from the beginning of this section that we wanted to build a unigram language model, as this is the basis for a $n$-gram language model, which is introduced in the next section.
Consider using the Pitman-Yor process as a prior for unigram word distributions. We use a uniform word distribution over our fixed vocabulary $V$ of $W$ words as the base distribution $G_0$, while the draw from the Pitman-Yor process $F$ is the desired unigram distribution over words. Repeating the beginning of this section; we have a training corpus $\mathcal{D}$ comprising $M$ datapoints with $c(w)$ occurrences of word $w\in V$. In the PYCRP this is reflected as $c(w)$ customers eating dish $w$, at any table. We now also have the seating arrangement $\pi_{[M]}$ of the $M$ customers $x_1, \ldots, x_M$. Now let $\theta_w$ be the number of any of the $k$ occupied tables in the restaurant serving dish $w$. The predictive probability of a new word given the seating arrangement is given by \cref{eq:pycrpconddist}, which evaluates to:
\begin{align}
p(x_{M+1} = w | \pi_{[M]}) = \frac{c(w)-t_w\gamma}{\alpha+M}+\frac{\alpha+k\gamma}{\alpha+M}G_0(w)
\end{align}
by aggregating the probabilities in \cref{eq:pycrpconddist} corresponding to each dish $w$.
%http://cerebus.ift.ulaval.ca/~dallaire/talks/pyp_mixture.pdf
%http://www.auai.org/uai2014/proceedings/individuals/227.pdf
\section{The Bayesian Approach to $n$-gram Language Modelling}
We can achieve hierarchy in a Bayesian model in various ways. In this section, and in the next section, we will discuss two different hierarchical models. The hierarchical Pitman-Yor process will be discussed in the next section, and can be used to model domain knowledge. In the remainder of this section we introduce the nested Pitman-Yor process, which we can use to model $n$-grams.\footnote[][-24em]{The difference can be considered as modelling the hierarchy within data (the nested Pitman-Yor process), and hierarchy between data (the hierarchical Pitman-Yor process). However, note that in the literature the nested Pitman-Yor process as a language model is often referred to as hierachical Pitman-Yor language model.}
We introduce dependence between distributions through linear combinations of realisations of independent Pitman-Yor processes, which we call the nested Pitman-Yor process.\autocite{Teh2006AHierarchical,Teh2006ATechnical} There is also a body of similar work on nested Dirichlet processes.\autocite{Blei2010The,Rodriguez2008The}
Let $x_{ij}$ be the $i$th observation, with $i = 1,\ldots,k_n$ for subjects within layer $n$, for $n=1,\ldots,N$. In the context of our Bayesian language model, $n$ refers to the $n$-grams, $N$ is then the maximum $n$-gram length we model.\footnote{The Sequence memoizer introduced by \cite{Wood2009A} is a nested Pitman-Yor model with $N\rightarrow\infty$. \cite{Wood2009A}} And for each $n$-gram layer $n$ we have $k_n$ observations.
From the previous section on Pitman-Yor unigram language models we had a finite vocabulary $V$ with $W$ words. For each word $w\in V$, we have $F(w)$ be the estimated probability of $w$. Now let $F=\{F(w)\}_{w\in V}$ be the vector of unigram probabilities, with a prior distribution given by $G_0$:
\begin{align}\label{eq:pypcontext}
F&\sim\operatorname{PY}(\alpha, \gamma, G_0).
\end{align}
$G_0=\{G_0(w)\}_{w\in V}$ is the mean vector with a priori probabilities for the words. In practice this distribution is usually set to the uniform distribution over $W$ samples.
With the nested Pitman-Yor process we want to model the probabilities over the current word given various contexts, comprising up to $(n-1)$ words. Let $u$ be this context, and let $G_u(w)$ be the probability of the current word being word $w$. We now place a Pitman-Yor process prior on $G_u = \{G_u(w)\}_{w\in V}$:
\begin{align}\label{eq:npypcontext}
G_u&\sim\operatorname{PY}(\alpha_{|u|},\gamma_{|u|},G_{\pi(u)})
\end{align}
with $p(u)$ being the suffix of $u$ consisting of all but the earliest word. The concentration and discount parameters $\alpha$ and $\gamma$ are parametrised with $n = |u|$. The mean vector $G_u(w)$ contains the probabilities of the current word given all but the earliest word in the context $u$.
In this definition, where $n$ can take up any value, we do not necessarily know $G_{\pi(u)}$ either, hence we put a recursive prior over $G_{\pi(u)}$ on \cref{eq:pypcontext}, but now with parameters $\alpha_{|\pi(u)|}$, $\gamma_{|\pi(u)|}$, and mean vector $G_{\pi(\pi(u))}$ instead. We repeat this process until we get to $G_\emptyset$, which is the same as $F$ from \cref{eq:pypcontext}.
In the paper by \textcite{Teh2006ATechnical}, this process is called the hierarchical Chinese restaurant process. In the remainder of this thesis we will however call it the nested Chinese restaurant process.
\subsection{Nested Chinese Restaurant Process}
To get back to the metaphor of the Chinese restaurant process, think of a huge building with multiple floors. Each floor has a different strategy. The first floor $n=1$ only serves one dish. On the second floor they serve two dishes. On the $N$th floor, they serve $N$ dishes. Now recall that the dishes were word types. On the $n$th floor, you get to choose $n-1$ dishes, and the $n$th dish is a desert. The order in which you order the $n-1$ dishes matters, just as it does in an $n$-gram.\footnote{Otherwise we would just have a bag of words.} Each floor consists of an unbounded number of rooms, each hosting a Chinese restaurant for a context $u$.
The idea is that since $G_u$ is Pitman-Yor process distributed, we can draw words from it using the PYCRP. Furthermore, a floor in the restaurant only directly depends on the floor beneath, and we only need to be able to draw words from that PYCRP, by drawing words from its parent distribution $G_{\pi(u)}$. Since $G_{\pi(u)}$ is itself distributed as a Pitman-Yor process, we can recursively apply this step until we need draws from the global mean distribution $G_0$. This last step is easy, since it is a uniform distribution.
Now let $x_{u1},x_{u2},\ldots$ be the sequence of words for context $u$, drawn iid from $G_u$; let $y_{u1},y_{u2},\ldots$ be another sequence drawn iid from the parent distribution $G_{\pi(u)}$. We use $m$ to index draws from $G_u$, and $n$ to index the draws from $G_{\pi(u)}$.\footnote{Because $n$ is the short version of $m$.} Define $t_{uwn}=1$ if $y_{un}$ takes on the value $m$, and $0$ otherwise. So if $m=\{m_1, m_2, m_3, m_4\}$, $u=\{m_1, m_2, m_3\}$ and $n=m_5$, then $t_{uwn} = 0$, because $y_{un} = \{m_1,m_2,m_3,m_5\}$. If $n=m_4$ then $t_{uwn} = 1$. In other words, if the context $u$ and the focus word $n$ make up the $n$-gram $m$, $t_{uwn} =1$, and otherwise it is $0$.
Each word $x_{um}$ is assigned to one of the draws $y_{un}$ from $G_{\pi(u)}$. In the restaurant metaphor this means that if you have an $4$-gram $\{w_1,w_2,w_3,w_4\}$, its context $\{w_1,w_2,w_3\}$ is assigned to another floor, i.e.~it will sit at a table on the trigram floor.\footnote{Together with the ``real'' trigrams, of which in turn their context will sit be seated on the bigram floor, \&c.}
If $y_{un}$ takes on value $w$,\footnote{Remember that then $t_{uwn}=1$, which means that $un=w$.} define $c_{uwn}$ as the number of words $x_{um}$ drawn from $G_u$ assigned to $y_{un}$, otherwise let $c_{uwn} = 0$. So if $y_{un}$ is the $n$-gram $w$, then we define $c_{uwn}$ as the number of words that follow $w$. E.g., $y_{un}$ is drawn from $G_{\pi(u)}$ and is the $(n-1)$-gram. $c_{uwn}$ then denotes the number of $n$-grams that can follow from $y_{un}$.
In order to avoid notational clutter, we have to introduce some abbreviations and represent marginal counts by dots. For example, $c_{u\cdot n}$ is the number of $x_{um}$ ($n$-grams)'s assigned the value of $y_{un}$; $c_{uw\cdot}$ is the number of $x_{um}$'s with value $w$, thus the $n$-gram count for $w$; and $t_{u\cdot\cdot}$ is the current number of draws $y_{un}$ from $G_{\pi(u)}$, which the current number of $(n-1)$-grams.
Put together, we have the following relationships among the $c_{uw\cdot}$'s and $t_{uw\cdot}$:
\begin{align}
\begin{cases}
t_{uw\cdot} = 0 & \text{if }c_{uw\cdot}=0;\\
1\leq t_{uw\cdot}\leq c_{uw\cdot} & \text{if }c_{uw\cdot} > 0;
\end{cases} \label{eq:npyp1}\\
c_{uw\cdot} = \sum_{u':\pi(u')=u} t_u'w\cdot\label{eq:npyp2}
\end{align}
The first clause in \cref{eq:npyp1} states that if the there are no tables in restaurant $u$ serving dish $w$, that there are also no customers, hence the count is $0$. The second clause states that if there is at least $1$ table, that there must also be a positive number of customers, however, the number of tables cannot exceed the number of customers.
\Cref{eq:npyp2} shows the marginalising step: the number of customers easting desert $w$ in restaurant $u$ is the sum of all people who may have had a different main course, but chose the same desert.
\section{The Bayesian Approach to Skipgram Language Modelling}
%essentially the same, but backoff refer to later chapter
In this chapter we introduced many concepts such as the Chinese restaurant process, the stick-breaking process, and the related urn process. They all are closely connected to the Dirichlet Process, and depending on the properties of modelling that you are after, they prove to be more tractable marginalisations of the Dirichlet Process. We also showed how unigram and $n$-gram language models fit in the paradigm of non-parametric Bayesian models, with the culmination of the Pitman-Yor process language models.
Each of the topics mentioned in this chapter, are, and have been, topic of research. This overview is therefore never complete, but we hope that it serves to be a round-up of literature that is otherwise fragmented, and stained with using words for concepts that are similar, but not the same.
In the next chapter we continue with the notion of $n$-gram Pitman-Yor process language models, in particular we add skipgrams as features.
% \section{The Bayesian Approach to Domain-dependent $n$-gram Language Modelling}
% \subsection{Chinese Restaurant Franchise}
% Recall the Chinese restaurant process, which is a metaphoric restaurant with an infinite number of round tables, with ample space to host an unbounded number of guests. Each table in the restaurant may serve only one dish, however, multiple tables may serve the same dish. We connected this to language modelling by saying that the customers are word tokens, the dishes the word types, and the number of customers eating a dish thus represent the occurrence count of that specific word.
% Each restaurant can then be considered a distribution over words, therefore representing a unigram language model. Now we want to build a hierarchical $n$-gram language model. To this end, we introduce the Chinese restaurant franchise. The layout of this section is similar to the previous section, and follows the same patterns, albeit now for a hierarchical model.
% A Chinese restaurant franchise is a collection of Chinese restaurants, which share a common menu with an infinite number of dishes. Each restaurant operates as an independent Chinese restaurant process with respect to the seating arrangement, but the dishes for each table are chosen by a different method, which depends on the global character of the franchise.
% Let $x_{i,j}$ denote the $j$th customer at the $i$th restaurant. When a customer enters, he chooses a table, either sharing it with other people, or an empty table. The probabilities of choosing either are the same as in the Chinese restaurant process.
% If the customers sits at an already occupied table, he is served the same dish as the other customers sitting at that table. If he seats at an unoccupied table, he must select a dish. Similar to choosing the table, the customer will also choose a dish with a probability proportional to its popularity. The popularity of a dish is determined by the number of tables across the entire franchise currently serving the dish. The customer can also decide to order a new dish.
% Let $K$ be the number of tables $t_1,\ldots,t_K$ in the restaurant serving $|D|$ dishes $D = (d_1,\ldots,d_D)$. $\tau_1,\ldots,\tau_K$ denote the dishes served, with table $i$ serving dish $d_i$. Furthermore, let $|d_i| = \sum_{t=1}^K \mathbb{1}[\tau_t = d_i]$ be the number of tables serving dish $d_i$. The distribution for choosing a dish $d$ at a new table is then:
% \begin{align}
% p(\theta_{K+1}=d | \tau_1, \ldots, \tau_K) =
% \begin{cases}
% \frac{|d|}{\alpha_0+K}, & \text{if }d\in D,\\
% \frac{\alpha}{\alpha_0+K}, & \text{otherwise.}
% \end{cases}
% \end{align}
% As for the customer, he has three options to consider:
% \begin{enumerate*}
% \item join an already occupied table and choose locally from the dishes already being served at restaurant $i$;
% \item start a new table and choose a dish from the global menu;
% \item start a new table and select a new dish.
% \end{enumerate*}
% Let $z_{ij}$ denote the dish chosen by the $j$th customer at restaurant $i$ and $c(d_s) = \sum_{n=1}^{j-1}\mathbb{1}[z_{in}=d_s]$ the number of customers eating dish $s$. The distribution on $z_{ij}$ is then:
% \begin{align}
% p(z_{ij} = d | \alpha_0, \alpha ) =
% \begin{cases}
% \frac{c(d)}{\alpha+(j-1)}+\frac{\alpha}{(\alpha+(j-1)}\frac{|d|}{\alpha_0 + K}, & \text{if }d\in D,\\
% \frac{\alpha}{\alpha+(j-1)}\frac{\alpha_0}{\alpha_0 + K}, & \text{otherwise.}
% \end{cases}
% \end{align}
% Put into words, dishes are chosesn according to their popularity, since the weight for a dish $d\in D$ is the sum of the number of customers at restaurant $i$ eating that dish plus $\alpha$ times the number of tables in the franchise serving that dish. The concentration parameter $\alpha$ specifies the relative importance of the local-level and global-level mixtures.
% To specify the Chinese restaurant franchise mixture model, we associate each dish with a class parameter drawn independently from a base measure $G_0$
% The conditional distribution is:
% \begin{align}
% a | b, \alpha_0, G_0 &\sim \sum_{}^{} \frac{}{(i-1)+\alpha_0}\delta_\phi(\phi) + \frac{\alpha_0}{(i-1)+\alpha_0}G_0
% \end{align}
% Since $G_0$ is drawn from another Dirichlet process $\operatorname{DP}(\alpha, H)$, we want to integrate it out. Notice that $G_0$ only appears in its role as the distribution of variables $\psi_{jt}$. To integrate $G_0$ out, we use the conditional probability of the Dirichlet process \cref{?}, and write the conditional distribution of $\psi_{jt}$ directly:
% \begin{align}
% % \psi_{jt} | \psi_{11},\psi_{12},\ldots,\psi_{21},\ldots,\psi_{j(t-1)},\alpha,H &\sim\sum_{k=1}^K\frac{m_k}{\alpha+\sum__{i=1}^K m_i}\delta_{\theta_k}+\frac{\alpha}{\alpha+\sum_{i=1}^K m_i}H.
% \end{align}
% \subsection{Tree Stick-breaking Process}
% \subsection{Urn?}
% \subsection{Hierarchical Dirichlet Process}
% \subsection{Hierarchical Pitman-Yor Process: The Bayesian $n$-gram Language Model}