Skip to content

Commit

Permalink
fix quotes (huggingface#15)
Browse files Browse the repository at this point in the history
  • Loading branch information
patrickvonplaten authored Mar 19, 2020
1 parent 14d8aee commit 7ebac00
Showing 1 changed file with 21 additions and 21 deletions.
42 changes: 21 additions & 21 deletions how-to-generate.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,15 +113,15 @@ its next word: \\(w_t = argmax_{w}P(w | w_{1:t-1})\\) at each timestep

<img src="/blog/assets/02_how-to-generate/greedy_search.png" alt="greedy search" style="margin: auto; display: block;">

Starting from the word \\(\text{"The"}\\), the algorithm greedily chooses
the next word of highest probability \\(\text{"nice"}\\) and so on, so
Starting from the word \\(\text{``The''}\\), the algorithm greedily chooses
the next word of highest probability \\(\text{``nice''}\\) and so on, so
that the final generated word sequence is
\\(\text{"The", "nice", "woman"}\\) having an overall probability of
\\(\text{``The'', ``nice'', ``woman''}\\) having an overall probability of
\\(0.5 \times 0.4 = 0.2\\).

In the following we will generate word sequences using GPT2 on the
context
\\((\text{"I", "enjoy", "walking", "with", "my", "cute", "dog"})\\). Let's
\\((\text{``I'', ``enjoy'', ``walking'', ``with'', ``my'', ``cute'', ``dog''})\\). Let's
see how greedy search can be used in `transformers`:

</div>
Expand Down Expand Up @@ -165,10 +165,10 @@ The major drawback of greedy search though is that it misses high
probability words hidden behind a low probability word as can be seen in
our sketch above:

The word \\(\text{"has"}\\) with its high conditional probability of
\\(0.9\\) is hidden behind the word \\(\text{"dog"}\\), which has only the
The word \\(\text{``has''}\\) with its high conditional probability of
\\(0.9\\) is hidden behind the word \\(\text{``dog''}\\), which has only the
second-highest conditional probability, so that greedy search misses the
word sequence \\(\text{"The"}, \text{"dog"}, \text{"has"}\\).
word sequence \\(\text{``The''}, \text{``dog''}, \text{``has''}\\).

Thankfully, we have beam search to alleviate this problem\!

Expand All @@ -186,10 +186,10 @@ highest probability. Let's illustrate with `num_beams=2`:
<img src="/blog/assets/02_how-to-generate/beam_search.png" alt="beam search" style="margin: auto; display: block;">

At time step \\(1\\), besides the most likely hypothesis
\\(\text{"The", "woman"}\\), beam search also keeps track of the second
most likely one \\(\text{"The", "dog"}\\). At time step \\(2\\), beam search
finds that the word sequence \\(\text{"The", "dog", "has"}\\) has with
\\(0.36\\) a higher probability than \\(\text{"The", "nice", "woman"}\\),
\\(\text{``The'', ``woman''}\\), beam search also keeps track of the second
most likely one \\(\text{``The'', ``dog''}\\). At time step \\(2\\), beam search
finds that the word sequence \\(\text{``The'', ``dog'', ``has''}\\) has with
\\(0.36\\) a higher probability than \\(\text{``The'', ``nice'', ``woman''}\\),
which has \\(0.2\\). Great, it has found the most likely word sequence in
our toy example\!

Expand Down Expand Up @@ -389,10 +389,10 @@ generation when sampling.
<img src="/blog/assets/02_how-to-generate/sampling_search.png" alt="sampling search" style="margin: auto; display: block;">

It becomes obvious that language generation using sampling is not
*deterministic* anymore. The word \\(\text{"car"}\\) is sampled from the
conditioned probability distribution \\(P(w | \text{"The"})\\), followed
by sampling \\(\text{"drives"}\\) from
\\(P(w | \text{"The"}, \text{"car"})\\).
*deterministic* anymore. The word \\(\text{``car''}\\) is sampled from the
conditioned probability distribution \\(P(w | \text{``The''})\\), followed
by sampling \\(\text{``drives''}\\) from
\\(P(w | \text{``The'', ``car''})\\).

In `transformers`, we set `do_sample=True` and deactivate *Top-K*
sampling (more on this later) via `top_k=0`. In the following, we will
Expand Down Expand Up @@ -454,7 +454,7 @@ look as follows.
<img src="/blog/assets/02_how-to-generate/sampling_search_with_temp.png" alt="sampling temp search" style="margin: auto; display: block;">

The conditional next word distribution of step \\(t=1\\) becomes much
sharper leaving almost no chance for word \\(\text{"car"}\\) to be
sharper leaving almost no chance for word \\(\text{``car''}\\) to be
selected.

Let's see how we can cool down the distribution in the library by
Expand Down Expand Up @@ -523,7 +523,7 @@ to 6 words. While the 6 most likely words, defined as
probability mass in the first step, it includes almost all of the
probability mass in the second step. Nevertheless, we see that it
successfully eliminates the rather weird candidates
\\(\text{"not", "the", "small", "told"}\\) in the second sampling step.
\\(\text{``not'', ``the'', ``small'', ``told''}\\) in the second sampling step.

Let's see how *Top-K* can be used in the library by setting `top_k=50`:

Expand Down Expand Up @@ -573,9 +573,9 @@ others from a much more flat distribution (distribution on the left in
the graph above).

In step \\(t=1\\), *Top-K* eliminates the possibility to sample
\\(\text{"people", "big", "house", "cat"}\\), which seem like reasonable
\\(\text{``people'', ``big'', ``house'', ``cat''}\\), which seem like reasonable
candidates. On the other hand, in step \\(t=2\\) the method includes the
arguably ill-fitted words \\(\text{"down", "a"}\\) in the sample pool of
arguably ill-fitted words \\(\text{``down'', ``a''}\\) in the sample pool of
words. Thus, limiting the sample pool to a fixed size *K* could endanger
the model to produce gibberish for sharp distributions and limit the
model's creativity for flat distribution. This intuition led [Ari
Expand Down Expand Up @@ -604,9 +604,9 @@ words to exceed together \\(p=92\%\\) of the probability mass, defined as
likely words, whereas it only has to pick the top 3 words in the second
example to exceed 92%. Quite simple actually\! It can be seen that it
keeps a wide range of words where the next word is arguably less
predictable, *e.g.* \\(P(w | \text{"The"})\\), and only a few words when
predictable, *e.g.* \\(P(w | \text{``The''})\\), and only a few words when
the next word seems more predictable, *e.g.*
\\(P(w | \text{"The", "car"})\\).
\\(P(w | \text{``The'', ``car''})\\).

Alright, time to check it out in `transformers`\! We activate *Top-p*
sampling by setting `0 < top_p < 1`:
Expand Down

0 comments on commit 7ebac00

Please sign in to comment.