diff --git a/subtitles/zh-CN/06_transformer-models-decoders.srt b/subtitles/zh-CN/06_transformer-models-decoders.srt index 4f81ed010..7a22241f2 100644 --- a/subtitles/zh-CN/06_transformer-models-decoders.srt +++ b/subtitles/zh-CN/06_transformer-models-decoders.srt @@ -5,13 +5,13 @@ 2 00:00:07,140 --> 00:00:07,973 -一个例子 +一种流行的仅包含解码器架构 An example 3 00:00:07,973 --> 00:00:11,338 -一种流行的解码器唯一架构是 GPT 两种。 -of a popular decoder only architecture is GPT two. +的例子是 GPT-2。 +of a popular decoder only architecture is GPT-2. 4 00:00:11,338 --> 00:00:14,160 @@ -20,7 +20,7 @@ In order to understand how decoders work 5 00:00:14,160 --> 00:00:17,430 -我们建议你观看有关编码器的视频。 +我们建议您观看有关编码器的视频。 we recommend taking a look at the video regarding encoders. 6 @@ -35,7 +35,7 @@ One can use a decoder 8 00:00:21,210 --> 00:00:23,760 -对于大多数与编码器相同的任务 +执行与编码器相同的大多数任务 for most of the same tasks as an encoder 9 @@ -55,12 +55,12 @@ with the encoder to try 12 00:00:30,300 --> 00:00:32,670 -并了解架构差异 +并了解在编码器和解码器之间 and understand the architectural differences 13 00:00:32,670 --> 00:00:34,803 -在编码器和解码器之间。 +的架构差异。 between an encoder and decoder. 14 @@ -70,12 +70,12 @@ We'll use a small example using three words. 15 00:00:38,910 --> 00:00:41,050 -我们通过他们的解码器传递它们。 +我们通过解码器传递它们。 We pass them through their decoder. 16 00:00:41,050 --> 00:00:44,793 -我们检索每个单词的数字表示。 +我们检索每个单词的数值表示。 We retrieve a numerical representation for each word. 17 @@ -85,17 +85,17 @@ Here for example, the decoder converts the three words. 18 00:00:49,350 --> 00:00:53,545 -欢迎来到纽约,欢迎来到这三个数字序列。 +Welcome to NYC,这三个数字序列。 Welcome to NYC, and these three sequences of numbers. 19 00:00:53,545 --> 00:00:56,040 -解码器只输出一个序列 +解码器针对每个输入词汇 The decoder outputs exactly one sequence 20 00:00:56,040 --> 00:00:58,740 -每个输入词的数字。 +只输出一个数列。 of numbers per input word. 21 @@ -105,7 +105,7 @@ This numerical representation can also 22 00:01:00,630 --> 00:01:03,783 -称为特征向量或特征传感器。 +称为特征向量(feature vector)或特征传感器(feature sensor)。 be called a feature vector or a feature sensor. 23 @@ -115,32 +115,32 @@ Let's dive in this representation. 24 00:01:07,200 --> 00:01:08,490 -它包含一个向量 +它包含了每个通过解码器 It contains one vector 25 00:01:08,490 --> 00:01:11,340 -每个通过解码器的单词。 +传递的单词的一个向量。 per word that was passed through the decoder. 26 00:01:11,340 --> 00:01:14,250 -这些向量中的每一个都是一个数字表示 +这些向量中的每一个单词 Each of these vectors is a numerical representation 27 00:01:14,250 --> 00:01:15,573 -有问题的词。 +都是一个数值表示。 of the word in question. 28 00:01:16,920 --> 00:01:18,562 -该向量的维度被定义 +这个向量的维度 The dimension of that vector is defined 29 00:01:18,562 --> 00:01:20,703 -通过模型的架构。 +由模型的架构所决定。 by the architecture of the model. 30 @@ -150,12 +150,12 @@ Where the decoder differs from the encoder is principally 31 00:01:26,040 --> 00:01:28,200 -具有自我注意机制。 +具有自注意力机制。 with its self attention mechanism. 32 00:01:28,200 --> 00:01:30,843 -它使用所谓的掩蔽自我关注。 +它使用所谓的掩蔽自注意力。 It's using what is called masked self attention. 33 @@ -165,27 +165,27 @@ Here, for example, if we focus on the word "to" 34 00:01:34,650 --> 00:01:37,620 -我们会看到 vector 是绝对未修改的 +我们会发现它的向量 we'll see that is vector is absolutely unmodified 35 00:01:37,620 --> 00:01:39,690 -用纽约的话来说。 +完全未被 NYC 单词修改。 by the NYC word. 36 00:01:39,690 --> 00:01:41,731 -那是因为右边所有的话,也都知道 +那是因为右边所有的话,即 That's because all the words on the right, also known 37 00:01:41,731 --> 00:01:45,276 -因为这个词的正确上下文被掩盖了 +单词的右侧上下文都被屏蔽了 as the right context of the word is masked rather 38 00:01:45,276 --> 00:01:49,230 -而不是受益于左右所有的话。 +而没有从左侧和右侧的所有单词中受益。 than benefiting from all the words on the left and right. 39 @@ -205,32 +205,32 @@ which can be the left context or the right context. 42 00:01:59,539 --> 00:02:03,356 -Masked self attention 机制不同 +掩蔽自注意力机制不同于 The masked self attention mechanism differs 43 00:02:03,356 --> 00:02:04,320 -来自 self attention 机制 +自注意力机制 from the self attention mechanism 44 00:02:04,320 --> 00:02:07,110 -通过使用额外的掩码来隐藏上下文 +通过使用额外的掩码在单词的两边 by using an additional mask to hide the context 45 00:02:07,110 --> 00:02:09,390 -在单词的两边 +来隐藏上下文 on either side of the word 46 00:02:09,390 --> 00:02:12,810 -单词数值表示不会受到影响 +通过隐藏上下文中的单词 the words numerical representation will not be affected 47 00:02:12,810 --> 00:02:14,853 -通过隐藏上下文中的单词。 +单词数值表示不会受到影响。 by the words in the hidden context. 48 @@ -245,7 +245,7 @@ Decoders like encoders can be used as standalone models 50 00:02:22,380 --> 00:02:25,020 -因为它们生成数字表示。 +因为它们生成数值表示。 as they generate a numerical representation. 51 @@ -265,42 +265,42 @@ A word can only have access to its left context 54 00:02:34,530 --> 00:02:36,690 -只能访问他们的左上下文。 +因为它只有左侧的上下文信息。 having only access to their left context. 55 00:02:36,690 --> 00:02:39,120 -他们天生擅长文本生成 +它们天生擅长文本生成 They're inherently good at text generation 56 00:02:39,120 --> 00:02:41,010 -生成单词的能力 +即在已知的词序列基础上生成一个单词 the ability to generate a word 57 00:02:41,010 --> 00:02:45,000 -或给定已知单词序列的单词序列。 +或单词序列的能力。 or a sequence of words given a known sequence of words. 58 00:02:45,000 --> 00:02:45,833 -这是众所周知的 +这被称为 This is known 59 00:02:45,833 --> 00:02:49,083 -作为因果语言建模或自然语言生成。 +因果语言建模或自然语言生成。 as causal language modeling or natural language generation. 60 00:02:50,430 --> 00:02:53,520 -这是因果语言建模如何工作的示例。 +下面是一个展示因果语言模型的工作原理的示例。 Here's an example of how causal language modeling works. 61 00:02:53,520 --> 00:02:56,410 -我们从一个词开始,这是我的 +我们从一个词 my 开始, We start with an initial word, which is my 62 @@ -320,52 +320,52 @@ and this vector contains information about the sequence 65 00:03:07,230 --> 00:03:08,733 -这是一个词。 +这里的序列是一个单词。 which is here a single word. 66 00:03:09,780 --> 00:03:11,430 -我们应用一个小的转换 +我们对该向量 We apply a small transformation 67 00:03:11,430 --> 00:03:13,110 -到那个向量,以便它映射 +应用一个小的转换 to that vector so that it maps 68 00:03:13,110 --> 00:03:16,500 -到模型已知的所有单词,这是一个映射 +使其映射到模型已知的所有单词 to all the words known by the model, which is a mapping 69 00:03:16,500 --> 00:03:19,890 -我们稍后会看到称为语言建模头。 +这个映射我们稍后会看到,称为语言模型头部信息 that we'll see later called a language modeling head. 70 00:03:19,890 --> 00:03:21,930 -我们确定该模型相信 +我们发现模型认为 We identify that the model believes 71 00:03:21,930 --> 00:03:25,053 -最有可能的后续单词是 name。 +接下来最有可能的单词是 “name”。 that the most probable following word is name. 72 00:03:26,250 --> 00:03:28,710 -然后我们取那个新词并添加它 +然后我们把这个新单词加到原始的序列 my 后面 We then take that new word and add it 73 00:03:28,710 --> 00:03:33,480 -到我的初始序列,我们现在以我的名字命名。 +我们得到了 my name。 to the initial sequence from my, we are now at my name. 74 00:03:33,480 --> 00:03:36,870 -这就是自回归方面的用武之地。 +这就是自回归(auto-regressive)的作用所在。 This is where the auto regressive aspect comes in. 75 @@ -375,7 +375,7 @@ Auto regressive models. 76 00:03:38,490 --> 00:03:42,513 -我们使用他们过去的输出作为输入和以下步骤。 +我们使用它们过去的输出作为输入和接下来的步骤。 We use their past outputs as inputs and the following steps. 77 @@ -395,7 +395,7 @@ and retrieve the most probable following word. 80 00:03:52,978 --> 00:03:57,978 -本例中就是 “是” 这个词,我们重复操作 +本例中就是 “is” 这个词,我们重复操作 In this case, it is the word "is", we repeat the operation 81 @@ -410,13 +410,13 @@ We've now generated a full sentence. 83 00:04:04,590 --> 00:04:07,890 -我们决定就此打住,但我们可以继续一段时间。 +我们决定就此打住,但我们也可以继续一段时间。 We decide to stop there, but we could continue for a while. 84 00:04:07,890 --> 00:04:12,890 -例如,GPT 2 的最大上下文大小为 1,024。 -GPT two, for example, has a maximum context size of 1,024. +例如,GPT-2 的最大上下文大小为 1,024。 +GPT-2, for example, has a maximum context size of 1,024. 85 00:04:13,170 --> 00:04:16,830 @@ -425,11 +425,11 @@ We could eventually generate up to a 1,024 words 86 00:04:16,830 --> 00:04:19,050 -并且解码器仍然会有一些记忆 +并且解码器仍然会对这个序列的前几个单词 and the decoder would still have some memory 87 00:04:19,050 --> 00:04:21,003 -这个序列中的第一个单词。 +有一些记忆。 of the first words in this sequence.