diff --git a/subtitles/zh-CN/60_what-is-the-bleu-metric.srt b/subtitles/zh-CN/60_what-is-the-bleu-metric.srt index e3fce657e..7de1b2163 100644 --- a/subtitles/zh-CN/60_what-is-the-bleu-metric.srt +++ b/subtitles/zh-CN/60_what-is-the-bleu-metric.srt @@ -20,52 +20,52 @@ 5 00:00:07,650 --> 00:00:10,170 -对于许多 NLP 任务,我们可以使用通用指标 +对于许多 NLP 任务,我们可以使用常见指标 For many NLP tasks we can use common metrics 6 00:00:10,170 --> 00:00:12,810 -比如准确性或 F1 分数,但你会做什么 +比如准确性或 F1 分数, like accuracy or F1 score, but what do you do 7 00:00:12,810 --> 00:00:14,340 -当你想衡量文本的质量时 +但是当你想衡量模型所翻译的文本的质量时 when you wanna measure the quality of text 8 00:00:14,340 --> 00:00:16,560 -那是从模型翻译过来的? +该如何评估呢? that's been translated from a model? 9 00:00:16,560 --> 00:00:18,750 -在本视频中,我们将了解一个广泛使用的指标 +在本视频中,我们将为大家介绍一个 In this video, we'll take a look at a widely used metric 10 00:00:18,750 --> 00:00:20,613 -用于称为 BLEU 的机器翻译。 +广泛使用于机器翻译的指标,叫做 BLEU 。 for machine translation called BLEU. 11 00:00:22,290 --> 00:00:23,940 -BLEU 背后的基本思想是分配 +BLEU 背后的基本逻辑是 The basic idea behind BLEU is to assign 12 00:00:23,940 --> 00:00:26,250 -翻译的单一数字分数 +为每个翻译分配一个单独的数字评分 a single numerical score to a translation 13 00:00:26,250 --> 00:00:27,450 -这告诉我们它有多好 +用于评估 that tells us how good it is 14 00:00:27,450 --> 00:00:30,199 -与一个或多个参考翻译相比。 +它与一个或者多个翻译相比,其质量的优劣。 compared to one or more reference translations. 15 @@ -80,12 +80,12 @@ that has been translated into English by some model. 17 00:00:35,340 --> 00:00:37,170 -如果我们比较生成的翻译 +如果我们将生成的翻译 If we compare the generated translation 18 00:00:37,170 --> 00:00:39,150 -一些参考人工翻译, +与一些用于参照的人工翻译进行比较, to some reference human translations, 19 @@ -100,47 +100,47 @@ but has made a common error. 21 00:00:43,260 --> 00:00:46,050 -西班牙语单词 tengo 在英语中的意思是, +西班牙语单词 tengo 在英语中的意思是 have, The Spanish word tengo means have in English, 22 00:00:46,050 --> 00:00:48,700 -这种一对一的翻译不太自然。 +这种一一对应的直译不太自然。 and this one-to-one translation is not quite natural. 23 00:00:49,890 --> 00:00:51,270 -那么我们如何衡量质量 +那么对于使用某种自动的方法生成的翻译 So how can we measure the quality 24 00:00:51,270 --> 00:00:54,270 -以某种自动方式生成的翻译? +我们如何来评估它的质量呢? of a generated translation in some automatic way? 25 00:00:54,270 --> 00:00:56,730 -BLEU 采用的方法是比较 n-gram +BLEU 采用的方法是 The approach that BLEU takes is to compare the n-grams 26 00:00:56,730 --> 00:00:58,550 -生成的 n-gram 翻译 +将所生成翻译的 n-gram 和参照的翻译的 n-gram of the generated translation to the n-grams 27 00:00:58,550 --> 00:01:00,390 -在参考资料中。 +进行比较。 in the references. 28 00:01:00,390 --> 00:01:02,400 -现在,n-gram 只是一种奇特的说法 +现在,n-gram 只是用于描述 n 个单词的 Now, an n-gram is just a fancy way of saying 29 00:01:02,400 --> 00:01:03,960 -一大块 n 个单词。 +一种奇特的说法。 a chunk of n words. 30 @@ -150,32 +150,32 @@ So let's start with unigrams, 31 00:01:05,220 --> 00:01:08,020 -对应于句子中的单个单词。 +它对应于句子中的单个单词。 which corresponds to the individual words in a sentence. 32 00:01:08,880 --> 00:01:11,250 -在此示例中,你可以看到其中四个单词 +在此示例中,你可以看到所生成的翻译中有四个单词 In this example, you can see that four of the words 33 00:01:11,250 --> 00:01:13,140 -在生成的翻译中也发现 +在参照的翻译的其中一个 in the generated translation are also found 34 00:01:13,140 --> 00:01:14,990 -在其中一个参考翻译中。 +也出现了。 in one of the reference translations. 35 00:01:16,350 --> 00:01:18,240 -一旦我们找到了我们的比赛, +一旦我们找到了匹配项, And once we've found our matches, 36 00:01:18,240 --> 00:01:20,130 -一种给译文打分的方法 +给译文打分的一种方法 one way to assign a score to the translation 37 @@ -185,22 +185,22 @@ is to compute the precision of the unigrams. 38 00:01:23,070 --> 00:01:25,200 -这意味着我们只计算匹配词的数量 +这意味着我们在生成的和参考的翻译中 This means we just count the number of matching words 39 00:01:25,200 --> 00:01:27,360 -在生成的和参考的翻译中 +只计算匹配词的数量 in the generated and reference translations 40 00:01:27,360 --> 00:01:29,660 -并通过除以单词数来归一化计数 +并且通过除以生成结果的单词数 and normalize the count by dividing by the number of words 41 00:01:29,660 --> 00:01:30,753 -在这一代。 +来归一化计数值。 in the generation. 42 @@ -210,7 +210,7 @@ In this example, we found four matching words 43 00:01:34,080 --> 00:01:36,033 -而我们这一代人有五个字。 +而我们的生成结果中有五个单词。 and our generation has five words. 44 @@ -225,7 +225,7 @@ and higher precision scores mean a better translation. 46 00:01:44,160 --> 00:01:45,570 -但这并不是故事的全部 +但是到这里还没有结束 But this isn't really the whole story 47 @@ -235,17 +235,17 @@ because one problem with unigram precision 48 00:01:47,310 --> 00:01:49,140 -翻译模型有时会卡住吗 +翻译模型有时会在重复的句式上卡住 is that translation models sometimes get stuck 49 00:01:49,140 --> 00:01:51,330 -以重复的方式重复同一个词 +这样的句式会很多次重复 in repetitive patterns and just repeat the same word 50 00:01:51,330 --> 00:01:52,293 -几次。 +某一个单词。 several times. 51 @@ -260,12 +260,12 @@ we can get really high precision scores 53 00:01:56,370 --> 00:01:57,840 -虽然翻译很烂 +即使从人类的角度来看 even though the translation is terrible 54 00:01:57,840 --> 00:01:59,090 -从人的角度来看! +这个翻译很糟糕! from a human perspective! 55 @@ -280,32 +280,32 @@ we get a perfect unigram precision score. 57 00:02:06,960 --> 00:02:09,930 -所以为了处理这个问题,BLEU 使用了修改后的精度 +所以为了解决这个问题,BLEU 使用了修改后的精度 So to handle this, BLEU uses a modified precision 58 00:02:09,930 --> 00:02:12,210 -剪掉计算一个单词的次数, +基于它出现在参考翻译中 that clips the number of times to count a word, 59 00:02:12,210 --> 00:02:13,680 -基于最大次数 +出现的最大次数 based on the maximum number of times 60 00:02:13,680 --> 00:02:16,399 -它出现在参考翻译中。 +再减掉计算一个单词的次数。 it appears in the reference translation. 61 00:02:16,399 --> 00:02:18,630 -在这个例子中,单词 six 只出现了一次 +在这个例子中,单词 six 只在参考翻译中出现了一次 In this example, the word six only appears once 62 00:02:18,630 --> 00:02:21,360 -在参考中,所以我们把分子剪成一 +所以我们把分子改为 1 in the reference, so we clip the numerator to one 63 @@ -335,27 +335,27 @@ the order in which the words appear in the translations. 68 00:02:33,900 --> 00:02:35,700 -例如,假设我们有 Yoda +例如,假设我们有 Yoda 为我们 For example, suppose we had Yoda 69 00:02:35,700 --> 00:02:37,410 -翻译我们的西班牙语句子, +翻译西班牙语句子, translate our Spanish sentence, 70 00:02:37,410 --> 00:02:39,457 -那么我们可能会得到一些倒退的东西,比如, +那么我们可能会得到一些退步的结果, then we might get something backwards like, 71 00:02:39,457 --> 00:02:42,450 -“我已经六十岁了。” +比如,“Years sixty thirty have I.” "Years sixty thirty have I." 72 00:02:42,450 --> 00:02:44,670 -在这种情况下,修改后的 unigram 精度 +在这种情况下,修改后的 unigram 精度值 In this case, the modified unigram precision 73 @@ -370,12 +370,12 @@ So to deal with word ordering problems, 75 00:02:50,460 --> 00:02:52,020 -BLEU 实际计算精度 +BLEU 实际上计算几个不同的 n-gram 精度值, BLEU actually computes the precision 76 00:02:52,020 --> 00:02:55,410 -对于几个不同的 n-gram,然后对结果进行平均。 +然后对结果计算平均值。 for several different n-grams and then averages the result. 77 @@ -385,22 +385,22 @@ For example, if we compare 4-grams, 78 00:02:57,300 --> 00:02:58,830 -我们可以看到没有匹配的块 +我们可以看到翻译中 we can see that there are no matching chunks 79 00:02:58,830 --> 00:03:01,020 -翻译中的四个词, +没有匹配四个词的语块, of four words in the translations, 80 00:03:01,020 --> 00:03:02,913 -所以 4 克精度为 0。 +所以 4-gram 精度为 0。 and so the 4-gram precision is 0. 81 00:03:05,460 --> 00:03:07,560 -现在,计算数据集库中的 BLEU 分数 +现在,使用 Datasets 库计算 BLEU 分数 Now, to compute BLEU scores in Datasets library 82 @@ -420,12 +420,12 @@ provide your model's predictions with their references 85 00:03:13,290 --> 00:03:14,390 -你很高兴去! +然后就一切就绪! and you're good to go! 86 00:03:16,470 --> 00:03:19,200 -输出将包含几个感兴趣的字段。 +输出将包含几个重点的字段。 The output will contain several fields of interest. 87 @@ -435,27 +435,27 @@ The precisions field contains 88 00:03:20,490 --> 00:03:23,133 -每个 n-gram 的所有单独精度分数。 +每个 n-gram 的全部单体精度分数。 all the individual precision scores for each n-gram. 89 00:03:25,050 --> 00:03:26,940 -然后计算 BLEU 分数本身 +然后 BLEU 分数本身 The BLEU score itself is then calculated 90 00:03:26,940 --> 00:03:30,090 -通过取精度分数的几何平均值。 +通过取精度分数的几何平均值进行计算。 by taking the geometric mean of the precision scores. 91 00:03:30,090 --> 00:03:32,790 -默认情况下,所有四个 n-gram 精度的平均值 +默认情况下,所有四个 n-gram 精度的平均值都会输出, And by default, the mean of all four n-gram precisions 92 00:03:32,790 --> 00:03:35,793 -据报道,该指标有时也称为 BLEU-4。 +该指标有时也称为 BLEU-4。 is reported, a metric that is sometimes also called BLEU-4. 93 @@ -465,7 +465,7 @@ In this example, we can see the BLEU score is zero 94 00:03:38,880 --> 00:03:40,780 -因为 4 克精度为零。 +因为 4-gram 精度为零。 because the 4-gram precision was zero. 95 @@ -475,7 +475,7 @@ Now, the BLEU metric has some nice properties, 96 00:03:45,390 --> 00:03:47,520 -但这远非一个完美的指标。 +但这离作为完美评估指标的距离还很远。 but it is far from a perfect metric. 97 @@ -490,12 +490,12 @@ and it's widely used in research 99 00:03:50,970 --> 00:03:52,620 -这样你就可以将你的模型与其他模型进行比较 +这样你就可以基于普遍的基准将你的模型 so you can compare your model against others 100 00:03:52,620 --> 00:03:54,630 -在共同的基准上。 +与其他模型进行比较。 on common benchmarks. 101 @@ -505,12 +505,12 @@ On the other hand, there are several big problems with BLEU, 102 00:03:56,670 --> 00:03:58,830 -包括它不包含语义的事实 +包括实际上它不包含语义 including the fact it doesn't incorporate semantics 103 00:03:58,830 --> 00:04:01,920 -它在非英语语言上很挣扎。 +它不适用于非英语语言。 and it struggles a lot on non-English languages. 104 @@ -525,17 +525,17 @@ is that it assumes the human translations 106 00:04:04,620 --> 00:04:05,820 -已经被代币化 +已经被词元化 have already been tokenized 107 00:04:05,820 --> 00:04:07,320 -这使得比较模型变得困难 +这使得在基于不同的分词器的情况下 and this makes it hard to compare models 108 00:04:07,320 --> 00:04:08,820 -使用不同的分词器。 +比较模型变得困难。 that use different tokenizers. 109 @@ -560,7 +560,7 @@ is to use the SacreBLEU metric, 113 00:04:19,440 --> 00:04:22,830 -它解决了 BLEU 的标记化限制。 +它解决了 BLEU 的词元化限制。 which addresses the tokenization limitations of BLEU. 114 @@ -570,12 +570,12 @@ As you can see in this example, 115 00:04:24,360 --> 00:04:26,580 -计算 SacreBLEU 分数几乎相同 +计算 SacreBLEU 分数几乎与 BLEU computing the SacreBLEU score is almost identical 116 00:04:26,580 --> 00:04:28,020 -到 BLEU 一个。 +完全一致。 to the BLEU one. 117 @@ -590,7 +590,7 @@ instead of a list of words to the translations, 119 00:04:32,640 --> 00:04:35,640 -SacreBLEU 负责底层的代币化。 +SacreBLEU 负责底层的词元化。 and SacreBLEU takes care of the tokenization under the hood. 120