diff --git a/.github/workflows/quality.yml b/.github/workflows/quality.yml index 6b7afd288..06b296e3e 100644 --- a/.github/workflows/quality.yml +++ b/.github/workflows/quality.yml @@ -11,11 +11,11 @@ jobs: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - - name: Set up Python 3.6 + - name: Set up Python 3.8 uses: actions/setup-python@v2 with: - python-version: 3.6 + python-version: 3.8 - name: Install Python dependencies run: pip install black - name: Run Quality check - run: make quality \ No newline at end of file + run: make quality diff --git a/subtitles/zh-CN/56_data-processing-for-masked-language-modeling.srt b/subtitles/zh-CN/56_data-processing-for-masked-language-modeling.srt index 5aabd947f..b94c74449 100644 --- a/subtitles/zh-CN/56_data-processing-for-masked-language-modeling.srt +++ b/subtitles/zh-CN/56_data-processing-for-masked-language-modeling.srt @@ -5,12 +5,12 @@ 2 00:00:05,250 --> 00:00:07,230 -- 让我们看看如何预处理我们的数据 +- 让我们看一下如何针对掩码语言建模 - Let's see how we can preprocess our data 3 00:00:07,230 --> 00:00:08,703 -用于掩码语言建模。 +预处理我们的数据。 for masked language modeling. 4 @@ -20,7 +20,7 @@ As a reminder, masked language modeling 5 00:00:12,570 --> 00:00:15,333 -是当模型需要填补句子中的空白时。 +主要在模型需要填补句子中的空白时使用。 is when a model needs to fill the blanks in a sentence. 6 @@ -30,27 +30,27 @@ To do this, you just need texts, no labels, 7 00:00:19,650 --> 00:00:22,200 -因为这是一个自我监督的问题。 +因为这是一个自监督的问题。 as this is a self-supervised problem. 8 00:00:22,200 --> 00:00:23,670 -要将其应用于你自己的数据, +要将其应用于您自己的数据, To apply this on your own data, 9 00:00:23,670 --> 00:00:25,740 -只要确保你收集了所有的文本 +只要确保您在数据集的一列中 just make sure you have all your texts gathered 10 00:00:25,740 --> 00:00:27,603 -在数据集的一列中。 +收集了所有的文本。 in one column of your dataset. 11 00:00:28,440 --> 00:00:30,480 -在我们开始随机掩盖事物之前, +在开始随机掩码处理之前, Before we start randomly masking things, 12 @@ -60,7 +60,7 @@ we will need to somehow make all those texts the same length 13 00:00:33,090 --> 00:00:34,263 -将它们一起批处理。 +从而将它们一起批处理。 to batch them together. 14 @@ -70,7 +70,7 @@ The first way to make all the texts the same length 15 00:00:38,490 --> 00:00:40,590 -是我们在文本分类中使用的那个。 +和我们在文本分类中所使用的相同。 is the one we used in text classification. 16 @@ -95,27 +95,27 @@ this is all done by our tokenizer 20 00:00:49,923 --> 00:00:53,130 -具有正确的填充和截断选项。 +并且配置相应的填充和截断选项。 with the right options for padding and truncation. 21 00:00:53,130 --> 00:00:56,100 -但是,这会使我们丢失很多文本 +如果与我们选择的上下文长度相比, This will however make us lose a lot of texts 22 00:00:56,100 --> 00:00:58,620 -如果我们数据集中的示例很长, +我们数据集的示例很长, if the examples in our dataset are very long, 23 00:00:58,620 --> 00:01:00,960 -与我们选择的上下文长度相比。 +就会使我们丢失很多文本。 compared to the context length we picked. 24 00:01:00,960 --> 00:01:03,393 -在这里,所有灰色部分都丢失了。 +在这里,所有标记灰色部分都丢失了。 Here, all the portion in gray is lost. 25 @@ -125,17 +125,17 @@ This is why a second way to generate samples of text 26 00:01:06,660 --> 00:01:08,820 -具有相同的长度是分块我们的文本 +具有相同的长度是为了在上下文长度中 with the same length is to chunk our text 27 00:01:08,820 --> 00:01:10,560 -在上下文长度中, +为我们的文本分块 in pieces of context lengths, 28 00:01:10,560 --> 00:01:14,010 -而不是在第一个块之后丢弃所有内容。 +而不是在第一个数据块之后丢弃所有内容。 instead of discarding everything after the first chunk. 29 @@ -150,7 +150,7 @@ of length smaller than the context size, 31 00:01:17,700 --> 00:01:20,493 -我们可以选择保留和填充或忽略。 +我们可以选择保留并填充或者忽略。 which we can choose to keep and pad or ignore. 32 @@ -160,32 +160,32 @@ Here is how we can apply this in practice, 33 00:01:23,790 --> 00:01:26,460 -只需添加 return overflowing tokens 选项 +只需在我们调用分词器时添加 return overflowing tokens by just adding the return overflowing tokens option 34 00:01:26,460 --> 00:01:28,200 -在我们的分词器调用中。 +选项 in our tokenizer call. 35 00:01:28,200 --> 00:01:30,243 -请注意这如何为我们提供更大的数据集! +请注意这样会为我们提供更大的数据集! Note how this gives us a bigger dataset! 36 00:01:31,560 --> 00:01:34,260 -这第二种分块方式是理想的,如果你所有的文本 +如果你所有的文本很长, This second way of chunking is ideal if all your texts 37 00:01:34,260 --> 00:01:36,270 -很长,但行不通 +这里第二种分块方式是理想的, are very long, but it won't work 38 00:01:36,270 --> 00:01:39,900 -如果你的课文有不同的长度,那也不错。 +但如果你的课文有不同的长度,那么效果就不尽人意。 as nicely if you have a variety of lengths in the texts. 39 @@ -195,22 +195,22 @@ In this case, 40 00:01:41,040 --> 00:01:44,280 -最好的选择是连接所有标记化的文本 +最好的选择是将所有标记的文本组合成为一个大的数据流 the best option is to concatenate all your tokenized texts 41 00:01:44,280 --> 00:01:46,560 -在一个大流中,有一个特殊的标记 +附加一个特殊的标记 in one big stream, with a special tokens 42 00:01:46,560 --> 00:01:49,800 -指示你何时从一份文件转到另一份文件, +表明你何时从一份文件转到另一份文件, to indicate when you pass from one document to the other, 43 00:01:49,800 --> 00:01:52,503 -然后才将大流分成块。 +然后才将该数据流分成数据块。 and only then split the big stream into chunks. 44 @@ -230,32 +230,32 @@ and another one to chunk it. 47 00:02:00,780 --> 00:02:02,850 -注意它是如何减少样本数量的 +注意在我们这里的数据集中, Notice how it reduces the number of samples 48 00:02:02,850 --> 00:02:04,230 -在我们这里的数据集中, +它是如何减少样本数量的 in our dataset here, 49 00:02:04,230 --> 00:02:06,580 -一定有不少短条目! +一定有大量短条目! there must have been quite a few short entries! 50 00:02:07,710 --> 00:02:11,130 -完成此操作后,掩码就很容易了。 +完成此操作后,掩码处理就很容易了。 Once this is done, the masking is the easy part. 51 00:02:11,130 --> 00:02:13,400 -有专门为此设计的数据整理器 +在 Transformers 库中有专门为此设计的 There is a data collator designed specifically for this 52 00:02:13,400 --> 00:02:15,540 -在变形金刚图书馆。 +数据整理器。 in the Transformers library. 53 @@ -265,7 +265,7 @@ You can use it directly in the Trainer, 54 00:02:17,700 --> 00:02:20,400 -或者将你的数据集转换为张量流数据集时 +或者将你的数据集转换为 tensorflow 数据集时 or when converting your datasets to tensorflow datasets 55