-
Notifications
You must be signed in to change notification settings - Fork 776
/
Copy path42_training-a-new-tokenizer.srt
565 lines (452 loc) · 13.5 KB
/
42_training-a-new-tokenizer.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
1
00:00:00,000 --> 00:00:02,667
(空气呼啸)
(air whooshing)
2
00:00:05,310 --> 00:00:08,700
- 在这段视频中,我们将一起看到
- In this video we will see together
3
00:00:08,700 --> 00:00:11,820
训练 tokenizer 的目的是什么,
what is the purpose of training a tokenizer,
4
00:00:11,820 --> 00:00:14,400
要遵循的关键步骤是什么,
what are the key steps to follow,
5
00:00:14,400 --> 00:00:16,953
什么是最简单的方法。
and what is the easiest way to do it.
6
00:00:18,690 --> 00:00:20,677
你会问自己这个问题,
You will ask yourself the question,
7
00:00:20,677 --> 00:00:23,040
“我应该训练一个新的 tokenizer 吗?”,
"Should I train a new tokenizer?",
8
00:00:23,040 --> 00:00:25,773
当你计划从头开始训练新模型时。
when you plan to train a new model from scratch.
9
00:00:29,520 --> 00:00:34,020
一个训练过的分词器会不适合你的语料库
A trained tokenizer would not be suitable for your corpus
10
00:00:34,020 --> 00:00:37,080
如果你的语料库使用一个不同的语言,
if your corpus is in a different language,
11
00:00:37,080 --> 00:00:42,060
使用新字符,比如重音符号或大写字母,
uses new characters, such as accents or upper cased letters,
12
00:00:42,060 --> 00:00:47,060
有特定的词汇,例如医学的或法律的,
has a specific vocabulary, for example medical or legal,
13
00:00:47,100 --> 00:00:49,050
或使用一个不同的风格,
or uses a different style,
14
00:00:49,050 --> 00:00:51,873
例如,来自另一个世纪时的语言。
a language from another century for example.
15
00:00:56,490 --> 00:00:58,320
如果我让 tokenizer 训练在
If I take the tokenizer trained on
16
00:00:58,320 --> 00:01:00,780
bert-base-uncased 模型,
the bert-base-uncased model,
17
00:01:00,780 --> 00:01:03,213
并忽略其归一化步骤,
and ignore its normalization step,
18
00:01:04,260 --> 00:01:07,650
然后我们就能看到 tokenization 操作
then we can see that the tokenization operation
19
00:01:07,650 --> 00:01:09,277
在英语句子上,
on the English sentence,
20
00:01:09,277 --> 00:01:12,480
“这是一个适合我们 tokenizer 的句子”,
"Here is a sentence adapted to our tokenizer",
21
00:01:12,480 --> 00:01:15,600
产生一个相当令人满意的 token 列表,
produces a rather satisfactory list of tokens,
22
00:01:15,600 --> 00:01:18,510
从某种意义上说,这八个字的句子
in the sense that this sentence of eight words
23
00:01:18,510 --> 00:01:20,643
被标记成了九个 token 。
is tokenized into nine tokens.
24
00:01:22,920 --> 00:01:26,340
另一方面,如果我使用相同的 tokenizer
On the other hand, if I use this same tokenizer
25
00:01:26,340 --> 00:01:29,370
在孟加拉语的一个句子中,我们看到
on a sentence in Bengali, we see that
26
00:01:29,370 --> 00:01:33,690
要么一个词被分成许多子 token ,
either a word is divided into many sub tokens,
27
00:01:33,690 --> 00:01:36,270
或者 tokenizer 不知道任何一个
or that the tokenizer does not know one of
28
00:01:36,270 --> 00:01:39,873
unicode 字符并仅返回未知 token 。
the unicode characters and returns only an unknown token.
29
00:01:41,220 --> 00:01:44,970
一个常用词被分成许多 token 的事实
The fact that a common word is split into many tokens
30
00:01:44,970 --> 00:01:47,910
可能是问题的,因为语言模型
can be problematic, because language models
31
00:01:47,910 --> 00:01:51,903
只能处理有限长度的 token 序列。
can only handle a sequence of tokens of limited length.
32
00:01:52,830 --> 00:01:55,830
过度拆分初始文本的 tokenizer
A tokenizer that excessively splits your initial text
33
00:01:55,830 --> 00:01:58,503
甚至可能影响模型的性能。
may even impact the performance of your model.
34
00:01:59,760 --> 00:02:02,280
未知的 token 也有问题,
Unknown tokens are also problematic,
35
00:02:02,280 --> 00:02:04,530
因为模型将无法提取
because the model will not be able to extract
36
00:02:04,530 --> 00:02:07,563
来自文本未知部分的任何信息。
any information from the unknown part of the text.
37
00:02:11,430 --> 00:02:13,440
在另一个例子中,我们可以看到
In this other example, we can see that
38
00:02:13,440 --> 00:02:17,100
tokenizer 替换包含字符的单词
the tokenizer replaces words containing characters
39
00:02:17,100 --> 00:02:20,973
带有重音和带有未知 token 的大写字母。
with accents and capital letters with unknown tokens.
40
00:02:22,050 --> 00:02:24,770
最后,如果我们再次使用这个 tokenizer
Finally, if we use again this tokenizer
41
00:02:24,770 --> 00:02:28,170
为了标记医学词汇,我们再次看到
to tokenize medical vocabulary, we see again that
42
00:02:28,170 --> 00:02:31,800
一个单词被分成许多子 token ,
a single word is divided into many sub tokens,
43
00:02:31,800 --> 00:02:34,803
四个用于 "paracetamol" ,四个用于 "pharyngitis" 。
four for paracetamol, and four for pharyngitis.
44
00:02:37,110 --> 00:02:39,360
目前的大多数 tokenizer 被
Most of the tokenizers used by the current
45
00:02:39,360 --> 00:02:42,540
最先进的语言模型使用的, 需要被训练
state of the art language models need to be trained
46
00:02:42,540 --> 00:02:45,360
在语料库上, 其与使用的语料库如要相似于
on a corpus that is similar to the one used
47
00:02:45,360 --> 00:02:47,463
预训练语言模型的。
to pre-train the language model.
48
00:02:49,140 --> 00:02:51,150
这种训练包括学习规则
This training consists in learning rules
49
00:02:51,150 --> 00:02:53,250
将文本分成标记。
to divide the text into tokens.
50
00:02:53,250 --> 00:02:56,160
以及学习和使用这些规则的方法
And the way to learn these rules and use them
51
00:02:56,160 --> 00:02:58,233
取决于所选的 tokenizer 模型。
depends on the chosen tokenizer model.
52
00:03:00,630 --> 00:03:04,590
因此,要训练一个新的 tokenizer ,首先需要
Thus, to train a new tokenizer, it is first necessary
53
00:03:04,590 --> 00:03:07,653
构建由原始文本组成的训练语料库。
to build a training corpus composed of raw texts.
54
00:03:08,910 --> 00:03:12,423
然后,你必须为你的 tokenizer 选择一种结构。
Then, you have to choose an architecture for your tokenizer.
55
00:03:13,410 --> 00:03:14,763
这里有两个选项。
Here there are two options.
56
00:03:15,900 --> 00:03:19,710
最简单的是重用与那个结构, 其相同于
The simplest is to reuse the same architecture as the one
57
00:03:19,710 --> 00:03:22,863
另一个已经训练过的模型使用的 tokenizer 。
of a tokenizer used by another model already trained.
58
00:03:24,210 --> 00:03:25,980
否则也可以
Otherwise it is also possible
59
00:03:25,980 --> 00:03:28,560
彻底设计你的分词器。
to completely design your tokenizer.
60
00:03:28,560 --> 00:03:31,683
但这需要更多的经验和观察。
But it requires more experience and attention.
61
00:03:33,750 --> 00:03:36,660
一旦选择了结构,你就可以
Once the architecture is chosen, you can thus
62
00:03:36,660 --> 00:03:39,513
在你构建的语料库上训练这个 tokenizer 。
train this tokenizer on your constituted corpus.
63
00:03:40,650 --> 00:03:43,440
最后,你需要做的最后一件事就是保存
Finally, the last thing that you need to do is to save
64
00:03:43,440 --> 00:03:46,443
能够使用此 tokenizer 的学习规则。
the learned rules to be able to use this tokenizer.
65
00:03:49,530 --> 00:03:51,330
让我们举个例子。
Let's take an example.
66
00:03:51,330 --> 00:03:54,873
假设你想在 Python 代码上训练 GPT-2 模型。
Let's say you want to train a GPT-2 model on Python code.
67
00:03:56,160 --> 00:03:59,640
即使 Python 代码通常是英文的
Even if the Python code is usually in English
68
00:03:59,640 --> 00:04:02,386
这种类型的文本非常具体,
this type of text is very specific,
69
00:04:02,386 --> 00:04:04,473
并值得一个训练过的 tokenizer 。
and deserves a tokenizer trained on it.
70
00:04:05,340 --> 00:04:07,980
为了让你相信这一点,我们将在最后看到
To convince you of this, we will see at the end
71
00:04:07,980 --> 00:04:10,023
一个例子产生的差异。
the difference produced on an example.
72
00:04:11,400 --> 00:04:13,747
为此,我们将使用该方法
For that we are going to use the method
73
00:04:13,747 --> 00:04:18,240
“train_new_from_iterator”,所有快速的 tokenizer
"train_new_from_iterator" that all the fast tokenizers
74
00:04:18,240 --> 00:04:20,040
在库中有的,因此,
of the library have and thus,
75
00:04:20,040 --> 00:04:22,503
特别是 GPT2TokenizerFast。
in particular GPT2TokenizerFast.
76
00:04:23,880 --> 00:04:26,100
这是我们案例中最简单的方法
This is the simplest method in our case
77
00:04:26,100 --> 00:04:28,983
有一个适合 Python 代码的 tokenizer 。
to have a tokenizer adapted to Python code.
78
00:04:30,180 --> 00:04:34,140
请记住,第一件事是收集训练语料库。
Remember, the first thing is to gather a training corpus.
79
00:04:34,140 --> 00:04:37,320
我们将使用 CodeSearchNet 数据集的一个子部分
We will use a subpart of the CodeSearchNet dataset
80
00:04:37,320 --> 00:04:39,360
仅包含 Python 函数
containing only Python functions
81
00:04:39,360 --> 00:04:42,360
来自 Github 上的开源库。
from open source libraries on Github.
82
00:04:42,360 --> 00:04:43,650
这是个好时机。
It's good timing.
83
00:04:43,650 --> 00:04:46,980
该数据集为数据集库所知
This dataset is known by the datasets library
84
00:04:46,980 --> 00:04:49,203
我们可以用两行代码加载它。
and we can load it in two lines of code.
85
00:04:50,760 --> 00:04:55,230
然后,正如 “train_new_from_iterator” 方法所期望的那样
Then, as the "train_new_from_iterator" method expects
86
00:04:55,230 --> 00:04:57,150
文本列表的迭代器,
a iterator of lists of texts,
87
00:04:57,150 --> 00:04:59,970
我们创建 “get_training_corpus” 函数,
we create the "get_training_corpus" function,
88
00:04:59,970 --> 00:05:01,743
这将返回一个迭代器。
which will return an iterator.
89
00:05:03,870 --> 00:05:05,430
现在我们有了迭代器
Now that we have our iterator
90
00:05:05,430 --> 00:05:09,630
在我们的 Python 函数语料库中,我们可以加载
on our Python functions corpus, we can load
91
00:05:09,630 --> 00:05:12,351
GPT-2 分词器结构。
the GPT-2 tokenizer architecture.
92
00:05:12,351 --> 00:05:16,560
这里 old_tokenizer 不适合我们的语料库。
Here old_tokenizer is not adapted to our corpus.
93
00:05:16,560 --> 00:05:17,700
但我们只需要
But we only need
94
00:05:17,700 --> 00:05:20,733
再用一行在我们的新语料库上训练它。
one more line to train it on our new corpus.
95
00:05:21,780 --> 00:05:24,720
有一点很普遍的是大多数 分词化
An argument that is common to most of the tokenization
96
00:05:24,720 --> 00:05:28,980
算法, 目前使用的是根据词汇表的大小。
algorithms used at the moment is the size of the vocabulary.
97
00:05:28,980 --> 00:05:31,773
我们在这里选择值 52,000。
We choose here the value 52,000.
98
00:05:32,820 --> 00:05:35,760
最后,一旦训练结束,
Finally, once the training is finished,
99
00:05:35,760 --> 00:05:38,850
我们只需要在本地保存我们的新 tokenizer ,
we just have to save our new tokenizer locally,
100
00:05:38,850 --> 00:05:41,730
或将其发送到 hub 以便能够重用它
or send it to the hub to be able to reuse it
101
00:05:41,730 --> 00:05:43,593
之后很容易。
very easily afterwards.
102
00:05:45,270 --> 00:05:48,990
最后大家一起看个例子有没有用
Finally, let's see together on an example if it was useful
103
00:05:48,990 --> 00:05:53,073
重新训练一个类似于 GPT-2 的 tokenizer 。
to re-train a tokenizer similar to GPT-2 one.
104
00:05:55,110 --> 00:05:57,660
使用 GPT-2 的原始 tokenizer
With the original tokenizer of GPT-2
105
00:05:57,660 --> 00:06:00,330
我们看到所有的空间都是孤立的,
we see that all spaces are isolated,
106
00:06:00,330 --> 00:06:01,920
和方法名称 "randn",
and the method name randn,
107
00:06:01,920 --> 00:06:04,833
在 Python 代码中相对比较常见,分为两部分。
relatively common in Python code, is split in two.
108
00:06:05,730 --> 00:06:09,060
使用我们新的分词器,单缩进和双缩进
With our new tokenizer, single and double indentations
109
00:06:09,060 --> 00:06:10,890
已经学习和方法 "randn"
have been learned and the method randn
110
00:06:10,890 --> 00:06:13,770
被分词化为一个 token 。
is tokenized into one token.
111
00:06:13,770 --> 00:06:15,000
有了这个,
And with that,
112
00:06:15,000 --> 00:06:18,123
你现在知道如何训练你自己的分词器了。
you now know how to train your very own tokenizers now.
113
00:06:19,498 --> 00:06:22,165
(空气呼啸)
(air whooshing)