-
Notifications
You must be signed in to change notification settings - Fork 771
/
13_word-based-tokenizers.srt
301 lines (241 loc) · 6.84 KB
/
13_word-based-tokenizers.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
1
00:00:00,165 --> 00:00:01,416
(屏幕呼啸)
(screen whooshing)
2
00:00:01,416 --> 00:00:02,716
(贴纸弹出)
(sticker popping)
3
00:00:02,716 --> 00:00:03,549
(屏幕呼啸)
(screen whooshing)
4
00:00:03,549 --> 00:00:05,603
- 让我们来看看基于单词的分词。
*[译者注: token, tokenization, tokenizer 等词均译成了 分词*, 实则不翻译最佳]
- Let's take a look at word-based tokenization.
5
00:00:07,650 --> 00:00:09,780
基于单词的分词化的想法是
Word-based tokenization is the idea
6
00:00:09,780 --> 00:00:11,940
将原始文本拆分成单词
of splitting the raw text into words
7
00:00:11,940 --> 00:00:14,673
通过按空格或其他特定规则拆分,
by splitting on spaces or other specific rules,
8
00:00:16,020 --> 00:00:17,163
比如标点符号。
like punctuation.
9
00:00:18,900 --> 00:00:21,810
在这个算法中,每个单词都有一个特定的数字
In this algorithm, each word has a specific number
10
00:00:21,810 --> 00:00:23,463
或者说他的 ID。
or ID attributed to it.
11
00:00:24,360 --> 00:00:27,270
在这里,"let's" 的 ID 是 250,
Here, let's has the ID 250,
12
00:00:27,270 --> 00:00:30,150
"do" 是 861,并且分词化
do has 861, and tokenization
13
00:00:30,150 --> 00:00:33,393
后面跟感叹号的是 345。
followed by an exclamation mark has 345.
14
00:00:34,380 --> 00:00:36,000
这个方法很有趣
This approach is interesting
15
00:00:36,000 --> 00:00:38,100
因为模型有表示
as the model has representations
16
00:00:38,100 --> 00:00:40,233
是基于整个单词的。
that are based on entire words.
17
00:00:42,720 --> 00:00:45,960
单个号码所持有的信息量高,
The information held in a single number is high,
18
00:00:45,960 --> 00:00:48,240
因为一个词包含很多上下文
as a word contains a lot of contextual
19
00:00:48,240 --> 00:00:49,803
和语义信息。
and semantic information.
20
00:00:53,070 --> 00:00:55,473
然而,这种方法确实有其局限性。
However, this approach does have its limits.
21
00:00:56,610 --> 00:01:00,570
比如 dog 这个词和 dogs 这个词很相似
For example, the word dog and the word dogs are very similar
22
00:01:00,570 --> 00:01:01,923
他们的意思很接近。
and their meaning is close.
23
00:01:03,210 --> 00:01:05,550
然而,基于单词的分词化,
The word-based tokenization, however,
24
00:01:05,550 --> 00:01:08,520
会给这两个词赋予完全不同的 ID
will attribute entirely different IDs to these two words
25
00:01:08,520 --> 00:01:10,110
因此模型将学习
and the model will therefore learn
26
00:01:10,110 --> 00:01:12,930
这两个词的两个不同的嵌入。
two different embeddings for these two words.
27
00:01:12,930 --> 00:01:15,090
这很不幸,因为我们想要这个模型
This is unfortunate as we would like the model
28
00:01:15,090 --> 00:01:18,240
理解这些词是确实相关的,
to understand that these words are indeed related,
29
00:01:18,240 --> 00:01:21,483
而 dogs 只是 dog 这个词的复数形式。
and that dogs is simply the plural form of the word dog.
30
00:01:22,980 --> 00:01:24,480
这种方法的另一个问题是,
Another issue with this approach,
31
00:01:24,480 --> 00:01:28,050
是语言中有很多不同的词。
is that there are a lot of different words in the language.
32
00:01:28,050 --> 00:01:29,490
如果我们想让我们的模型理解
If we want our model to understand
33
00:01:29,490 --> 00:01:32,160
该语言中所有可能的句子,
all possible sentences in that language,
34
00:01:32,160 --> 00:01:35,850
那么我们需要为每个不同的词设置一个 ID。
then we will need to have an ID for each different word.
35
00:01:35,850 --> 00:01:37,380
以及总字数,
And the total number of words,
36
00:01:37,380 --> 00:01:40,080
也称为词汇量大小,
which is also known as the vocabulary size,
37
00:01:40,080 --> 00:01:41,913
可以很快变得非常大。
can quickly become very large.
38
00:01:44,400 --> 00:01:47,640
这是一个问题,因为每个 ID 都映射到一个大向量
This is an issue because each ID is mapped to a large vector
39
00:01:47,640 --> 00:01:50,190
代表这个词的意思,
that represents the word's meaning,
40
00:01:50,190 --> 00:01:52,170
并实现保持这些映射
and keeping track of these mappings
41
00:01:52,170 --> 00:01:54,990
需要很大的模型
requires an enormous number of weights
42
00:01:54,990 --> 00:01:57,123
当词汇量很大时。
when the vocabulary size is very large.
43
00:01:59,160 --> 00:02:00,960
如果我们希望我们的模型保持精简,
If we want our models to stay lean,
44
00:02:00,960 --> 00:02:04,440
我们可以选择让分词器忽略某些词
we can opt for our tokenizer to ignore certain words
45
00:02:04,440 --> 00:02:06,093
我们不一定需要的。
that we don't necessarily need.
46
00:02:08,400 --> 00:02:11,970
例如,在这里,当在文本上训练我们的分词器时,
For example, here, when training our tokenizer on a text,
47
00:02:11,970 --> 00:02:15,020
我们可能只想使用 10,000 个最常用的单词
we might want to take only the 10,000 most frequent words
48
00:02:15,020 --> 00:02:16,320
在该文本中。
in that text.
49
00:02:16,320 --> 00:02:18,600
而不是从该文本中提取所有单词
Rather than taking all words from in that text
50
00:02:18,600 --> 00:02:22,503
或所有语言的单词来创建我们的基本词汇。
or all languages words to create our basic vocabulary.
51
00:02:23,790 --> 00:02:26,520
分词器将知道如何转换这 10,000 个单词
The tokenizer will know how to convert those 10,000 words
52
00:02:26,520 --> 00:02:29,370
转换成数字,但任何其他词都会被转换
into numbers, but any other word will be converted
53
00:02:29,370 --> 00:02:31,530
作为词汇外的词,
to the out-of-vocabulary word,
54
00:02:31,530 --> 00:02:33,783
或者像这里显示的那样,未知的词。
or like shown here, the unknown word.
55
00:02:35,280 --> 00:02:37,440
不幸的是,这是一种妥协。
Unfortunately, this is a compromise.
56
00:02:37,440 --> 00:02:39,900
该模型将具有完全相同的表示
The model will have the exact same representation
57
00:02:39,900 --> 00:02:42,390
对于所有它不知道的单词,
for all words that it doesn't know,
58
00:02:42,390 --> 00:02:45,210
这可能会导致大量信息丢失
which can result in a lot of lost information
59
00:02:45,210 --> 00:02:47,664
如果存在许多未知单词。
if many unknown words are present.
60
00:02:47,664 --> 00:02:50,581
(屏幕呼啸)
(screen whooshing)