subtitles/zh-CN/13_word-based-tokenizers.srt

1
00:00:00,165 --> 00:00:01,416
（屏幕呼啸）
(screen whooshing)

2
00:00:01,416 --> 00:00:02,716
（贴纸弹出）
(sticker popping)

3
00:00:02,716 --> 00:00:03,549
（屏幕呼啸）
(screen whooshing)

4
00:00:03,549 --> 00:00:05,603
- 让我们来看看基于单词的分词。
*[译者注: token, tokenization, tokenizer 等词均译成了 分词*, 实则不翻译最佳]
- Let's take a look at word-based tokenization.

5
00:00:07,650 --> 00:00:09,780
基于单词的分词化的想法是
Word-based tokenization is the idea

6
00:00:09,780 --> 00:00:11,940
将原始文本拆分成单词
of splitting the raw text into words

7
00:00:11,940 --> 00:00:14,673
通过按空格或其他特定规则拆分，
by splitting on spaces or other specific rules,

8
00:00:16,020 --> 00:00:17,163
比如标点符号。
like punctuation.

9
00:00:18,900 --> 00:00:21,810
在这个算法中，每个单词都有一个特定的数字
In this algorithm, each word has a specific number

10
00:00:21,810 --> 00:00:23,463
或者说他的 ID。
or ID attributed to it.

11
00:00:24,360 --> 00:00:27,270
在这里，"let's" 的 ID 是 250，
Here, let's has the ID 250,

12
00:00:27,270 --> 00:00:30,150
 "do" 是 861，并且分词化
do has 861, and tokenization

13
00:00:30,150 --> 00:00:33,393
后面跟感叹号的是 345。
followed by an exclamation mark has 345.

14
00:00:34,380 --> 00:00:36,000
这个方法很有趣
This approach is interesting

15
00:00:36,000 --> 00:00:38,100
因为模型有表示
as the model has representations

16
00:00:38,100 --> 00:00:40,233
是基于整个单词的。
that are based on entire words.

17
00:00:42,720 --> 00:00:45,960
单个号码所持有的信息量高，
The information held in a single number is high,

18
00:00:45,960 --> 00:00:48,240
因为一个词包含很多上下文
as a word contains a lot of contextual

19
00:00:48,240 --> 00:00:49,803
和语义信息。
and semantic information.

20
00:00:53,070 --> 00:00:55,473
然而，这种方法确实有其局限性。
However, this approach does have its limits.

21
00:00:56,610 --> 00:01:00,570
比如 dog 这个词和 dogs 这个词很相似
For example, the word dog and the word dogs are very similar

22
00:01:00,570 --> 00:01:01,923
他们的意思很接近。
and their meaning is close.

23
00:01:03,210 --> 00:01:05,550
然而，基于单词的分词化，
The word-based tokenization, however,

24
00:01:05,550 --> 00:01:08,520
会给这两个词赋予完全不同的 ID
will attribute entirely different IDs to these two words

25
00:01:08,520 --> 00:01:10,110
因此模型将学习
and the model will therefore learn

26
00:01:10,110 --> 00:01:12,930
这两个词的两个不同的嵌入。
two different embeddings for these two words.

27
00:01:12,930 --> 00:01:15,090
这很不幸，因为我们想要这个模型
This is unfortunate as we would like the model

28
00:01:15,090 --> 00:01:18,240
理解这些词是确实相关的，
to understand that these words are indeed related,

29
00:01:18,240 --> 00:01:21,483
而 dogs 只是 dog 这个词的复数形式。
and that dogs is simply the plural form of the word dog.

30
00:01:22,980 --> 00:01:24,480
这种方法的另一个问题是，
Another issue with this approach,

31
00:01:24,480 --> 00:01:28,050
是语言中有很多不同的词。
is that there are a lot of different words in the language.

32
00:01:28,050 --> 00:01:29,490
如果我们想让我们的模型理解
If we want our model to understand

33
00:01:29,490 --> 00:01:32,160
该语言中所有可能的句子，
all possible sentences in that language,

34
00:01:32,160 --> 00:01:35,850
那么我们需要为每个不同的词设置一个 ID。
then we will need to have an ID for each different word.

35
00:01:35,850 --> 00:01:37,380
以及总字数，
And the total number of words,

36
00:01:37,380 --> 00:01:40,080
也称为词汇量大小，
which is also known as the vocabulary size,

37
00:01:40,080 --> 00:01:41,913
可以很快变得非常大。
can quickly become very large.

38
00:01:44,400 --> 00:01:47,640
这是一个问题，因为每个 ID 都映射到一个大向量
This is an issue because each ID is mapped to a large vector

39
00:01:47,640 --> 00:01:50,190
代表这个词的意思，
that represents the word's meaning,

40
00:01:50,190 --> 00:01:52,170
并实现保持这些映射
and keeping track of these mappings

41
00:01:52,170 --> 00:01:54,990
需要很大的模型
requires an enormous number of weights

42
00:01:54,990 --> 00:01:57,123
当词汇量很大时。
when the vocabulary size is very large.

43
00:01:59,160 --> 00:02:00,960
如果我们希望我们的模型保持精简，
If we want our models to stay lean,

44
00:02:00,960 --> 00:02:04,440
我们可以选择让分词器忽略某些词
we can opt for our tokenizer to ignore certain words

45
00:02:04,440 --> 00:02:06,093
我们不一定需要的。
that we don't necessarily need.

46
00:02:08,400 --> 00:02:11,970
例如，在这里，当在文本上训练我们的分词器时，
For example, here, when training our tokenizer on a text,

47
00:02:11,970 --> 00:02:15,020
我们可能只想使用 10,000 个最常用的单词
we might want to take only the 10,000 most frequent words

48
00:02:15,020 --> 00:02:16,320
在该文本中。
in that text.

49
00:02:16,320 --> 00:02:18,600
而不是从该文本中提取所有单词
Rather than taking all words from in that text

50
00:02:18,600 --> 00:02:22,503
或所有语言的单词来创建我们的基本词汇。
or all languages words to create our basic vocabulary.

51
00:02:23,790 --> 00:02:26,520
分词器将知道如何转换这 10,000 个单词
The tokenizer will know how to convert those 10,000 words

52
00:02:26,520 --> 00:02:29,370
转换成数字，但任何其他词都会被转换
into numbers, but any other word will be converted

53
00:02:29,370 --> 00:02:31,530
作为词汇外的词，
to the out-of-vocabulary word,

54
00:02:31,530 --> 00:02:33,783
或者像这里显示的那样，未知的词。
or like shown here, the unknown word.

55
00:02:35,280 --> 00:02:37,440
不幸的是，这是一种妥协。
Unfortunately, this is a compromise.

56
00:02:37,440 --> 00:02:39,900
该模型将具有完全相同的表示
The model will have the exact same representation

57
00:02:39,900 --> 00:02:42,390
对于所有它不知道的单词，
for all words that it doesn't know,

58
00:02:42,390 --> 00:02:45,210
这可能会导致大量信息丢失
which can result in a lot of lost information

59
00:02:45,210 --> 00:02:47,664
如果存在许多未知单词。
if many unknown words are present.

60
00:02:47,664 --> 00:02:50,581
（屏幕呼啸）
(screen whooshing)