-
Notifications
You must be signed in to change notification settings - Fork 776
/
Copy path61_data-processing-for-summarization.srt
240 lines (192 loc) · 5.6 KB
/
61_data-processing-for-summarization.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
1
00:00:00,227 --> 00:00:01,359
(空气呼啸)
(air whooshing)
2
00:00:01,359 --> 00:00:02,610
(笑脸点击)
(smiley clicking)
3
00:00:02,610 --> 00:00:05,550
(空气呼啸)
(air whooshing)
4
00:00:05,550 --> 00:00:08,450
- 让我们看看如何预处理数据集以进行文本摘要。
- Let's see how to preprocess a dataset for summarization.
5
00:00:09,750 --> 00:00:13,083
这是概括一份长文档的任务。
This is the task of, well, summarizing a long document.
6
00:00:14,040 --> 00:00:16,830
该视频将重点介绍如何预处理你的数据集
This video will focus on how to preprocess your dataset
7
00:00:16,830 --> 00:00:19,680
一旦你成功将其按照以下格式处理:
once you have managed to put it in the following format:
8
00:00:19,680 --> 00:00:21,510
用一列表示长文件,
one column for the long documents,
9
00:00:21,510 --> 00:00:23,610
和一列表示摘要。
and one for the summaries.
10
00:00:23,610 --> 00:00:24,930
这是我们如何使用 XSUM 数据集上的
Here is how we can achieve this
11
00:00:24,930 --> 00:00:27,573
Datasets 库实现这一效果。
with the Datasets library on the XSUM dataset.
12
00:00:28,650 --> 00:00:30,810
只要你能够让你的数据以如下形式呈现,
As long as you manage to have your data look like this,
13
00:00:30,810 --> 00:00:33,690
你应该能够按照相同的步骤操作。
you should be able to follow the same steps.
14
00:00:33,690 --> 00:00:35,880
这一次,我们的标签对于某些类不再是整数
For once, our labels are not integers
15
00:00:35,880 --> 00:00:39,150
而是纯文本。
corresponding to some classes, but plain text.
16
00:00:39,150 --> 00:00:42,480
因此,我们需要将它们词元化,就像我们的输入数据一样。
We will thus need to tokenize them, like our inputs.
17
00:00:42,480 --> 00:00:43,920
虽然那里有一个小陷阱,
There is a small trap there though,
18
00:00:43,920 --> 00:00:45,360
因为我们需要
as we need to tokenize our targets
19
00:00:45,360 --> 00:00:48,690
在 as_target_tokenizer 上下文管理器中词元化我们的目标输出。
inside the as_target_tokenizer context manager.
20
00:00:48,690 --> 00:00:51,030
这是因为我们添加的特殊词元
This is because the special tokens we add
21
00:00:51,030 --> 00:00:54,000
其输入和目标输出可能略有不同,
might be slightly different for the inputs and the target,
22
00:00:54,000 --> 00:00:57,300
所以 tokenizer 必须知道它正在处理哪个。
so the tokenizer has to know which one it is processing.
23
00:00:57,300 --> 00:00:59,550
通过 map 函数处理整个数据集
Processing the whole dataset is then super easy
24
00:00:59,550 --> 00:01:01,290
非常容易。
with the map function.
25
00:01:01,290 --> 00:01:03,450
由于摘要相比文件,
Since the summaries are usually much shorter
26
00:01:03,450 --> 00:01:05,400
通常要短得多,
than the documents, you should definitely pick
27
00:01:05,400 --> 00:01:08,880
你应该针对输入和目标输出选择不同的最大长度设定。
different maximum lengths for the inputs and targets.
28
00:01:08,880 --> 00:01:11,730
你可以通过设置 padding=max_length 在此阶段
You can choose to pad at this stage to that maximum length
29
00:01:11,730 --> 00:01:14,070
选择填充到最大长度。
by setting padding=max_length.
30
00:01:14,070 --> 00:01:16,170
因为它还需要一步,在这里,
Here we'll show you how to pad dynamically,
31
00:01:16,170 --> 00:01:17,620
我们将向你展示如何动态填充。
as it requires one more step.
32
00:01:18,840 --> 00:01:20,910
你的输入和目标输出
Your inputs and targets are all sentences
33
00:01:20,910 --> 00:01:22,620
都是各种长度的句子。
of various lengths.
34
00:01:22,620 --> 00:01:24,960
由于输入和目标输出的最大长度均不相同
We'll pad the inputs and targets separately
35
00:01:24,960 --> 00:01:27,030
我们将分别填充输入
as the maximum lengths of the inputs and targets
36
00:01:27,030 --> 00:01:28,280
和目标输出。
are completely different.
37
00:01:29,130 --> 00:01:31,170
然后,我们将输入数据
Then, we pad the inputs to the maximum lengths
38
00:01:31,170 --> 00:01:33,813
填充到最大长度,对于目标输出数据也是如此。
among the inputs, and same for the target.
39
00:01:34,860 --> 00:01:36,630
我们用 pad token 填充输入,
We pad the input with the pad token,
40
00:01:36,630 --> 00:01:39,000
和索引为 -100 的目标输出
and the targets with the -100 index
41
00:01:39,000 --> 00:01:40,980
确保在损失计算中
to make sure they are not taken into account
42
00:01:40,980 --> 00:01:42,180
不会包含它们。
in the loss computation.
43
00:01:43,440 --> 00:01:45,180
Transformers 库为我们提供
The Transformers library provide us
44
00:01:45,180 --> 00:01:48,510
数据整理器以自动完成这一切。
with a data collator to do this all automatically.
45
00:01:48,510 --> 00:01:51,690
然后你可以将它与你的数据集一起传递给 Trainer,
You can then pass it to the Trainer with your datasets,
46
00:01:51,690 --> 00:01:55,710
或者在你当前的模型上使用 model.fit 之前通过 to_tf_dataset 方法
or use it in the to_tf_dataset method before using model.fit
47
00:01:55,710 --> 00:01:56,823
使用它。
on your current model.
48
00:01:58,339 --> 00:02:02,876
(空气呼啸)
(air whooshing)