-
Notifications
You must be signed in to change notification settings - Fork 771
/
65_data-processing-for-question-answering.srt
300 lines (240 loc) · 7.42 KB
/
65_data-processing-for-question-answering.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
1
00:00:05,580 --> 00:00:07,177
- 让我们研究如何针对问答场景
- Let's study how to preprocess a dataset
2
00:00:07,177 --> 00:00:08,643
预处理数据集。
for question answering.
3
00:00:10,200 --> 00:00:11,640
问答场景是在一些上下文环境下
Question answering is a task
4
00:00:11,640 --> 00:00:14,343
针对某个问题寻找其答案的应用场景。
of finding answers to a question in some context.
5
00:00:15,270 --> 00:00:17,550
例如,我们将使用 SQuAD 数据集
For example, we'll use the SQuAD dataset
6
00:00:17,550 --> 00:00:19,860
在其中我们删除了我们不会使用的列
in which we remove columns we won't use
7
00:00:19,860 --> 00:00:21,660
并针对标签提取
and just extract the information we will need
8
00:00:21,660 --> 00:00:22,950
我们需要的信息,
for the labels,
9
00:00:22,950 --> 00:00:26,370
即上下文中答案的开始和结束。
the start and the end of the answer in the context.
10
00:00:26,370 --> 00:00:28,690
如果你有自己的问答数据集,
If you have your own dataset for question answering,
11
00:00:28,690 --> 00:00:31,680
只要确保你清理数据以达到同一效果,
just make sure you clean your data to get to the same point,
12
00:00:31,680 --> 00:00:33,900
其中一列包含问题,
with one column containing the questions,
13
00:00:33,900 --> 00:00:35,940
一列包含上下文,
one column containing the context,
14
00:00:35,940 --> 00:00:38,610
一列为上下文中的答案的
one column for the index of the start and end character
15
00:00:38,610 --> 00:00:40,473
起始和终止字符的索引。
of the answer in the context.
16
00:00:41,610 --> 00:00:44,520
请注意,答案必须是上下文的一部分。
Note that the answer must be part of the context.
17
00:00:44,520 --> 00:00:47,160
如果你想进行生成式问答,
If you want to perform generative question answering,
18
00:00:47,160 --> 00:00:50,160
查看下面链接的视频序列之一。
look at one of the sequence to sequence videos linked below.
19
00:00:51,600 --> 00:00:53,430
现在,如果我们看一下
Now, if we have a look at the tokens
20
00:00:53,430 --> 00:00:54,750
我们将输入给我们的模型的词元,
we will feed our model,
21
00:00:54,750 --> 00:00:58,320
我们会看到答案就在上下文中的某个地方。
we'll see the answer lies somewhere inside the context.
22
00:00:58,320 --> 00:01:01,080
对于很长的上下文,该答案可能
For very long context, that answer may get truncated
23
00:01:01,080 --> 00:01:02,580
会被分词器截断。
by the tokenizer.
24
00:01:02,580 --> 00:01:05,970
在这种情况下,我们的模型将没有任何合适的标签,
In this case, we won't have any proper labels for our model,
25
00:01:05,970 --> 00:01:07,680
所以我们应该
so we should keep the truncated part
26
00:01:07,680 --> 00:01:10,203
保留截断的部分作为一个单独的功能而不是丢弃它。
as a separate feature instead of discarding it.
27
00:01:11,100 --> 00:01:12,990
我们唯一需要小心的是
The only thing we need to be careful with
28
00:01:12,990 --> 00:01:15,660
是允许单独的块之间有一些重叠
is to allow some overlap between separate chunks
29
00:01:15,660 --> 00:01:17,670
这样答案就不会被截断
so that the answer is not truncated
30
00:01:17,670 --> 00:01:19,920
使得包含答案的特征
and that the feature containing the answer
31
00:01:19,920 --> 00:01:22,623
获得足够的上下文来预测它。
gets sufficient context to be able to predict it.
32
00:01:23,490 --> 00:01:26,040
这是分词器如何完成的。
Here is how it can be done by the tokenizer.
33
00:01:26,040 --> 00:01:29,370
我们将问题、上下文传递给它,
We pass it the question, context, set a truncation
34
00:01:29,370 --> 00:01:33,240
仅针对上下文设置截断,并且填充到最大长度。
for the context only, and the padding to the maximum length.
35
00:01:33,240 --> 00:01:35,340
stride 参数是我们针对重叠的词元
The stride argument is where we set the number
36
00:01:35,340 --> 00:01:36,900
设置数字的地方,
of overlapping tokens,
37
00:01:36,900 --> 00:01:39,600
并且return_overflowing_tokens 等于 true
and the return overflowing tokens equals true
38
00:01:39,600 --> 00:01:42,630
意味着我们不想丢弃截断的部分。
means we don't want to discard the truncated part.
39
00:01:42,630 --> 00:01:45,210
最后,我们还返回 return_offsets_mapping
Lastly, we also return the offset mappings
40
00:01:45,210 --> 00:01:47,220
从而能够找到和答案的起始与结束相关的
to be able to find the tokens corresponding
41
00:01:47,220 --> 00:01:48,693
相应的词元。
to the answer start and end.
42
00:01:49,860 --> 00:01:52,290
我们想要保留这些词元,因为它们将成为标签
We want those tokens because they will be the labels
43
00:01:52,290 --> 00:01:53,970
我们将其传递给我们的模型。
we pass through our model.
44
00:01:53,970 --> 00:01:56,870
在独热编码版本中,这是它们的样子。
In a one-hot encoded version, here is what they look like.
45
00:01:57,930 --> 00:02:00,480
如果我们的上下文不包含答案,
If the context we have does not contain the answer,
46
00:02:00,480 --> 00:02:03,799
我们将这两个标签设置为 CLS 词元的索引。
we set the two labels to the index of the CLS token.
47
00:02:03,799 --> 00:02:05,700
如果上下文仅部分包含答案,
We also do this if the context
48
00:02:05,700 --> 00:02:07,713
我们也会这样做。
only partially contains the answer.
49
00:02:08,580 --> 00:02:11,400
在代码方面,这是我们如何做到的。
In terms of code, here is how we can do it.
50
00:02:11,400 --> 00:02:13,710
使用输入的序列 ID,
Using the sequence IDs of an input,
51
00:02:13,710 --> 00:02:17,220
我们可以确定上下文的开始和结束。
we can determine the beginning and the end of the context.
52
00:02:17,220 --> 00:02:19,800
然后,我们就知道是否需要
Then, we know if we have to return to the CLS position
53
00:02:19,800 --> 00:02:22,290
根据两个标签返回 CLS 位置
for the two labels or we determine the position
54
00:02:22,290 --> 00:02:25,050
或者我们确定答案的第一个和最后一个词元的位置。
of the first and last tokens of the answer.
55
00:02:25,050 --> 00:02:27,800
我们可以在前面的示例中检查它是否正常发挥作用。
We can check it works properly on our previous example.
56
00:02:28,680 --> 00:02:31,380
把它们放在一起看起来就像这个大函数,
Putting it all together looks like this big function,
57
00:02:31,380 --> 00:02:34,233
我们可以使用 map 方法将其应用于我们的数据集。
which we can apply to our datasets with the map method.
58
00:02:35,310 --> 00:02:37,920
由于我们在词元化过程中应用了填充操作,
Since we applied padding during the tokenization,
59
00:02:37,920 --> 00:02:40,680
然后我们可以直接作为训练器使用
we can then use this directly as the trainer
60
00:02:40,680 --> 00:02:44,133
或者应用 to_tf_dataset 方法来使用 Keras.fit。
or apply the to_tf_dataset method to use Keras.fit.