-
Notifications
You must be signed in to change notification settings - Fork 776
/
Copy path48_inside-the-question-answering-pipeline-(tensorflow).srt
410 lines (328 loc) · 9.13 KB
/
48_inside-the-question-answering-pipeline-(tensorflow).srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
1
00:00:00,000 --> 00:00:03,417
(轻过渡音乐)
(light transition music)
2
00:00:05,490 --> 00:00:08,440
- 让我们来看看问答 pipeline(管线) 的内部情况。
- Let's have a look inside the question answering pipeline.
3
00:00:09,780 --> 00:00:11,370
问答管线
The question answering pipeline
4
00:00:11,370 --> 00:00:13,710
可以提取问题的答案
can extract answers to questions
5
00:00:13,710 --> 00:00:16,020
来自给定的上下文或文本段落
from a given context or passage of text
6
00:00:16,020 --> 00:00:18,370
就像 transformer 仓库的 README 文件的这一部分。
like this part of the Transformers repo README.
7
00:00:19,290 --> 00:00:21,180
它也适用于很长的上下文,
It also works for very long context,
8
00:00:21,180 --> 00:00:24,720
即使答案靠后,就像这个例子一样。
even if the answer is at the very end, like in this example.
9
00:00:24,720 --> 00:00:26,223
在本视频中,我们将了解原因。
In this video, we'll see why.
10
00:00:27,840 --> 00:00:29,310
问答管线
The question answering pipeline
11
00:00:29,310 --> 00:00:32,130
遵循与其他管线相同的步骤。
follows the same steps as the other pipelines.
12
00:00:32,130 --> 00:00:35,550
问题和上下文被标记为一个句子对,
The question and context are tokenized as a sentence pair,
13
00:00:35,550 --> 00:00:38,463
提供给模型,然后应用一些后处理。
fed to the model then some post-processing is applied.
14
00:00:39,540 --> 00:00:42,840
所以分词化和模型步骤应该很熟悉。
So tokenization and model steps should be familiar.
15
00:00:42,840 --> 00:00:45,000
我们使用适合问答的 auto 类
We use the auto class suitable for question answering
16
00:00:45,000 --> 00:00:47,460
而不是序列分类,
instead of sequence classification,
17
00:00:47,460 --> 00:00:50,190
但与文本分类的一个关键区别
but one key difference with text classification
18
00:00:50,190 --> 00:00:52,380
是我们的模型输出两个张量
is that our model outputs two tensors
19
00:00:52,380 --> 00:00:55,230
命名 start logits 和 end logits。
named start logits and end logits.
20
00:00:55,230 --> 00:00:56,160
这是为什么?
Why is that?
21
00:00:56,160 --> 00:00:58,170
嗯,这就是模型找到答案的方式
Well, this is the way the model finds the answer
22
00:00:58,170 --> 00:00:59,043
对这个问题。
to the question.
23
00:01:00,090 --> 00:01:02,610
首先,让我们看一下模型输入。
First, let's have a look at the model inputs.
24
00:01:02,610 --> 00:01:05,800
它是与问题分词化相关的数字
It's numbers associated with the tokenization of the question
25
00:01:05,800 --> 00:01:05,850
,
,
26
00:01:05,850 --> 00:01:07,753
对于该上下文
followed by the context
27
00:01:07,753 --> 00:01:10,233
使用通常的 CLS 和 SEP 特殊 token 。
with the usual CLS and SEP special tokens.
28
00:01:11,130 --> 00:01:13,203
答案是那些 token 的一部分。
The answer is a part of those tokens.
29
00:01:14,040 --> 00:01:15,330
所以我们要求模型预测
So we ask the model to predict
30
00:01:15,330 --> 00:01:17,040
哪个 token 开始回答
which token starts the answer
31
00:01:17,040 --> 00:01:19,320
并结束答案。
and which ends the answer.
32
00:01:19,320 --> 00:01:20,910
对于我们的两个 logit 输出,
For our two logit outputs,
33
00:01:20,910 --> 00:01:23,823
理论标签是粉色和紫色的向量。
the theoretical labels are the pink and purple vectors.
34
00:01:24,870 --> 00:01:26,700
要将这些 logits 转换为概率,
To convert those logits into probabilities,
35
00:01:26,700 --> 00:01:28,596
我们需要应用 SoftMax,
we will need to apply a SoftMax,
36
00:01:28,596 --> 00:01:31,020
就像在文本分类管线中一样。
like in the text classification pipeline.
37
00:01:31,020 --> 00:01:32,310
我们只是掩蔽 token
We just mask the tokens
38
00:01:32,310 --> 00:01:35,940
在这样做之前不属于上下文的一部分,
that are not part of the context before doing that,
39
00:01:35,940 --> 00:01:38,310
不掩蔽初始 CLS token
leaving the initial CLS token unmasked
40
00:01:38,310 --> 00:01:40,773
我们用它来预测一个不可能的答案。
as we use it to predict an impossible answer.
41
00:01:41,940 --> 00:01:44,730
这就是它在代码方面的样子。
This is what it looks like in terms of code.
42
00:01:44,730 --> 00:01:47,340
我们使用一个大的负数作为掩码
We use a large negative number for the masking
43
00:01:47,340 --> 00:01:49,533
因为它的指数将为零。
since its exponential will then be zero.
44
00:01:50,850 --> 00:01:53,160
现在每个开始和结束位置的概率
Now the probability for each start and end position
45
00:01:53,160 --> 00:01:55,740
对应一个可能的答案
corresponding to a possible answer
46
00:01:55,740 --> 00:01:57,540
会给一个分数是一个产品
will give a score that is a product
47
00:01:57,540 --> 00:01:58,680
开始概率和结束概率
of the start probabilities and end probabilities
48
00:01:59,680 --> 00:02:00,873
在这些位置。
at those position.
49
00:02:01,920 --> 00:02:04,530
当然,开始索引大于结束索引
Of course, a start index greater than an end index
50
00:02:04,530 --> 00:02:06,330
对应一个不可能的答案。
corresponds to an impossible answer.
51
00:02:07,744 --> 00:02:09,510
这是找到最佳分数的代码
Here is the code to find the best score
52
00:02:09,510 --> 00:02:11,280
对一个可能的答案。
for a possible answer.
53
00:02:11,280 --> 00:02:13,830
一旦我们有了 token 的开始和结束位置,
Once we have the start and end position for the tokens,
54
00:02:13,830 --> 00:02:16,650
我们使用分词器提供的偏移量映射
we use the offset mappings provided by our tokenizer
55
00:02:16,650 --> 00:02:19,710
找到初始上下文中的字符范围,
to find the span of characters in the initial context,
56
00:02:19,710 --> 00:02:20,810
我们得到了答案。
and we get our answer.
57
00:02:22,080 --> 00:02:23,700
现在,当上下文很长时,
Now, when the context is long,
58
00:02:23,700 --> 00:02:25,977
它可能会被分词器截断。
it might get truncated by the tokenizer.
59
00:02:26,834 --> 00:02:29,790
这可能会导致部分答案,或者更糟的是,
This might result in part of the answer, or worse,
60
00:02:29,790 --> 00:02:32,190
整个答案,被截断了。
the whole answer, being truncated.
61
00:02:32,190 --> 00:02:34,020
所以我们不丢弃截断的 token
So we don't discard the truncated tokens
62
00:02:34,020 --> 00:02:36,420
但与他们一起构建新功能。
but build new features with them.
63
00:02:36,420 --> 00:02:39,330
这些功能中的每一个都包含问题,
Each of those features contains the question,
64
00:02:39,330 --> 00:02:42,150
然后是上下文中的一大块文本。
then a chunk of text in the context.
65
00:02:42,150 --> 00:02:44,520
如果我们采用不相交的文本块,
If we take disjoint chunks of texts,
66
00:02:44,520 --> 00:02:45,840
我们可能会得到答案
we might end up with the answer
67
00:02:45,840 --> 00:02:47,733
被分成两个特征。
being split between two features.
68
00:02:48,720 --> 00:02:52,050
因此,我们取而代之的是重叠的文本块
So instead, we take overlapping chunks of text
69
00:02:52,050 --> 00:02:53,910
确保至少其中一个块
to make sure at least one of the chunks
70
00:02:53,910 --> 00:02:56,940
将完整包含问题的答案。
will fully contain the answer to the question.
71
00:02:56,940 --> 00:02:59,220
所以,分词器会自动为我们完成所有这些
So, tokenizers does all of this for us automatically
72
00:02:59,220 --> 00:03:01,920
使用 return overflowing tokens 选项。
with the return overflowing tokens option.
73
00:03:01,920 --> 00:03:02,753
步幅参数
The stride argument
74
00:03:02,753 --> 00:03:04,830
控制重叠标记的数量。
controls the number of overlapping tokens.
75
00:03:05,940 --> 00:03:07,740
这是我们很长的上下文
Here is how our very long context
76
00:03:07,740 --> 00:03:10,323
在一些重叠的两个特征中被截断。
gets truncated in two features with some overlap.
77
00:03:11,160 --> 00:03:12,720
通过应用相同的后处理
By applying the same post-processing
78
00:03:12,720 --> 00:03:14,850
我们之前看到的每个功能,
we saw before for each feature,
79
00:03:14,850 --> 00:03:17,970
我们得到每个分数的答案,
we get the answer with a score for each of them,
80
00:03:17,970 --> 00:03:19,920
我们选择得分最高的答案
and we take the answer with the best score
81
00:03:19,920 --> 00:03:21,303
作为最终解决方案。
as a final solution.
82
00:03:23,089 --> 00:03:26,506
(轻过渡音乐)
(light transition music)