-
Notifications
You must be signed in to change notification settings - Fork 776
/
Copy path01_the-pipeline-function.srt
500 lines (400 loc) · 11.4 KB
/
01_the-pipeline-function.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
1
00:00:00,069 --> 00:00:01,341
(屏幕嗖嗖声)
(screen whooshes)
2
00:00:01,341 --> 00:00:02,449
(面部标志呼啸而过)
(face logo whooshes)
3
00:00:02,449 --> 00:00:05,880
(屏幕嗖嗖声)
(screen whooshes)
4
00:00:05,880 --> 00:00:07,080
- 本节课内容是: pipeline 函数
- The pipeline function.
5
00:00:09,540 --> 00:00:12,020
pipeline 函数是 Transformers 库中的
The pipeline function is the most high level API
6
00:00:12,020 --> 00:00:14,010
最顶层的 API
of the Transformers library.
7
00:00:14,010 --> 00:00:16,050
它将所有步骤重新组合在一起
It regroups together all the steps
8
00:00:16,050 --> 00:00:18,873
从而实现从原始文本到可用预测的转换。
to go from raw texts to usable predictions.
9
00:00:20,228 --> 00:00:22,980
使用的模型是 pipeline 的核心,
The model used is at the core of a pipeline,
10
00:00:22,980 --> 00:00:24,390
但 pipeline 还包括
but the pipeline also include
11
00:00:24,390 --> 00:00:26,610
所有必要的预处理,
all the necessary pre-processing,
12
00:00:26,610 --> 00:00:30,240
因为模型不期望得到文本,而是数字,
since the model does not expect texts, but number,
13
00:00:30,240 --> 00:00:32,040
以及一些后期处理,
as well as some post-processing,
14
00:00:32,040 --> 00:00:34,533
使模型的输出可读。
to make the output of the model human-readable.
15
00:00:35,910 --> 00:00:37,593
让我们通过一个情绪分析 pipeline
Let's look at a first example
16
00:00:37,593 --> 00:00:39,693
的例子来解释一下。
with the sentiment analysis pipeline.
17
00:00:40,740 --> 00:00:44,670
此 pipeline 对给定的输入执行文本分类
This pipeline performs text classification on a given input
18
00:00:44,670 --> 00:00:46,953
并确定它的情绪是正面的还是负面的。
and determines if it's positive or negative.
19
00:00:47,910 --> 00:00:51,750
在这里,它为给定文本标记正面情绪的标签,
Here, it attributed the positive label on the given text,
20
00:00:51,750 --> 00:00:54,413
置信度为 95%。
with a confidence of 95%.
21
00:00:55,650 --> 00:00:58,470
您可以将多个文本传递到同一个 pipeline,
You can pass multiple texts to the same pipeline,
22
00:00:58,470 --> 00:01:00,270
它们将被处理并
which will be processed and passed
23
00:01:00,270 --> 00:01:02,673
作为一个批次传递给模型。
through the model together as a batch.
24
00:01:03,570 --> 00:01:05,970
输出是包含单个结果的列表
The output is a list of individual results
25
00:01:05,970 --> 00:01:07,923
与输入文本的顺序相同。
in the same order as the input texts.
26
00:01:08,790 --> 00:01:12,270
在这里,我们为第一个文本找到了相同的标签和分数,
Here we find the same label and score for the first text,
27
00:01:12,270 --> 00:01:14,443
第二个文本被判断为否定情绪
and the second text is judged negative
28
00:01:14,443 --> 00:01:17,243
置信度为 99.9%。
with a confidence of 99.9%.
29
00:01:18,720 --> 00:01:20,700
零样本分类 pipeline
The zero-shot classification pipeline
30
00:01:20,700 --> 00:01:23,610
是一个更通用的文本分类 pipeline,
is a more general text-classification pipeline,
31
00:01:23,610 --> 00:01:26,370
它允许您提供所需的标签。
it allows you to provide the labels you want.
32
00:01:26,370 --> 00:01:29,850
在这里,我们想根据标签对输入文本进行分类,
Here we want to classify our input text along the labels,
33
00:01:29,850 --> 00:01:32,643
标签有教育、政治和商业。
education, politics, and business.
34
00:01:33,540 --> 00:01:35,580
pipeline 都能够成功识别
The pipeline successfully recognizes
35
00:01:35,580 --> 00:01:38,280
与其他标签相比,它更多地是关于教育,
it's more about education than the other labels,
36
00:01:38,280 --> 00:01:40,643
置信度为 84%。
with a confidence of 84%.
37
00:01:41,670 --> 00:01:43,110
继续执行其他任务,
Moving on to other tasks,
38
00:01:43,110 --> 00:01:45,030
文本生成 pipeline 将
the text generation pipeline will
39
00:01:45,030 --> 00:01:46,533
自动完成给定的提示。
auto-complete a given prompt.
40
00:01:47,460 --> 00:01:49,980
输出带有一点随机性,
The output is generated with a bit of randomness,
41
00:01:49,980 --> 00:01:52,800
所以每次在给定的提示上调用生成器对象时
so it changes each time you call the generator object
42
00:01:52,800 --> 00:01:53,763
这个结果都会改变。
on a given prompt.
43
00:01:54,990 --> 00:01:57,123
到目前为止,我们已经使用了 pipeline API
Up until now, we've used the the pipeline API
44
00:01:57,123 --> 00:02:00,360
结合与每个任务关联的默认模型来演示,
with the default model associated to each task,
45
00:02:00,360 --> 00:02:02,880
但您可以将它与任何经过预训练的模型一起使用
but you can use it with any model that has been pretrained
46
00:02:02,880 --> 00:02:04,263
或根据此任务进行微调。
or fine-tuned on this task.
47
00:02:06,540 --> 00:02:10,350
进入模型中心 huggingface.co/models
Going on the model hub, huggingface.co/models
48
00:02:10,350 --> 00:02:13,350
您可以按任务过滤可用模型。
you can filter the available models by task.
49
00:02:13,350 --> 00:02:17,190
我们之前示例中使用的默认模型是 gpt2,
The default model used in our previous example was gpt2,
50
00:02:17,190 --> 00:02:19,290
但还有更多模型可供选择,
but there are many more models available,
51
00:02:19,290 --> 00:02:20,523
且不仅仅是英语。
and not just in English.
52
00:02:21,450 --> 00:02:23,670
让我们回到文本生成 pipeline
Let's go back to the text generation pipeline
53
00:02:23,670 --> 00:02:26,193
并用另一个模型 distilgpt2 加载它。
and load it with another model, distilgpt2.
54
00:02:27,060 --> 00:02:28,950
这是 gpt2 的轻量级版本
This is a lighter version of gpt2
55
00:02:28,950 --> 00:02:30,603
由 Hugging Face 团队创建。
created by the Hugging Face team.
56
00:02:31,740 --> 00:02:34,110
将 pipeline 应用于给定提示时,
When applying the pipeline to a given prompt,
57
00:02:34,110 --> 00:02:36,360
我们可以指定几个参数
we can specify several arguments
58
00:02:36,360 --> 00:02:39,240
例如生成文本的最大长度,
such as the maximum length of the generated texts,
59
00:02:39,240 --> 00:02:41,700
或者我们想要返回的句子数量,
or the number of sentences we want to return,
60
00:02:41,700 --> 00:02:44,150
因为这一代有一些随机性。
since there is some randomness in the generation.
61
00:02:46,080 --> 00:02:48,750
通过猜测句子中的下一个单词来生成文本
Generating texts by guessing the next word in a sentence
62
00:02:48,750 --> 00:02:51,450
是 GPT-2 的预训练目标。
was the pretraining objective of GPT-2.
63
00:02:51,450 --> 00:02:55,140
掩码填充 pipeline 是 BERT 的预训练目标,
The fill mask pipeline is the pretraining objective of BERT,
64
00:02:55,140 --> 00:02:57,363
这是猜测掩码词的值。
which is to guess the value of masked word.
65
00:02:58,260 --> 00:03:01,020
在这种情况下,我们询问两个最可能的值
In this case, we ask the two most likely values
66
00:03:01,020 --> 00:03:03,660
对于缺失的词,根据模型,
for the missing words, according to the model,
67
00:03:03,660 --> 00:03:07,053
并通过数学计算推测可能的答案。
and get mathematical or computational as possible answers.
68
00:03:08,280 --> 00:03:10,170
Transformers 模型可以执行的另一项任务
Another task Transformers model can perform
69
00:03:10,170 --> 00:03:12,660
就是对句子中的每一个词进行分类
is to classify each word in the sentence
70
00:03:12,660 --> 00:03:14,970
而不是整个句子。
instead of the sentence as a whole.
71
00:03:14,970 --> 00:03:18,390
其中一个例子是命名实体识别,
One example of this is Named Entity Recognition,
72
00:03:18,390 --> 00:03:20,820
这是识别实体的任务,
which is the task of identifying entities,
73
00:03:20,820 --> 00:03:25,323
例如句子中的人、组织或地点。
such as persons, organizations or locations in a sentence.
74
00:03:26,400 --> 00:03:30,570
在这里,模型正确地找到了人 Sylvain,
Here, the model correctly finds the person, Sylvain,
75
00:03:30,570 --> 00:03:32,453
组织,是 Hugging Face
the organization, Hugging Face,
76
00:03:32,453 --> 00:03:35,010
以及位置,布鲁克林,
as well as the location, Brooklyn,
77
00:03:35,010 --> 00:03:36,303
在输入文本中。
inside the input text.
78
00:03:37,661 --> 00:03:40,230
使用的 grouped_entities=True 参数
The grouped_entities=True argument used
79
00:03:40,230 --> 00:03:42,330
就是把 pipeline 组合在一起
is to make the pipeline group together
80
00:03:42,330 --> 00:03:44,790
链接到同一实体的不同词,
the different words linked to the same entity,
81
00:03:44,790 --> 00:03:46,353
比如这里的 Hugging 和 Face。
such as Hugging and Face here.
82
00:03:48,270 --> 00:03:50,670
pipeline API 可用的另一个任务
Another task available with the pipeline API
83
00:03:50,670 --> 00:03:52,920
是抽取式问答。
is extractive question answering.
84
00:03:52,920 --> 00:03:55,380
提供上下文和问题,
Providing a context and a question,
85
00:03:55,380 --> 00:03:58,290
该模型将识别上下文中的文本范围
the model will identify the span of text in the context
86
00:03:58,290 --> 00:04:00,190
包含问题的答案。
containing the answer to the question.
87
00:04:01,650 --> 00:04:03,960
获取长文的简短摘要
Getting short summaries of very long articles
88
00:04:03,960 --> 00:04:06,540
也是 Transformers 库可以提供的功能,
is also something the Transformers library can help with,
89
00:04:06,540 --> 00:04:08,140
这是摘要 pipeline 提供的功能。
with the summarization pipeline.
90
00:04:09,480 --> 00:04:12,570
pipeline API 支持的最后一个任务
Finally, the last task supported by the pipeline API
91
00:04:12,570 --> 00:04:14,130
是翻译。
is translation.
92
00:04:14,130 --> 00:04:16,170
这里我们使用在模型中心 (Model Hub) 提供的
Here we use a French/English model
93
00:04:16,170 --> 00:04:17,460
法语/英语模型
found on the model hub
94
00:04:17,460 --> 00:04:19,893
获取我们输入文本的英文版本。
to get the English version of our input text.
95
00:04:21,600 --> 00:04:23,490
这是所有任务的简要总结
Here is a brief summary of all the tasks
96
00:04:23,490 --> 00:04:25,500
我们在这段视频中进行了调查。
we've looked into in this video.
97
00:04:25,500 --> 00:04:27,390
然后通过模型中心中提供的
Try then out through the inference widgets
98
00:04:27,390 --> 00:04:28,327
推理小部件进行尝试。
in the model hub.
99
00:04:30,459 --> 00:04:33,475
(屏幕嗖嗖声)
(screen whooshes)
100
00:04:33,475 --> 00:04:35,175
(徽标嗖嗖声)
(logo whooshes)