Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add accurate English subtitles. #384

Merged
merged 1 commit into from
Nov 28, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
539 changes: 455 additions & 84 deletions subtitles/en/00_welcome-to-the-hugging-face-course.srt

Large diffs are not rendered by default.

668 changes: 445 additions & 223 deletions subtitles/en/01_the-pipeline-function.srt

Large diffs are not rendered by default.

582 changes: 581 additions & 1 deletion subtitles/en/02_the-carbon-footprint-of-transformers.srt

Large diffs are not rendered by default.

599 changes: 396 additions & 203 deletions subtitles/en/03_what-is-transfer-learning.srt

Large diffs are not rendered by default.

418 changes: 280 additions & 138 deletions subtitles/en/04_the-transformer-architecture.srt

Large diffs are not rendered by default.

678 changes: 454 additions & 224 deletions subtitles/en/05_transformer-models-encoders.srt

Large diffs are not rendered by default.

658 changes: 395 additions & 263 deletions subtitles/en/06_transformer-models-decoders.srt

Large diffs are not rendered by default.

944 changes: 621 additions & 323 deletions subtitles/en/07_transformer-models-encoder-decoders.srt

Large diffs are not rendered by default.

715 changes: 471 additions & 244 deletions subtitles/en/08_what-happens-inside-the-pipeline-function-(pytorch).srt

Large diffs are not rendered by default.

Large diffs are not rendered by default.

465 changes: 308 additions & 157 deletions subtitles/en/10_instantiate-a-transformers-model-(pytorch).srt

Large diffs are not rendered by default.

512 changes: 317 additions & 195 deletions subtitles/en/11_instantiate-a-transformers-model-(tensorflow).srt

Large diffs are not rendered by default.

137 changes: 99 additions & 38 deletions subtitles/en/12_tokenizers-overview.srt
Original file line number Diff line number Diff line change
@@ -1,38 +1,99 @@
1
00:00:03,840 --> 00:00:09,200
In these few videos, we'll take a look at the 
tokenizers. In Natural Language Processing,  

2
00:00:09,200 --> 00:00:14,880
most of the data that we handle consists of raw 
text. However, machine learning models cannot read  

3
00:00:14,880 --> 00:00:23,200
and understand text in its raw form they can only 
work with numbers. The tokenizer's objective will  

4
00:00:23,200 --> 00:00:30,080
be to translate the text into numbers. There are 
several possible approaches to this conversion,  

5
00:00:30,080 --> 00:00:33,120
and the objective is to find the 
most meaningful representation.  

6
00:00:36,000 --> 00:00:40,400
We'll take a look at three distinct tokenization 
algorithms. We compare them one to one,  

7
00:00:40,400 --> 00:00:44,880
so we recommend you look at the videos 
in the following order: Word-based,  

8
00:00:45,680 --> 00:00:55,680
Character-based, and Subword-based.
1
00:00:00,450 --> 00:00:01,509
(intro whooshing)

2
00:00:01,509 --> 00:00:02,720
(smiley snapping)

3
00:00:02,720 --> 00:00:03,930
(words whooshing)

4
00:00:03,930 --> 00:00:04,920
- In the next few videos,

5
00:00:04,920 --> 00:00:06,720
we'll take a look at the tokenizers.

6
00:00:07,860 --> 00:00:09,240
In natural language processing,

7
00:00:09,240 --> 00:00:12,930
most of the data that we
handle consists of raw text.

8
00:00:12,930 --> 00:00:14,280
However, machine learning models

9
00:00:14,280 --> 00:00:17,103
cannot read or understand
text in its raw form,

10
00:00:18,540 --> 00:00:20,253
they can only work with numbers.

11
00:00:21,360 --> 00:00:23,220
So the tokenizer's objective

12
00:00:23,220 --> 00:00:25,923
will be to translate
the text into numbers.

13
00:00:27,600 --> 00:00:30,240
There are several possible
approaches to this conversion,

14
00:00:30,240 --> 00:00:31,110
and the objective

15
00:00:31,110 --> 00:00:33,453
is to find the most
meaningful representation.

16
00:00:36,240 --> 00:00:39,390
We'll take a look at three
distinct tokenization algorithms.

17
00:00:39,390 --> 00:00:40,530
We compare them one to one,

18
00:00:40,530 --> 00:00:42,600
so we recommend you take
a look at the videos

19
00:00:42,600 --> 00:00:44,040
in the following order.

20
00:00:44,040 --> 00:00:45,390
First, "Word-based,"

21
00:00:45,390 --> 00:00:46,800
followed by "Character-based,"

22
00:00:46,800 --> 00:00:48,877
and finally, "Subword-based."

23
00:00:48,877 --> 00:00:51,794
(outro whooshing)

Loading