huggingface · lewtun · Nov 28, 2022 · Nov 27, 2022
diff --git a/subtitles/en/00_welcome-to-the-hugging-face-course.srt b/subtitles/en/00_welcome-to-the-hugging-face-course.srt
diff --git a/subtitles/en/01_the-pipeline-function.srt b/subtitles/en/01_the-pipeline-function.srt
diff --git a/subtitles/en/02_the-carbon-footprint-of-transformers.srt b/subtitles/en/02_the-carbon-footprint-of-transformers.srt
diff --git a/subtitles/en/03_what-is-transfer-learning.srt b/subtitles/en/03_what-is-transfer-learning.srt
diff --git a/subtitles/en/04_the-transformer-architecture.srt b/subtitles/en/04_the-transformer-architecture.srt
diff --git a/subtitles/en/05_transformer-models-encoders.srt b/subtitles/en/05_transformer-models-encoders.srt
diff --git a/subtitles/en/06_transformer-models-decoders.srt b/subtitles/en/06_transformer-models-decoders.srt
diff --git a/subtitles/en/07_transformer-models-encoder-decoders.srt b/subtitles/en/07_transformer-models-encoder-decoders.srt
diff --git a/subtitles/en/08_what-happens-inside-the-pipeline-function-(pytorch).srt b/subtitles/en/08_what-happens-inside-the-pipeline-function-(pytorch).srt
diff --git a/subtitles/en/09_what-happens-inside-the-pipeline-function-(tensorflow).srt b/subtitles/en/09_what-happens-inside-the-pipeline-function-(tensorflow).srt
diff --git a/subtitles/en/10_instantiate-a-transformers-model-(pytorch).srt b/subtitles/en/10_instantiate-a-transformers-model-(pytorch).srt
diff --git a/subtitles/en/11_instantiate-a-transformers-model-(tensorflow).srt b/subtitles/en/11_instantiate-a-transformers-model-(tensorflow).srt
diff --git a/subtitles/en/12_tokenizers-overview.srt b/subtitles/en/12_tokenizers-overview.srt
@@ -1,38 +1,99 @@
-1
-00:00:03,840 --> 00:00:09,200
-In these few videos, we'll take a look at the 
-tokenizers. In Natural Language Processing,  
-
-2
-00:00:09,200 --> 00:00:14,880
-most of the data that we handle consists of raw 
-text. However, machine learning models cannot read  
-
-3
-00:00:14,880 --> 00:00:23,200
-and understand text in its raw form they can only 
-work with numbers. The tokenizer's objective will  
-
-4
-00:00:23,200 --> 00:00:30,080
-be to translate the text into numbers. There are 
-several possible approaches to this conversion,  
-
-5
-00:00:30,080 --> 00:00:33,120
-and the objective is to find the 
-most meaningful representation.  
-
-6
-00:00:36,000 --> 00:00:40,400
-We'll take a look at three distinct tokenization 
-algorithms. We compare them one to one,  
-
-7
-00:00:40,400 --> 00:00:44,880
-so we recommend you look at the videos 
-in the following order: Word-based,  
-
-8
-00:00:45,680 --> 00:00:55,680
-Character-based, and Subword-based.
+1
+00:00:00,450 --> 00:00:01,509
+(intro whooshing)
+
+2
+00:00:01,509 --> 00:00:02,720
+(smiley snapping)
+
+3
+00:00:02,720 --> 00:00:03,930
+(words whooshing)
+
+4
+00:00:03,930 --> 00:00:04,920
+- In the next few videos,
+
+5
+00:00:04,920 --> 00:00:06,720
+we'll take a look at the tokenizers.
+
+6
+00:00:07,860 --> 00:00:09,240
+In natural language processing,
+
+7
+00:00:09,240 --> 00:00:12,930
+most of the data that we
+handle consists of raw text.
+
+8
+00:00:12,930 --> 00:00:14,280
+However, machine learning models
+
+9
+00:00:14,280 --> 00:00:17,103
+cannot read or understand
+text in its raw form,
+
+10
+00:00:18,540 --> 00:00:20,253
+they can only work with numbers.
+
+11
+00:00:21,360 --> 00:00:23,220
+So the tokenizer's objective
+
+12
+00:00:23,220 --> 00:00:25,923
+will be to translate
+the text into numbers.
+
+13
+00:00:27,600 --> 00:00:30,240
+There are several possible
+approaches to this conversion,
+
+14
+00:00:30,240 --> 00:00:31,110
+and the objective
+
+15
+00:00:31,110 --> 00:00:33,453
+is to find the most
+meaningful representation.
+
+16
+00:00:36,240 --> 00:00:39,390
+We'll take a look at three
+distinct tokenization algorithms.
+
+17
+00:00:39,390 --> 00:00:40,530
+We compare them one to one,
+
+18
+00:00:40,530 --> 00:00:42,600
+so we recommend you take
+a look at the videos
+
+19
+00:00:42,600 --> 00:00:44,040
+in the following order.
+
+20
+00:00:44,040 --> 00:00:45,390
+First, "Word-based,"
+
+21
+00:00:45,390 --> 00:00:46,800
+followed by "Character-based,"
+
+22
+00:00:46,800 --> 00:00:48,877
+and finally, "Subword-based."
+
+23
+00:00:48,877 --> 00:00:51,794
+(outro whooshing)
+