diff --git a/subtitles/README.md b/subtitles/README.md index 002948954..833481ede 100644 --- a/subtitles/README.md +++ b/subtitles/README.md @@ -37,8 +37,8 @@ For example, in the `zh-CN` subtitles, each block has the following format: ``` 1 00:00:05,850 --> 00:00:07,713 -- 欢迎来到 Hugging Face 课程。 -- Welcome to the Hugging Face Course. +欢迎来到 Hugging Face 课程。 +Welcome to the Hugging Face Course. ``` To upload the SRT file to YouTube, we need the subtitle in monolingual format, i.e. the above block should read: @@ -46,7 +46,7 @@ To upload the SRT file to YouTube, we need the subtitle in monolingual format, i ``` 1 00:00:05,850 --> 00:00:07,713 -- 欢迎来到 Hugging Face 课程。 +欢迎来到 Hugging Face 课程。 ``` To handle this, we provide a script that converts the bilingual SRT files to monolingual ones. To perform the conversion, run: diff --git a/subtitles/en/metadata_tasks.csv b/subtitles/en/metadata_tasks.csv new file mode 100644 index 000000000..2af9c9a82 --- /dev/null +++ b/subtitles/en/metadata_tasks.csv @@ -0,0 +1,7 @@ +id,title,link,srt_filename +wVHdVlPScxA,🤗 Tasks: Token Classification,https://www.youtube.com/watch?v=wVHdVlPScxA&list=PLo2EIpI_JMQtyEr-sLJSy5_SnLCb4vtQf&index=1,subtitles/en/tasks_00_🤗-tasks-token-classification.srt +ajPx5LwJD-I,🤗 Tasks: Question Answering,https://www.youtube.com/watch?v=ajPx5LwJD-I&list=PLo2EIpI_JMQtyEr-sLJSy5_SnLCb4vtQf&index=2,subtitles/en/tasks_01_🤗-tasks-question-answering.srt +Vpjb1lu0MDk,🤗 Tasks: Causal Language Modeling,https://www.youtube.com/watch?v=Vpjb1lu0MDk&list=PLo2EIpI_JMQtyEr-sLJSy5_SnLCb4vtQf&index=3,subtitles/en/tasks_02_🤗-tasks-causal-language-modeling.srt +mqElG5QJWUg,🤗 Tasks: Masked Language Modeling,https://www.youtube.com/watch?v=mqElG5QJWUg&list=PLo2EIpI_JMQtyEr-sLJSy5_SnLCb4vtQf&index=4,subtitles/en/tasks_03_🤗-tasks-masked-language-modeling.srt +yHnr5Dk2zCI,🤗 Tasks: Summarization,https://www.youtube.com/watch?v=yHnr5Dk2zCI&list=PLo2EIpI_JMQtyEr-sLJSy5_SnLCb4vtQf&index=5,subtitles/en/tasks_04_🤗-tasks-summarization.srt +1JvfrvZgi6c,🤗 Tasks: Translation,https://www.youtube.com/watch?v=1JvfrvZgi6c&list=PLo2EIpI_JMQtyEr-sLJSy5_SnLCb4vtQf&index=6,subtitles/en/tasks_05_🤗-tasks-translation.srt diff --git a/subtitles/en/raw/tasks.md b/subtitles/en/raw/tasks.md new file mode 100644 index 000000000..a95d2429a --- /dev/null +++ b/subtitles/en/raw/tasks.md @@ -0,0 +1,77 @@ +Note: the following transcripts are associated with Merve Noyan's videos in the Hugging Face Tasks playlist: https://www.youtube.com/playlist?list=PLo2EIpI_JMQtyEr-sLJSy5_SnLCb4vtQf + +Token Classification video + +Welcome to the Hugging Face tasks series! In this video we’ll take a look at the token classification task. +Token classification is the task of assigning a label to each token in a sentence. There are various token classification tasks and the most common are Named Entity Recognition and Part-of-Speech Tagging. +Let’s take a quick look at the Named Entity Recognition task. The goal of this task is to find the entities in a piece of text, such as person, location, or organization. This task is formulated as labelling each token with one class for each entity, and another class for tokens that have no entity. +Another token classification task is part-of-speech tagging. The goal of this task is to label the words for a particular part of a speech, such as noun, pronoun, adjective, verb and so on. This task is formulated as labelling each token with parts of speech. +Token classification models are evaluated on Accuracy, Recall, Precision and F1-Score. The metrics are calculated for each of the classes. We calculate true positive, true negative and false positives to calculate precision and recall, and take their harmonic mean to get F1-Score. Then we calculate it for every class and take the overall average to evaluate our model. +An example dataset used for this task is ConLL2003. Here, each token belongs to a certain named entity class, denoted as the indices of the list containing the labels. +You can extract important information from invoices using named entity recognition models, such as date, organization name or address. +For more information about the Token classification task, check out the Hugging Face course. + + +Question Answering video + +Welcome to the Hugging Face tasks series. In this video, we will take a look at the Question Answering task. +Question answering is the task of extracting an answer in a given document. +Question answering models take a context, which is the document you want to search in, and a question and return an answer. Note that the answer is not generated, but extracted from the context. This type of question answering is called extractive. +The task is evaluated on two metrics, exact match and F1-Score. +As the name implies, exact match looks for an exact match between the predicted answer and the correct answer. +A common metric used is the F1-Score, which is calculated over tokens that are predicted correctly and incorrectly. It is calculated over the average of two metrics called precision and recall which are metrics that are used widely in classification problems. +An example dataset used for this task is called SQuAD. This dataset contains contexts, questions and the answers that are obtained from English Wikipedia articles. +You can use question answering models to automatically answer the questions asked by your customers. You simply need a document containing information about your business and query through that document with the questions asked by your customers. +For more information about the Question Answering task, check out the Hugging Face course. + + +Causal Language Modeling video + +Welcome to the Hugging Face tasks series! In this video we’ll take a look at Causal Language Modeling. +Causal language modeling is the task of predicting the next +word in a sentence, given all the previous words. This task is very similar to the autocorrect function that you might have on your phone. +These models take a sequence to be completed and outputs the complete sequence. +Classification metrics can’t be used as there’s no single correct answer for completion. Instead, we evaluate the distribution of the text completed by the model. +A common metric to do so is the cross-entropy loss. Perplexity is also a widely used metric and it is calculated as the exponential of the cross-entropy loss. +You can use any dataset with plain text and tokenize the text to prepare the data. +Causal language models can be used to generate code. +For more information about the Causal Language Modeling task, check out the Hugging Face course. + + +Masked Language Modeling video + +Welcome to the Hugging Face tasks series! In this video we’ll take a look at Masked Language Modeling. +Masked language modeling is the task of predicting which words should fill in the blanks of a sentence. +These models take a masked text as the input and output the possible values for that mask. +Masked language modeling is handy before fine-tuning your model for your task. For example, if you need to use a model in a specific domain, say, biomedical documents, models like BERT will treat your domain-specific words as rare tokens. If you train a masked language model using your biomedical corpus and then fine tune your model on a downstream task, you will have a better performance. +Classification metrics can’t be used as there’s no single correct answer to mask values. Instead, we evaluate the distribution of the mask values. +A common metric to do so is the cross-entropy loss. Perplexity is also a widely used metric and it is calculated as the exponential of the cross-entropy loss. +You can use any dataset with plain text and tokenize the text to mask the data. +For more information about the Masked Language Modeling, check out the Hugging Face course. + + +Summarization video + +Welcome to the Hugging Face tasks series. In this video, we will take a look at the Text Summarization task. +Summarization is a task of producing a shorter version of a document while preserving the relevant and important information in the document. +Summarization models take a document to be summarized and output the summarized text. +This task is evaluated on the ROUGE score. It’s based on the overlap between the produced sequence and the correct sequence. +You might see this as ROUGE-1, which is the overlap of single tokens and ROUGE-2, the overlap of subsequent token pairs. ROUGE-N refers to the overlap of n subsequent tokens. Here we see an example of how overlaps take place. +An example dataset used for this task is called Extreme Summarization, XSUM. This dataset contains texts and their summarized versions. +You can use summarization models to summarize research papers which would enable researchers to easily pick papers for their reading list. +For more information about the Summarization task, check out the Hugging Face course. + + +Translation video + +Welcome to the Hugging Face tasks series. In this video, we will take a look at the Translation task. +Translation is the task of translating text from one language to another. +These models take a text in the source language and output the translation of that text in the target language. +The task is evaluated on the BLEU score. +The score ranges from 0 to 1, in which 1 means the translation perfectly matched and 0 did not match at all. +BLEU is calculated over subsequent tokens called n-grams. Unigram refers to a single token while bi-gram refers to token pairs and n-grams refer to n subsequent tokens. +Machine translation datasets contain pairs of text in a language and translation of the text in another language. +These models can help you build conversational agents across different languages. +One option is to translate the training data used for the chatbot and train a separate chatbot. +You can put one translation model from your user’s language to the language your chatbot is trained on, translate the user inputs and do intent classification, take the output of the chatbot and translate it from the language your chatbot was trained on to the user’s language. +For more information about the Translation task, check out the Hugging Face course. diff --git "a/subtitles/en/tasks_00_\360\237\244\227-tasks-token-classification.srt" "b/subtitles/en/tasks_00_\360\237\244\227-tasks-token-classification.srt" new file mode 100644 index 000000000..ee6e6f207 --- /dev/null +++ "b/subtitles/en/tasks_00_\360\237\244\227-tasks-token-classification.srt" @@ -0,0 +1,116 @@ +1 +00:00:04,520 --> 00:00:07,400 +Welcome to the Hugging Face tasks series! + +2 +00:00:07,400 --> 00:00:11,870 +In this video we’ll take a look at the token +classification task. + +3 +00:00:11,870 --> 00:00:17,900 +Token classification is the task of assigning +a label to each token in a sentence. + +4 +00:00:17,900 --> 00:00:23,310 +There are various token classification tasks +and the most common are Named Entity Recognition + +5 +00:00:23,310 --> 00:00:26,430 +and Part-of-Speech Tagging. + +6 +00:00:26,430 --> 00:00:31,640 +Let’s take a quick look at the Named Entity +Recognition task. + +7 +00:00:31,640 --> 00:00:38,400 +The goal of this task is to find the entities +in a piece of text, such as person, location, + +8 +00:00:38,400 --> 00:00:40,210 +or organization. + +9 +00:00:40,210 --> 00:00:45,250 +This task is formulated as labelling each +token with one class for each entity, and + +10 +00:00:45,250 --> 00:00:51,719 +another class for tokens that have no entity. + +11 +00:00:51,719 --> 00:00:55,670 +Another token classification task is part-of-speech +tagging. + +12 +00:00:55,670 --> 00:01:01,399 +The goal of this task is to label the words +for a particular part of a speech, such as + +13 +00:01:01,399 --> 00:01:05,900 +noun, pronoun, adjective, verb and so on. + +14 +00:01:05,900 --> 00:01:11,270 +This task is formulated as labelling each +token with parts of speech. + +15 +00:01:11,270 --> 00:01:19,659 +Token classification models are evaluated +on Accuracy, Recall, Precision and F1-Score. + +16 +00:01:19,659 --> 00:01:22,950 +The metrics are calculated for each of the +classes. + +17 +00:01:22,950 --> 00:01:28,040 +We calculate true positive, true negative +and false positives to calculate precision + +18 +00:01:28,040 --> 00:01:31,829 +and recall, and take their harmonic mean to +get F1-Score. + +19 +00:01:31,829 --> 00:01:42,329 +Then we calculate it for every class and take +the overall average to evaluate our model. + +20 +00:01:42,329 --> 00:01:45,680 +An example dataset used for this task is ConLL2003. + +21 +00:01:45,680 --> 00:01:51,750 +Here, each token belongs to a certain named +entity class, denoted as the indices of the + +22 +00:01:51,750 --> 00:01:55,380 +list containing the labels. + +23 +00:01:55,380 --> 00:02:00,720 +You can extract important information from +invoices using named entity recognition models, + +24 +00:02:00,720 --> 00:02:07,070 +such as date, organization name or address. + +25 +00:02:07,070 --> 00:02:16,840 +For more information about the Token classification +task, check out the Hugging Face course. diff --git "a/subtitles/en/tasks_01_\360\237\244\227-tasks-question-answering.srt" "b/subtitles/en/tasks_01_\360\237\244\227-tasks-question-answering.srt" new file mode 100644 index 000000000..6416fde12 --- /dev/null +++ "b/subtitles/en/tasks_01_\360\237\244\227-tasks-question-answering.srt" @@ -0,0 +1,87 @@ +1 +00:00:04,400 --> 00:00:06,480 +Welcome to the Hugging Face tasks series.   + +2 +00:00:07,200 --> 00:00:10,080 +In this video, we will take a look  +at the Question Answering task.  + +3 +00:00:13,120 --> 00:00:17,200 +Question answering is the task of  +extracting an answer in a given document.  + +4 +00:00:21,120 --> 00:00:25,600 +Question answering models take a context,  +which is the document you want to search in,   + +5 +00:00:26,240 --> 00:00:31,440 +and a question and return an answer.  +Note that the answer is not generated,   + +6 +00:00:31,440 --> 00:00:37,600 +but extracted from the context. This type  +of question answering is called extractive.  + +7 +00:00:42,320 --> 00:00:46,960 +The task is evaluated on two  +metrics, exact match and F1-Score.  + +8 +00:00:49,680 --> 00:00:52,320 +As the name implies, exact match looks for an   + +9 +00:00:52,320 --> 00:00:57,840 +exact match between the predicted  +answer and the correct answer.  + +10 +00:01:00,080 --> 00:01:05,520 +A common metric used is the F1-Score, which  +is calculated over tokens that are predicted   + +11 +00:01:05,520 --> 00:01:10,960 +correctly and incorrectly. It is calculated  +over the average of two metrics called   + +12 +00:01:10,960 --> 00:01:16,560 +precision and recall which are metrics that  +are used widely in classification problems.  + +13 +00:01:20,880 --> 00:01:28,240 +An example dataset used for this task is called  +SQuAD. This dataset contains contexts, questions   + +14 +00:01:28,240 --> 00:01:32,080 +and the answers that are obtained  +from English Wikipedia articles.  + +15 +00:01:35,440 --> 00:01:39,520 +You can use question answering models to  +automatically answer the questions asked   + +16 +00:01:39,520 --> 00:01:46,480 +by your customers. You simply need a document  +containing information about your business   + +17 +00:01:47,200 --> 00:01:53,840 +and query through that document with  +the questions asked by your customers.  + +18 +00:01:55,680 --> 00:02:06,160 +For more information about the Question Answering  +task, check out the Hugging Face course. diff --git "a/subtitles/en/tasks_02_\360\237\244\227-tasks-causal-language-modeling.srt" "b/subtitles/en/tasks_02_\360\237\244\227-tasks-causal-language-modeling.srt" new file mode 100644 index 000000000..06dc54e12 --- /dev/null +++ "b/subtitles/en/tasks_02_\360\237\244\227-tasks-causal-language-modeling.srt" @@ -0,0 +1,63 @@ +1 +00:00:04,560 --> 00:00:06,640 +Welcome to the Hugging Face tasks series!   + +2 +00:00:07,200 --> 00:00:10,400 +In this video we’ll take a look  +at Causal Language Modeling.  + +3 +00:00:13,600 --> 00:00:16,880 +Causal language modeling is  +the task of predicting the next  + +4 +00:00:16,880 --> 00:00:21,920 +word in a sentence, given all the  +previous words. This task is very   + +5 +00:00:21,920 --> 00:00:29,920 +similar to the autocorrect function  +that you might have on your phone.  + +6 +00:00:29,920 --> 00:00:34,720 +These models take a sequence to be  +completed and outputs the complete sequence.  + +7 +00:00:38,640 --> 00:00:44,160 +Classification metrics can’t be used as there’s  +no single correct answer for completion.   + +8 +00:00:44,960 --> 00:00:49,280 +Instead, we evaluate the distribution  +of the text completed by the model.  + +9 +00:00:50,800 --> 00:00:55,440 +A common metric to do so is the  +cross-entropy loss. Perplexity is   + +10 +00:00:55,440 --> 00:01:01,280 +also a widely used metric and it is calculated  +as the exponential of the cross-entropy loss.  + +11 +00:01:05,200 --> 00:01:11,840 +You can use any dataset with plain text  +and tokenize the text to prepare the data.  + +12 +00:01:15,040 --> 00:01:18,240 +Causal language models can  +be used to generate code.  + +13 +00:01:22,480 --> 00:01:33,200 +For more information about the Causal Language  +Modeling task, check out the Hugging Face course. diff --git "a/subtitles/en/tasks_03_\360\237\244\227-tasks-masked-language-modeling.srt" "b/subtitles/en/tasks_03_\360\237\244\227-tasks-masked-language-modeling.srt" new file mode 100644 index 000000000..28f376b68 --- /dev/null +++ "b/subtitles/en/tasks_03_\360\237\244\227-tasks-masked-language-modeling.srt" @@ -0,0 +1,85 @@ +1 +00:00:04,660 --> 00:00:07,589 +Welcome to the Hugging Face tasks series! + +2 +00:00:07,589 --> 00:00:13,730 +In this video we’ll take a look at Masked +Language Modeling. + +3 +00:00:13,730 --> 00:00:20,720 +Masked language modeling is the task of predicting +which words should fill in the blanks of a + +4 +00:00:20,720 --> 00:00:23,500 +sentence. + +5 +00:00:23,500 --> 00:00:32,870 +These models take a masked text as the input +and output the possible values for that mask. + +6 +00:00:32,870 --> 00:00:37,550 +Masked language modeling is handy before fine-tuning +your model for your task. + +7 +00:00:37,550 --> 00:00:43,579 +For example, if you need to use a model in +a specific domain, say, biomedical documents, + +8 +00:00:43,579 --> 00:00:49,050 +models like BERT will treat your domain-specific +words as rare tokens. + +9 +00:00:49,050 --> 00:00:54,220 +If you train a masked language model using +your biomedical corpus and then fine tune + +10 +00:00:54,220 --> 00:01:02,929 +your model on a downstream task, you will +have a better performance. + +11 +00:01:02,929 --> 00:01:07,799 +Classification metrics can’t be used as +there’s no single correct answer to mask + +12 +00:01:07,799 --> 00:01:08,799 +values. + +13 +00:01:08,799 --> 00:01:12,900 +Instead, we evaluate the distribution of the +mask values. + +14 +00:01:12,900 --> 00:01:16,590 +A common metric to do so is the cross-entropy +loss. + +15 +00:01:16,590 --> 00:01:22,010 +Perplexity is also a widely used metric and +it is calculated as the exponential of the + +16 +00:01:22,010 --> 00:01:27,240 +cross-entropy loss. + +17 +00:01:27,240 --> 00:01:35,680 +You can use any dataset with plain text and +tokenize the text to mask the data. + +18 +00:01:35,680 --> 00:01:44,710 +For more information about the Masked Language +Modeling, check out the Hugging Face course. diff --git "a/subtitles/en/tasks_04_\360\237\244\227-tasks-summarization.srt" "b/subtitles/en/tasks_04_\360\237\244\227-tasks-summarization.srt" new file mode 100644 index 000000000..0c16f7f85 --- /dev/null +++ "b/subtitles/en/tasks_04_\360\237\244\227-tasks-summarization.srt" @@ -0,0 +1,68 @@ +1 +00:00:04,560 --> 00:00:06,640 +Welcome to the Hugging Face tasks series.   + +2 +00:00:07,280 --> 00:00:10,720 +In this video, we will take a look  +at the Text Summarization task.  + +3 +00:00:13,680 --> 00:00:16,480 +Summarization is a task of  +producing a shorter version   + +4 +00:00:16,480 --> 00:00:21,600 +of a document while preserving the relevant  +and important information in the document.  + +5 +00:00:25,040 --> 00:00:29,840 +Summarization models take a document to be  +summarized and output the summarized text.  + +6 +00:00:33,360 --> 00:00:40,240 +This task is evaluated on the ROUGE score. It’s  +based on the overlap between the produced sequence   + +7 +00:00:40,240 --> 00:00:48,000 +and the correct sequence. +You might see this as ROUGE-1,   + +8 +00:00:48,000 --> 00:00:55,600 +which is the overlap of single tokens and ROUGE-2,  +the overlap of subsequent token pairs. ROUGE-N   + +9 +00:00:55,600 --> 00:01:02,960 +refers to the overlap of n subsequent tokens.  +Here we see an example of how overlaps take place.  + +10 +00:01:06,160 --> 00:01:11,280 +An example dataset used for this task is  +called Extreme Summarization, XSUM. This   + +11 +00:01:11,280 --> 00:01:14,480 +dataset contains texts and  +their summarized versions.  + +12 +00:01:17,680 --> 00:01:21,280 +You can use summarization models  +to summarize research papers which   + +13 +00:01:21,280 --> 00:01:25,680 +would enable researchers to easily  +pick papers for their reading list.  + +14 +00:01:29,040 --> 00:01:39,520 +For more information about the Summarization  +task, check out the Hugging Face course. diff --git "a/subtitles/en/tasks_05_\360\237\244\227-tasks-translation.srt" "b/subtitles/en/tasks_05_\360\237\244\227-tasks-translation.srt" new file mode 100644 index 000000000..ff491e24c --- /dev/null +++ "b/subtitles/en/tasks_05_\360\237\244\227-tasks-translation.srt" @@ -0,0 +1,96 @@ +1 +00:00:04,569 --> 00:00:07,529 +Welcome to the Hugging Face tasks series. + +2 +00:00:07,529 --> 00:00:11,840 +In this video, we will take a look at the +Translation task. + +3 +00:00:11,840 --> 00:00:19,420 +Translation is the task of translating text +from one language to another. + +4 +00:00:19,420 --> 00:00:24,420 +These models take a text in the source language +and output the translation of that text in + +5 +00:00:24,420 --> 00:00:28,609 +the target language. + +6 +00:00:28,609 --> 00:00:31,619 +The task is evaluated on the BLEU score. + +7 +00:00:31,619 --> 00:00:38,430 +The score ranges from 0 to 1, in which 1 means +the translation perfectly matched and 0 did + +8 +00:00:38,430 --> 00:00:40,110 +not match at all. + +9 +00:00:40,110 --> 00:00:45,320 +BLEU is calculated over subsequent tokens +called n-grams. + +10 +00:00:45,320 --> 00:00:51,629 +Unigram refers to a single token while bi-gram +refers to token pairs and n-grams refer to + +11 +00:00:51,629 --> 00:00:56,219 +n subsequent tokens. + +12 +00:00:56,219 --> 00:01:01,859 +Machine translation datasets contain pairs +of text in a language and translation of the + +13 +00:01:01,859 --> 00:01:05,910 +text in another language. + +14 +00:01:05,910 --> 00:01:11,290 +These models can help you build conversational +agents across different languages. + +15 +00:01:11,290 --> 00:01:16,110 +One option is to translate the training data +used for the chatbot and train a separate + +16 +00:01:16,110 --> 00:01:19,970 +chatbot. + +17 +00:01:19,970 --> 00:01:24,950 +You can put one translation model from your +user’s language to the language your chatbot + +18 +00:01:24,950 --> 00:01:31,360 +is trained on, translate the user inputs and +do intent classification, take the output + +19 +00:01:31,360 --> 00:01:39,399 +of the chatbot and translate it from the language +your chatbot was trained on to the user’s + +20 +00:01:39,399 --> 00:01:40,850 +language. + +21 +00:01:40,850 --> 00:01:49,720 +For more information about the Translation +task, check out the Hugging Face course. diff --git a/subtitles/fr/metadata_tasks.csv b/subtitles/fr/metadata_tasks.csv new file mode 100644 index 000000000..c40e0d858 --- /dev/null +++ b/subtitles/fr/metadata_tasks.csv @@ -0,0 +1,7 @@ +id,title,link,srt_filename +wVHdVlPScxA,🤗 Tasks: Token Classification,https://www.youtube.com/watch?v=wVHdVlPScxA&list=PLo2EIpI_JMQtyEr-sLJSy5_SnLCb4vtQf&index=1,subtitles/fr/tasks_00_🤗-tasks-token-classification.srt +ajPx5LwJD-I,🤗 Tasks: Question Answering,https://www.youtube.com/watch?v=ajPx5LwJD-I&list=PLo2EIpI_JMQtyEr-sLJSy5_SnLCb4vtQf&index=2,subtitles/fr/tasks_01_🤗-tasks-question-answering.srt +Vpjb1lu0MDk,🤗 Tasks: Causal Language Modeling,https://www.youtube.com/watch?v=Vpjb1lu0MDk&list=PLo2EIpI_JMQtyEr-sLJSy5_SnLCb4vtQf&index=3,subtitles/fr/tasks_02_🤗-tasks-causal-language-modeling.srt +mqElG5QJWUg,🤗 Tasks: Masked Language Modeling,https://www.youtube.com/watch?v=mqElG5QJWUg&list=PLo2EIpI_JMQtyEr-sLJSy5_SnLCb4vtQf&index=4,subtitles/fr/tasks_03_🤗-tasks-masked-language-modeling.srt +yHnr5Dk2zCI,🤗 Tasks: Summarization,https://www.youtube.com/watch?v=yHnr5Dk2zCI&list=PLo2EIpI_JMQtyEr-sLJSy5_SnLCb4vtQf&index=5,subtitles/fr/tasks_04_🤗-tasks-summarization.srt +1JvfrvZgi6c,🤗 Tasks: Translation,https://www.youtube.com/watch?v=1JvfrvZgi6c&list=PLo2EIpI_JMQtyEr-sLJSy5_SnLCb4vtQf&index=6,subtitles/fr/tasks_05_🤗-tasks-translation.srt diff --git "a/subtitles/fr/tasks_00_\360\237\244\227-tasks-token-classification.srt" "b/subtitles/fr/tasks_00_\360\237\244\227-tasks-token-classification.srt" new file mode 100644 index 000000000..7120d4f6e --- /dev/null +++ "b/subtitles/fr/tasks_00_\360\237\244\227-tasks-token-classification.srt" @@ -0,0 +1,116 @@ +1 +00:00:04,520 --> 00:00:07,400 +Bienvenue dans la série de tâches Hugging Face ! + +2 +00:00:07,400 --> 00:00:11,870 +Dans cette vidéo, nous allons examiner la +tâche de classification des jetons. + +3 +00:00:11,870 --> 00:00:17,900 +La classification des jetons consiste à attribuer +une étiquette à chaque jeton dans une phrase. + +4 +00:00:17,900 --> 00:00:23,310 +Il existe diverses tâches de classification de jetons +et les plus courantes sont la reconnaissance d'entités nommées et le balisage de la + +5 +00:00:23,310 --> 00:00:26,430 +partie du discours. + +6 +00:00:26,430 --> 00:00:31,640 +Jetons un coup d'œil à la +tâche de reconnaissance d'entité nommée. + +7 +00:00:31,640 --> 00:00:38,400 +L'objectif de cette tâche est de trouver les entités +dans un morceau de texte, telles qu'une personne, un lieu + +8 +00:00:38,400 --> 00:00:40,210 +ou une organisation. + +9 +00:00:40,210 --> 00:00:45,250 +Cette tâche consiste à étiqueter chaque +jeton avec une classe pour chaque entité et + +10 +00:00:45,250 --> 00:00:51,719 +une autre classe pour les jetons qui n'ont pas d'entité. + +11 +00:00:51,719 --> 00:00:55,670 +Une autre tâche de classification des jetons est le balisage des parties du discours +. + +12 +00:00:55,670 --> 00:01:01,399 +Le but de cette tâche est d'étiqueter les mots +pour une partie particulière d'un discours, comme le + +13 +00:01:01,399 --> 00:01:05,900 +nom, le pronom, l'adjectif, le verbe et ainsi de suite. + +14 +00:01:05,900 --> 00:01:11,270 +Cette tâche consiste à étiqueter chaque +jeton avec des parties du discours. + +15 +00:01:11,270 --> 00:01:19,659 +Les modèles de classification de jetons sont évalués +sur l'exactitude, le rappel, la précision et le score F1. + +16 +00:01:19,659 --> 00:01:22,950 +Les métriques sont calculées pour chacune des +classes. + +17 +00:01:22,950 --> 00:01:28,040 +Nous calculons les vrais positifs, les vrais négatifs +et les faux positifs pour calculer la précision + +18 +00:01:28,040 --> 00:01:31,829 +et le rappel, et prenons leur moyenne harmonique pour +obtenir le F1-Score. + +19 +00:01:31,829 --> 00:01:42,329 +Ensuite, nous le calculons pour chaque classe et prenons +la moyenne globale pour évaluer notre modèle. + +20 +00:01:42,329 --> 00:01:45,680 +Un exemple de jeu de données utilisé pour cette tâche est ConLL2003. + +21 +00:01:45,680 --> 00:01:51,750 +Ici, chaque jeton appartient à une certaine +classe d'entités nommées, désignées par les indices de la + +22 +00:01:51,750 --> 00:01:55,380 +liste contenant les étiquettes. + +23 +00:01:55,380 --> 00:02:00,720 +Vous pouvez extraire des informations importantes des +factures à l'aide de modèles de reconnaissance d'entités nommées, + +24 +00:02:00,720 --> 00:02:07,070 +telles que la date, le nom de l'organisation ou l'adresse. + +25 +00:02:07,070 --> 00:02:16,840 +Pour plus d'informations sur la tâche de classification des jetons +, consultez le cours Hugging Face. diff --git "a/subtitles/fr/tasks_01_\360\237\244\227-tasks-question-answering.srt" "b/subtitles/fr/tasks_01_\360\237\244\227-tasks-question-answering.srt" new file mode 100644 index 000000000..19ee1a8b8 --- /dev/null +++ "b/subtitles/fr/tasks_01_\360\237\244\227-tasks-question-answering.srt" @@ -0,0 +1,87 @@ +1 +00:00:04,400 --> 00:00:06,480 +Bienvenue dans la série de tâches Hugging Face. + +2 +00:00:07,200 --> 00:00:10,080 +Dans cette vidéo, nous allons examiner +la tâche de réponse aux questions. + +3 +00:00:13,120 --> 00:00:17,200 +La réponse aux questions consiste à +extraire une réponse dans un document donné. + +4 +00:00:21,120 --> 00:00:25,600 +Les modèles de réponse aux questions prennent un contexte, +qui est le document dans lequel vous souhaitez effectuer une recherche, + +5 +00:00:26,240 --> 00:00:31,440 +et une question et renvoient une réponse. +Notez que la réponse n'est pas générée, + +6 +00:00:31,440 --> 00:00:37,600 +mais extraite du contexte. Ce type +de réponse aux questions est appelé extractif. + +7 +00:00:42,320 --> 00:00:46,960 +La tâche est évaluée sur deux +statistiques, la correspondance exacte et le score F1. + +8 +00:00:49,680 --> 00:00:52,320 +Comme son nom l'indique, la correspondance exacte recherche une + +9 +00:00:52,320 --> 00:00:57,840 +correspondance exacte entre la +réponse prédite et la bonne réponse. + +10 +00:01:00,080 --> 00:01:05,520 +Une métrique couramment utilisée est le F1-Score, qui +est calculé sur des jetons prédits + +11 +00:01:05,520 --> 00:01:10,960 +correctement et incorrectement. Il est calculé +sur la moyenne de deux métriques appelées + +12 +00:01:10,960 --> 00:01:16,560 +précision et rappel, qui sont des métriques +largement utilisées dans les problèmes de classification. + +13 +00:01:20,880 --> 00:01:28,240 +Un exemple d'ensemble de données utilisé pour cette tâche est appelé +SQuAD. Cet ensemble de données contient des contextes, des questions + +14 +00:01:28,240 --> 00:01:32,080 +et les réponses obtenues à +partir d'articles de Wikipédia en anglais. + +15 +00:01:35,440 --> 00:01:39,520 +Vous pouvez utiliser des modèles de questions-réponses pour +répondre automatiquement aux questions posées + +16 +00:01:39,520 --> 00:01:46,480 +par vos clients. Vous avez simplement besoin d'un document +contenant des informations sur votre entreprise + +17 +00:01:47,200 --> 00:01:53,840 +et interrogez ce document avec +les questions posées par vos clients. + +18 +00:01:55,680 --> 00:02:06,160 +Pour plus d'informations sur la tâche Question Answering +, consultez le cours Hugging Face. diff --git "a/subtitles/fr/tasks_02_\360\237\244\227-tasks-causal-language-modeling.srt" "b/subtitles/fr/tasks_02_\360\237\244\227-tasks-causal-language-modeling.srt" new file mode 100644 index 000000000..f2a509484 --- /dev/null +++ "b/subtitles/fr/tasks_02_\360\237\244\227-tasks-causal-language-modeling.srt" @@ -0,0 +1,63 @@ +1 +00:00:04,560 --> 00:00:06,640 +Bienvenue dans la série de tâches Hugging Face ! + +2 +00:00:07,200 --> 00:00:10,400 +Dans cette vidéo, nous allons jeter un œil +à la modélisation du langage causal. + +3 +00:00:13,600 --> 00:00:16,880 +La modélisation du langage causal consiste à +prédire le + +4 +00:00:16,880 --> 00:00:21,920 +mot suivant dans une phrase, compte tenu de tous les +mots précédents. Cette tâche est très + +5 +00:00:21,920 --> 00:00:29,920 +similaire à la fonction de correction automatique +que vous pourriez avoir sur votre téléphone. + +6 +00:00:29,920 --> 00:00:34,720 +Ces modèles prennent une séquence à +compléter et génèrent la séquence complète. + +7 +00:00:38,640 --> 00:00:44,160 +Les statistiques de classification ne peuvent pas être utilisées, car il n'y a +pas de réponse correcte unique pour l'achèvement. + +8 +00:00:44,960 --> 00:00:49,280 +Au lieu de cela, nous évaluons la distribution +du texte complété par le modèle. + +9 +00:00:50,800 --> 00:00:55,440 +Une mesure courante pour ce faire est la +perte d'entropie croisée. La perplexité est + +10 +00:00:55,440 --> 00:01:01,280 +également une mesure largement utilisée et elle est calculée +comme l'exponentielle de la perte d'entropie croisée. + +11 +00:01:05,200 --> 00:01:11,840 +Vous pouvez utiliser n'importe quel ensemble de données avec du texte brut +et segmenter le texte pour préparer les données. + +12 +00:01:15,040 --> 00:01:18,240 +Les modèles de langage causal peuvent +être utilisés pour générer du code. + +13 +00:01:22,480 --> 00:01:33,200 +Pour plus d'informations sur la +tâche Modélisation du langage causal, consultez le cours Hugging Face. diff --git "a/subtitles/fr/tasks_03_\360\237\244\227-tasks-masked-language-modeling.srt" "b/subtitles/fr/tasks_03_\360\237\244\227-tasks-masked-language-modeling.srt" new file mode 100644 index 000000000..47686d9b0 --- /dev/null +++ "b/subtitles/fr/tasks_03_\360\237\244\227-tasks-masked-language-modeling.srt" @@ -0,0 +1,85 @@ +1 +00:00:04,660 --> 00:00:07,589 +Bienvenue dans la série de tâches Hugging Face ! + +2 +00:00:07,589 --> 00:00:13,730 +Dans cette vidéo, nous allons jeter un œil à la +modélisation du langage masqué. + +3 +00:00:13,730 --> 00:00:20,720 +La modélisation du langage masqué consiste à prédire +quels mots doivent remplir les blancs d'une + +4 +00:00:20,720 --> 00:00:23,500 +phrase. + +5 +00:00:23,500 --> 00:00:32,870 +Ces modèles prennent un texte masqué en entrée +et génèrent les valeurs possibles pour ce masque. + +6 +00:00:32,870 --> 00:00:37,550 +La modélisation en langage masqué est pratique avant d'affiner +votre modèle pour votre tâche. + +7 +00:00:37,550 --> 00:00:43,579 +Par exemple, si vous devez utiliser un modèle dans +un domaine spécifique, par exemple des documents biomédicaux, des + +8 +00:00:43,579 --> 00:00:49,050 +modèles comme BERT traiteront vos mots spécifiques à un domaine +comme des jetons rares. + +9 +00:00:49,050 --> 00:00:54,220 +Si vous entraînez un modèle de langage masqué à l'aide de +votre corpus biomédical, puis affinez + +10 +00:00:54,220 --> 00:01:02,929 +votre modèle sur une tâche en aval, vous +obtiendrez de meilleures performances. + +11 +00:01:02,929 --> 00:01:07,799 +Les métriques de classification ne peuvent pas être utilisées car +il n'y a pas de réponse correcte unique aux + +12 +00:01:07,799 --> 00:01:08,799 +valeurs de masque. + +13 +00:01:08,799 --> 00:01:12,900 +Au lieu de cela, nous évaluons la distribution des +valeurs de masque. + +14 +00:01:12,900 --> 00:01:16,590 +Une métrique courante pour ce faire est la +perte d'entropie croisée. + +15 +00:01:16,590 --> 00:01:22,010 +La perplexité est également une métrique largement utilisée et +elle est calculée comme l'exponentielle de la + +16 +00:01:22,010 --> 00:01:27,240 +perte d'entropie croisée. + +17 +00:01:27,240 --> 00:01:35,680 +Vous pouvez utiliser n'importe quel jeu de données avec du texte brut et +marquer le texte pour masquer les données. + +18 +00:01:35,680 --> 00:01:44,710 +Pour plus d'informations sur la +modélisation du langage masqué, consultez le cours Hugging Face. diff --git "a/subtitles/fr/tasks_04_\360\237\244\227-tasks-summarization.srt" "b/subtitles/fr/tasks_04_\360\237\244\227-tasks-summarization.srt" new file mode 100644 index 000000000..a8bc6e3bd --- /dev/null +++ "b/subtitles/fr/tasks_04_\360\237\244\227-tasks-summarization.srt" @@ -0,0 +1,68 @@ +1 +00:00:04,560 --> 00:00:06,640 +Bienvenue dans la série de tâches Hugging Face. + +2 +00:00:07,280 --> 00:00:10,720 +Dans cette vidéo, nous allons +examiner la tâche de synthèse de texte. + +3 +00:00:13,680 --> 00:00:16,480 +Le résumé consiste à +produire une version plus courte + +4 +00:00:16,480 --> 00:00:21,600 +d'un document tout en préservant les informations pertinentes +et importantes du document. + +5 +00:00:25,040 --> 00:00:29,840 +Les modèles de synthèse prennent un document à +résumer et génèrent le texte résumé. + +6 +00:00:33,360 --> 00:00:40,240 +Cette tâche est évaluée sur le score ROUGE. Il est +basé sur le chevauchement entre la séquence produite + +7 +00:00:40,240 --> 00:00:48,000 +et la séquence correcte. +Vous pouvez voir cela comme ROUGE-1, + +8 +00:00:48,000 --> 00:00:55,600 +qui est le chevauchement de jetons uniques et ROUGE-2, +le chevauchement des paires de jetons suivantes. ROUGE-N + +9 +00:00:55,600 --> 00:01:02,960 +fait référence au chevauchement de n jetons suivants. +Ici, nous voyons un exemple de la façon dont les chevauchements ont lieu. + +10 +00:01:06,160 --> 00:01:11,280 +Un exemple d'ensemble de données utilisé pour cette tâche +s'appelle Extreme Summarization, XSUM. Cet + +11 +00:01:11,280 --> 00:01:14,480 +ensemble de données contient des textes et +leurs versions résumées. + +12 +00:01:17,680 --> 00:01:21,280 +Vous pouvez utiliser des modèles +de synthèse pour résumer les articles de recherche, ce + +13 +00:01:21,280 --> 00:01:25,680 +qui permettrait aux chercheurs de choisir facilement des +articles pour leur liste de lecture. + +14 +00:01:29,040 --> 00:01:39,520 +Pour plus d'informations sur la +tâche de synthèse , consultez le cours Hugging Face. diff --git "a/subtitles/fr/tasks_05_\360\237\244\227-tasks-translation.srt" "b/subtitles/fr/tasks_05_\360\237\244\227-tasks-translation.srt" new file mode 100644 index 000000000..7473cadd6 --- /dev/null +++ "b/subtitles/fr/tasks_05_\360\237\244\227-tasks-translation.srt" @@ -0,0 +1,96 @@ +1 +00:00:04,569 --> 00:00:07,529 +Bienvenue dans la série de tâches Hugging Face. + +2 +00:00:07,529 --> 00:00:11,840 +Dans cette vidéo, nous allons jeter un œil à la +tâche de traduction. + +3 +00:00:11,840 --> 00:00:19,420 +La traduction est la tâche de traduire un texte +d'une langue à une autre. + +4 +00:00:19,420 --> 00:00:24,420 +Ces modèles prennent un texte dans la langue source +et génèrent la traduction de ce texte dans + +5 +00:00:24,420 --> 00:00:28,609 +la langue cible. + +6 +00:00:28,609 --> 00:00:31,619 +La tâche est évaluée sur le score BLEU. + +7 +00:00:31,619 --> 00:00:38,430 +Le score varie de 0 à 1, où 1 signifie que +la traduction correspond parfaitement et 0 ne + +8 +00:00:38,430 --> 00:00:40,110 +correspond pas du tout. + +9 +00:00:40,110 --> 00:00:45,320 +BLEU est calculé sur les jetons suivants +appelés n-grammes. + +10 +00:00:45,320 --> 00:00:51,629 +Unigram fait référence à un seul jeton tandis que bi-gramme +fait référence à des paires de jetons et n-grammes fait référence à + +11 +00:00:51,629 --> 00:00:56,219 +n jetons suivants. + +12 +00:00:56,219 --> 00:01:01,859 +Les ensembles de données de traduction automatique contiennent des paires +de texte dans une langue et la traduction du + +13 +00:01:01,859 --> 00:01:05,910 +texte dans une autre langue. + +14 +00:01:05,910 --> 00:01:11,290 +Ces modèles peuvent vous aider à créer des +agents conversationnels dans différentes langues. + +15 +00:01:11,290 --> 00:01:16,110 +Une option consiste à traduire les données de formation +utilisées pour le chatbot et à former un + +16 +00:01:16,110 --> 00:01:19,970 +chatbot séparé. + +17 +00:01:19,970 --> 00:01:24,950 +Vous pouvez mettre un modèle de traduction de +la langue de votre utilisateur vers la langue dans laquelle votre chatbot + +18 +00:01:24,950 --> 00:01:31,360 +est formé, traduire les entrées de l'utilisateur et +effectuer une classification d'intention, prendre la sortie + +19 +00:01:31,360 --> 00:01:39,399 +du chatbot et la traduire de la langue dans laquelle +votre chatbot a été formé vers la + +20 +00:01:39,399 --> 00:01:40,850 +langue de l'utilisateur. + +21 +00:01:40,850 --> 00:01:49,720 +Pour plus d'informations sur la +tâche de traduction, consultez le cours Hugging Face. diff --git a/utils/generate_subtitles.py b/utils/generate_subtitles.py index 31dccbe2c..68592830f 100644 --- a/utils/generate_subtitles.py +++ b/utils/generate_subtitles.py @@ -1,31 +1,45 @@ import pandas as pd -from tqdm.auto import tqdm from youtube_transcript_api import YouTubeTranscriptApi from youtube_transcript_api.formatters import SRTFormatter from youtubesearchpython import Playlist from pathlib import Path import argparse -import sys +COURSE_VIDEOS_PLAYLIST = "https://youtube.com/playlist?list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o" +TASK_VIDEOS_PLAYLIST = "https://youtube.com/playlist?list=PLo2EIpI_JMQtyEr-sLJSy5_SnLCb4vtQf" +# These videos are not part of the course, but are part of the task playlist +TASK_VIDEOS_TO_SKIP = ["tjAIM7BOYhw", "WdAeKSOpxhw", "KWwzcmG98Ds", "TksaY_FDgnk", "leNG9fN9FQU", "dKE8SIt9C-w"] -def generate_subtitles(language: str, youtube_language_code: str = None): + +def generate_subtitles(language: str, youtube_language_code: str = None, is_task_playlist: bool = False): metadata = [] formatter = SRTFormatter() path = Path(f"subtitles/{language}") path.mkdir(parents=True, exist_ok=True) - playlist_videos = Playlist.getVideos("https://youtube.com/playlist?list=PLo2EIpI_JMQvWfQndUesu0nPBAtZ9gP1o") + if is_task_playlist: + playlist_videos = Playlist.getVideos(TASK_VIDEOS_PLAYLIST) + else: + playlist_videos = Playlist.getVideos(COURSE_VIDEOS_PLAYLIST) for idx, video in enumerate(playlist_videos["videos"]): video_id = video["id"] title = video["title"] title_formatted = title.lower().replace(" ", "-").replace(":", "").replace("?", "") id_str = f"{idx}".zfill(2) - srt_filename = f"subtitles/{language}/{id_str}_{title_formatted}.srt" + + if is_task_playlist: + srt_filename = f"{path}/tasks_{id_str}_{title_formatted}.srt" + else: + srt_filename = f"{path}/{id_str}_{title_formatted}.srt" # Skip course events if "Event Day" in title: continue + # Skip task videos that don't belong to the course + if video_id in TASK_VIDEOS_TO_SKIP: + continue + # Get transcript transcript_list = YouTubeTranscriptApi.list_transcripts(video_id) english_transcript = transcript_list.find_transcript(language_codes=["en", "en-US"]) @@ -51,10 +65,14 @@ def generate_subtitles(language: str, youtube_language_code: str = None): f.write("No transcript found for this video!") metadata.append({"id": video_id, "title": title, "link": video["link"], "srt_filename": srt_filename}) - break df = pd.DataFrame(metadata) - df.to_csv(f"subtitles/{language}/metadata.csv", index=False) + + if is_task_playlist: + df.to_csv(f"{path}/metadata_tasks.csv", index=False) + else: + df.to_csv(f"{path}/metadata.csv", index=False) + if __name__ == "__main__": @@ -62,5 +80,6 @@ def generate_subtitles(language: str, youtube_language_code: str = None): parser.add_argument("--language", type=str, help="Language to generate subtitles for") parser.add_argument("--youtube_language_code", type=str, help="YouTube language code") args = parser.parse_args() - generate_subtitles(args.language, args.youtube_language_code) + generate_subtitles(args.language, args.youtube_language_code, is_task_playlist=False) + generate_subtitles(args.language, args.youtube_language_code, is_task_playlist=True) print(f"All done! Subtitles stored at subtitles/{args.language}")