Skip to content

Commit

Permalink
Merge pull request #515 from tyisme614/review_ep35
Browse files Browse the repository at this point in the history
docs(zh-cn): Reviewed 35_loading-a-custom-dataset.srt
  • Loading branch information
xianbaoqian authored Apr 11, 2023
2 parents 5b077d8 + 3d45b8b commit f631679
Showing 1 changed file with 35 additions and 35 deletions.
70 changes: 35 additions & 35 deletions subtitles/zh-CN/35_loading-a-custom-dataset.srt
Original file line number Diff line number Diff line change
Expand Up @@ -20,8 +20,8 @@

5
00:00:08,430 --> 00:00:09,750
尽管 Hugging Face Hub 主持
Although the Hugging Face Hub hosts
尽管 Hugging Face Hub 上承载了
Although the HuggingFace Hub hosts

6
00:00:09,750 --> 00:00:11,730
Expand All @@ -30,27 +30,27 @@ over a thousand public datasets,

7
00:00:11,730 --> 00:00:12,930
你经常需要处理数据
你可能仍然需要经常处理存储在你的笔记本电脑
you'll often need to work with data

8
00:00:12,930 --> 00:00:15,900
存储在你的笔记本电脑或某些远程服务器上
或存储在远程服务器上的数据
that is stored on your laptop or some remote server.

9
00:00:15,900 --> 00:00:18,060
在本视频中,我们将探讨数据集库如何
在本视频中,我们将探讨如何利用 Datasets 库
In this video, we'll explore how the Datasets library

10
00:00:18,060 --> 00:00:20,310
可用于加载不可用的数据集
加载 Hugging Face Hub 以外
can be used to load datasets that aren't available

11
00:00:20,310 --> 00:00:21,510
在 Hugging Face Hub 上
的数据集
on the Hugging Face Hub.

12
Expand All @@ -75,22 +75,22 @@ To load a dataset in one of these formats,

16
00:00:31,200 --> 00:00:32,730
你只需要提供格式的名称
你只需要向 load_dataset 函数
you just need to provide the name of the format

17
00:00:32,730 --> 00:00:34,350
到 load_dataset 函数
提供格式的名称
to the load_dataset function,

18
00:00:34,350 --> 00:00:35,790
连同 data_files 参数
并且连同 data_files 参数一起传入
along with a data_files argument

19
00:00:35,790 --> 00:00:37,610
指向一个或多个文件路径或 URL。
该参数指向一个或多个文件路径或 URL。
that points to one or more filepaths or URLs.

20
Expand All @@ -105,7 +105,7 @@ In this example, we first download a dataset

22
00:00:45,960 --> 00:00:48,963
关于来自 UCI 机器学习库的葡萄酒质量
该数据集是来自 UCI 机器学习库的葡萄酒质量数据
about wine quality from the UCI machine learning repository.

23
Expand Down Expand Up @@ -150,7 +150,7 @@ so here we've also specified

31
00:01:06,750 --> 00:01:09,030
分隔符是分号
分号作为分隔符
that the separator is a semi-colon.

32
Expand All @@ -165,7 +165,7 @@ is loaded automatically as a DatasetDict object,

34
00:01:13,020 --> 00:01:15,920
CSV 文件中的每一列都表示为一个特征
CSV 文件中的每一列都代表一个特征
with each column in the CSV file represented as a feature.

35
Expand All @@ -175,7 +175,7 @@ If your dataset is located on some remote server like GitHub

36
00:01:20,280 --> 00:01:22,050
或其他一些存储库
或其他一些数据仓库
or some other repository,

37
Expand Down Expand Up @@ -205,12 +205,12 @@ This format is quite common in NLP,

42
00:01:35,100 --> 00:01:36,750
你通常会找到书籍和戏剧
你常常会发现书籍和戏剧
and you'll typically find books and plays

43
00:01:36,750 --> 00:01:39,393
只是一个包含原始文本的文件
只是一个包含原始文本的独立文件
are just a single file with raw text inside.

44
Expand All @@ -220,7 +220,7 @@ In this example, we have a text file of Shakespeare plays

45
00:01:43,020 --> 00:01:45,330
存储在 GitHub 存储库中
存储在 GitHub 仓库中
that's stored on a GitHub repository.

46
Expand All @@ -245,12 +245,12 @@ As you can see, these files are processed line-by-line,

50
00:01:55,110 --> 00:01:57,690
所以原始文本中的空行也被表示
所以原始文本中的空行
so empty lines in the raw text are also represented

51
00:01:57,690 --> 00:01:58,953
作为数据集中的一行
也按照数据集中的一行表示
as a row in the dataset.

52
Expand All @@ -270,12 +270,12 @@ where every row in the file is a separate JSON object.

55
00:02:09,510 --> 00:02:11,100
对于这些文件,你可以加载数据集
对于这些文件,你可以通过选择 JSON 加载脚本
For these files, you can load the dataset

56
00:02:11,100 --> 00:02:13,020
通过选择 JSON 加载脚本
来加载数据集
by selecting the JSON loading script

57
Expand All @@ -285,12 +285,12 @@ and pointing the data_files argument to the file or URL.

58
00:02:17,160 --> 00:02:19,410
在这个例子中,我们加载了一个 JSON 行文件
在这个例子中,我们加载了一个多行 JSON 的文件
In this example, we've loaded a JSON lines files

59
00:02:19,410 --> 00:02:21,710
基于 Stack Exchange 问题和答案。
其内容基于 Stack Exchange 问题和答案。
based on Stack Exchange questions and answers.

60
Expand All @@ -310,27 +310,27 @@ so the load_dataset function allow you to specify

63
00:02:31,200 --> 00:02:32,733
要加载哪个特定密钥
要加载哪个特定关键词
which specific key to load.

64
00:02:33,630 --> 00:02:35,910
例如,用于问答的 SQuAD 数据集
例如,用于问答的 SQuAD 数据集有它的格式,
For example, the SQuAD dataset for question and answering

65
00:02:35,910 --> 00:02:38,340
有它的格式,我们可以通过指定来加载它
我们可以通过指定我们感兴趣的数据字段
has its format, and we can load it by specifying

66
00:02:38,340 --> 00:02:40,340
我们对数据字段感兴趣
我们对 data 字段感兴趣
that we're interested in the data field.

67
00:02:41,400 --> 00:02:42,780
最后一件事要提
最后要和大家分享的内容是
There is just one last thing to mention

68
Expand All @@ -340,7 +340,7 @@ about all of these loading scripts.

69
00:02:44,910 --> 00:02:46,410
你可以有不止一次分裂
你可以有不止一次数据切分
You can have more than one split,

70
Expand All @@ -350,7 +350,7 @@ you can load them by treating data files as a dictionary,

71
00:02:49,080 --> 00:02:52,140
并将每个拆分名称映射到其对应的文件
并将每个拆分的名称映射到其对应的文件
and map each split name to its corresponding file.

72
Expand All @@ -360,22 +360,22 @@ Everything else stays completely unchanged

73
00:02:53,970 --> 00:02:55,350
你可以看到一个加载的例子
你可以看到一个例子,
and you can see an example of loading

74
00:02:55,350 --> 00:02:58,283
SQuAD 的训练和验证拆分均在此处
加载此 SQuAD 的训练和验证分解步骤都在这里
both the training and validation splits for this SQuAD here.

75
00:02:59,550 --> 00:03:02,310
这样,你现在可以从笔记本电脑加载数据集
这样,你现在可以加载来自笔记本电脑的数据集,来自 Hugging Face Hub 的数据集
And with that, you can now load datasets from your laptop,

76
00:03:02,310 --> 00:03:04,653
Hugging Face Hub,或任何其他地方
或来自任何其他地方的数据集
the Hugging Face Hub, or anywhere else want.

77
Expand Down

0 comments on commit f631679

Please sign in to comment.