Skip to content

Commit 1e26793

Browse files
Docs Revamp: Datasets docs (#146)
Add documentation for Datasets
1 parent 873424a commit 1e26793

9 files changed

+100
-0
lines changed

docs/assets/hub/datasets-gated.png

97.8 KB
Loading

docs/assets/hub/datasets-main.png

547 KB
Loading

docs/hub/_sections.yml

+3
Original file line numberDiff line numberDiff line change
@@ -9,3 +9,6 @@
99

1010
- local: spaces-main
1111
title: Spaces
12+
13+
- local: datasets-main
14+
title: Datasets

docs/hub/datasets-adding.md

+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
---
2+
title: Adding New Datasets
3+
---
4+
5+
<h1>Adding new datasets</h1>
6+
7+
Any Hugging Face user can create a dataset! You can start by [creating your dataset repository](https://huggingface.co/new-dataset) and choosing one of the following methods to upload your dataset:
8+
9+
* [Add files manually to the repository through the UI](https://huggingface.co/docs/datasets/upload_dataset#upload-your-files)
10+
* [Push files with the `push_to_hub` method from 🤗 Datasets](https://huggingface.co/docs/datasets/upload_dataset#upload-from-python)
11+
* [Use Git to commit and push your dataset files](https://huggingface.co/docs/datasets/share#clone-the-repository)
12+
13+
While it's possible to add raw data to your dataset repo in a number of formats (JSON, CSV, Parquet, text, and images), for large datasets you may want to [create a loading script](https://huggingface.co/docs/datasets/dataset_script#create-a-dataset-loading-script). This script defines the different configurations and splits of your dataset, as well as how to download and process the data.
14+
15+
## Datasets outside a namespace
16+
17+
Datasets outside a namspace are maintained by the Hugging Face team on GitHub. Unlike the naming convention used for community datasets (`username/dataset_name` or `org/dataset_name`), datasets outside a namespace can be referenced directly by their name (e.g. [`glue`](https://huggingface.co/datasets/glue)). If you find that an improvement is needed, refer to the [🤗 Datasets documentation](https://huggingface.co/docs/datasets/master/en/share#datasets-on-github-legacy) for an explanation on how to submit a PR on GitHub to propose edits.

docs/hub/datasets-cards.md

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
---
2+
title: Dataset Cards
3+
---
4+
5+
<h1>Dataset Cards</h1>
6+
7+
Each dataset may be documented by the `README.md` file in the repository. This file is called a **dataset card**, and the Hugging Face Hub will render its contents on the dataset's main page. To inform users about how to responsibly use the data, it's a good idea to include information about any potential biases within the dataset. Generally, dataset cards help users understand the contents of the dataset and give context for how the dataset should be used.
8+
9+
The [Dataset Card Creation Guide](https://github.com/huggingface/datasets/blob/master/templates/README_guide.md) gives an overview of what components are found in a dataset card. The process for creating a new dataset card is outlined in the [Create a dataset card](https://huggingface.co/docs/datasets/dataset_card) guide.
10+
11+
Reading through existing dataset cards, such as the [ELI5 dataset card](https://github.com/huggingface/datasets/blob/master/datasets/eli5/README.md), is a great way to familiarize yourself with the common conventions. There is also an [interactive dataset card builder](https://huggingface.co/datasets/card-creator/) that can guide you through creating your card.

docs/hub/datasets-gated.md

+23
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
title: Gated datasets
3+
---
4+
5+
<h1>Gated datasets</h1>
6+
7+
To give dataset creators more control over how their datasets are used, the Hub allows users to enable **User Access requests** through a dataset's **Settings** tab. Enabling this setting requires users to agree to share their contact information with Hugging Face in order to access the dataset. The contact information is stored in a database, and dataset owners are able to download a copy of the user access report.
8+
9+
The User Access request dialog can be modified to include additional text and checkbox fields in the prompt. To do this, add a YAML section to the dataset's `README.md` file (create one if it does not already exist) and add an `extra_gated_fields` property. Within this property, you'll be able to add as many custom fields as you like and whether they are a `text` or `checkbox` field. An `extra_gated_prompt` property can also be included to add a customized text message.
10+
11+
```
12+
---
13+
extra_gated_prompt: "You agree to not attempt to determine the identity of individuals in this dataset"
14+
extra_gated_fields:
15+
Company: text
16+
Country: text
17+
I agree to use this model for non-commerical use ONLY: checkbox
18+
---
19+
```
20+
21+
![A gated Dataset showing the User Access request dialog](/docs/assets/hub/datasets-gated.png)
22+
23+
The `README.md` file for a dataset is called a [Dataset Card](./datasets-cards). Visit the documentation to learn more about how to use it and to see the properties that you can configure.

docs/hub/datasets-main.md

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
---
2+
title: Datasets
3+
---
4+
5+
<h1>Datasets</h1>
6+
7+
The Hugging Face Hub is home to a growing collection of datasets that span a variety of domains and tasks. These docs will guide you through interacting with the datasets on the Hub, uploading new datasets, and using datasets in your projects.
8+
9+
This documentation focuses on the datasets functionality in the Hugging Face Hub instead of the 🤗 Datasets library. For detailed information about 🤗 Datasets, visit the [🤗 Datasets documentation](https://huggingface.co/docs/datasets/index).
10+
## Contents
11+
12+
- [Datasets Overview](./datasets-overview)
13+
- [Dataset Cards](./datasets-cards)
14+
- [Gated Datasets](./datasets-gated)
15+
- [Using Datasets](./datasets-usage)
16+
- [Adding New Datasets](./datasets-adding)

docs/hub/datasets-overview.md

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
---
2+
title: Datasets Overview
3+
---
4+
5+
<h1>Datasets Overview</h1>
6+
7+
## Datasets on the Hub
8+
9+
The Hugging Face Hub hosts a [large number of community-curated datasets](https://huggingface.co/datasets) for a diverse range of tasks such as translation, automatic speech recognition, and image classification. Alongside the information contained in the [dataset card](./datasets-cards), many datasets, such as [GLUE](https://huggingface.co/datasets/glue), include a Dataset Preview to showcase the data. There is also a handy [Datasets Viewer](https://huggingface.co/datasets/viewer/) which also displays the features of a dataset in addition to the preview.
10+
11+
Each dataset is a [Git repository](./repositories-main), equipped with the necessary scripts to download the data and generate splits for training, evaluation, and testing. For information on how a dataset repository is structured, refer to the [Structure your repository guide](https://huggingface.co/docs/datasets/repository_structure).
12+
13+
## Search for datasets
14+
15+
Like models and Spaces, you can search the Hub for datasets using the search bar in the top navigation or on the [main datasets page](https://huggingface.co/datasets). There's a large number of languages, tasks, and licenses that you can use to filter your results to find a dataset that's right for you.
16+
17+
![Datasets search page on the Hugging Face Hub](/docs/assets/hub/datasets-main.png)
18+
19+
## Privacy
20+
21+
Since datasets are repositories, you can [toggle their visibility between private and public](./repositories-best-practices) through the Settings tab. If a dataset is owned by an [organization](TODO), the privacy settings apply to all the members of the organization.

docs/hub/datasets-usage.md

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
---
2+
title: Using 🤗 Datasets
3+
---
4+
5+
<h1>Using 🤗 Datasets</h1>
6+
7+
Once you've found an interesting dataset on the Hugging Face Hub, you can load the dataset using 🤗 Datasets. You can click on the **Use in dataset library** button to copy the code to load a dataset. Many datasets on the Hub contain a [loading script](https://huggingface.co/docs/datasets/dataset_script), which allows you to easily [load the dataset when you need it](https://huggingface.co/docs/datasets/load_hub).
8+
9+
Some datasets might not include a loading script, in which case the data might be stored directly in the repository, in formats such as CSV, JSON and Parquet. 🤗 Datasets can [load those kinds of datasets](https://huggingface.co/docs/datasets/loading#hugging-face-hub) as well. For more information about using 🤗 Datasets, check out the [tutorials](https://huggingface.co/docs/datasets/tutorial) and [how-to guides](https://huggingface.co/docs/datasets/how_to) available in the 🤗 Datasets documentation.

0 commit comments

Comments
 (0)