-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* disallow video push_to_hub * docs * minor
- Loading branch information
Showing
5 changed files
with
188 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,172 @@ | ||
# Create a video dataset | ||
|
||
This guide will show you how to create a video dataset with `VideoFolder` and some metadata. This is a no-code solution for quickly creating a video dataset with several thousand videos. | ||
|
||
<Tip> | ||
|
||
You can control access to your dataset by requiring users to share their contact information first. Check out the [Gated datasets](https://huggingface.co/docs/hub/datasets-gated) guide for more information about how to enable this feature on the Hub. | ||
|
||
</Tip> | ||
|
||
## VideoFolder | ||
|
||
The `VideoFolder` is a dataset builder designed to quickly load a video dataset with several thousand videos without requiring you to write any code. | ||
|
||
<Tip> | ||
|
||
💡 Take a look at the [Split pattern hierarchy](repository_structure#split-pattern-hierarchy) to learn more about how `VideoFolder` creates dataset splits based on your dataset repository structure. | ||
|
||
</Tip> | ||
|
||
`VideoFolder` automatically infers the class labels of your dataset based on the directory name. Store your dataset in a directory structure like: | ||
|
||
``` | ||
folder/train/dog/golden_retriever.mp4 | ||
folder/train/dog/german_shepherd.mp4 | ||
folder/train/dog/chihuahua.mp4 | ||
folder/train/cat/maine_coon.mp4 | ||
folder/train/cat/bengal.mp4 | ||
folder/train/cat/birman.mp4 | ||
``` | ||
|
||
Then users can load your dataset by specifying `videofolder` in [`load_dataset`] and the directory in `data_dir`: | ||
|
||
```py | ||
>>> from datasets import load_dataset | ||
|
||
>>> dataset = load_dataset("videofolder", data_dir="/path/to/folder") | ||
``` | ||
|
||
You can also use `videofolder` to load datasets involving multiple splits. To do so, your dataset directory should have the following structure: | ||
|
||
``` | ||
folder/train/dog/golden_retriever.mp4 | ||
folder/train/cat/maine_coon.mp4 | ||
folder/test/dog/german_shepherd.mp4 | ||
folder/test/cat/bengal.mp4 | ||
``` | ||
|
||
<Tip warning={true}> | ||
|
||
If all video files are contained in a single directory or if they are not on the same level of directory structure, `label` column won't be added automatically. If you need it, set `drop_labels=False` explicitly. | ||
|
||
</Tip> | ||
|
||
|
||
If there is additional information you'd like to include about your dataset, like text captions or bounding boxes, add it as a `metadata.csv` file in your folder. This lets you quickly create datasets for different computer vision tasks like text captioning or object detection. You can also use a JSONL file `metadata.jsonl`. | ||
|
||
``` | ||
folder/train/metadata.csv | ||
folder/train/0001.mp4 | ||
folder/train/0002.mp4 | ||
folder/train/0003.mp4 | ||
``` | ||
|
||
You can also zip your videos: | ||
|
||
``` | ||
folder/metadata.csv | ||
folder/train.zip | ||
folder/test.zip | ||
folder/valid.zip | ||
``` | ||
|
||
Your `metadata.csv` file must have a `file_name` column which links video files with their metadata: | ||
|
||
```csv | ||
file_name,additional_feature | ||
0001.mp4,This is a first value of a text feature you added to your videos | ||
0002.mp4,This is a second value of a text feature you added to your videos | ||
0003.mp4,This is a third value of a text feature you added to your videos | ||
``` | ||
|
||
or using `metadata.jsonl`: | ||
|
||
```jsonl | ||
{"file_name": "0001.mp4", "additional_feature": "This is a first value of a text feature you added to your videos"} | ||
{"file_name": "0002.mp4", "additional_feature": "This is a second value of a text feature you added to your videos"} | ||
{"file_name": "0003.mp4", "additional_feature": "This is a third value of a text feature you added to your videos"} | ||
``` | ||
|
||
<Tip> | ||
|
||
If metadata files are present, the inferred labels based on the directory name are dropped by default. To include those labels, set `drop_labels=False` in `load_dataset`. | ||
|
||
</Tip> | ||
|
||
### Video captioning | ||
|
||
Video captioning datasets have text describing a video. An example `metadata.csv` may look like: | ||
|
||
```csv | ||
file_name,text | ||
0001.mp4,This is a golden retriever playing with a ball | ||
0002.mp4,A german shepherd | ||
0003.mp4,One chihuahua | ||
``` | ||
|
||
Load the dataset with `VideoFolder`, and it will create a `text` column for the video captions: | ||
|
||
```py | ||
>>> dataset = load_dataset("videofolder", data_dir="/path/to/folder", split="train") | ||
>>> dataset[0]["text"] | ||
"This is a golden retriever playing with a ball" | ||
``` | ||
|
||
### Upload dataset to the Hub | ||
|
||
Once you've created a dataset, you can share it to the using `huggingface_hub` for example. Make sure you have the [huggingface_hub](https://huggingface.co/docs/huggingface_hub/index) library installed and you're logged in to your Hugging Face account (see the [Upload with Python tutorial](upload_dataset#upload-with-python) for more details). | ||
|
||
Upload your dataset with `huggingface_hub.HfApi.upload_folder`: | ||
|
||
```py | ||
from huggingface_hub import HfApi | ||
api = HfApi() | ||
|
||
api.upload_folder( | ||
folder_path="/path/to/local/dataset", | ||
repo_id="username/my-cool-dataset", | ||
repo_type="dataset", | ||
) | ||
``` | ||
|
||
## WebDataset | ||
|
||
The [WebDataset](https://github.com/webdataset/webdataset) format is based on TAR archives and is suitable for big video datasets. | ||
Indeed you can group your videos in TAR archives (e.g. 1GB of videos per TAR archive) and have thousands of TAR archives: | ||
|
||
``` | ||
folder/train/00000.tar | ||
folder/train/00001.tar | ||
folder/train/00002.tar | ||
... | ||
``` | ||
|
||
In the archives, each example is made of files sharing the same prefix: | ||
|
||
``` | ||
e39871fd9fd74f55.mp4 | ||
e39871fd9fd74f55.json | ||
f18b91585c4d3f3e.mp4 | ||
f18b91585c4d3f3e.json | ||
ede6e66b2fb59aab.mp4 | ||
ede6e66b2fb59aab.json | ||
ed600d57fcee4f94.mp4 | ||
ed600d57fcee4f94.json | ||
... | ||
``` | ||
|
||
You can put your videos labels/captions/features using JSON or text files for example. | ||
|
||
For more details on the WebDataset format and the python library, please check the [WebDataset documentation](https://webdataset.github.io/webdataset). | ||
|
||
Load your WebDataset and it will create on column per file suffix (here "mp4" and "json"): | ||
|
||
```python | ||
>>> from datasets import load_dataset | ||
|
||
>>> dataset = load_dataset("webdataset", data_dir="/path/to/folder", split="train") | ||
>>> dataset[0]["json"] | ||
{"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters