This repository contains a Fondant pipeline to load and filter the fondant-cc-25m dataset. This dataset contains more than 25 million images with a creative commons license, extracted from CommonCrawl.
You can either use the notebook to interactively build the pipeline, or follow along with the README below to use the CLI.
The primary goal of this sample is to showcase how you can use a Fondant pipeline and reusable components to load an image dataset from HuggingFace Hub and download all images. Pipeline Steps:
- Load from Huggingface Hub: The pipeline begins by loading the image dataset from Huggingface Hub.
- Download Images: The download image component download images and stores them to parquet.
- Filter Images: The filter image component filters images based on their resolution.
Accordingly, the getting started documentation, you can go to the src
folder and run the pipeline
by using the LocalRunner
as follow:
fondant run local pipeline.py
Note: The 'load_from_hub' component accepts an argument that defines the dataset size. You have the option to adjust it to load more images from HuggingFace. Therefore, you can modify this line:
"n_rows_to_load": 1000
After the pipeline is succeeded you can explore the data by using the fondant data explorer:
fondant explore --base_path ./data-dir