Creative commons licensed data pipeline

Introduction

This repository contains a Fondant pipeline to load and filter the fondant-cc-25m dataset. This dataset contains more than 25 million images with a creative commons license, extracted from CommonCrawl.

You can either use the notebook to interactively build the pipeline, or follow along with the README below to use the CLI.

Pipeline overview

The primary goal of this sample is to showcase how you can use a Fondant pipeline and reusable components to load an image dataset from HuggingFace Hub and download all images. Pipeline Steps:

Load from Huggingface Hub: The pipeline begins by loading the image dataset from Huggingface Hub.
Download Images: The download image component download images and stores them to parquet.
Filter Images: The filter image component filters images based on their resolution.

Running the sample pipeline and explore the data

Accordingly, the getting started documentation, you can go to the src folder and run the pipeline by using the LocalRunner as follow:

fondant run local pipeline.py

Note: The 'load_from_hub' component accepts an argument that defines the dataset size. You have the option to adjust it to load more images from HuggingFace. Therefore, you can modify this line: "n_rows_to_load": 1000

After the pipeline is succeeded you can explore the data by using the fondant data explorer:

fondant explore --base_path ./data-dir

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Creative commons licensed data pipeline

Introduction

Pipeline overview

Running the sample pipeline and explore the data

About

Contributors 3

Languages

ml6team/fondant-usecase-filter-creative-commons

Folders and files

Latest commit

History

Repository files navigation

Creative commons licensed data pipeline

Introduction

Pipeline overview

Running the sample pipeline and explore the data

About

Topics

Resources

Stars

Watchers

Forks

Contributors 3

Languages