Vision Data Curation (VDC)

A lightweight framework for cleaning, filtering, and sampling large-scale image datasets. Built for computer vision researchers and practitioners who want higher-quality data with less manual effort.

At its core, VDC embraces a data-centric AI philosophy, aiming to enhance the quality and diversity of your datasets with minimal manual intervention. By providing a structured, iterative pipeline, VDC helps researchers and practitioners build higher-performing models by starting with better data.

Features

VDC provides modular tools for dataset cleanup:

Input validation - detect corrupt files, invalid formats, low resolution, or extreme aspect ratios
Rotation correction - correct 90°/180°/270° orientation errors
Example-based filtering - remove images similar to a set of unwanted examples
Image Quality Filtering - remove images based on aesthetic score or NSFW classification
Duplicate removal - identify and remove near-duplicate images from your dataset
Hierarchical K-Means sampling - select diverse, representative subsets from large datasets

The Curation Pipeline

flowchart LR
    A[Raw<br/>Dataset] --> V[Validation]
    V --> R[Rotation]
    R --> D[Dedup]
    D --> E[Example<br/>Filter]
    E --> Q[Quality Filter<br/>Aesthetic/NSFW]
    Q --> S[Cluster-based<br/>Sampling]
    S --> F[Curated<br/>Dataset]

    U[Unwanted<br/>Examples] --> E

Installation

From PyPI

pip install vision-data-curation

From Source

git clone https://gitlab.com/birder/vision-data-curation.git
cd vision-data-curation
pip install -e .

Developing directly from the project root allows for script and configuration execution as if fully installed.

Usage

Each step is a script under vdc.scripts.

Examples:

# Remove corrupt/invalid images
python -m vdc.scripts.sanitize_images data/raw_images/

# Filter based on "Unwanted examples"
python -m vdc.scripts.filter_by_examples data/embeddings.csv --examples bad_examples.csv

Run python -m vdc.scripts to see available scripts
Run python -m vdc.scripts.<script> --help for options

Model Downloads: Models required for certain features (e.g., SSCD, PE, CLIP, aesthetic/NSFW predictors) are automatically downloaded to and managed within the models/ directory (defined by vdc.conf.settings.MODELS_DIR).

Configuration

Default settings live in vdc/conf/config.json
A config.json in your project root will take precedence (or pass --config to any script)

Visualization & Exploration

VDC provides accompanying R Markdown notebooks (found in the notebooks/ directory) to visually inspect and understand the impact of various curation steps, aiding in data-driven decision making.

While these notebooks use the .Rmd extension, they are entirely written in Python. This choice was made for their text-based nature, which significantly improves version control by producing clean and meaningful Git diffs compared to traditional .ipynb files.

For a seamless interactive experience in a Jupyter environment, we recommend installing jupytext (pip install jupytext). With jupytext installed, you can open and run .Rmd files directly in Jupyter as if they were .ipynb notebooks.

Documentation

For detailed walkthroughs, examples, and in-depth guides, please refer to our full documentation.

License

This project is licensed under the Apache-2.0 License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.vscode		.vscode
docs		docs
notebooks		notebooks
requirements		requirements
scripts		scripts
tests		tests
vdc		vdc
.bandit		.bandit
.flake8		.flake8
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.pylintrc		.pylintrc
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
bumpver.toml		bumpver.toml
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
tasks.py		tasks.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vision Data Curation (VDC)

Features

The Curation Pipeline

Installation

From PyPI

From Source

Usage

Configuration

Visualization & Exploration

Documentation

License

About

Uh oh!

Releases

Packages

Languages

License

birder-project/vision-data-curation

Folders and files

Latest commit

History

Repository files navigation

Vision Data Curation (VDC)

Features

The Curation Pipeline

Installation

From PyPI

From Source

Usage

Configuration

Visualization & Exploration

Documentation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages