Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modular dataset configuration #104

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Modular dataset configuration #104

wants to merge 7 commits into from

Conversation

jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Jan 6, 2025

✨ Description

A set of composable and dynamic dataset configuration classes, that allow defining arbitrary dataset definition schemes.

Dynamic configuration classes: experimental implementation in GPTDatasetConfig. if it works well we could use elsewhere, ex. for defining plugins.. It defines a class registry that is populated in __init_subclass__, and works as long as the subclass is imported.

The data config now has a dataset entry, which can be any dataset config. That dataset config may contain further nested dataset configs, etc.
But there is the constraint that dataset may be sampled/unsampled and split/unsplit, which constrains the nesting structure.

The types for now are:

  • memmap: A typical Megatron dataset, unsampled and unsplit
  • concatenated: The concatenation of multiple indexed datasets (unsammpled and unsplit, ex. memmap) as if it were one. Currently unused.
  • split: Split an indexed dataset using the provided ratios.
  • blended: Blend sampled datasets (split or unsplit) according to the given probabilities).
  • dummy: Always returns the same sample. Only available as sampled and split.
  • legacy: Same as before this PR, for backward compatibility only. This is the only way to do dataset from json files, which we aim to replace with a concatenated one anyway.

Dataset classes may include nested dataset definitions

Misc:

  • Serialize dict keys for configs (in case of enum mostly)
  • Generalize indexed dataset so the machinery can be used for other models

Breaking change: sample dataset source has been dropped since it's not that relevant. Otherwise configs are backward-compatible (for now).

For future work:

  • Define each split independently
  • Concatenate dataset from all files in directory.
  • More features.
  • Simplify things?

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

@jlamypoirier jlamypoirier marked this pull request as ready for review January 7, 2025 20:25
@jlamypoirier jlamypoirier requested a review from tscholak January 7, 2025 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants