Modular dataset configuration #104

jlamypoirier · 2025-01-06T21:06:56Z

✨ Description

A set of composable and dynamic dataset configuration classes, that allow defining arbitrary dataset definition schemes.

Dynamic configuration classes: experimental implementation in GPTDatasetConfig. if it works well we could use elsewhere, ex. for defining plugins.. It defines a class registry that is populated in __init_subclass__, and works as long as the subclass is imported.

The data config now has a dataset entry, which can be any dataset config. That dataset config may contain further nested dataset configs, etc.
But there is the constraint that dataset may be sampled/unsampled and split/unsplit, which constrains the nesting structure.

The types for now are:

memmap: A typical Megatron dataset, unsampled and unsplit
concatenated: The concatenation of multiple indexed datasets (unsammpled and unsplit, ex. memmap) as if it were one. Currently unused.
split: Split an indexed dataset using the provided ratios.
blended: Blend sampled datasets (split or unsplit) according to the given probabilities).
dummy: Always returns the same sample. Only available as sampled and split.
legacy: Same as before this PR, for backward compatibility only. This is the only way to do dataset from json files, which we aim to replace with a concatenated one anyway.

Dataset classes may include nested dataset definitions

Misc:

Serialize dict keys for configs (in case of enum mostly)
Generalize indexed dataset so the machinery can be used for other models

Breaking change: sample dataset source has been dropped since it's not that relevant. Otherwise configs are backward-compatible (for now).

For future work:

Define each split independently
Concatenate dataset from all files in directory.
More features.
Simplify things?

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

jlamypoirier added 3 commits January 6, 2025 16:06

Modular dataset configuration

147e33b

fixes

c41a2c5

fix

e013ba2

jlamypoirier marked this pull request as ready for review January 7, 2025 20:25

jlamypoirier requested a review from tscholak January 7, 2025 20:25

tscholak and others added 4 commits January 9, 2025 12:17

Merge branch 'main' into modular_dataset

6b45944

Merge branch 'main' into modular_dataset

952a03d

Generalize indexed

82285ae

fix

7011ca3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modular dataset configuration #104

Modular dataset configuration #104

jlamypoirier commented Jan 6, 2025 •

edited

Loading

Modular dataset configuration #104

Are you sure you want to change the base?

Modular dataset configuration #104

Conversation

jlamypoirier commented Jan 6, 2025 • edited Loading

✨ Description

🔍 Type of change

jlamypoirier commented Jan 6, 2025 •

edited

Loading