Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
✨ Description
A set of composable and dynamic dataset configuration classes, that allow defining arbitrary dataset definition schemes.
Dynamic configuration classes: experimental implementation in
GPTDatasetConfig
. if it works well we could use elsewhere, ex. for defining plugins.. It defines a class registry that is populated in__init_subclass__
, and works as long as the subclass is imported.The data config now has a
dataset
entry, which can be any dataset config. That dataset config may contain further nested dataset configs, etc.But there is the constraint that dataset may be sampled/unsampled and split/unsplit, which constrains the nesting structure.
The types for now are:
memmap
: A typical Megatron dataset, unsampled and unsplitconcatenated
: The concatenation of multiple indexed datasets (unsammpled and unsplit, ex. memmap) as if it were one. Currently unused.split
: Split an indexed dataset using the provided ratios.blended
: Blend sampled datasets (split or unsplit) according to the given probabilities).dummy
: Always returns the same sample. Only available as sampled and split.legacy
: Same as before this PR, for backward compatibility only. This is the only way to do dataset from json files, which we aim to replace with a concatenated one anyway.Dataset classes may include nested dataset definitions
Misc:
Breaking change:
sample
dataset source has been dropped since it's not that relevant. Otherwise configs are backward-compatible (for now).For future work:
🔍 Type of change
Select all that apply: