Split alpaca_dataset to alpaca + alpaca_cleaned #639
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
One of the follow ups mentioned in #637 - we should have reasonable defaults set in builder functions and configs should only have specific overrides.
alpaca_dataset
was a strange case because almost all the time we were overridinguse_clean=True
and the default wasFalse
(which is reasonable since the alpaca dataset by default should return the original dataset). It didn't make sense to makeuse_clean=True
as the default because that is essentially a different dataset. This is also supported by the fact that we've noticed differences in memory usage and performance when using the cleaned version compared to the original, likely due to different sample length distributions (thanks to @SLR722 for pointing this out). At this point, it makes sense to just have a separate builder for the cleaned dataset.Changelog
Split alpaca into
alpaca_dataset
andalpaca_cleaned_dataset
and refactor callsites accordinglyTest plan
pytest tests --with-integration