Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split alpaca_dataset to alpaca + alpaca_cleaned #639

Merged
merged 2 commits into from
Apr 2, 2024

Conversation

RdoubleA
Copy link
Contributor

@RdoubleA RdoubleA commented Apr 2, 2024

Context

One of the follow ups mentioned in #637 - we should have reasonable defaults set in builder functions and configs should only have specific overrides. alpaca_dataset was a strange case because almost all the time we were overriding use_clean=True and the default was False (which is reasonable since the alpaca dataset by default should return the original dataset). It didn't make sense to make use_clean=True as the default because that is essentially a different dataset. This is also supported by the fact that we've noticed differences in memory usage and performance when using the cleaned version compared to the original, likely due to different sample length distributions (thanks to @SLR722 for pointing this out). At this point, it makes sense to just have a separate builder for the cleaned dataset.

Changelog

Split alpaca into alpaca_dataset and alpaca_cleaned_dataset and refactor callsites accordingly

Test plan

pytest tests --with-integration

Copy link

pytorch-bot bot commented Apr 2, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/639

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit aa1e385 with merge base 8183b42 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@RdoubleA RdoubleA requested a review from ebsmothers April 2, 2024 19:18
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 2, 2024
Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also add alpaca_cleaned_dataset to api_ref_datasets.rst? Otherwise looks good!

@RdoubleA RdoubleA merged commit e7e310a into main Apr 2, 2024
20 checks passed
@RdoubleA RdoubleA deleted the rafiayub/alpaca_cleaned branch April 2, 2024 21:36
tcapelle pushed a commit to tcapelle/torchtune that referenced this pull request Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants