Split alpaca_dataset to alpaca + alpaca_cleaned #639

RdoubleA · 2024-04-02T19:18:15Z

Context

One of the follow ups mentioned in #637 - we should have reasonable defaults set in builder functions and configs should only have specific overrides. alpaca_dataset was a strange case because almost all the time we were overriding use_clean=True and the default was False (which is reasonable since the alpaca dataset by default should return the original dataset). It didn't make sense to make use_clean=True as the default because that is essentially a different dataset. This is also supported by the fact that we've noticed differences in memory usage and performance when using the cleaned version compared to the original, likely due to different sample length distributions (thanks to @SLR722 for pointing this out). At this point, it makes sense to just have a separate builder for the cleaned dataset.

Changelog

Split alpaca into alpaca_dataset and alpaca_cleaned_dataset and refactor callsites accordingly

Test plan

pytest tests --with-integration

pytorch-bot · 2024-04-02T19:18:18Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/639

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit aa1e385 with merge base 8183b42 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ebsmothers

Should we also add alpaca_cleaned_dataset to api_ref_datasets.rst? Otherwise looks good!

add alpaca_cleaned

aa6d259

RdoubleA requested a review from ebsmothers April 2, 2024 19:18

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 2, 2024

ebsmothers approved these changes Apr 2, 2024

View reviewed changes

update datasets directory in live docs

aa1e385

RdoubleA merged commit e7e310a into main Apr 2, 2024
20 checks passed

RdoubleA deleted the rafiayub/alpaca_cleaned branch April 2, 2024 21:36

tcapelle pushed a commit to tcapelle/torchtune that referenced this pull request Apr 5, 2024

Split alpaca_dataset to alpaca + alpaca_cleaned (pytorch#639)

0770781

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split alpaca_dataset to alpaca + alpaca_cleaned #639

Split alpaca_dataset to alpaca + alpaca_cleaned #639

RdoubleA commented Apr 2, 2024

pytorch-bot bot commented Apr 2, 2024 •

edited

Loading

ebsmothers left a comment

Split alpaca_dataset to alpaca + alpaca_cleaned #639

Split alpaca_dataset to alpaca + alpaca_cleaned #639

Conversation

RdoubleA commented Apr 2, 2024

Context

Changelog

Test plan

pytorch-bot bot commented Apr 2, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/639

✅ No Failures

ebsmothers left a comment

Choose a reason for hiding this comment

pytorch-bot bot commented Apr 2, 2024 •

edited

Loading