Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4/7] Refactor preference dataset with transforms design #1276

Merged
merged 26 commits into from
Aug 13, 2024

Conversation

RdoubleA
Copy link
Contributor

@RdoubleA RdoubleA commented Aug 6, 2024

Context

Following the RFC in #1186, we will use the unified message_transform -> template -> tokenization data pipeline in all our datasets. This PR updates PreferenceDataset to follow this pipeline. We can't use SFTDataset here because chosen and rejected messages are treated separately.

Covers #1214

Changelog

  • Refactor PreferenceDataset and stack_exchange_paired_dataset
  • Add ChosenRejectedToMessages message transform
  • StackExchangedPairedTemplate -> QuestionAnswerTemplate
  • Update tests

Test plan

  • Added unit tests
  • DPO runs with stack exchange paired against main, DPO run with hh_rlhf_helpful_dataset
image

Copy link

pytorch-bot bot commented Aug 6, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1276

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8fd001d with merge base 0531dcb (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 6, 2024
@codecov-commenter
Copy link

codecov-commenter commented Aug 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.18%. Comparing base (6be89c0) to head (8fd001d).
Report is 5 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1276      +/-   ##
==========================================
+ Coverage   68.57%   69.18%   +0.61%     
==========================================
  Files         258      262       +4     
  Lines       11972    12129     +157     
==========================================
+ Hits         8210     8392     +182     
+ Misses       3762     3737      -25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@SalmanMohammadi
Copy link
Collaborator

SalmanMohammadi commented Aug 7, 2024

I've included a couple examples of preference datasets below below, see here for a good list, too.

Chat Format

Anthropic HH-RLHF (processed)
image

Instruct-format*

image

*this is actually a mixed dataset but there's good examples of instruct-based prompts here.

These datasets are nice examples of processed datasets which are ready for consumption in a standard format. This is one of the things I was unclear about - it seems like we currently have a prompt template for supporting each specific dataset type OOTB? Do you think this scales in a sensible way, or are we just being deliberate with the datasets we provide OOTB to ensure there's sufficient examples for users to specify their own? For example, with preference datasets with standard chat/instruct formats as above, would it be possible for the user specify a standard template we provide (such as ChatML?) rather than having to write one for their specific usecase?

Continuing on from above RE the default datasets we provide, I would also prefer the default exemplar preference dataset to be chat/instruct style. Chat/instruct style preference datasets are the standard for open source preference/DPO-trained models; I think it's a far more common use case (the stack exchange paired dataset is pretty specific), and one of the motivators for this refactor is to unify chat+instruct preference datasets, right? I'd like to make sure there's at least one example of a chat/instruct dataset which allows users to replicate e.g. instruct models tuned with DPO in torchtune.

Sorry to make you explain probably obvious things here - I'm trying to get the full picture of how an end-user would configure one of these datasets with the current example.

Thank you so much for helping out with this, I'm really enjoying seeing your datasets refactor masterplan fall into place so neatly.

edit: will review a bit closer once I have a better grasp of your vision here.

@SalmanMohammadi
Copy link
Collaborator

SalmanMohammadi commented Aug 7, 2024

To further motivate my points above, my grand plan for alignment in torchtune is for a user to roughly follow the fine-tuning steps used by e.g. Llama2 or InstructGPT. The dataset in the the PPO config I provide is the chat-style HHRLHF dataset above - I want to demonstrate an end-to-end journey for what I anticipate will be the most common use-case for someone using torchtune for alignment: from SFT, to reward modelling, and finally RLHF using PPO. If I were to write a tutorial for this, I would love to show users how to follow this process with comprehensible changes to our configs (in an ideal world, without modifying any code at all!!).

@RdoubleA
Copy link
Contributor Author

RdoubleA commented Aug 7, 2024

@SalmanMohammadi Thanks for sharing these examples. I suppose I generalized a bit based off of stack exchange paired but as you mention it is rather niche, although already present in our library. I'm happy to add one or both of the datasets you recommended as builders, both of those seem to share the same message_transform and that might be more generalizable than the one I put for stack exchange.

@RdoubleA
Copy link
Contributor Author

RdoubleA commented Aug 7, 2024

it seems like we currently have a prompt template for supporting each specific dataset type OOTB?

Not exactly, the prompt template is not required. A lot of the exemplar datasets have it just to show plenty of different examples of how you would configure a custom dataset. Users can use any or no prompt template when making their own builders.

@RdoubleA RdoubleA changed the title [4/n] Refactor preference dataset with transforms design [4/7] Refactor preference dataset with transforms design Aug 7, 2024
felipemello1
felipemello1 previously approved these changes Aug 8, 2024
Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some minor comments. Changes make sense to me. I did not build the docs. Have you had a chance to make sure it renders well? There is also a build doc error

torchtune/data/_messages.py Show resolved Hide resolved
torchtune/datasets/_preference.py Show resolved Hide resolved
torchtune/datasets/_preference.py Show resolved Hide resolved
torchtune/datasets/_preference.py Show resolved Hide resolved
torchtune/datasets/_preference.py Outdated Show resolved Hide resolved
torchtune/datasets/_preference.py Outdated Show resolved Hide resolved
@felipemello1 felipemello1 dismissed their stale review August 8, 2024 02:07

wait for tests to pass and solve conflicts

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments but no major concerns, please make sure other folks' comments are addressed as well

torchtune/data/_prompt_templates.py Show resolved Hide resolved
torchtune/datasets/_preference.py Outdated Show resolved Hide resolved
Comment on lines 22 to 23
sourced from Hugging Face Hub, local files, or remote files. This class requires
the dataset to have "chosen" and "rejected" model responses. At a high level, this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can consider giving an example raw data format here

tests/torchtune/datasets/test_preference_dataset.py Outdated Show resolved Hide resolved
torchtune/datasets/_preference.py Show resolved Hide resolved
torchtune/datasets/_stack_exchange_paired.py Outdated Show resolved Hide resolved
@SalmanMohammadi
Copy link
Collaborator

I'm happy to add one or both of the datasets you recommended as builders

Could we please add a builder for https://huggingface.co/datasets/RLHFlow/HH-RLHF-Helpful-standard?row=1*, which is a chat-style preference dataset?

Let me know if you'd rather see this as a followup.



def stack_exchange_paired_dataset(
tokenizer: ModelTokenizer,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this tokenizer: ModelTokenizer, but all other builders use model_transform: Transform?

@RdoubleA RdoubleA merged commit 6a7951f into pytorch:main Aug 13, 2024
20 checks passed
@RdoubleA RdoubleA deleted the merged_preference_dataset branch August 13, 2024 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants