[4/7] Refactor preference dataset with transforms design #1276

RdoubleA · 2024-08-06T20:32:01Z

Context

Following the RFC in #1186, we will use the unified message_transform -> template -> tokenization data pipeline in all our datasets. This PR updates PreferenceDataset to follow this pipeline. We can't use SFTDataset here because chosen and rejected messages are treated separately.

Covers #1214

Changelog

Refactor PreferenceDataset and stack_exchange_paired_dataset
Add ChosenRejectedToMessages message transform
StackExchangedPairedTemplate -> QuestionAnswerTemplate
Update tests

Test plan

Added unit tests
DPO runs with stack exchange paired against main, DPO run with hh_rlhf_helpful_dataset

pytorch-bot · 2024-08-06T20:32:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1276

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 8fd001d with merge base 0531dcb ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

codecov-commenter · 2024-08-06T20:43:10Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.18%. Comparing base (6be89c0) to head (8fd001d).
Report is 5 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1276      +/-   ##
==========================================
+ Coverage   68.57%   69.18%   +0.61%     
==========================================
  Files         258      262       +4     
  Lines       11972    12129     +157     
==========================================
+ Hits         8210     8392     +182     
+ Misses       3762     3737      -25

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

torchtune/datasets/_stack_exchange_paired.py

SalmanMohammadi · 2024-08-07T10:19:44Z

I've included a couple examples of preference datasets below below, see here for a good list, too.

Chat Format

Anthropic HH-RLHF (processed)

Instruct-format*

*this is actually a mixed dataset but there's good examples of instruct-based prompts here.

These datasets are nice examples of processed datasets which are ready for consumption in a standard format. This is one of the things I was unclear about - it seems like we currently have a prompt template for supporting each specific dataset type OOTB? Do you think this scales in a sensible way, or are we just being deliberate with the datasets we provide OOTB to ensure there's sufficient examples for users to specify their own? For example, with preference datasets with standard chat/instruct formats as above, would it be possible for the user specify a standard template we provide (such as ChatML?) rather than having to write one for their specific usecase?

Continuing on from above RE the default datasets we provide, I would also prefer the default exemplar preference dataset to be chat/instruct style. Chat/instruct style preference datasets are the standard for open source preference/DPO-trained models; I think it's a far more common use case (the stack exchange paired dataset is pretty specific), and one of the motivators for this refactor is to unify chat+instruct preference datasets, right? I'd like to make sure there's at least one example of a chat/instruct dataset which allows users to replicate e.g. instruct models tuned with DPO in torchtune.

Sorry to make you explain probably obvious things here - I'm trying to get the full picture of how an end-user would configure one of these datasets with the current example.

Thank you so much for helping out with this, I'm really enjoying seeing your datasets refactor masterplan fall into place so neatly.

edit: will review a bit closer once I have a better grasp of your vision here.

SalmanMohammadi · 2024-08-07T10:30:34Z

To further motivate my points above, my grand plan for alignment in torchtune is for a user to roughly follow the fine-tuning steps used by e.g. Llama2 or InstructGPT. The dataset in the the PPO config I provide is the chat-style HHRLHF dataset above - I want to demonstrate an end-to-end journey for what I anticipate will be the most common use-case for someone using torchtune for alignment: from SFT, to reward modelling, and finally RLHF using PPO. If I were to write a tutorial for this, I would love to show users how to follow this process with comprehensible changes to our configs (in an ideal world, without modifying any code at all!!).

RdoubleA · 2024-08-07T13:54:24Z

@SalmanMohammadi Thanks for sharing these examples. I suppose I generalized a bit based off of stack exchange paired but as you mention it is rather niche, although already present in our library. I'm happy to add one or both of the datasets you recommended as builders, both of those seem to share the same message_transform and that might be more generalizable than the one I put for stack exchange.

RdoubleA · 2024-08-07T13:55:51Z

it seems like we currently have a prompt template for supporting each specific dataset type OOTB?

Not exactly, the prompt template is not required. A lot of the exemplar datasets have it just to show plenty of different examples of how you would configure a custom dataset. Users can use any or no prompt template when making their own builders.

…ataset

felipemello1

some minor comments. Changes make sense to me. I did not build the docs. Have you had a chance to make sure it renders well? There is also a build doc error

torchtune/data/_messages.py

torchtune/datasets/_preference.py

wait for tests to pass and solve conflicts

ebsmothers

A few comments but no major concerns, please make sure other folks' comments are addressed as well

torchtune/data/_prompt_templates.py

torchtune/datasets/_preference.py

ebsmothers · 2024-08-08T14:31:38Z

torchtune/datasets/_preference.py

+    sourced from Hugging Face Hub, local files, or remote files. This class requires
+    the dataset to have "chosen" and "rejected" model responses. At a high level, this


Can consider giving an example raw data format here

tests/torchtune/datasets/test_preference_dataset.py

torchtune/datasets/_preference.py

torchtune/datasets/_stack_exchange_paired.py

SalmanMohammadi · 2024-08-08T14:53:51Z

I'm happy to add one or both of the datasets you recommended as builders

Could we please add a builder for https://huggingface.co/datasets/RLHFlow/HH-RLHF-Helpful-standard?row=1*, which is a chat-style preference dataset?

Let me know if you'd rather see this as a followup.

SalmanMohammadi · 2024-08-09T11:07:26Z

torchtune/datasets/_stack_exchange_paired.py

+
+
+def stack_exchange_paired_dataset(
+    tokenizer: ModelTokenizer,


Why is this tokenizer: ModelTokenizer, but all other builders use model_transform: Transform?

torchtune/datasets/_preference.py

torchtune/datasets/_stack_exchanged_paired.py

torchtune/datasets/_preference.py

RdoubleA added 16 commits July 22, 2024 18:44

initial commit

a3fe457

Merge branch 'main' into merged_dataset_1

9da786f

flesh out prompt templates

969909d

Merge branch 'main' into merged_dataset_1

c422a01

refactor samsum

ef79507

Merge branch 'main' into merged_dataset_1

5d2e7f5

add all tests, update live docs

7d54201

Merge branch 'main' into merged_dataset_1

062ff38

fix tests

df00fe1

change naming

4157dd7

refactor preference

604fd7b

fix recipe tests

ba2e2ec

remove content.strip() in tokenizer

a531e48

Merge branch 'merged_dataset_1' into merged_preference_dataset

cb2596c

Merge branch 'main' into merged_preference_dataset

0478165

refactor preference, stack exchange

99e6e27

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 6, 2024

fix merge

a41479b

RdoubleA requested a review from SalmanMohammadi August 6, 2024 20:34

RdoubleA added 2 commits August 6, 2024 14:07

update dpo configs

3efa81e

Merge branch 'main' into merged_preference_dataset

fc5bb35

SalmanMohammadi reviewed Aug 7, 2024

View reviewed changes

torchtune/datasets/_stack_exchange_paired.py Outdated Show resolved Hide resolved

RdoubleA added 2 commits August 7, 2024 07:16

Merge remote-tracking branch 'upstream/main' into merged_preference_d…

e077ff3

…ataset

fix merge

3099984

RdoubleA mentioned this pull request Aug 7, 2024

[6/7] SFTDataset: revamp instruct/chat #1286

Merged

RdoubleA changed the title ~~[4/n] Refactor preference dataset with transforms design~~ [4/7] Refactor preference dataset with transforms design Aug 7, 2024

felipemello1 previously approved these changes Aug 8, 2024

View reviewed changes

ebsmothers approved these changes Aug 8, 2024

View reviewed changes

Merge branch 'main' into merged_preference_dataset

4266aa9

SalmanMohammadi reviewed Aug 9, 2024

View reviewed changes

torchtune/datasets/_preference.py Show resolved Hide resolved

SalmanMohammadi reviewed Aug 9, 2024

View reviewed changes

torchtune/datasets/_stack_exchanged_paired.py Show resolved Hide resolved

SalmanMohammadi reviewed Aug 9, 2024

View reviewed changes

torchtune/datasets/_preference.py Show resolved Hide resolved

RdoubleA added 4 commits August 9, 2024 17:23

Merge branch 'main' into merged_preference_dataset

6d0e5ac

add hh rlhf dataset builder

4bb637c

Merge branch 'main' into merged_preference_dataset

c6bdd78

address comments, add hh_rlhf builder, add tests

8fd001d

RdoubleA merged commit 6a7951f into pytorch:main Aug 13, 2024
20 checks passed

RdoubleA deleted the merged_preference_dataset branch August 13, 2024 23:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[4/7] Refactor preference dataset with transforms design #1276

[4/7] Refactor preference dataset with transforms design #1276

RdoubleA commented Aug 6, 2024 •

edited

Loading

pytorch-bot bot commented Aug 6, 2024 •

edited

Loading

codecov-commenter commented Aug 6, 2024 •

edited

Loading

SalmanMohammadi commented Aug 7, 2024 •

edited

Loading

SalmanMohammadi commented Aug 7, 2024 •

edited

Loading

RdoubleA commented Aug 7, 2024

RdoubleA commented Aug 7, 2024

felipemello1 left a comment •

edited

Loading

ebsmothers left a comment

ebsmothers Aug 8, 2024

SalmanMohammadi commented Aug 8, 2024

SalmanMohammadi Aug 9, 2024

		sourced from Hugging Face Hub, local files, or remote files. This class requires
		the dataset to have "chosen" and "rejected" model responses. At a high level, this



		def stack_exchange_paired_dataset(
		tokenizer: ModelTokenizer,

[4/7] Refactor preference dataset with transforms design #1276

[4/7] Refactor preference dataset with transforms design #1276

Conversation

RdoubleA commented Aug 6, 2024 • edited Loading

Context

Changelog

Test plan

pytorch-bot bot commented Aug 6, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1276

✅ No Failures

codecov-commenter commented Aug 6, 2024 • edited Loading

Codecov Report

SalmanMohammadi commented Aug 7, 2024 • edited Loading

Chat Format

Instruct-format*

SalmanMohammadi commented Aug 7, 2024 • edited Loading

RdoubleA commented Aug 7, 2024

RdoubleA commented Aug 7, 2024

felipemello1 left a comment • edited Loading

Choose a reason for hiding this comment

ebsmothers left a comment

Choose a reason for hiding this comment

ebsmothers Aug 8, 2024

Choose a reason for hiding this comment

SalmanMohammadi commented Aug 8, 2024

SalmanMohammadi Aug 9, 2024

Choose a reason for hiding this comment

RdoubleA commented Aug 6, 2024 •

edited

Loading

pytorch-bot bot commented Aug 6, 2024 •

edited

Loading

codecov-commenter commented Aug 6, 2024 •

edited

Loading

SalmanMohammadi commented Aug 7, 2024 •

edited

Loading

SalmanMohammadi commented Aug 7, 2024 •

edited

Loading

felipemello1 left a comment •

edited

Loading