Use any reward model for online methods #2276

qgallouedec · 2024-10-24T20:43:49Z

What does this PR do?

This PR allows any reward model to be used with Online DPO, i.e. it removes the requirement to have the same chat template and tokenizer.

The user must now provide reward_processing_class.

trainer = OnlineDPOTrainer(
    model=model,
    reward_model=reward_model,
    args=training_args,
    train_dataset=train_dataset,
    processing_class=tokenizer,
    reward_processing_class=reward_tokenizer,  # <-
)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2024-10-24T20:48:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2024-10-25T08:29:19Z

trl/trainer/online_dpo_trainer.py

-            completions = self.processing_class.batch_decode(
-                prompt_completion_ids[:, context_length:], skip_special_tokens=True
-            )
-            completions = [completion.strip() for completion in completions]  # remove the leading space


I think we don't need strip

…test

qgallouedec · 2024-10-25T10:11:11Z

Results for a gemma reward model

accelerate launch examples/scripts/dpo_online.py \
    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
    --reward_model_path Ray2333/GRM-Gemma-2B-rewardmodel-ft \
    --dataset_name trl-lib/ultrafeedback-prompt \
    --learning_rate 5.0e-7 \
    --logging_steps 10 \
    --output_dir Qwen2-0.5B-OnlineDPO-GRM-Gemma \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 2 \
    --warmup_ratio 0.1 \
    --missing_eos_penalty 1.0 \
    --push_to_hub

https://wandb.ai/huggingface/huggingface/runs/520cnnjl

For ref, with Pair RM judge instead:

accelerate launch examples/scripts/dpo_online.py \
    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
    --judge pair_rm \
    --dataset_name trl-lib/ultrafeedback-prompt \
    --learning_rate 5.0e-7 \
    --logging_steps 10 \
    --output_dir Qwen2-0.5B-OnlineDPO-PairRM \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 2 \
    --warmup_ratio 0.1 \
    --push_to_hub

https://wandb.ai/huggingface/huggingface/runs/ffd4u5wa

lewtun

Awesome PR @qgallouedec - this will unlock so many combinations of policy with models on RewardBench 🔥 !

Have you done a test run of e.g. trying to optimise Qwen2.5-0.5B-Instruct with the 7B ArmoRM model?

lewtun · 2024-10-25T15:42:27Z

docs/source/online_dpo_trainer.md


  trainer = OnlineDPOTrainer(
      ...
 -     judge=judge,
 +     reward_model=reward_model,
+     reward_processing_class=reward_tokenizer,


Is the reason to use a processing class in case we want to support other modalities beyond text?

Possibly. And since the tokenizer is now called processing_class within trainers, I'd recommend always aligning with it (even if only the textual modality is supported). Unless you have a good reason not to.

lewtun · 2024-10-25T15:47:04Z

examples/scripts/dpo_online.py

@@ -93,8 +93,13 @@
            trust_remote_code=model_config.trust_remote_code,
            **model_kwargs,
        )
+        reward_tokenizer = AutoTokenizer.from_pretrained(
+            training_args.reward_model_path,
+            trust_remote_code=model_config.trust_remote_code,


Should we also set truncation=True along with truncation_side="left" to ensure the labels aren't lost on long inputs? We might also need to allow people to set max_length since the context window of the tokenizer might be different from the policy - would that be best stored in the ScriptArguments?

Ok for truncation, and truncation side. Not sure what's the best way to let the user set the max_length. Ok for doing this in a follow-up PR?

Sounds good to follow up in separate PR - it should generally be safe for RMs that define the max length implicitly in their config anyway

lewtun · 2024-10-25T15:49:44Z

trl/trainer/online_dpo_trainer.py

-            )
+            # The reward model may not have the same chat template or tokenizer as the model, so we need to use the
+            # raw data (string), apply the chat template (if needed), and tokenize it with the reward processing class.
+            prompts = 2 * prompts  # repeat the prompt: [prompt0, prompt1] -> [prompt0, prompt1, prompt0, prompt1]


Why do we do this? Is it to align a prompt with chosen/rejected?

At this point, we have:

prompts = ["What color is the sky?", "What's the capital of France?"] completions = ["Blue", "Lyon", "Green", "Paris"]

and later, we need to concat the prompts and the completions to compute the reward

prompt_completion_ids = torch.cat((prompts_ids, completions_ids), dim=1) _, scores, _ = get_reward( self.reward_model, prompt_completion_ids, self.reward_processing_class.pad_token_id, context_length )

so we need to repeat the prompt.

trl/trainer/online_dpo_trainer.py

qgallouedec · 2024-10-28T15:00:07Z

Have you done a test run of e.g. trying to optimise Qwen2.5-0.5B-Instruct with the 7B ArmoRM model?

ArmoRM is a custom classifier (its code for using it is not standard). So our get_reward function probably won't work for it. However, by modifying the code a little, I still manage to use it, and this is what I get:

https://wandb.ai/huggingface/huggingface/runs/merlfqgx (screenshot to come)

accelerate launch examples/scripts/dpo_online.py \
    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
    --reward_model_path RLHFlow/ArmoRM-Llama3-8B-v0.1 \
    --dataset_name trl-lib/ultrafeedback-prompt \
    --learning_rate 5.0e-7 \
    --logging_steps 10 \
    --output_dir Qwen2-0.5B-OnlineDPO-AutoRM \
    --per_device_train_batch_size 8 \
    --gradient_accumulation_steps 2 \
    --warmup_ratio 0.1 \
    --missing_eos_penalty 1.0 \
    --push_to_hub

Refactor reward processing in OnlineDPOTrainer

571b6f3

qgallouedec added 2 commits October 25, 2024 08:26

Refactor completion decoding and reward processing

370e010

remove strip

5e7a495

qgallouedec commented Oct 25, 2024

View reviewed changes

qgallouedec added 5 commits October 25, 2024 08:54

remove warning

c4455b1

Add reward_tokenizer to training script

12893e2

Add reward_tokenizer and reward_processing_class to OnlineDPOTrainer …

6e8ca96

…test

propagate to xpo and nash

86fd762

style

1dc15d3

qgallouedec marked this pull request as ready for review October 25, 2024 10:12

qgallouedec requested review from kashif, edbeeching and lewtun October 25, 2024 10:12

qgallouedec added 5 commits October 25, 2024 12:53

reduce memory requirement with inference_mode

34e0eaf

fix tests

9255c34

pairrm judge llmblender

6ee647b

setUpClass(cls)

52808af

Add setUpClass method to TestJudges class

a2192ee

qgallouedec mentioned this pull request Oct 25, 2024

Don't pass eval_dataset in to trainers when no eval strategy #2270

Merged

5 tasks

Merge branch 'main' into any_reward_model_online

bbcc129

lewtun approved these changes Oct 28, 2024

View reviewed changes

qgallouedec and others added 4 commits October 28, 2024 11:50

Merge branch 'main' into any_reward_model_online

68bd4b2

truncation left for reward tokenizer

1834770

don't logcompletion without eval dataset

a9d8b23

only eval when possible

414b90b

qgallouedec mentioned this pull request Oct 28, 2024

Refactor unit tests to use standard unittest assertion methods #2283

Merged

5 tasks

qgallouedec merged commit b269657 into main Oct 28, 2024
9 of 10 checks passed

qgallouedec deleted the any_reward_model_online branch October 28, 2024 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use any reward model for online methods #2276

Use any reward model for online methods #2276

qgallouedec commented Oct 24, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 24, 2024

qgallouedec Oct 25, 2024

qgallouedec commented Oct 25, 2024 •

edited

Loading

lewtun left a comment

lewtun Oct 25, 2024

qgallouedec Oct 28, 2024

lewtun Oct 25, 2024

qgallouedec Oct 28, 2024

qgallouedec Oct 28, 2024

lewtun Oct 28, 2024

lewtun Oct 25, 2024

qgallouedec Oct 28, 2024

qgallouedec commented Oct 28, 2024 •

edited

Loading

Use any reward model for online methods #2276

Use any reward model for online methods #2276

Conversation

qgallouedec commented Oct 24, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Oct 24, 2024

Choose a reason for hiding this comment

qgallouedec commented Oct 25, 2024 • edited Loading

lewtun left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qgallouedec commented Oct 28, 2024 • edited Loading

qgallouedec commented Oct 24, 2024 •

edited

Loading

qgallouedec commented Oct 25, 2024 •

edited

Loading

qgallouedec commented Oct 28, 2024 •

edited

Loading