Padding free dpo #2437

dame-cell · 2024-12-04T15:45:30Z

What does this PR do?

New feature #2422

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

For now this is just a draft will be continuing to work on it

dame-cell · 2024-12-10T15:19:45Z

not really done yet but for now
everything seems to be working
if padding_free is set to True the trainer will not pad and also when padding_free =True attention_mask will not be used

for now here are some task to be done :

Ensure when padding_Free =True the trainer will not pad
Ensure that when padding_free = True the trainer will not use or return attention_mask
Ensure that when padding_free = True we use positon_ids
make tests

dame-cell · 2024-12-11T13:02:27Z

most of the stuff is done just some small stuff left like dealing with list and converting to tensor

dame-cell · 2024-12-11T14:43:58Z

Hey @osanseviero,

The main idea for using padding_free is mostly in place now, but there are still a few things that need to be done. It would be awesome if you could take a look at the code and let me know if there's anything else I should address or add.

I've made it so the user can directly do this

trainer = DPOTrainer(
                model=self.model,
                ref_model=None,
                args=training_args,
                tokenizer=self.tokenizer,
                padding_free=True, # when true it will not use any padding 
                train_dataset=dummy_dataset["train"],
                eval_dataset=dummy_dataset["test"],
            )

winglian · 2024-12-12T17:26:54Z

tests/test_dpo_trainer.py

        with tempfile.TemporaryDirectory() as tmp_dir:
            training_args = DPOConfig(
                output_dir=tmp_dir,
                per_device_train_batch_size=2,
                max_steps=3,
                remove_unused_columns=False,
-                gradient_accumulation_steps=4,


Would be good to have tests for this with gradient accumulation too. perhaps using pytest.mark.parameterize?

All right will do so thanks for reviewing 😎

qgallouedec · 2024-12-13T11:02:12Z

trl/trainer/ppo_config.py

@@ -53,6 +53,9 @@ class PPOConfig(OnPolicyConfig):
            Discount factor.


This modif shouldn't be here, right?

oh my bad i'll fix it right now

You still have modifications in ppo files

…nto padding_free_dpo

qgallouedec · 2024-12-14T13:55:21Z

Here

trl/trl/trainer/dpo_trainer.py

Lines 1115 to 1123 in 6d4ed07

    
           # Flush left to reduce the memory usage 
        
           # [[0, 0, x, x, x, x],  ->  [[x, x, x, x], 
        
           #  [0, x, x, x, 0, 0]]       [x, x, x, 0]] 
        
           for i in range(attention_mask.size(0)): 
        
               first_one_idx = torch.nonzero(attention_mask[i])[0].item() 
        
               input_ids[i] = torch.roll(input_ids[i], shifts=-first_one_idx) 
        
               attention_mask[i] = torch.roll(attention_mask[i], shifts=-first_one_idx) 
        
               loss_mask[i] = torch.roll(loss_mask[i], shifts=-first_one_idx)

After the flushing left, we could remove pad tokens, and add position ids:

# Flush left to reduce the memory usage 
# [[0, 0, x, x, x, x],  ->  [[x, x, x, x], 
#  [0, x, x, x, 0, 0]]       [x, x, x, 0]] 
for i in range(attention_mask.size(0)): 
    first_one_idx = torch.nonzero(attention_mask[i])[0].item() 
    input_ids[i] = torch.roll(input_ids[i], shifts=-first_one_idx) 
    attention_mask[i] = torch.roll(attention_mask[i], shifts=-first_one_idx) 
    loss_mask[i] = torch.roll(loss_mask[i], shifts=-first_one_idx) 

if self.padding_free: 
    # input =             input =            pos_ids =           input =                    pod_ids =
    # [[x, x, x, x],  ->  [[x, x, x, x], and [[0, 1, 2, 3],  ->  [x, x, x, x, x, x, x]  and [0, 1, 2, 3, 0, 1, 2]
    #  [x, x, x, 0]]       [x, x, x]]         [0, 1, 2]] 

    ... # code here

dame-cell · 2024-12-14T14:00:02Z

all right awesome actually this make more sense 😭

dame-cell · 2024-12-14T16:31:41Z

before I push my code again I want to benchmark this with padding and padding_free just to show the performance

qgallouedec · 2024-12-14T16:39:28Z

You can push it, no worry we can still refine after

dame-cell · 2024-12-15T11:33:37Z

Thank you for your understanding! I wanted to let you know that I’m a bit tied up today and tomorrow, so I might not be able to push the code right away. I’ll try to get to it as soon as possible, but please feel free to let me know if there’s a hard deadline I should prioritize.

Thanks for your patience!
I'll keep working on it so I'll try to push it by Tommorow if i can

qgallouedec · 2024-12-15T11:35:36Z

No rush on our side :)

dame-cell · 2024-12-17T13:34:51Z

all right so I think this does it I did check if we can train this on a single T4 gpu colab notebook
now using the examples scripts provided the file trl/scripts/dpo.py with a bit of update
I was able to train a model using the padding_Free =True

python trl/examples/scripts/dpo.py \
    --dataset_name trl-lib/ultrafeedback_binarized \
    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
    --learning_rate 5.0e-6 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --gradient_accumulation_steps 8 \
    --gradient_checkpointing \
    --logging_steps 1 \
    --output_dir Qwen2-0.5B-DPO \
    --no_remove_unused_columns \
    --use_peft \
    --lora_r 32 \
    --lora_alpha 16

without padding_free it kept saying OOM is this normal or what ?
I have not updated the docs yet because I'm not 100 % sure this one works or is correct until after a review

…nto padding_free_dpo

dame-cell · 2024-12-19T02:21:49Z

@osanseviero Just wanted to follow up on this PR and see if there’s any feedback so far. I’m happy to clarify anything or make updates if needed. Let me know whenever you get a chance—thanks so much for your time! 🙌

qgallouedec · 2024-12-19T13:33:33Z

You still need to revert the changes applied to PPO files. And apply pre-commits

trl/trainer/dpo_trainer.py

…nto padding_free_dpo

tests/test_dpo_trainer.py

dame-cell · 2024-12-21T15:46:44Z

The new push changes some code due to these problems
this is training without padding_free

{'loss': 0.6933, 'grad_norm': 28.669958114624023, 'learning_rate': 4.880000000000001e-06, 'rewards/chosen':

 -0.0004301070875953883, 'rewards/rejected': -0.00019319055718369782, 'rewards/accuracies': 0.5, 'rewards/margins': 

-0.0002369165886193514, 'logps/chosen': -359.64996337890625, 'logps/rejected': -218.63882446289062, 'logits/chosen': 

-3.0950357913970947, 'logits/rejected': -2.8118934631347656, 'epoch': 0.02}

and training with padding_free

{'loss': 0.6931, 'grad_norm': 197.54409790039062, 'learning_rate': 4.960000000000001e-06, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.0, 'rewards/margins': 0.0, 'logps/chosen': -972.31005859375, 'logps/rejected': -1276.14697265625, 'logits/chosen': -2.9294216632843018, 'logits/rejected': -2.5271308422088623, 'epoch': 0.01}
{'loss': 0.6791, 'grad_norm': 124.33696746826172, 'learning_rate': 4.92e-06, 'rewards/chosen': 0.008736038580536842, 'rewards/rejected': -0.019646836444735527, 'rewards/accuracies': 1.0, 'rewards/margins': 0.02838287316262722, 'logps/chosen': -656.6194458007812, 'logps/rejected': -656.0517578125, 'logits/chosen': -2.9904067516326904, 'logits/rejected': -2.4395949840545654, 'epoch': 0.02}
{'loss': 0.6568, 'grad_norm': 147.94912719726562, 'learning_rate': 4.880000000000001e-06, 'rewards/chosen': 0.020291520282626152, 'rewards/rejected': -0.05392017588019371, 'rewards/accuracies': 1.0, 'rewards/margins': 0.07421170175075531, 'logps/chosen': -653.7568359375, 'logps/rejected': -713.071044921875, 'logits/chosen': -3.067491292953491, 'logits/rejected': -2.60148286819458, 'epoch': 0.02}
{'loss': 0.6229, 'grad_norm': 213.63063049316406, 'learning_rate': 4.84e-06, 'rewards/chosen': 0.02659912034869194, 'rewards/rejected': -0.12024041265249252, 'rewards/accuracies': 1.0, 'rewards/margins': 0.14683951437473297, 'logps/chosen': -1233.0953369140625, 'logps/rejected': -761.375732421875, 'logits/chosen': -2.791830062866211, 'logits/rejected': -2.6060781478881836, 'epoch': 0.03}

the grad_norm is so high for padding_free and the reward/accuracies is always 1.0 which is not correct

dame-cell and others added 15 commits November 30, 2024 20:22

added eos token for ppotrainer

ca99954

remove the unnecessary stuff

fe1d5f6

Update ppo_config.py

b15c635

remove redundant EOS token fallback

1bcb3a4

remove redundant EOS token fallback

2ef2b24

remove some unnecessary tests stuff

42a0f73

added tests and update concatenated_inputs

6130a91

return only list and also a lot to do

6732ed2

padding free not tested but getting closer

ce67292

rebase and also reevaluate my approach

91e40aa

merge main

8a34cb5

fix identation

1d38632

better tests

814d69e

concatenated_forward now supports padding_free

7dae607

collator now does not return attention masks

1a59d74

postion ids and no attention mask works

562c52e

dame-cell added 2 commits December 11, 2024 20:09

update concatenated forward to support padding_free

d194054

update concatenated forward to support padding_free

3855851

dame-cell marked this pull request as ready for review December 11, 2024 14:43

Merge branch 'main' into padding_free_dpo

d9adbfb

winglian reviewed Dec 12, 2024

View reviewed changes

qgallouedec reviewed Dec 13, 2024

View reviewed changes

dame-cell added 5 commits December 13, 2024 17:36

grad accumalation tests

1145006

Merge branch 'padding_free_dpo' of https://github.com/dame-cell/trl i…

24f73a4

…nto padding_free_dpo

Resolved merge conflict in ppo_trainer.py

f6bd9e1

Resolved merge conflict in ppo_trainer.py

bbd99cf

Resolved merge conflict in ppo_trainer.py

ba4969d

padding_free in concatenated_forward and update_test

1328fc3

dame-cell and others added 4 commits December 17, 2024 19:04

Merge branch 'main' into padding_free_dpo

51a2cc6

padding_free in concatenated_forward and update_test

986ed71

padding_free in concatenated_forward and update_test

570b79a

Merge branch 'padding_free_dpo' of https://github.com/dame-cell/trl i…

9dd9564

…nto padding_free_dpo

dame-cell marked this pull request as ready for review December 17, 2024 14:58

dame-cell added 2 commits December 18, 2024 17:26

Merge branch 'main' into padding_free_dpo

b781876

Merge branch 'main' into padding_free_dpo

525ecb2

Merge branch 'main' into padding_free_dpo

c8ce9c8

qgallouedec reviewed Dec 19, 2024

View reviewed changes

trl/trainer/dpo_trainer.py Outdated Show resolved Hide resolved

dame-cell and others added 5 commits December 19, 2024 19:46

Reverted PPO trainer to original version and updated DPO files

b7fad73

Merge branch 'padding_free_dpo' of https://github.com/dame-cell/trl i…

55cd219

…nto padding_free_dpo

Updated DPO files

5e8df69

Merge branch 'main' into padding_free_dpo

ba1ded1

Merge branch 'main' into padding_free_dpo

0784202

qgallouedec reviewed Dec 21, 2024

View reviewed changes

tests/test_dpo_trainer.py Outdated Show resolved Hide resolved

dame-cell added 3 commits December 21, 2024 19:10

update test_dpo_trainer.py

ddfed7c

update dpo_trainer.py

955c7e8

update dpo_trainer.py

ba4356e

update dpo_trainer.py

c61abb5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Padding free dpo #2437

Padding free dpo #2437

dame-cell commented Dec 4, 2024 •

edited

Loading

dame-cell commented Dec 10, 2024 •

edited

Loading

dame-cell commented Dec 11, 2024

dame-cell commented Dec 11, 2024 •

edited

Loading

winglian Dec 12, 2024

dame-cell Dec 12, 2024

qgallouedec Dec 13, 2024

dame-cell Dec 13, 2024

qgallouedec Dec 21, 2024 •

edited

Loading

qgallouedec commented Dec 14, 2024 •

edited

Loading

dame-cell commented Dec 14, 2024

dame-cell commented Dec 14, 2024 •

edited

Loading

qgallouedec commented Dec 14, 2024

dame-cell commented Dec 15, 2024 •

edited

Loading

qgallouedec commented Dec 15, 2024

dame-cell commented Dec 17, 2024 •

edited

Loading

dame-cell commented Dec 19, 2024

qgallouedec commented Dec 19, 2024

dame-cell commented Dec 21, 2024 •

edited

Loading

		@@ -53,6 +53,9 @@ class PPOConfig(OnPolicyConfig):
		Discount factor.

Padding free dpo #2437

Are you sure you want to change the base?

Padding free dpo #2437

Conversation

dame-cell commented Dec 4, 2024 • edited Loading

What does this PR do?

Before submitting

dame-cell commented Dec 10, 2024 • edited Loading

dame-cell commented Dec 11, 2024

dame-cell commented Dec 11, 2024 • edited Loading

winglian Dec 12, 2024

Choose a reason for hiding this comment

dame-cell Dec 12, 2024

Choose a reason for hiding this comment

qgallouedec Dec 13, 2024

Choose a reason for hiding this comment

dame-cell Dec 13, 2024

Choose a reason for hiding this comment

qgallouedec Dec 21, 2024 • edited Loading

Choose a reason for hiding this comment

qgallouedec commented Dec 14, 2024 • edited Loading

dame-cell commented Dec 14, 2024

dame-cell commented Dec 14, 2024 • edited Loading

qgallouedec commented Dec 14, 2024

dame-cell commented Dec 15, 2024 • edited Loading

qgallouedec commented Dec 15, 2024

dame-cell commented Dec 17, 2024 • edited Loading

dame-cell commented Dec 19, 2024

qgallouedec commented Dec 19, 2024

dame-cell commented Dec 21, 2024 • edited Loading

dame-cell commented Dec 4, 2024 •

edited

Loading

dame-cell commented Dec 10, 2024 •

edited

Loading

dame-cell commented Dec 11, 2024 •

edited

Loading

qgallouedec Dec 21, 2024 •

edited

Loading

qgallouedec commented Dec 14, 2024 •

edited

Loading

dame-cell commented Dec 14, 2024 •

edited

Loading

dame-cell commented Dec 15, 2024 •

edited

Loading

dame-cell commented Dec 17, 2024 •

edited

Loading

dame-cell commented Dec 21, 2024 •

edited

Loading