DPO with chatglm2-6b #701

valkryhx · 2023-08-27T11:38:20Z

When training a chatglm2-6b model (sft with qlora) with DPO, the loss did not seem to get low ?

Below is part of training log, I find there is only small jitter on the training loss and the rewards/margins seems really really small ?

Is it the expected behavior?

And ,if I set ref_model=None in the DPOTrainer ,when eval, the eval_loss and eval_rewards and etc will always keep the same during training. It does not seem right...

{'loss': 0.6924, 'learning_rate': 5e-06, 'rewards/chosen': 0.008880767971277237, 'rewards/rejected': 0.007253284566104412, 'rewards/accuracies': 0.44999998807907104, 'rewards/margins': 0.0016274836380034685, 'logps/rejected': -82.50346374511719, 'logps/chosen': -81.14619445800781, 'logits/rejected': -2.220956802368164, 'logits/chosen': -2.227485179901123, 'epoch': 0.01}
{'loss': 0.6951, 'learning_rate': 1e-05, 'rewards/chosen': 0.006737555377185345, 'rewards/rejected': 0.010529976338148117, 'rewards/accuracies': 0.5, 'rewards/margins': -0.0037924200296401978, 'logps/rejected': -98.62718200683594, 'logps/chosen': -75.55088806152344, 'logits/rejected': -2.2945332527160645, 'logits/chosen': -2.1720192432403564, 'epoch': 0.02}
{'loss': 0.6942, 'learning_rate': 9.9999992772835e-06, 'rewards/chosen': -0.0009587286040186882, 'rewards/rejected': 0.0010564230615273118, 'rewards/accuracies': 0.44999998807907104, 'rewards/margins': -0.0020151520147919655, 'logps/rejected': -86.17352294921875, 'logps/chosen': -69.18254089355469, 'logits/rejected': -2.2097086906433105, 'logits/chosen': -2.1535096168518066, 'epoch': 0.03}
{'loss': 0.6928, 'learning_rate': 9.999997109134203e-06, 'rewards/chosen': 0.0020176221150904894, 'rewards/rejected': 0.0011578561970964074, 'rewards/accuracies': 0.550000011920929, 'rewards/margins': 0.0008597659762017429, 'logps/rejected': -67.88219451904297, 'logps/chosen': -85.24455261230469, 'logits/rejected': -2.137803792953491, 'logits/chosen': -2.1222000122070312, 'epoch': 0.03}
{'loss': 0.687, 'learning_rate': 9.99999349555274e-06, 'rewards/chosen': 0.0029097364749759436, 'rewards/rejected': -0.00971378292888403, 'rewards/accuracies': 0.699999988079071, 'rewards/margins': 0.012623520568013191, 'logps/rejected': -88.10517883300781, 'logps/chosen': -79.19243621826172, 'logits/rejected': -2.200411081314087, 'logits/chosen': -2.149261951446533, 'epoch': 0.04}
{'loss': 0.688, 'learning_rate': 9.999988436540154e-06, 'rewards/chosen': 0.009169521741569042, 'rewards/rejected': -0.001259879907593131, 'rewards/accuracies': 0.75, 'rewards/margins': 0.010429401881992817, 'logps/rejected': -95.37321472167969, 'logps/chosen': -82.72525787353516, 'logits/rejected': -2.132511615753174, 'logits/chosen': -2.094295024871826, 'epoch': 0.05}
{'loss': 0.6936, 'learning_rate': 9.999981932097907e-06, 'rewards/chosen': 0.0041534616611897945, 'rewards/rejected': 0.0049835206009447575, 'rewards/accuracies': 0.550000011920929, 'rewards/margins': -0.0008300594054162502, 'logps/rejected': -78.36058044433594, 'logps/chosen': -69.1476058959961, 'logits/rejected': -2.12066650390625, 'logits/chosen': -2.0973222255706787, 'epoch': 0.06}
{'loss': 0.6902, 'learning_rate': 9.99997398222788e-06, 'rewards/chosen': 0.011761991307139397, 'rewards/rejected': 0.00546670937910676, 'rewards/accuracies': 0.5, 'rewards/margins': 0.006295280996710062, 'logps/rejected': -70.21249389648438, 'logps/chosen': -70.36653137207031, 'logits/rejected': -2.2433230876922607, 'logits/chosen': -2.2556514739990234, 'epoch': 0.07}
{'loss': 0.6944, 'learning_rate': 9.99996458693237e-06, 'rewards/chosen': -0.0023219208233058453, 'rewards/rejected': 6.135962757980451e-05, 'rewards/accuracies': 0.6000000238418579, 'rewards/margins': -0.0023832800798118114, 'logps/rejected': -70.10619354248047, 'logps/chosen': -79.71769714355469, 'logits/rejected': -2.219357967376709, 'logits/chosen': -2.200735330581665, 'epoch': 0.08}
{'loss': 0.6943, 'learning_rate': 9.999953746214097e-06, 'rewards/chosen': 0.0026049234438687563, 'rewards/rejected': 0.004713707137852907, 'rewards/accuracies': 0.44999998807907104, 'rewards/margins': -0.0021087839268147945, 'logps/rejected': -72.87063598632812, 'logps/chosen': -83.99541473388672, 'logits/rejected': -2.2576162815093994, 'logits/chosen': -2.206639051437378, 'epoch': 0.09}

tannonk · 2023-08-28T07:19:51Z

This looks like the same issue as #699.

valkryhx · 2023-08-29T01:28:48Z

chatglm2 train with lora + dpo is normal and loss can get lower then converge,
but qlora + dpo will have the issue above.
Tested yesterday.

kashif · 2023-08-29T13:35:50Z

@valkryhx can you kindly try now with the main branch of TRL Thank you!

valkryhx · 2023-08-30T01:26:08Z

Really quick fix! I will try the newly code.
Besides , I have some questions to ask :
I load a peft model (which does not do the Merge and Unload) , like the way introduced https://github.com/huggingface/blog/blob/main/dpo-trl.md
So I guess now I have a trainable model = peftModelObj(base_model + adapter).

But then when I set ref_model = None in the DPOTrainer,
the DPOTrianer codes will use model.disable_adapters() etc https://github.com/huggingface/trl/blob/main/trl/trainer/dpo_trainer.py#L347
to act as the ref_model.
But in this way ,the sft adapter is disabled , so only the frozen base_model will be treated as the ref_model to perform forwarding? I thought the initially loaded adapter+base_model by AutoPeftModelForCausalLM is the start point to train and act as the ref_model , but now as the code mentioned we does not seem to use the sft adapter ,only the base_model remains as the ref_model.

Do I misunderstand something about the code ?
Or the AutoPeftModelForCausalLM class in fact do the merge so (adapter + base_model) becomes a newly base_model obj ?And the DPOTrainer create new adapter for the newly base_model?
Thank you in advance. @kashif @tannonk

github-actions · 2023-09-26T15:05:08Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

kashif mentioned this issue Aug 29, 2023

[DPO] fix DPO ref_model=None #703

Merged

github-actions bot closed this as completed Nov 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DPO with chatglm2-6b #701

DPO with chatglm2-6b #701

valkryhx commented Aug 27, 2023 •

edited

Loading

tannonk commented Aug 28, 2023

valkryhx commented Aug 29, 2023

kashif commented Aug 29, 2023

valkryhx commented Aug 30, 2023 •

edited

Loading

github-actions bot commented Sep 26, 2023

DPO with chatglm2-6b #701

DPO with chatglm2-6b #701

Comments

valkryhx commented Aug 27, 2023 • edited Loading

tannonk commented Aug 28, 2023

valkryhx commented Aug 29, 2023

kashif commented Aug 29, 2023

valkryhx commented Aug 30, 2023 • edited Loading

github-actions bot commented Sep 26, 2023

valkryhx commented Aug 27, 2023 •

edited

Loading

valkryhx commented Aug 30, 2023 •

edited

Loading