Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPO with chatglm2-6b #701

Closed
valkryhx opened this issue Aug 27, 2023 · 5 comments
Closed

DPO with chatglm2-6b #701

valkryhx opened this issue Aug 27, 2023 · 5 comments

Comments

@valkryhx
Copy link

valkryhx commented Aug 27, 2023

When training a chatglm2-6b model (sft with qlora) with DPO, the loss did not seem to get low ?

Below is part of training log, I find there is only small jitter on the training loss and the rewards/margins seems really really small ?

Is it the expected behavior?

And ,if I set ref_model=None in the DPOTrainer ,when eval, the eval_loss and eval_rewards and etc will always keep the same during training. It does not seem right...

{'loss': 0.6924, 'learning_rate': 5e-06, 'rewards/chosen': 0.008880767971277237, 'rewards/rejected': 0.007253284566104412, 'rewards/accuracies': 0.44999998807907104, 'rewards/margins': 0.0016274836380034685, 'logps/rejected': -82.50346374511719, 'logps/chosen': -81.14619445800781, 'logits/rejected': -2.220956802368164, 'logits/chosen': -2.227485179901123, 'epoch': 0.01}
{'loss': 0.6951, 'learning_rate': 1e-05, 'rewards/chosen': 0.006737555377185345, 'rewards/rejected': 0.010529976338148117, 'rewards/accuracies': 0.5, 'rewards/margins': -0.0037924200296401978, 'logps/rejected': -98.62718200683594, 'logps/chosen': -75.55088806152344, 'logits/rejected': -2.2945332527160645, 'logits/chosen': -2.1720192432403564, 'epoch': 0.02}
{'loss': 0.6942, 'learning_rate': 9.9999992772835e-06, 'rewards/chosen': -0.0009587286040186882, 'rewards/rejected': 0.0010564230615273118, 'rewards/accuracies': 0.44999998807907104, 'rewards/margins': -0.0020151520147919655, 'logps/rejected': -86.17352294921875, 'logps/chosen': -69.18254089355469, 'logits/rejected': -2.2097086906433105, 'logits/chosen': -2.1535096168518066, 'epoch': 0.03}
{'loss': 0.6928, 'learning_rate': 9.999997109134203e-06, 'rewards/chosen': 0.0020176221150904894, 'rewards/rejected': 0.0011578561970964074, 'rewards/accuracies': 0.550000011920929, 'rewards/margins': 0.0008597659762017429, 'logps/rejected': -67.88219451904297, 'logps/chosen': -85.24455261230469, 'logits/rejected': -2.137803792953491, 'logits/chosen': -2.1222000122070312, 'epoch': 0.03}
{'loss': 0.687, 'learning_rate': 9.99999349555274e-06, 'rewards/chosen': 0.0029097364749759436, 'rewards/rejected': -0.00971378292888403, 'rewards/accuracies': 0.699999988079071, 'rewards/margins': 0.012623520568013191, 'logps/rejected': -88.10517883300781, 'logps/chosen': -79.19243621826172, 'logits/rejected': -2.200411081314087, 'logits/chosen': -2.149261951446533, 'epoch': 0.04}
{'loss': 0.688, 'learning_rate': 9.999988436540154e-06, 'rewards/chosen': 0.009169521741569042, 'rewards/rejected': -0.001259879907593131, 'rewards/accuracies': 0.75, 'rewards/margins': 0.010429401881992817, 'logps/rejected': -95.37321472167969, 'logps/chosen': -82.72525787353516, 'logits/rejected': -2.132511615753174, 'logits/chosen': -2.094295024871826, 'epoch': 0.05}
{'loss': 0.6936, 'learning_rate': 9.999981932097907e-06, 'rewards/chosen': 0.0041534616611897945, 'rewards/rejected': 0.0049835206009447575, 'rewards/accuracies': 0.550000011920929, 'rewards/margins': -0.0008300594054162502, 'logps/rejected': -78.36058044433594, 'logps/chosen': -69.1476058959961, 'logits/rejected': -2.12066650390625, 'logits/chosen': -2.0973222255706787, 'epoch': 0.06}
{'loss': 0.6902, 'learning_rate': 9.99997398222788e-06, 'rewards/chosen': 0.011761991307139397, 'rewards/rejected': 0.00546670937910676, 'rewards/accuracies': 0.5, 'rewards/margins': 0.006295280996710062, 'logps/rejected': -70.21249389648438, 'logps/chosen': -70.36653137207031, 'logits/rejected': -2.2433230876922607, 'logits/chosen': -2.2556514739990234, 'epoch': 0.07}
{'loss': 0.6944, 'learning_rate': 9.99996458693237e-06, 'rewards/chosen': -0.0023219208233058453, 'rewards/rejected': 6.135962757980451e-05, 'rewards/accuracies': 0.6000000238418579, 'rewards/margins': -0.0023832800798118114, 'logps/rejected': -70.10619354248047, 'logps/chosen': -79.71769714355469, 'logits/rejected': -2.219357967376709, 'logits/chosen': -2.200735330581665, 'epoch': 0.08}
{'loss': 0.6943, 'learning_rate': 9.999953746214097e-06, 'rewards/chosen': 0.0026049234438687563, 'rewards/rejected': 0.004713707137852907, 'rewards/accuracies': 0.44999998807907104, 'rewards/margins': -0.0021087839268147945, 'logps/rejected': -72.87063598632812, 'logps/chosen': -83.99541473388672, 'logits/rejected': -2.2576162815093994, 'logits/chosen': -2.206639051437378, 'epoch': 0.09}

@tannonk
Copy link
Contributor

tannonk commented Aug 28, 2023

This looks like the same issue as #699.

@valkryhx
Copy link
Author

chatglm2 train with lora + dpo is normal and loss can get lower then converge,
but qlora + dpo will have the issue above.
Tested yesterday.

@kashif
Copy link
Collaborator

kashif commented Aug 29, 2023

@valkryhx can you kindly try now with the main branch of TRL Thank you!

@valkryhx
Copy link
Author

valkryhx commented Aug 30, 2023

Really quick fix! I will try the newly code.
Besides , I have some questions to ask :
I load a peft model (which does not do the Merge and Unload) , like the way introduced https://github.com/huggingface/blog/blob/main/dpo-trl.md
So I guess now I have a trainable model = peftModelObj(base_model + adapter).

But then when I set ref_model = None in the DPOTrainer,
the DPOTrianer codes will use model.disable_adapters() etc https://github.com/huggingface/trl/blob/main/trl/trainer/dpo_trainer.py#L347
to act as the ref_model.
But in this way ,the sft adapter is disabled , so only the frozen base_model will be treated as the ref_model to perform forwarding? I thought the initially loaded adapter+base_model by AutoPeftModelForCausalLM is the start point to train and act as the ref_model , but now as the code mentioned we does not seem to use the sft adapter ,only the base_model remains as the ref_model.

Do I misunderstand something about the code ?
Or the AutoPeftModelForCausalLM class in fact do the merge so (adapter + base_model) becomes a newly base_model obj ?And the DPOTrainer create new adapter for the newly base_model?
Thank you in advance. @kashif @tannonk

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@github-actions github-actions bot closed this as completed Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants