-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DPO with chatglm2-6b #701
Comments
This looks like the same issue as #699. |
chatglm2 train with lora + dpo is normal and loss can get lower then converge, |
@valkryhx can you kindly try now with the main branch of TRL Thank you! |
Really quick fix! I will try the newly code. But then when I set ref_model = None in the DPOTrainer, Do I misunderstand something about the code ? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
When training a chatglm2-6b model (sft with qlora) with DPO, the loss did not seem to get low ?
Below is part of training log, I find there is only small jitter on the training loss and the rewards/margins seems really really small ?
Is it the expected behavior?
And ,if I set ref_model=None in the DPOTrainer ,when eval, the eval_loss and eval_rewards and etc will always keep the same during training. It does not seem right...
{'loss': 0.6924, 'learning_rate': 5e-06, 'rewards/chosen': 0.008880767971277237, 'rewards/rejected': 0.007253284566104412, 'rewards/accuracies': 0.44999998807907104, 'rewards/margins': 0.0016274836380034685, 'logps/rejected': -82.50346374511719, 'logps/chosen': -81.14619445800781, 'logits/rejected': -2.220956802368164, 'logits/chosen': -2.227485179901123, 'epoch': 0.01}
{'loss': 0.6951, 'learning_rate': 1e-05, 'rewards/chosen': 0.006737555377185345, 'rewards/rejected': 0.010529976338148117, 'rewards/accuracies': 0.5, 'rewards/margins': -0.0037924200296401978, 'logps/rejected': -98.62718200683594, 'logps/chosen': -75.55088806152344, 'logits/rejected': -2.2945332527160645, 'logits/chosen': -2.1720192432403564, 'epoch': 0.02}
{'loss': 0.6942, 'learning_rate': 9.9999992772835e-06, 'rewards/chosen': -0.0009587286040186882, 'rewards/rejected': 0.0010564230615273118, 'rewards/accuracies': 0.44999998807907104, 'rewards/margins': -0.0020151520147919655, 'logps/rejected': -86.17352294921875, 'logps/chosen': -69.18254089355469, 'logits/rejected': -2.2097086906433105, 'logits/chosen': -2.1535096168518066, 'epoch': 0.03}
{'loss': 0.6928, 'learning_rate': 9.999997109134203e-06, 'rewards/chosen': 0.0020176221150904894, 'rewards/rejected': 0.0011578561970964074, 'rewards/accuracies': 0.550000011920929, 'rewards/margins': 0.0008597659762017429, 'logps/rejected': -67.88219451904297, 'logps/chosen': -85.24455261230469, 'logits/rejected': -2.137803792953491, 'logits/chosen': -2.1222000122070312, 'epoch': 0.03}
{'loss': 0.687, 'learning_rate': 9.99999349555274e-06, 'rewards/chosen': 0.0029097364749759436, 'rewards/rejected': -0.00971378292888403, 'rewards/accuracies': 0.699999988079071, 'rewards/margins': 0.012623520568013191, 'logps/rejected': -88.10517883300781, 'logps/chosen': -79.19243621826172, 'logits/rejected': -2.200411081314087, 'logits/chosen': -2.149261951446533, 'epoch': 0.04}
{'loss': 0.688, 'learning_rate': 9.999988436540154e-06, 'rewards/chosen': 0.009169521741569042, 'rewards/rejected': -0.001259879907593131, 'rewards/accuracies': 0.75, 'rewards/margins': 0.010429401881992817, 'logps/rejected': -95.37321472167969, 'logps/chosen': -82.72525787353516, 'logits/rejected': -2.132511615753174, 'logits/chosen': -2.094295024871826, 'epoch': 0.05}
{'loss': 0.6936, 'learning_rate': 9.999981932097907e-06, 'rewards/chosen': 0.0041534616611897945, 'rewards/rejected': 0.0049835206009447575, 'rewards/accuracies': 0.550000011920929, 'rewards/margins': -0.0008300594054162502, 'logps/rejected': -78.36058044433594, 'logps/chosen': -69.1476058959961, 'logits/rejected': -2.12066650390625, 'logits/chosen': -2.0973222255706787, 'epoch': 0.06}
{'loss': 0.6902, 'learning_rate': 9.99997398222788e-06, 'rewards/chosen': 0.011761991307139397, 'rewards/rejected': 0.00546670937910676, 'rewards/accuracies': 0.5, 'rewards/margins': 0.006295280996710062, 'logps/rejected': -70.21249389648438, 'logps/chosen': -70.36653137207031, 'logits/rejected': -2.2433230876922607, 'logits/chosen': -2.2556514739990234, 'epoch': 0.07}
{'loss': 0.6944, 'learning_rate': 9.99996458693237e-06, 'rewards/chosen': -0.0023219208233058453, 'rewards/rejected': 6.135962757980451e-05, 'rewards/accuracies': 0.6000000238418579, 'rewards/margins': -0.0023832800798118114, 'logps/rejected': -70.10619354248047, 'logps/chosen': -79.71769714355469, 'logits/rejected': -2.219357967376709, 'logits/chosen': -2.200735330581665, 'epoch': 0.08}
{'loss': 0.6943, 'learning_rate': 9.999953746214097e-06, 'rewards/chosen': 0.0026049234438687563, 'rewards/rejected': 0.004713707137852907, 'rewards/accuracies': 0.44999998807907104, 'rewards/margins': -0.0021087839268147945, 'logps/rejected': -72.87063598632812, 'logps/chosen': -83.99541473388672, 'logits/rejected': -2.2576162815093994, 'logits/chosen': -2.206639051437378, 'epoch': 0.09}
The text was updated successfully, but these errors were encountered: