The loss function of reward model. #22

huzechuan · 2023-01-31T12:42:17Z

Hi, I am confused that the loss function of ChatGPT's reward model takes as input the difference of two responses and then passes a sigmoid function. However, the loss function in this repo only takes one response as input and uses the ranking score as a label to calculate the CE loss. Is there an advantage to this?

lucidrains · 2023-01-31T17:19:44Z

@huzechuan i have to admit i haven't totally digested the way they derive their reward values for training

but at the moment, even if their reward is derived from a collection of sampled responses, this repository doesn't lock you into any one method, as you can do your second step (training the reward model) from any <sequence, reward value> pair, which you define

i guess i'll have to worry about this once i build out the application for sampling from some version of the model and collecting the ratings, so do let me know in detail the optimal way they discovered. i just think there are other applications beyond text that this could be used for (rl, protein design), that does not necessarily need this sigmoid of difference approach

yangjianxin1 · 2023-02-12T09:41:28Z

Hi, I am confused that the loss function of ChatGPT's reward model takes as input the difference of two responses and then passes a sigmoid function. However, the loss function in this repo only takes one response as input and uses the ranking score as a label to calculate the CE loss. Is there an advantage to this?

I have the same confusion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The loss function of reward model. #22

The loss function of reward model. #22

huzechuan commented Jan 31, 2023

lucidrains commented Jan 31, 2023 •

edited

Loading

yangjianxin1 commented Feb 12, 2023

The loss function of reward model. #22

The loss function of reward model. #22

Comments

huzechuan commented Jan 31, 2023

lucidrains commented Jan 31, 2023 • edited Loading

yangjianxin1 commented Feb 12, 2023

lucidrains commented Jan 31, 2023 •

edited

Loading