You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am confused that the loss function of ChatGPT's reward model takes as input the difference of two responses and then passes a sigmoid function. However, the loss function in this repo only takes one response as input and uses the ranking score as a label to calculate the CE loss. Is there an advantage to this?
The text was updated successfully, but these errors were encountered:
@huzechuan i have to admit i haven't totally digested the way they derive their reward values for training
but at the moment, even if their reward is derived from a collection of sampled responses, this repository doesn't lock you into any one method, as you can do your second step (training the reward model) from any <sequence, reward value> pair, which you define
i guess i'll have to worry about this once i build out the application for sampling from some version of the model and collecting the ratings, so do let me know in detail the optimal way they discovered. i just think there are other applications beyond text that this could be used for (rl, protein design), that does not necessarily need this sigmoid of difference approach
Hi, I am confused that the loss function of ChatGPT's reward model takes as input the difference of two responses and then passes a sigmoid function. However, the loss function in this repo only takes one response as input and uses the ranking score as a label to calculate the CE loss. Is there an advantage to this?
Hi, I am confused that the loss function of ChatGPT's reward model takes as input the difference of two responses and then passes a sigmoid function. However, the loss function in this repo only takes one response as input and uses the ranking score as a label to calculate the CE loss. Is there an advantage to this?
The text was updated successfully, but these errors were encountered: