You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm also interested in this! From your paper about OpenR, I guess you will label + when the mc_value is larger than 0 (if I understand right), which means that this path can lead to a correct answer. But I don't think it's a nice idea, and also other work[1] uses regression to predict the reward.
Step-level Value Preference Optimization for Mathematical Reasoning
OmegaPRM and Math-Shepherd both report that soft label is better?
OmegaPRM和Math-Shepherd好像都报告了soft label更好?
The text was updated successfully, but these errors were encountered: