Enhancing Reward Function for MCTS in Marco-o1 #13

johnhaofu · 2024-11-26T06:18:40Z

The current reward function in Marco-o1's MCTS implementation relies solely on token-level confidence scores derived from the model's output probabilities. While this method provides a straightforward way to evaluate reasoning paths, it has notable limitations:

Local Optimality: Token-level probabilities may lead to paths that seem promising locally but fail to achieve global correctness.
Model Bias: The model's inherent biases might result in overconfidence in certain common patterns, misguiding the search process.
Context Insensitivity: The reward function does not evaluate the logical consistency of the tokens in the broader context of the reasoning path.
Lack of Task-Specificity: The reward function is generic and does not incorporate domain-specific knowledge or logical rules pertinent to the task.

Sniper970119 · 2024-11-26T06:50:34Z

Thank you for your attention.

As you mentioned, we also noted in our README that we have identified limitations in our current reward function, which is a significant constraint on our model's capabilities. From the perspective of test@k, this has had a considerable impact on the final performance.

Additionally, we are currently training our reward model. We believe that as the precision of the reward improves, the performance of our model will improve as well.

johnhaofu · 2024-11-26T06:57:06Z

Thank you for your detailed response and for acknowledging the limitations of the current reward function. It's great to know that you are already working on training a reward model to address this issue.

Given the impact of the reward function on the test@k performance, I believe that incorporating task-specific knowledge or logical rules into the reward evaluation could provide a significant boost. For example:

Introducing global consistency checks to ensure reasoning paths align with task goals.
Integrating intermediate validation steps for complex tasks, such as math or multi-step reasoning problems, to evaluate sub-path correctness.
If possible, could you share more about the approach you are using to train the reward model? Are you focusing on supervised learning with labeled paths, or exploring reinforcement learning techniques? I'd love to hear your thoughts on how to balance model flexibility with precision in reward evaluation.

Looking forward to your insights and progress updates!

Sniper970119 · 2024-11-26T07:03:01Z

Initially, we still plan to use ORM + MCTS, as this type of data is relatively easy to obtain and we already have a considerable amount of it.
Meanwhile, these tree search data can be used as unsupervised labels to train the PRM. After collecting some data, we can train our PRM. Our ultimate goal is MCTS+PRM+RL.
I hope this answer addresses your question.

johnhaofu · 2024-11-26T07:10:38Z

Thank you for the clarification! Your plan of starting with ORM + MCTS and using tree search results as unsupervised labels for PRM training sounds solid. Excited to see how this develops!

ywb2018 · 2024-11-27T02:31:16Z

Initially, we still plan to use ORM + MCTS, as this type of data is relatively easy to obtain and we already have a considerable amount of it. Meanwhile, these tree search data can be used as unsupervised labels to train the PRM. After collecting some data, we can train our PRM. Our ultimate goal is MCTS+PRM+RL. I hope this answer addresses your question.

I'm curious how you use ORM for tasks that don't have a standard answer (such as the translation task mentioned in your technical report)

ccp123456789 · 2024-11-28T02:01:03Z

Initially, we still plan to use ORM + MCTS, as this type of data is relatively easy to obtain and we already have a considerable amount of it. Meanwhile, these tree search data can be used as unsupervised labels to train the PRM. After collecting some data, we can train our PRM. Our ultimate goal is MCTS+PRM+RL. I hope this answer addresses your question.

How to combine ORM and MCTS? Generally, it requires process rewards and judgment for each step.

ZyangLee · 2024-11-29T10:14:28Z

Initially, we still plan to use ORM + MCTS, as this type of data is relatively easy to obtain and we already have a considerable amount of it. Meanwhile, these tree search data can be used as unsupervised labels to train the PRM. After collecting some data, we can train our PRM. Our ultimate goal is MCTS+PRM+RL. I hope this answer addresses your question.

How to combine ORM and MCTS? Generally, it requires process rewards and judgment for each step.

Maybe simply calculate the scores of nodes on the Monte Carlo Tree using UCB? As in existing work.

ccp123456789 · 2024-12-02T01:59:40Z

Why choose the base model Qwen2-7B-Instruct instead of Qwen2.5-7B-Instruct? I suspect it might be because Qwen2-7B-Instruct showed improvements in experiments, while Qwen2.5 did not show significant gains. I have also compared the capabilities of these two base models before, and the base capabilities of Qwen2 are significantly better than those of Qwen2.5

huanhuan6666 mentioned this issue Dec 3, 2024

Name 'o1'? #19

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancing Reward Function for MCTS in Marco-o1 #13

Enhancing Reward Function for MCTS in Marco-o1 #13

johnhaofu commented Nov 26, 2024

Sniper970119 commented Nov 26, 2024 •

edited

Loading

johnhaofu commented Nov 26, 2024

Sniper970119 commented Nov 26, 2024

johnhaofu commented Nov 26, 2024

ywb2018 commented Nov 27, 2024

ccp123456789 commented Nov 28, 2024

ZyangLee commented Nov 29, 2024

ccp123456789 commented Dec 2, 2024

Enhancing Reward Function for MCTS in Marco-o1 #13

Enhancing Reward Function for MCTS in Marco-o1 #13

Comments

johnhaofu commented Nov 26, 2024

Sniper970119 commented Nov 26, 2024 • edited Loading

johnhaofu commented Nov 26, 2024

Sniper970119 commented Nov 26, 2024

johnhaofu commented Nov 26, 2024

ywb2018 commented Nov 27, 2024

ccp123456789 commented Nov 28, 2024

ZyangLee commented Nov 29, 2024

ccp123456789 commented Dec 2, 2024

Sniper970119 commented Nov 26, 2024 •

edited

Loading