Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing Reward Function for MCTS in Marco-o1 #13

Open
johnhaofu opened this issue Nov 26, 2024 · 8 comments
Open

Enhancing Reward Function for MCTS in Marco-o1 #13

johnhaofu opened this issue Nov 26, 2024 · 8 comments

Comments

@johnhaofu
Copy link

The current reward function in Marco-o1's MCTS implementation relies solely on token-level confidence scores derived from the model's output probabilities. While this method provides a straightforward way to evaluate reasoning paths, it has notable limitations:

Local Optimality: Token-level probabilities may lead to paths that seem promising locally but fail to achieve global correctness.
Model Bias: The model's inherent biases might result in overconfidence in certain common patterns, misguiding the search process.
Context Insensitivity: The reward function does not evaluate the logical consistency of the tokens in the broader context of the reasoning path.
Lack of Task-Specificity: The reward function is generic and does not incorporate domain-specific knowledge or logical rules pertinent to the task.

@Sniper970119
Copy link
Collaborator

Sniper970119 commented Nov 26, 2024

Thank you for your attention.

As you mentioned, we also noted in our README that we have identified limitations in our current reward function, which is a significant constraint on our model's capabilities. From the perspective of test@k, this has had a considerable impact on the final performance.

Additionally, we are currently training our reward model. We believe that as the precision of the reward improves, the performance of our model will improve as well.

@johnhaofu
Copy link
Author

Thank you for your detailed response and for acknowledging the limitations of the current reward function. It's great to know that you are already working on training a reward model to address this issue.

Given the impact of the reward function on the test@k performance, I believe that incorporating task-specific knowledge or logical rules into the reward evaluation could provide a significant boost. For example:

Introducing global consistency checks to ensure reasoning paths align with task goals.
Integrating intermediate validation steps for complex tasks, such as math or multi-step reasoning problems, to evaluate sub-path correctness.
If possible, could you share more about the approach you are using to train the reward model? Are you focusing on supervised learning with labeled paths, or exploring reinforcement learning techniques? I'd love to hear your thoughts on how to balance model flexibility with precision in reward evaluation.

Looking forward to your insights and progress updates!

@Sniper970119
Copy link
Collaborator

Initially, we still plan to use ORM + MCTS, as this type of data is relatively easy to obtain and we already have a considerable amount of it.
Meanwhile, these tree search data can be used as unsupervised labels to train the PRM. After collecting some data, we can train our PRM. Our ultimate goal is MCTS+PRM+RL.
I hope this answer addresses your question.

@johnhaofu
Copy link
Author

Thank you for the clarification! Your plan of starting with ORM + MCTS and using tree search results as unsupervised labels for PRM training sounds solid. Excited to see how this develops!

@ywb2018
Copy link

ywb2018 commented Nov 27, 2024

Initially, we still plan to use ORM + MCTS, as this type of data is relatively easy to obtain and we already have a considerable amount of it. Meanwhile, these tree search data can be used as unsupervised labels to train the PRM. After collecting some data, we can train our PRM. Our ultimate goal is MCTS+PRM+RL. I hope this answer addresses your question.

I'm curious how you use ORM for tasks that don't have a standard answer (such as the translation task mentioned in your technical report)

@ccp123456789
Copy link

Initially, we still plan to use ORM + MCTS, as this type of data is relatively easy to obtain and we already have a considerable amount of it. Meanwhile, these tree search data can be used as unsupervised labels to train the PRM. After collecting some data, we can train our PRM. Our ultimate goal is MCTS+PRM+RL. I hope this answer addresses your question.

How to combine ORM and MCTS? Generally, it requires process rewards and judgment for each step.

@ZyangLee
Copy link

Initially, we still plan to use ORM + MCTS, as this type of data is relatively easy to obtain and we already have a considerable amount of it. Meanwhile, these tree search data can be used as unsupervised labels to train the PRM. After collecting some data, we can train our PRM. Our ultimate goal is MCTS+PRM+RL. I hope this answer addresses your question.

How to combine ORM and MCTS? Generally, it requires process rewards and judgment for each step.

Maybe simply calculate the scores of nodes on the Monte Carlo Tree using UCB? As in existing work.

@ccp123456789
Copy link

Why choose the base model Qwen2-7B-Instruct instead of Qwen2.5-7B-Instruct? I suspect it might be because Qwen2-7B-Instruct showed improvements in experiments, while Qwen2.5 did not show significant gains. I have also compared the capabilities of these two base models before, and the base capabilities of Qwen2 are significantly better than those of Qwen2.5

@huanhuan6666 huanhuan6666 mentioned this issue Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants