Reward verification and evaluation fixes #55

hynky1999 · 2025-01-26T13:09:27Z

bumps math-verify to the newest version
fixes aime evaluation by using the answer column directly (this would be problem for few-shot, but since we don't use it I think it's fine)
Changes rewards evaluation:
1 .The answer from model must be clearly preceded by answer is xxx or by be in \boxed env (with the boxed env being tried first).
2. No latex malformed operators + nits are applied to the resulting latex, meaning model has to output correct latex.
3. the gold must be in latex format, this ensures that the extracted gold is exactly what we want to compare against
4. Removed the try/catch it's handled in math-verify now

cc @qgallouedec, I am not sure whether returning 1 if the gold is not found is the correct way to do this. It assumes that if rewards is 1 no gradient change is enforced is that correct?

qgallouedec · 2025-01-26T14:44:58Z

cc @qgallouedec, I am not sure whether returning 1 if the gold is not found is the correct way to do this. It assumes that if rewards is 1 no gradient change is enforced is that correct?

You can return any value here: the model generates a group of completions for one prompt, then these completions are rewarded, then normalised. If all completions get the same reward (because the common prompt is not parsable here), then the normalisation outputs a tensor of 0s. So no gradient.

qgallouedec

Nice!!
I approve the GRPO part

src/open_r1/grpo.py

pepinu

one comment about abstracting away how reward is calculated from internals of the loop

pepinu · 2025-01-26T15:33:03Z

src/open_r1/grpo.py

-            reward = float(verify(answer, parse(sol)))
-        except Exception:  # if it fails for any reason, return 0.0
-            reward = 0.0
+        gold_parsed = parse(sol, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])


I'd move l46-l73 to another method named get_reward

def get_reward(content: str, solution: str) -> float: """ Calculate reward by comparing the parsed content with the ground truth solution. Args: content (str): The model's output content to evaluate solution (str): The ground truth solution to compare against Returns: float: 1.0 if the content matches the solution, 0.0 otherwise """ gold_parsed = parse(solution, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()]) if len(gold_parsed) == 0: return 1.0 answer_parsed = parse( content, extraction_config=[ LatexExtractionConfig( normalization_config=NormalizationConfig( nits=False, malformed_operators=False, basic_latex=True, equations=True, boxed=True, units=True, ), boxed_match_priority=0, try_extract_without_anchor=False, ) ], extraction_mode="first_match", ) return float(verify(answer_parsed, gold_parsed))

then the loop on line 45 changes to

def accuracy_reward(completions, solution, **kwargs): """Reward function that checks if the completion is the same as the ground truth.""" contents = [completion[0]["content"] for completion in completions] rewards = [] for content, sol in zip(contents, solution): reward = get_reward(content, sol) rewards.append(reward) return rewards

Which improves readability

Think it's fine for now, let's move it once we add stuff for code and other rewards.

lewtun

Thanks for the fix on AIME and improvement to GRPO! Both changes LGTM - feel free to merge with @qgallouedec's suggestion

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

Hynek Kydlicek added 2 commits January 25, 2025 23:09

bump up deps, fix aime24 evals, make grpo more strict

70355fc

minor fixes

7a37218

hynky1999 requested a review from qgallouedec January 26, 2025 13:09

🤨 fmt

827b0fe

qgallouedec approved these changes Jan 26, 2025

View reviewed changes

src/open_r1/grpo.py Outdated Show resolved Hide resolved

qgallouedec mentioned this pull request Jan 26, 2025

How to supervise non-math data? #56

Open

pepinu reviewed Jan 26, 2025

View reviewed changes

lewtun approved these changes Jan 26, 2025

View reviewed changes

lewtun mentioned this pull request Jan 26, 2025

Make HuggingFaceH4/aime_2024 public #51

Closed

Update src/open_r1/grpo.py

19e79ce

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

hynky1999 merged commit 90b0947 into main Jan 26, 2025
1 check passed

hynky1999 deleted the verify_and_eval_fixes branch January 26, 2025 17:35

mariagrandury mentioned this pull request Jan 27, 2025

Evaluate GRPO vs. other RL algorithms #11

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reward verification and evaluation fixes #55

Reward verification and evaluation fixes #55

hynky1999 commented Jan 26, 2025 •

edited

Loading

qgallouedec commented Jan 26, 2025 •

edited

Loading

qgallouedec left a comment

pepinu left a comment

pepinu Jan 26, 2025

hynky1999 Jan 26, 2025

lewtun left a comment

Reward verification and evaluation fixes #55

Reward verification and evaluation fixes #55

Conversation

hynky1999 commented Jan 26, 2025 • edited Loading

qgallouedec commented Jan 26, 2025 • edited Loading

qgallouedec left a comment

Choose a reason for hiding this comment

pepinu left a comment

Choose a reason for hiding this comment

pepinu Jan 26, 2025

Choose a reason for hiding this comment

hynky1999 Jan 26, 2025

Choose a reason for hiding this comment

lewtun left a comment

Choose a reason for hiding this comment

hynky1999 commented Jan 26, 2025 •

edited

Loading

qgallouedec commented Jan 26, 2025 •

edited

Loading