[Fix] Fix pass_at_k missing for SkRLGymGenerator.agent_loop flow due to token-level rewards by CharlieFRuan · Pull Request #271 · NovaSky-AI/SkyRL

CharlieFRuan · 2025-09-09T06:52:14Z

In #226, we started building per-token reward for the agent_loop() codepath to enable per-step reward. However, in get_metrics_from_generator_output(), we do not compute pass_at_n for token-level rewards:

def get_metrics_from_generator_output(
    generator_output: GeneratorOutput, uids: List[str]
) -> Tuple[float, Optional[float]]:
    ...
    if isinstance(rewards[0], list):
        # We just compute mean over sequence reward.
        # TODO: We should make metrics customizable by the environment
        mean_raw_reward = float(np.mean([sum(seq_rewards) for seq_rewards in rewards]))
        pass_at_n = None  # not computed for token-level rewards since it's ill-defined
    else:
        ...

This PR resolves this by still using per-trajectory reward when all intermediate rewards are None.

…ken-level rewards

gemini-code-assist

Code Review

This pull request addresses an issue where pass_at_k metrics were not being calculated for the agent_loop flow due to token-level rewards. The introduced logic correctly distinguishes between per-trajectory and per-token reward scenarios, ensuring that rewards are formatted as a float when appropriate, which allows for the pass_at_k calculation. The implementation is sound and effectively resolves the problem. I've included one suggestion to refactor a loop into a more concise and readable single-line expression.

skyrl-train/skyrl_train/generators/skyrl_gym_generator.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

CharlieFRuan · 2025-09-09T07:06:59Z

/gemini review

gemini-code-assist

Code Review

This pull request effectively addresses the issue of pass_at_k metrics not being computed for certain reward structures. The change in skyrl_train/generators/skyrl_gym_generator.py correctly distinguishes between per-trajectory and per-token rewards by checking if intermediate rewards are None. This ensures that per-trajectory rewards are passed as a single float, enabling the metric calculation. The updated tests in skyrl-train/tests/cpu/generators/test_skyrl_gym_generator.py are well-designed, using parameterization to validate both scenarios and confirming the fix. The code is clear, correct, and well-tested.

…low due to token-level rewards" (#300) Reverts #271 as it causes errors For more, see #299 (comment)

Fixes #311. ### PRs around this issue - `pass_at_n` no longer computed for multi-turn rollouts after #226 - This PR fixed it by introducing `None` reward, which is ill-defined and later reverted: #271 ### This PR - Deeming the last turn's reward as the entire trajectory's reward (and being > 0 signifies a "pass") for the purpose of computing `pass@N` - Adding documentation about (per-turn) rewards, metrics, and per-token rewards conversion (for better intuition) in `Creating a New Environment or Task` for lack of a better place to put it at - Add unit test and more documentation to the metric util - Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`, since our `BaseTextEnvStepOutput.reward` is `float` and not `Optional[float]`. Also added corresponding documentation, saying returning `0.0` as reward for intermediate turns if not using turn-level reward - Also, added a minor fix to pass_at_n computation, where negative reward is taken into account. See this comment for more: #317 (comment) ### Test - Ran `run_gsm8k.sh` with: - orange: `batched=True` (previously working already, since it was not the `agent_loop()` codepath that does not convert to per-token rewards) - green: `batched=False` (the codepath where `pass_at_n` is not computed prior to this PR) - grey: baseline from a previous stable PR's run <img width="1101" height="574" alt="image" src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7" /> ### Rendered doc <img width="1116" height="933" alt="image" src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0" /> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>

…low due to token-level rewards" (#300) Reverts NovaSky-AI/SkyRL#271 as it causes errors For more, see NovaSky-AI/SkyRL#299 (comment)

Fixes NovaSky-AI/SkyRL#311. ### PRs around this issue - `pass_at_n` no longer computed for multi-turn rollouts after NovaSky-AI/SkyRL#226 - This PR fixed it by introducing `None` reward, which is ill-defined and later reverted: NovaSky-AI/SkyRL#271 ### This PR - Deeming the last turn's reward as the entire trajectory's reward (and being > 0 signifies a "pass") for the purpose of computing `pass@N` - Adding documentation about (per-turn) rewards, metrics, and per-token rewards conversion (for better intuition) in `Creating a New Environment or Task` for lack of a better place to put it at - Add unit test and more documentation to the metric util - Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`, since our `BaseTextEnvStepOutput.reward` is `float` and not `Optional[float]`. Also added corresponding documentation, saying returning `0.0` as reward for intermediate turns if not using turn-level reward - Also, added a minor fix to pass_at_n computation, where negative reward is taken into account. See this comment for more: NovaSky-AI/SkyRL#317 (comment) ### Test - Ran `run_gsm8k.sh` with: - orange: `batched=True` (previously working already, since it was not the `agent_loop()` codepath that does not convert to per-token rewards) - green: `batched=False` (the codepath where `pass_at_n` is not computed prior to this PR) - grey: baseline from a previous stable PR's run <img width="1101" height="574" alt="image" src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7" /> ### Rendered doc <img width="1116" height="933" alt="image" src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0" /> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>

…low due to token-level rewards" (#300) Reverts NovaSky-AI/SkyRL#271 as it causes errors For more, see NovaSky-AI/SkyRL#299 (comment)

Fixes NovaSky-AI/SkyRL#311. ### PRs around this issue - `pass_at_n` no longer computed for multi-turn rollouts after NovaSky-AI/SkyRL#226 - This PR fixed it by introducing `None` reward, which is ill-defined and later reverted: NovaSky-AI/SkyRL#271 ### This PR - Deeming the last turn's reward as the entire trajectory's reward (and being > 0 signifies a "pass") for the purpose of computing `pass@N` - Adding documentation about (per-turn) rewards, metrics, and per-token rewards conversion (for better intuition) in `Creating a New Environment or Task` for lack of a better place to put it at - Add unit test and more documentation to the metric util - Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`, since our `BaseTextEnvStepOutput.reward` is `float` and not `Optional[float]`. Also added corresponding documentation, saying returning `0.0` as reward for intermediate turns if not using turn-level reward - Also, added a minor fix to pass_at_n computation, where negative reward is taken into account. See this comment for more: NovaSky-AI/SkyRL#317 (comment) ### Test - Ran `run_gsm8k.sh` with: - orange: `batched=True` (previously working already, since it was not the `agent_loop()` codepath that does not convert to per-token rewards) - green: `batched=False` (the codepath where `pass_at_n` is not computed prior to this PR) - grey: baseline from a previous stable PR's run <img width="1101" height="574" alt="image" src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7" /> ### Rendered doc <img width="1116" height="933" alt="image" src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0" /> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>

…to token-level rewards (NovaSky-AI#271) In NovaSky-AI#226, we started building per-token reward for the `agent_loop()` codepath to enable per-step reward. However, in `get_metrics_from_generator_output()`, we do not compute pass_at_n for token-level rewards: ```python def get_metrics_from_generator_output( generator_output: GeneratorOutput, uids: List[str] ) -> Tuple[float, Optional[float]]: ... if isinstance(rewards[0], list): # We just compute mean over sequence reward. # TODO: We should make metrics customizable by the environment mean_raw_reward = float(np.mean([sum(seq_rewards) for seq_rewards in rewards])) pass_at_n = None # not computed for token-level rewards since it's ill-defined else: ... ``` This PR resolves this by still using per-trajectory reward when all intermediate rewards are None. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…low due to token-level rewards" (NovaSky-AI#300) Reverts NovaSky-AI#271 as it causes errors For more, see NovaSky-AI#299 (comment)

…aSky-AI#317) Fixes NovaSky-AI#311. ### PRs around this issue - `pass_at_n` no longer computed for multi-turn rollouts after NovaSky-AI#226 - This PR fixed it by introducing `None` reward, which is ill-defined and later reverted: NovaSky-AI#271 ### This PR - Deeming the last turn's reward as the entire trajectory's reward (and being > 0 signifies a "pass") for the purpose of computing `pass@N` - Adding documentation about (per-turn) rewards, metrics, and per-token rewards conversion (for better intuition) in `Creating a New Environment or Task` for lack of a better place to put it at - Add unit test and more documentation to the metric util - Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`, since our `BaseTextEnvStepOutput.reward` is `float` and not `Optional[float]`. Also added corresponding documentation, saying returning `0.0` as reward for intermediate turns if not using turn-level reward - Also, added a minor fix to pass_at_n computation, where negative reward is taken into account. See this comment for more: NovaSky-AI#317 (comment) ### Test - Ran `run_gsm8k.sh` with: - orange: `batched=True` (previously working already, since it was not the `agent_loop()` codepath that does not convert to per-token rewards) - green: `batched=False` (the codepath where `pass_at_n` is not computed prior to this PR) - grey: baseline from a previous stable PR's run <img width="1101" height="574" alt="image" src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7" /> ### Rendered doc <img width="1116" height="933" alt="image" src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0" /> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>

[Fix][srgg] Fix pass_at_k missing for agent_loop flow due to dummy to…

6af2095

…ken-level rewards

gemini-code-assist bot reviewed Sep 9, 2025

View reviewed changes

skyrl-train/skyrl_train/generators/skyrl_gym_generator.py Outdated Show resolved Hide resolved

CharlieFRuan and others added 3 commits September 8, 2025 23:54

Update skyrl-train/skyrl_train/generators/skyrl_gym_generator.py

2611d45

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

compact code

1339c4d

add unit test

fbf4916

gemini-code-assist bot reviewed Sep 9, 2025

View reviewed changes

CharlieFRuan requested a review from tyler-griggs September 9, 2025 07:10

tyler-griggs approved these changes Sep 11, 2025

View reviewed changes

tyler-griggs merged commit 3ffaa10 into NovaSky-AI:main Sep 11, 2025
3 checks passed

erictang000 mentioned this pull request Sep 16, 2025

[SkyRL-Gym] Make SQL and Search envs return None for intermediate steps #299

Closed

CharlieFRuan mentioned this pull request Sep 16, 2025

Revert "[Fix] Fix pass_at_k missing for SkRLGymGenerator.agent_loop flow due to token-level rewards" #300

Merged

erictang000 pushed a commit that referenced this pull request Sep 16, 2025

Revert "[Fix] Fix pass_at_k missing for SkRLGymGenerator.agent_loop f…

7406b96

…low due to token-level rewards" (#300) Reverts #271 as it causes errors For more, see #299 (comment)

CharlieFRuan mentioned this pull request Sep 18, 2025

[Metrics] Add back pass_at_n computation for (per-token) rewards #317

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[Fix] Fix pass_at_k missing for SkRLGymGenerator.agent_loop flow due to token-level rewards#271

[Fix] Fix pass_at_k missing for SkRLGymGenerator.agent_loop flow due to token-level rewards#271
tyler-griggs merged 4 commits intoNovaSky-AI:mainfrom
CharlieFRuan:fix-0908-fix-reward

CharlieFRuan commented Sep 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

CharlieFRuan commented Sep 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

CharlieFRuan commented Sep 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

CharlieFRuan commented Sep 9, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants