[Fix] Fix pass_at_k missing for SkRLGymGenerator.agent_loop flow due to token-level rewards#271
Conversation
…ken-level rewards
There was a problem hiding this comment.
Code Review
This pull request addresses an issue where pass_at_k metrics were not being calculated for the agent_loop flow due to token-level rewards. The introduced logic correctly distinguishes between per-trajectory and per-token reward scenarios, ensuring that rewards are formatted as a float when appropriate, which allows for the pass_at_k calculation. The implementation is sound and effectively resolves the problem. I've included one suggestion to refactor a loop into a more concise and readable single-line expression.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request effectively addresses the issue of pass_at_k metrics not being computed for certain reward structures. The change in skyrl_train/generators/skyrl_gym_generator.py correctly distinguishes between per-trajectory and per-token rewards by checking if intermediate rewards are None. This ensures that per-trajectory rewards are passed as a single float, enabling the metric calculation. The updated tests in skyrl-train/tests/cpu/generators/test_skyrl_gym_generator.py are well-designed, using parameterization to validate both scenarios and confirming the fix. The code is clear, correct, and well-tested.
…low due to token-level rewards" (#300) Reverts #271 as it causes errors For more, see #299 (comment)
Fixes #311. ### PRs around this issue - `pass_at_n` no longer computed for multi-turn rollouts after #226 - This PR fixed it by introducing `None` reward, which is ill-defined and later reverted: #271 ### This PR - Deeming the last turn's reward as the entire trajectory's reward (and being > 0 signifies a "pass") for the purpose of computing `pass@N` - Adding documentation about (per-turn) rewards, metrics, and per-token rewards conversion (for better intuition) in `Creating a New Environment or Task` for lack of a better place to put it at - Add unit test and more documentation to the metric util - Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`, since our `BaseTextEnvStepOutput.reward` is `float` and not `Optional[float]`. Also added corresponding documentation, saying returning `0.0` as reward for intermediate turns if not using turn-level reward - Also, added a minor fix to pass_at_n computation, where negative reward is taken into account. See this comment for more: #317 (comment) ### Test - Ran `run_gsm8k.sh` with: - orange: `batched=True` (previously working already, since it was not the `agent_loop()` codepath that does not convert to per-token rewards) - green: `batched=False` (the codepath where `pass_at_n` is not computed prior to this PR) - grey: baseline from a previous stable PR's run <img width="1101" height="574" alt="image" src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7" /> ### Rendered doc <img width="1116" height="933" alt="image" src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0" /> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
…low due to token-level rewards" (#300) Reverts NovaSky-AI/SkyRL#271 as it causes errors For more, see NovaSky-AI/SkyRL#299 (comment)
Fixes NovaSky-AI/SkyRL#311. ### PRs around this issue - `pass_at_n` no longer computed for multi-turn rollouts after NovaSky-AI/SkyRL#226 - This PR fixed it by introducing `None` reward, which is ill-defined and later reverted: NovaSky-AI/SkyRL#271 ### This PR - Deeming the last turn's reward as the entire trajectory's reward (and being > 0 signifies a "pass") for the purpose of computing `pass@N` - Adding documentation about (per-turn) rewards, metrics, and per-token rewards conversion (for better intuition) in `Creating a New Environment or Task` for lack of a better place to put it at - Add unit test and more documentation to the metric util - Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`, since our `BaseTextEnvStepOutput.reward` is `float` and not `Optional[float]`. Also added corresponding documentation, saying returning `0.0` as reward for intermediate turns if not using turn-level reward - Also, added a minor fix to pass_at_n computation, where negative reward is taken into account. See this comment for more: NovaSky-AI/SkyRL#317 (comment) ### Test - Ran `run_gsm8k.sh` with: - orange: `batched=True` (previously working already, since it was not the `agent_loop()` codepath that does not convert to per-token rewards) - green: `batched=False` (the codepath where `pass_at_n` is not computed prior to this PR) - grey: baseline from a previous stable PR's run <img width="1101" height="574" alt="image" src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7" /> ### Rendered doc <img width="1116" height="933" alt="image" src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0" /> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
…low due to token-level rewards" (#300) Reverts NovaSky-AI/SkyRL#271 as it causes errors For more, see NovaSky-AI/SkyRL#299 (comment)
Fixes NovaSky-AI/SkyRL#311. ### PRs around this issue - `pass_at_n` no longer computed for multi-turn rollouts after NovaSky-AI/SkyRL#226 - This PR fixed it by introducing `None` reward, which is ill-defined and later reverted: NovaSky-AI/SkyRL#271 ### This PR - Deeming the last turn's reward as the entire trajectory's reward (and being > 0 signifies a "pass") for the purpose of computing `pass@N` - Adding documentation about (per-turn) rewards, metrics, and per-token rewards conversion (for better intuition) in `Creating a New Environment or Task` for lack of a better place to put it at - Add unit test and more documentation to the metric util - Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`, since our `BaseTextEnvStepOutput.reward` is `float` and not `Optional[float]`. Also added corresponding documentation, saying returning `0.0` as reward for intermediate turns if not using turn-level reward - Also, added a minor fix to pass_at_n computation, where negative reward is taken into account. See this comment for more: NovaSky-AI/SkyRL#317 (comment) ### Test - Ran `run_gsm8k.sh` with: - orange: `batched=True` (previously working already, since it was not the `agent_loop()` codepath that does not convert to per-token rewards) - green: `batched=False` (the codepath where `pass_at_n` is not computed prior to this PR) - grey: baseline from a previous stable PR's run <img width="1101" height="574" alt="image" src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7" /> ### Rendered doc <img width="1116" height="933" alt="image" src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0" /> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
…to token-level rewards (NovaSky-AI#271) In NovaSky-AI#226, we started building per-token reward for the `agent_loop()` codepath to enable per-step reward. However, in `get_metrics_from_generator_output()`, we do not compute pass_at_n for token-level rewards: ```python def get_metrics_from_generator_output( generator_output: GeneratorOutput, uids: List[str] ) -> Tuple[float, Optional[float]]: ... if isinstance(rewards[0], list): # We just compute mean over sequence reward. # TODO: We should make metrics customizable by the environment mean_raw_reward = float(np.mean([sum(seq_rewards) for seq_rewards in rewards])) pass_at_n = None # not computed for token-level rewards since it's ill-defined else: ... ``` This PR resolves this by still using per-trajectory reward when all intermediate rewards are None. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…low due to token-level rewards" (NovaSky-AI#300) Reverts NovaSky-AI#271 as it causes errors For more, see NovaSky-AI#299 (comment)
…aSky-AI#317) Fixes NovaSky-AI#311. ### PRs around this issue - `pass_at_n` no longer computed for multi-turn rollouts after NovaSky-AI#226 - This PR fixed it by introducing `None` reward, which is ill-defined and later reverted: NovaSky-AI#271 ### This PR - Deeming the last turn's reward as the entire trajectory's reward (and being > 0 signifies a "pass") for the purpose of computing `pass@N` - Adding documentation about (per-turn) rewards, metrics, and per-token rewards conversion (for better intuition) in `Creating a New Environment or Task` for lack of a better place to put it at - Add unit test and more documentation to the metric util - Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`, since our `BaseTextEnvStepOutput.reward` is `float` and not `Optional[float]`. Also added corresponding documentation, saying returning `0.0` as reward for intermediate turns if not using turn-level reward - Also, added a minor fix to pass_at_n computation, where negative reward is taken into account. See this comment for more: NovaSky-AI#317 (comment) ### Test - Ran `run_gsm8k.sh` with: - orange: `batched=True` (previously working already, since it was not the `agent_loop()` codepath that does not convert to per-token rewards) - green: `batched=False` (the codepath where `pass_at_n` is not computed prior to this PR) - grey: baseline from a previous stable PR's run <img width="1101" height="574" alt="image" src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7" /> ### Rendered doc <img width="1116" height="933" alt="image" src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0" /> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
In #226, we started building per-token reward for the
agent_loop()codepath to enable per-step reward. However, inget_metrics_from_generator_output(), we do not compute pass_at_n for token-level rewards:This PR resolves this by still using per-trajectory reward when all intermediate rewards are None.