[Generator] Support turn-level rewards in SkyRLGymGenerator#226
Merged
tyler-griggs merged 6 commits intoNovaSky-AI:mainfrom Sep 4, 2025
Merged
[Generator] Support turn-level rewards in SkyRLGymGenerator#226tyler-griggs merged 6 commits intoNovaSky-AI:mainfrom
tyler-griggs merged 6 commits intoNovaSky-AI:mainfrom
Conversation
…kyrl_gym_generator.py, skyrl_train/generators/skyrl_gym_generator.py)
7131a84 to
2f7f85f
Compare
SumanthRH
reviewed
Sep 4, 2025
SumanthRH
reviewed
Sep 4, 2025
Member
Author
|
Okay now ready for review! I'll add a wandb screenshot shortly. |
SumanthRH
reviewed
Sep 4, 2025
| train_dataset = dataset["train"] | ||
| val_dataset = dataset["test"] | ||
|
|
||
| instruction_following = 'Let\'s think step by step and output the final answer after "####".' |
Member
There was a problem hiding this comment.
we can assume max_turns is > 1 for this example and thus the prompt can be :
Suggested change
| instruction_following = 'Let\'s think step by step and output the final answer after "####".' | |
| instruction_following = 'Let\'s think step by step and output a tentative numeric answer after "####".' |
tyler-griggs
pushed a commit
that referenced
this pull request
Sep 11, 2025
…to token-level rewards (#271) In #226, we started building per-token reward for the `agent_loop()` codepath to enable per-step reward. However, in `get_metrics_from_generator_output()`, we do not compute pass_at_n for token-level rewards: ```python def get_metrics_from_generator_output( generator_output: GeneratorOutput, uids: List[str] ) -> Tuple[float, Optional[float]]: ... if isinstance(rewards[0], list): # We just compute mean over sequence reward. # TODO: We should make metrics customizable by the environment mean_raw_reward = float(np.mean([sum(seq_rewards) for seq_rewards in rewards])) pass_at_n = None # not computed for token-level rewards since it's ill-defined else: ... ``` This PR resolves this by still using per-trajectory reward when all intermediate rewards are None. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
This was referenced Sep 16, 2025
This was referenced Sep 18, 2025
CharlieFRuan
added a commit
that referenced
this pull request
Sep 22, 2025
Fixes #311. ### PRs around this issue - `pass_at_n` no longer computed for multi-turn rollouts after #226 - This PR fixed it by introducing `None` reward, which is ill-defined and later reverted: #271 ### This PR - Deeming the last turn's reward as the entire trajectory's reward (and being > 0 signifies a "pass") for the purpose of computing `pass@N` - Adding documentation about (per-turn) rewards, metrics, and per-token rewards conversion (for better intuition) in `Creating a New Environment or Task` for lack of a better place to put it at - Add unit test and more documentation to the metric util - Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`, since our `BaseTextEnvStepOutput.reward` is `float` and not `Optional[float]`. Also added corresponding documentation, saying returning `0.0` as reward for intermediate turns if not using turn-level reward - Also, added a minor fix to pass_at_n computation, where negative reward is taken into account. See this comment for more: #317 (comment) ### Test - Ran `run_gsm8k.sh` with: - orange: `batched=True` (previously working already, since it was not the `agent_loop()` codepath that does not convert to per-token rewards) - green: `batched=False` (the codepath where `pass_at_n` is not computed prior to this PR) - grey: baseline from a previous stable PR's run <img width="1101" height="574" alt="image" src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7" /> ### Rendered doc <img width="1116" height="933" alt="image" src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0" /> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
ztcanddota
added a commit
to ztcanddota/skyagent
that referenced
this pull request
Sep 28, 2025
…to token-level rewards (#271) In NovaSky-AI/SkyRL#226, we started building per-token reward for the `agent_loop()` codepath to enable per-step reward. However, in `get_metrics_from_generator_output()`, we do not compute pass_at_n for token-level rewards: ```python def get_metrics_from_generator_output( generator_output: GeneratorOutput, uids: List[str] ) -> Tuple[float, Optional[float]]: ... if isinstance(rewards[0], list): # We just compute mean over sequence reward. # TODO: We should make metrics customizable by the environment mean_raw_reward = float(np.mean([sum(seq_rewards) for seq_rewards in rewards])) pass_at_n = None # not computed for token-level rewards since it's ill-defined else: ... ``` This PR resolves this by still using per-trajectory reward when all intermediate rewards are None. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
ztcanddota
added a commit
to ztcanddota/skyagent
that referenced
this pull request
Sep 28, 2025
Fixes NovaSky-AI/SkyRL#311. ### PRs around this issue - `pass_at_n` no longer computed for multi-turn rollouts after NovaSky-AI/SkyRL#226 - This PR fixed it by introducing `None` reward, which is ill-defined and later reverted: NovaSky-AI/SkyRL#271 ### This PR - Deeming the last turn's reward as the entire trajectory's reward (and being > 0 signifies a "pass") for the purpose of computing `pass@N` - Adding documentation about (per-turn) rewards, metrics, and per-token rewards conversion (for better intuition) in `Creating a New Environment or Task` for lack of a better place to put it at - Add unit test and more documentation to the metric util - Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`, since our `BaseTextEnvStepOutput.reward` is `float` and not `Optional[float]`. Also added corresponding documentation, saying returning `0.0` as reward for intermediate turns if not using turn-level reward - Also, added a minor fix to pass_at_n computation, where negative reward is taken into account. See this comment for more: NovaSky-AI/SkyRL#317 (comment) ### Test - Ran `run_gsm8k.sh` with: - orange: `batched=True` (previously working already, since it was not the `agent_loop()` codepath that does not convert to per-token rewards) - green: `batched=False` (the codepath where `pass_at_n` is not computed prior to this PR) - grey: baseline from a previous stable PR's run <img width="1101" height="574" alt="image" src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7" /> ### Rendered doc <img width="1116" height="933" alt="image" src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0" /> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
SungjunlaLee
added a commit
to SungjunlaLee/SkyRL
that referenced
this pull request
Jan 3, 2026
…to token-level rewards (#271) In NovaSky-AI/SkyRL#226, we started building per-token reward for the `agent_loop()` codepath to enable per-step reward. However, in `get_metrics_from_generator_output()`, we do not compute pass_at_n for token-level rewards: ```python def get_metrics_from_generator_output( generator_output: GeneratorOutput, uids: List[str] ) -> Tuple[float, Optional[float]]: ... if isinstance(rewards[0], list): # We just compute mean over sequence reward. # TODO: We should make metrics customizable by the environment mean_raw_reward = float(np.mean([sum(seq_rewards) for seq_rewards in rewards])) pass_at_n = None # not computed for token-level rewards since it's ill-defined else: ... ``` This PR resolves this by still using per-trajectory reward when all intermediate rewards are None. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
SungjunlaLee
added a commit
to SungjunlaLee/SkyRL
that referenced
this pull request
Jan 3, 2026
Fixes NovaSky-AI/SkyRL#311. ### PRs around this issue - `pass_at_n` no longer computed for multi-turn rollouts after NovaSky-AI/SkyRL#226 - This PR fixed it by introducing `None` reward, which is ill-defined and later reverted: NovaSky-AI/SkyRL#271 ### This PR - Deeming the last turn's reward as the entire trajectory's reward (and being > 0 signifies a "pass") for the purpose of computing `pass@N` - Adding documentation about (per-turn) rewards, metrics, and per-token rewards conversion (for better intuition) in `Creating a New Environment or Task` for lack of a better place to put it at - Add unit test and more documentation to the metric util - Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`, since our `BaseTextEnvStepOutput.reward` is `float` and not `Optional[float]`. Also added corresponding documentation, saying returning `0.0` as reward for intermediate turns if not using turn-level reward - Also, added a minor fix to pass_at_n computation, where negative reward is taken into account. See this comment for more: NovaSky-AI/SkyRL#317 (comment) ### Test - Ran `run_gsm8k.sh` with: - orange: `batched=True` (previously working already, since it was not the `agent_loop()` codepath that does not convert to per-token rewards) - green: `batched=False` (the codepath where `pass_at_n` is not computed prior to this PR) - grey: baseline from a previous stable PR's run <img width="1101" height="574" alt="image" src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7" /> ### Rendered doc <img width="1116" height="933" alt="image" src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0" /> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
dzorlu
referenced
this pull request
in fleet-ai/SkyRL
Feb 4, 2026
Addressing issue #201 ## What does this PR do? Add support for turn-level rewards in SkyRLGymGenerator. The trainer already supports token-level rewards, but previously the SkyRLGymGenerator ignored rewards returned by `step()` except for the final turn. This PR tracks per-`step` rewards and builds a token-level reward list that associates each `step()` reward with the final token in the assistant's response that led to the reward. ## Tests Added CPU tests for main functionality and edge cases (e.g., response truncation due to exceeding max length) e2e test with step-level rewards: **WIP**
dzorlu
pushed a commit
to fleet-ai/SkyRL
that referenced
this pull request
Feb 4, 2026
…to token-level rewards (NovaSky-AI#271) In NovaSky-AI#226, we started building per-token reward for the `agent_loop()` codepath to enable per-step reward. However, in `get_metrics_from_generator_output()`, we do not compute pass_at_n for token-level rewards: ```python def get_metrics_from_generator_output( generator_output: GeneratorOutput, uids: List[str] ) -> Tuple[float, Optional[float]]: ... if isinstance(rewards[0], list): # We just compute mean over sequence reward. # TODO: We should make metrics customizable by the environment mean_raw_reward = float(np.mean([sum(seq_rewards) for seq_rewards in rewards])) pass_at_n = None # not computed for token-level rewards since it's ill-defined else: ... ``` This PR resolves this by still using per-trajectory reward when all intermediate rewards are None. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
dzorlu
pushed a commit
to fleet-ai/SkyRL
that referenced
this pull request
Feb 4, 2026
…aSky-AI#317) Fixes NovaSky-AI#311. ### PRs around this issue - `pass_at_n` no longer computed for multi-turn rollouts after NovaSky-AI#226 - This PR fixed it by introducing `None` reward, which is ill-defined and later reverted: NovaSky-AI#271 ### This PR - Deeming the last turn's reward as the entire trajectory's reward (and being > 0 signifies a "pass") for the purpose of computing `pass@N` - Adding documentation about (per-turn) rewards, metrics, and per-token rewards conversion (for better intuition) in `Creating a New Environment or Task` for lack of a better place to put it at - Add unit test and more documentation to the metric util - Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`, since our `BaseTextEnvStepOutput.reward` is `float` and not `Optional[float]`. Also added corresponding documentation, saying returning `0.0` as reward for intermediate turns if not using turn-level reward - Also, added a minor fix to pass_at_n computation, where negative reward is taken into account. See this comment for more: NovaSky-AI#317 (comment) ### Test - Ran `run_gsm8k.sh` with: - orange: `batched=True` (previously working already, since it was not the `agent_loop()` codepath that does not convert to per-token rewards) - green: `batched=False` (the codepath where `pass_at_n` is not computed prior to this PR) - grey: baseline from a previous stable PR's run <img width="1101" height="574" alt="image" src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7" /> ### Rendered doc <img width="1116" height="933" alt="image" src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0" /> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addressing issue #201
What does this PR do?
Add support for turn-level rewards in SkyRLGymGenerator. The trainer already supports token-level rewards, but previously the SkyRLGymGenerator ignored rewards returned by
step()except for the final turn. This PR tracks per-steprewards and builds a token-level reward list that associates eachstep()reward with the final token in the assistant's response that led to the reward.Tests
Added CPU tests for main functionality and edge cases (e.g., response truncation due to exceeding max length)
e2e test with step-level rewards: WIP