[Generator] Support turn-level rewards in SkyRLGymGenerator by tyler-griggs · Pull Request #226 · NovaSky-AI/SkyRL

tyler-griggs · 2025-08-30T06:35:31Z

Addressing issue #201

What does this PR do?

Add support for turn-level rewards in SkyRLGymGenerator. The trainer already supports token-level rewards, but previously the SkyRLGymGenerator ignored rewards returned by step() except for the final turn. This PR tracks per-step rewards and builds a token-level reward list that associates each step() reward with the final token in the assistant's response that led to the reward.

Tests

Added CPU tests for main functionality and edge cases (e.g., response truncation due to exceeding max length)

e2e test with step-level rewards: WIP

…kyrl_gym_generator.py, skyrl_train/generators/skyrl_gym_generator.py)

SumanthRH · 2025-09-04T20:19:27Z

skyrl-train/examples/turn_level_rewards/README.md

to be filled?

Yeah it's a WIP

skyrl-train/skyrl_train/generators/skyrl_gym_generator.py

tyler-griggs · 2025-09-04T20:55:04Z

Okay now ready for review! I'll add a wandb screenshot shortly.

SumanthRH · 2025-09-04T21:08:10Z

skyrl-train/examples/turn_level_rewards/gsm8k_multi_turn_dataset.py

+    train_dataset = dataset["train"]
+    val_dataset = dataset["test"]
+
+    instruction_following = 'Let\'s think step by step and output the final answer after "####".'


we can assume max_turns is > 1 for this example and thus the prompt can be :

Suggested change

instruction_following = 'Let\'s think step by step and output the final answer after "####".'

instruction_following = 'Let\'s think step by step and output a tentative numeric answer after "####".'

Good call, thanks

SumanthRH

Left a nit, lgtm

…to token-level rewards (#271) In #226, we started building per-token reward for the `agent_loop()` codepath to enable per-step reward. However, in `get_metrics_from_generator_output()`, we do not compute pass_at_n for token-level rewards: ```python def get_metrics_from_generator_output( generator_output: GeneratorOutput, uids: List[str] ) -> Tuple[float, Optional[float]]: ... if isinstance(rewards[0], list): # We just compute mean over sequence reward. # TODO: We should make metrics customizable by the environment mean_raw_reward = float(np.mean([sum(seq_rewards) for seq_rewards in rewards])) pass_at_n = None # not computed for token-level rewards since it's ill-defined else: ... ``` This PR resolves this by still using per-trajectory reward when all intermediate rewards are None. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Fixes #311. ### PRs around this issue - `pass_at_n` no longer computed for multi-turn rollouts after #226 - This PR fixed it by introducing `None` reward, which is ill-defined and later reverted: #271 ### This PR - Deeming the last turn's reward as the entire trajectory's reward (and being > 0 signifies a "pass") for the purpose of computing `pass@N` - Adding documentation about (per-turn) rewards, metrics, and per-token rewards conversion (for better intuition) in `Creating a New Environment or Task` for lack of a better place to put it at - Add unit test and more documentation to the metric util - Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`, since our `BaseTextEnvStepOutput.reward` is `float` and not `Optional[float]`. Also added corresponding documentation, saying returning `0.0` as reward for intermediate turns if not using turn-level reward - Also, added a minor fix to pass_at_n computation, where negative reward is taken into account. See this comment for more: #317 (comment) ### Test - Ran `run_gsm8k.sh` with: - orange: `batched=True` (previously working already, since it was not the `agent_loop()` codepath that does not convert to per-token rewards) - green: `batched=False` (the codepath where `pass_at_n` is not computed prior to this PR) - grey: baseline from a previous stable PR's run <img width="1101" height="574" alt="image" src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7" /> ### Rendered doc <img width="1116" height="933" alt="image" src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0" /> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>

…to token-level rewards (#271) In NovaSky-AI/SkyRL#226, we started building per-token reward for the `agent_loop()` codepath to enable per-step reward. However, in `get_metrics_from_generator_output()`, we do not compute pass_at_n for token-level rewards: ```python def get_metrics_from_generator_output( generator_output: GeneratorOutput, uids: List[str] ) -> Tuple[float, Optional[float]]: ... if isinstance(rewards[0], list): # We just compute mean over sequence reward. # TODO: We should make metrics customizable by the environment mean_raw_reward = float(np.mean([sum(seq_rewards) for seq_rewards in rewards])) pass_at_n = None # not computed for token-level rewards since it's ill-defined else: ... ``` This PR resolves this by still using per-trajectory reward when all intermediate rewards are None. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Fixes NovaSky-AI/SkyRL#311. ### PRs around this issue - `pass_at_n` no longer computed for multi-turn rollouts after NovaSky-AI/SkyRL#226 - This PR fixed it by introducing `None` reward, which is ill-defined and later reverted: NovaSky-AI/SkyRL#271 ### This PR - Deeming the last turn's reward as the entire trajectory's reward (and being > 0 signifies a "pass") for the purpose of computing `pass@N` - Adding documentation about (per-turn) rewards, metrics, and per-token rewards conversion (for better intuition) in `Creating a New Environment or Task` for lack of a better place to put it at - Add unit test and more documentation to the metric util - Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`, since our `BaseTextEnvStepOutput.reward` is `float` and not `Optional[float]`. Also added corresponding documentation, saying returning `0.0` as reward for intermediate turns if not using turn-level reward - Also, added a minor fix to pass_at_n computation, where negative reward is taken into account. See this comment for more: NovaSky-AI/SkyRL#317 (comment) ### Test - Ran `run_gsm8k.sh` with: - orange: `batched=True` (previously working already, since it was not the `agent_loop()` codepath that does not convert to per-token rewards) - green: `batched=False` (the codepath where `pass_at_n` is not computed prior to this PR) - grey: baseline from a previous stable PR's run <img width="1101" height="574" alt="image" src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7" /> ### Rendered doc <img width="1116" height="933" alt="image" src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0" /> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>

…to token-level rewards (#271) In NovaSky-AI/SkyRL#226, we started building per-token reward for the `agent_loop()` codepath to enable per-step reward. However, in `get_metrics_from_generator_output()`, we do not compute pass_at_n for token-level rewards: ```python def get_metrics_from_generator_output( generator_output: GeneratorOutput, uids: List[str] ) -> Tuple[float, Optional[float]]: ... if isinstance(rewards[0], list): # We just compute mean over sequence reward. # TODO: We should make metrics customizable by the environment mean_raw_reward = float(np.mean([sum(seq_rewards) for seq_rewards in rewards])) pass_at_n = None # not computed for token-level rewards since it's ill-defined else: ... ``` This PR resolves this by still using per-trajectory reward when all intermediate rewards are None. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Fixes NovaSky-AI/SkyRL#311. ### PRs around this issue - `pass_at_n` no longer computed for multi-turn rollouts after NovaSky-AI/SkyRL#226 - This PR fixed it by introducing `None` reward, which is ill-defined and later reverted: NovaSky-AI/SkyRL#271 ### This PR - Deeming the last turn's reward as the entire trajectory's reward (and being > 0 signifies a "pass") for the purpose of computing `pass@N` - Adding documentation about (per-turn) rewards, metrics, and per-token rewards conversion (for better intuition) in `Creating a New Environment or Task` for lack of a better place to put it at - Add unit test and more documentation to the metric util - Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`, since our `BaseTextEnvStepOutput.reward` is `float` and not `Optional[float]`. Also added corresponding documentation, saying returning `0.0` as reward for intermediate turns if not using turn-level reward - Also, added a minor fix to pass_at_n computation, where negative reward is taken into account. See this comment for more: NovaSky-AI/SkyRL#317 (comment) ### Test - Ran `run_gsm8k.sh` with: - orange: `batched=True` (previously working already, since it was not the `agent_loop()` codepath that does not convert to per-token rewards) - green: `batched=False` (the codepath where `pass_at_n` is not computed prior to this PR) - grey: baseline from a previous stable PR's run <img width="1101" height="574" alt="image" src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7" /> ### Rendered doc <img width="1116" height="933" alt="image" src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0" /> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>

Addressing issue #201 ## What does this PR do? Add support for turn-level rewards in SkyRLGymGenerator. The trainer already supports token-level rewards, but previously the SkyRLGymGenerator ignored rewards returned by `step()` except for the final turn. This PR tracks per-`step` rewards and builds a token-level reward list that associates each `step()` reward with the final token in the assistant's response that led to the reward. ## Tests Added CPU tests for main functionality and edge cases (e.g., response truncation due to exceeding max length) e2e test with step-level rewards: **WIP**

…to token-level rewards (NovaSky-AI#271) In NovaSky-AI#226, we started building per-token reward for the `agent_loop()` codepath to enable per-step reward. However, in `get_metrics_from_generator_output()`, we do not compute pass_at_n for token-level rewards: ```python def get_metrics_from_generator_output( generator_output: GeneratorOutput, uids: List[str] ) -> Tuple[float, Optional[float]]: ... if isinstance(rewards[0], list): # We just compute mean over sequence reward. # TODO: We should make metrics customizable by the environment mean_raw_reward = float(np.mean([sum(seq_rewards) for seq_rewards in rewards])) pass_at_n = None # not computed for token-level rewards since it's ill-defined else: ... ``` This PR resolves this by still using per-trajectory reward when all intermediate rewards are None. --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…aSky-AI#317) Fixes NovaSky-AI#311. ### PRs around this issue - `pass_at_n` no longer computed for multi-turn rollouts after NovaSky-AI#226 - This PR fixed it by introducing `None` reward, which is ill-defined and later reverted: NovaSky-AI#271 ### This PR - Deeming the last turn's reward as the entire trajectory's reward (and being > 0 signifies a "pass") for the purpose of computing `pass@N` - Adding documentation about (per-turn) rewards, metrics, and per-token rewards conversion (for better intuition) in `Creating a New Environment or Task` for lack of a better place to put it at - Add unit test and more documentation to the metric util - Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`, since our `BaseTextEnvStepOutput.reward` is `float` and not `Optional[float]`. Also added corresponding documentation, saying returning `0.0` as reward for intermediate turns if not using turn-level reward - Also, added a minor fix to pass_at_n computation, where negative reward is taken into account. See this comment for more: NovaSky-AI#317 (comment) ### Test - Ran `run_gsm8k.sh` with: - orange: `batched=True` (previously working already, since it was not the `agent_loop()` codepath that does not convert to per-token rewards) - green: `batched=False` (the codepath where `pass_at_n` is not computed prior to this PR) - grey: baseline from a previous stable PR's run <img width="1101" height="574" alt="image" src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7" /> ### Rendered doc <img width="1116" height="933" alt="image" src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0" /> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>

tyler-griggs requested a review from CharlieFRuan August 30, 2025 06:40

tyler-griggs marked this pull request as ready for review August 30, 2025 21:30

Split PR: keep only step rewards changes (tests/cpu/generators/test_s…

2f7f85f

…kyrl_gym_generator.py, skyrl_train/generators/skyrl_gym_generator.py)

tyler-griggs force-pushed the tgriggs/step_rewards branch from 7131a84 to 2f7f85f Compare August 31, 2025 00:26

tyler-griggs added 3 commits August 31, 2025 00:29

clean up

a84caa0

Merge remote-tracking branch 'origin/main' into tgriggs/step_rewards

8cad961

add multi-turn gsm8k env example

f6f8333

SumanthRH reviewed Sep 4, 2025

View reviewed changes

skyrl-train/skyrl_train/generators/skyrl_gym_generator.py Show resolved Hide resolved

readme and cleanup

ec4d3f8

SumanthRH reviewed Sep 4, 2025

View reviewed changes

SumanthRH approved these changes Sep 4, 2025

View reviewed changes

resolve comment

1166599

tyler-griggs merged commit 922a01a into NovaSky-AI:main Sep 4, 2025
3 checks passed

tyler-griggs mentioned this pull request Sep 4, 2025

[Clarification request] Per-turn rewards in SKyRLGymGenerator agent_loop #201

Closed

CharlieFRuan mentioned this pull request Sep 9, 2025

[Fix] Fix pass_at_k missing for SkRLGymGenerator.agent_loop flow due to token-level rewards #271

Merged

This was referenced Sep 16, 2025

[SkyRL-Gym] Make SQL and Search envs return None for intermediate steps #299

Closed

Pass at N not computed for agent loop code path with SkyRL-Gym #311

Closed

This was referenced Sep 18, 2025

[Metrics] Add back pass_at_n computation for (per-token) rewards #317

Merged

[Tracker] SkyRLGymGenerator improvements #179

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[Generator] Support turn-level rewards in SkyRLGymGenerator#226

[Generator] Support turn-level rewards in SkyRLGymGenerator#226
tyler-griggs merged 6 commits intoNovaSky-AI:mainfrom
tyler-griggs:tgriggs/step_rewards

tyler-griggs commented Aug 30, 2025 •

edited

Loading

Uh oh!

SumanthRH Sep 4, 2025

Uh oh!

tyler-griggs Sep 4, 2025

Uh oh!

Uh oh!

tyler-griggs commented Sep 4, 2025

Uh oh!

SumanthRH Sep 4, 2025

Uh oh!

tyler-griggs Sep 4, 2025

Uh oh!

SumanthRH left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	instruction_following = 'Let\'s think step by step and output the final answer after "####".'
	instruction_following = 'Let\'s think step by step and output a tentative numeric answer after "####".'

Comments

Conversation

tyler-griggs commented Aug 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Tests

Uh oh!

SumanthRH Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

tyler-griggs Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tyler-griggs commented Sep 4, 2025

Uh oh!

SumanthRH Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

tyler-griggs Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

SumanthRH left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tyler-griggs commented Aug 30, 2025 •

edited

Loading