Skip to content

Comments

[Generator] Support turn-level rewards in SkyRLGymGenerator#226

Merged
tyler-griggs merged 6 commits intoNovaSky-AI:mainfrom
tyler-griggs:tgriggs/step_rewards
Sep 4, 2025
Merged

[Generator] Support turn-level rewards in SkyRLGymGenerator#226
tyler-griggs merged 6 commits intoNovaSky-AI:mainfrom
tyler-griggs:tgriggs/step_rewards

Conversation

@tyler-griggs
Copy link
Member

@tyler-griggs tyler-griggs commented Aug 30, 2025

Addressing issue #201

What does this PR do?

Add support for turn-level rewards in SkyRLGymGenerator. The trainer already supports token-level rewards, but previously the SkyRLGymGenerator ignored rewards returned by step() except for the final turn. This PR tracks per-step rewards and builds a token-level reward list that associates each step() reward with the final token in the assistant's response that led to the reward.

Tests

Added CPU tests for main functionality and edge cases (e.g., response truncation due to exceeding max length)

e2e test with step-level rewards: WIP

@tyler-griggs tyler-griggs marked this pull request as ready for review August 30, 2025 21:30
…kyrl_gym_generator.py, skyrl_train/generators/skyrl_gym_generator.py)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be filled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a WIP

@tyler-griggs
Copy link
Member Author

Okay now ready for review! I'll add a wandb screenshot shortly.

train_dataset = dataset["train"]
val_dataset = dataset["test"]

instruction_following = 'Let\'s think step by step and output the final answer after "####".'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can assume max_turns is > 1 for this example and thus the prompt can be :

Suggested change
instruction_following = 'Let\'s think step by step and output the final answer after "####".'
instruction_following = 'Let\'s think step by step and output a tentative numeric answer after "####".'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, thanks

Copy link
Member

@SumanthRH SumanthRH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a nit, lgtm

@tyler-griggs tyler-griggs merged commit 922a01a into NovaSky-AI:main Sep 4, 2025
3 checks passed
tyler-griggs pushed a commit that referenced this pull request Sep 11, 2025
…to token-level rewards (#271)

In #226, we started building
per-token reward for the `agent_loop()` codepath to enable per-step
reward. However, in `get_metrics_from_generator_output()`, we do not
compute pass_at_n for token-level rewards:

```python
def get_metrics_from_generator_output(
    generator_output: GeneratorOutput, uids: List[str]
) -> Tuple[float, Optional[float]]:
    ...
    if isinstance(rewards[0], list):
        # We just compute mean over sequence reward.
        # TODO: We should make metrics customizable by the environment
        mean_raw_reward = float(np.mean([sum(seq_rewards) for seq_rewards in rewards]))
        pass_at_n = None  # not computed for token-level rewards since it's ill-defined
    else:
        ...
```

This PR resolves this by still using per-trajectory reward when all
intermediate rewards are None.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
CharlieFRuan added a commit that referenced this pull request Sep 22, 2025
Fixes #311.

### PRs around this issue
- `pass_at_n` no longer computed for multi-turn rollouts after
#226
- This PR fixed it by introducing `None` reward, which is ill-defined
and later reverted: #271

### This PR
- Deeming the last turn's reward as the entire trajectory's reward (and
being > 0 signifies a "pass") for the purpose of computing `pass@N`
- Adding documentation about (per-turn) rewards, metrics, and per-token
rewards conversion (for better intuition) in `Creating a New Environment
or Task` for lack of a better place to put it at
- Add unit test and more documentation to the metric util
- Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`,
since our `BaseTextEnvStepOutput.reward` is `float` and not
`Optional[float]`. Also added corresponding documentation, saying
returning `0.0` as reward for intermediate turns if not using turn-level
reward
- Also, added a minor fix to pass_at_n computation, where negative
reward is taken into account. See this comment for more:
#317 (comment)

### Test
- Ran `run_gsm8k.sh` with:
- orange: `batched=True` (previously working already, since it was not
the `agent_loop()` codepath that does not convert to per-token rewards)
- green: `batched=False` (the codepath where `pass_at_n` is not computed
prior to this PR)
  - grey: baseline from a previous stable PR's run
<img width="1101" height="574" alt="image"
src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7"
/>


### Rendered doc
<img width="1116" height="933" alt="image"
src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0"
/>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
ztcanddota added a commit to ztcanddota/skyagent that referenced this pull request Sep 28, 2025
…to token-level rewards (#271)

In NovaSky-AI/SkyRL#226, we started building
per-token reward for the `agent_loop()` codepath to enable per-step
reward. However, in `get_metrics_from_generator_output()`, we do not
compute pass_at_n for token-level rewards:

```python
def get_metrics_from_generator_output(
    generator_output: GeneratorOutput, uids: List[str]
) -> Tuple[float, Optional[float]]:
    ...
    if isinstance(rewards[0], list):
        # We just compute mean over sequence reward.
        # TODO: We should make metrics customizable by the environment
        mean_raw_reward = float(np.mean([sum(seq_rewards) for seq_rewards in rewards]))
        pass_at_n = None  # not computed for token-level rewards since it's ill-defined
    else:
        ...
```

This PR resolves this by still using per-trajectory reward when all
intermediate rewards are None.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
ztcanddota added a commit to ztcanddota/skyagent that referenced this pull request Sep 28, 2025
Fixes NovaSky-AI/SkyRL#311.

### PRs around this issue
- `pass_at_n` no longer computed for multi-turn rollouts after
NovaSky-AI/SkyRL#226
- This PR fixed it by introducing `None` reward, which is ill-defined
and later reverted: NovaSky-AI/SkyRL#271

### This PR
- Deeming the last turn's reward as the entire trajectory's reward (and
being > 0 signifies a "pass") for the purpose of computing `pass@N`
- Adding documentation about (per-turn) rewards, metrics, and per-token
rewards conversion (for better intuition) in `Creating a New Environment
or Task` for lack of a better place to put it at
- Add unit test and more documentation to the metric util
- Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`,
since our `BaseTextEnvStepOutput.reward` is `float` and not
`Optional[float]`. Also added corresponding documentation, saying
returning `0.0` as reward for intermediate turns if not using turn-level
reward
- Also, added a minor fix to pass_at_n computation, where negative
reward is taken into account. See this comment for more:
NovaSky-AI/SkyRL#317 (comment)

### Test
- Ran `run_gsm8k.sh` with:
- orange: `batched=True` (previously working already, since it was not
the `agent_loop()` codepath that does not convert to per-token rewards)
- green: `batched=False` (the codepath where `pass_at_n` is not computed
prior to this PR)
  - grey: baseline from a previous stable PR's run
<img width="1101" height="574" alt="image"
src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7"
/>


### Rendered doc
<img width="1116" height="933" alt="image"
src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0"
/>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
SungjunlaLee added a commit to SungjunlaLee/SkyRL that referenced this pull request Jan 3, 2026
…to token-level rewards (#271)

In NovaSky-AI/SkyRL#226, we started building
per-token reward for the `agent_loop()` codepath to enable per-step
reward. However, in `get_metrics_from_generator_output()`, we do not
compute pass_at_n for token-level rewards:

```python
def get_metrics_from_generator_output(
    generator_output: GeneratorOutput, uids: List[str]
) -> Tuple[float, Optional[float]]:
    ...
    if isinstance(rewards[0], list):
        # We just compute mean over sequence reward.
        # TODO: We should make metrics customizable by the environment
        mean_raw_reward = float(np.mean([sum(seq_rewards) for seq_rewards in rewards]))
        pass_at_n = None  # not computed for token-level rewards since it's ill-defined
    else:
        ...
```

This PR resolves this by still using per-trajectory reward when all
intermediate rewards are None.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
SungjunlaLee added a commit to SungjunlaLee/SkyRL that referenced this pull request Jan 3, 2026
Fixes NovaSky-AI/SkyRL#311.

### PRs around this issue
- `pass_at_n` no longer computed for multi-turn rollouts after
NovaSky-AI/SkyRL#226
- This PR fixed it by introducing `None` reward, which is ill-defined
and later reverted: NovaSky-AI/SkyRL#271

### This PR
- Deeming the last turn's reward as the entire trajectory's reward (and
being > 0 signifies a "pass") for the purpose of computing `pass@N`
- Adding documentation about (per-turn) rewards, metrics, and per-token
rewards conversion (for better intuition) in `Creating a New Environment
or Task` for lack of a better place to put it at
- Add unit test and more documentation to the metric util
- Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`,
since our `BaseTextEnvStepOutput.reward` is `float` and not
`Optional[float]`. Also added corresponding documentation, saying
returning `0.0` as reward for intermediate turns if not using turn-level
reward
- Also, added a minor fix to pass_at_n computation, where negative
reward is taken into account. See this comment for more:
NovaSky-AI/SkyRL#317 (comment)

### Test
- Ran `run_gsm8k.sh` with:
- orange: `batched=True` (previously working already, since it was not
the `agent_loop()` codepath that does not convert to per-token rewards)
- green: `batched=False` (the codepath where `pass_at_n` is not computed
prior to this PR)
  - grey: baseline from a previous stable PR's run
<img width="1101" height="574" alt="image"
src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7"
/>


### Rendered doc
<img width="1116" height="933" alt="image"
src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0"
/>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
dzorlu referenced this pull request in fleet-ai/SkyRL Feb 4, 2026
Addressing issue #201 

## What does this PR do?
Add support for turn-level rewards in SkyRLGymGenerator. The trainer
already supports token-level rewards, but previously the
SkyRLGymGenerator ignored rewards returned by `step()` except for the
final turn. This PR tracks per-`step` rewards and builds a token-level
reward list that associates each `step()` reward with the final token in
the assistant's response that led to the reward.

## Tests
Added CPU tests for main functionality and edge cases (e.g., response
truncation due to exceeding max length)

e2e test with step-level rewards: **WIP**
dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026
…to token-level rewards (NovaSky-AI#271)

In NovaSky-AI#226, we started building
per-token reward for the `agent_loop()` codepath to enable per-step
reward. However, in `get_metrics_from_generator_output()`, we do not
compute pass_at_n for token-level rewards:

```python
def get_metrics_from_generator_output(
    generator_output: GeneratorOutput, uids: List[str]
) -> Tuple[float, Optional[float]]:
    ...
    if isinstance(rewards[0], list):
        # We just compute mean over sequence reward.
        # TODO: We should make metrics customizable by the environment
        mean_raw_reward = float(np.mean([sum(seq_rewards) for seq_rewards in rewards]))
        pass_at_n = None  # not computed for token-level rewards since it's ill-defined
    else:
        ...
```

This PR resolves this by still using per-trajectory reward when all
intermediate rewards are None.

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
dzorlu pushed a commit to fleet-ai/SkyRL that referenced this pull request Feb 4, 2026
…aSky-AI#317)

Fixes NovaSky-AI#311.

### PRs around this issue
- `pass_at_n` no longer computed for multi-turn rollouts after
NovaSky-AI#226
- This PR fixed it by introducing `None` reward, which is ill-defined
and later reverted: NovaSky-AI#271

### This PR
- Deeming the last turn's reward as the entire trajectory's reward (and
being > 0 signifies a "pass") for the purpose of computing `pass@N`
- Adding documentation about (per-turn) rewards, metrics, and per-token
rewards conversion (for better intuition) in `Creating a New Environment
or Task` for lack of a better place to put it at
- Add unit test and more documentation to the metric util
- Remove the `Optional[float]` annotation in `skyrl_gym_generator.py`,
since our `BaseTextEnvStepOutput.reward` is `float` and not
`Optional[float]`. Also added corresponding documentation, saying
returning `0.0` as reward for intermediate turns if not using turn-level
reward
- Also, added a minor fix to pass_at_n computation, where negative
reward is taken into account. See this comment for more:
NovaSky-AI#317 (comment)

### Test
- Ran `run_gsm8k.sh` with:
- orange: `batched=True` (previously working already, since it was not
the `agent_loop()` codepath that does not convert to per-token rewards)
- green: `batched=False` (the codepath where `pass_at_n` is not computed
prior to this PR)
  - grey: baseline from a previous stable PR's run
<img width="1101" height="574" alt="image"
src="https://github.com/user-attachments/assets/eca0ddae-8c64-457f-af49-a2cd4aaeb2f7"
/>


### Rendered doc
<img width="1116" height="933" alt="image"
src="https://github.com/user-attachments/assets/ecf1e58e-3d49-4251-9cd4-76fe59c758f0"
/>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Tyler Griggs <131809874+tyler-griggs@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants